fix(gemini): gemini input token calculation when implicit cache is hit using langchain #1451
+3
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
For our gemini usage (using Langchain through VertexAI), we learned that costs for cached tokens is not correctly calculated. We traced this back to cached tokens not being correctly subtracted from
inputtoken count, because the input tokens were reported ininput_modality_1, where cached tokens are not being subtracted at all.Observations (Current state)
input_modality_1contains tokens,inputtoken count is 0.input, when they should be subtracted from theinput_modality_1.Before fix:

Impact
Price calculations are significantly off, when caching is used via VertexAI (with Langchain). In the above example we're taking about 23% deviation, but in cases where input tokens are the main cost & we're making heavy use of caching the calculation can be off by more than 50%.
Proposed fix
Subtract
cache_tokens_detailsfrom the correspondinginput_modalityin addition to subtracting frominput.We expect since this is only being applied to the specific
input_modalitythat there should not be any unexpected side-effects from this change.After fix:

Verification
I've validated this with a modified version of the
langfuse.langchain.CallbackHandler.pyagainst our Langfuse Cloud app.Important
Fixes token cost calculation by subtracting cached tokens from
input_modality_1inCallbackHandler.py, correcting significant price deviations with VertexAI caching.cache_tokens_detailsfrominput_modality_1in_parse_usage_model()inCallbackHandler.py.inputandinput_modality_1.langfuse.langchain.CallbackHandler.pyagainst Langfuse Cloud app.This description was created by
for 5726697. You can customize this summary. It will automatically update as commits are pushed.
Disclaimer: Experimental PR review
Greptile Overview
Greptile Summary
Fixes a critical bug in Gemini/Vertex AI cached token calculation when using Langchain. When cached tokens are present and input tokens are reported in
input_modality_{modality}fields (rather than the genericinputfield), the previous code only subtracted cached tokens frominput, leavinginput_modality_{modality}inflated. This caused cost calculations to be off by 23-50%+ when caching was used.Key changes:
input_modality_{modality}field in addition to theinputfieldprompt_tokens_detailsandcandidates_tokens_detailsare already handledmax(0, ...)to prevent negative token countsImpact:
Confidence Score: 5/5
max(0, ...)safeguard pattern. The change only affects Vertex AI/Gemini scenarios where cache_tokens_details AND input_modality fields both exist, making it highly isolated with no risk to other providers or non-cached scenarios.Important Files Changed
File Analysis
Sequence Diagram
sequenceDiagram participant LC as Langchain participant CB as CallbackHandler participant PU as _parse_usage_model participant LF as Langfuse LC->>CB: on_llm_end(response) CB->>PU: _parse_usage(response) Note over PU: Extract usage data from response alt Has cache_tokens_details (Vertex AI) PU->>PU: Extract cache token details PU->>PU: Create cached_modality_{modality} field alt input field exists PU->>PU: Subtract cached tokens from input end alt input_modality_{modality} exists PU->>PU: Subtract cached tokens from input_modality Note over PU: FIX: Ensures accurate token<br/>count when input is in modality end end PU-->>CB: Return usage_model with corrected tokens CB->>LF: Update generation with usage Note over LF: Cost calculated from<br/>corrected token counts