feat(callbacks): Add BigQuery Callbacks (Sync and Async) #439

Genesis929 · 2025-11-17T21:50:54Z

This PR introduces BigQueryCallbackHandler and AsyncBigQueryCallbackHandler to log LangChain run lifecycle events (LLM, Chain, Tool, Agent, Retriever, and Chat Model activities) directly to Google Cloud BigQuery.

Key features include:

Logging of all standard LangChain event types (e.g., on_llm_start, on_tool_end, on_chain_error).

Automatic dataset and table creation during handler initialization if they don't exist.

Utilizes BigQuery's specialized write clients for both synchronous and asynchronous data ingestion.

Issue: N/A

Dependencies: google-cloud-bigquery (Optional dependency)

haiyuan-eng-google · 2025-11-17T22:50:58Z

💡 Review Comment: Addressing Large Content Fields in BigQuery Logging

This is a crucial point regarding the scalability and reliability of our BigQuery logging plugin, especially when integrating with large context window LLMs. The current approach of using json.dumps on the full prompts or generations content could easily lead to failed streaming writes or significant, unexpected storage costs due to BigQuery's row size limits (even if the hard limit is high, practical limits for smooth streaming are much lower).

My primary recommendation is to implement content truncation before serialization to ensure the integrity of our logging pipeline and optimize storage.

✍️ Proposed Implementation Changes

I suggest the following steps be integrated into the BigQuery logging plugin's init and data formatting logic:

1. Introduce a Configuration Parameter

Action: Add a max_content_length parameter to the plugin's __init__ method.
Recommendation: Set a conservative, production-safe default value (e.g., 50 KB) to prevent immediate write failures while still capturing sufficient contextual data.

def __init__(
    self,
    ...,
    max_content_length: int = 50 * 1024,  # Default to 50KB
):
    self._max_content_length = max_content_length

2. Implement Safe Content Truncation

Action: Create a helper method (similar to _format_content_safely in the ADK) to handle the truncation logic. This function should operate on the raw content string before it is passed to json.dumps.

def _truncate_content_safely(self, content: str) -> str:
    """Truncates the content string if it exceeds the configured max length."""
    if len(content.encode('utf-8')) > self._max_content_length:
        # Truncate and add an indicator
        truncated_content = content.encode('utf-8')[:self._max_content_length].decode('utf-8', 'ignore')
        return f"{truncated_content} [TRUNCATED_MAX_BYTES:{self._max_content_length}]"
    return content

# ... then use it before serialization
# "content": json.dumps({"prompts": self._truncate_content_safely(prompts)}),

3. Update the BigQuery Schema

Action: Add a new boolean field to the BigQuery schema for the relevant logging table (e.g., agent_runs). This field will act as a flag for downstream analytics.
Field: is_truncated (type BOOLEAN)
Purpose: This is vital for analysts to understand when the logged content is partial. We need to be able to tell if a low-token prompt was genuinely short or if a massive prompt was cut off.

haiyuan-eng-google · 2025-11-17T22:53:20Z

📈 Enable High-Value Product Analytics (Session & User Context)

The current logging fields (run_id, parent_run_id) are excellent for technical traceability, but they fall short for genuine product analytics. To unlock queries that directly inform product development (the core value of this plugin), we need context about who is running the agent and within which session.

Action: Add two new optional columns to the BigQuery schema: session_id (STRING) and user_id (STRING).
Implementation: Implement logic to attempt extracting these values from the metadata dictionary provided during the run. Use a safe metadata.get("session_id") and metadata.get("user_id") (or equivalent keys as agreed upon).
Rationale: This simple addition enables critical product-level queries like:
- "How many distinct users interacted with the agent last week?"
- "What was the average number of turns per session?"
- "How did User X's sentiment change over Session Y?"

haiyuan-eng-google · 2025-11-17T22:53:40Z

🛡️ Robust Error Handling for Schema Mismatches

Schema evolution in BigQuery can lead to silent, hard-to-debug failures in our streaming write pipeline if the table schema has been modified independently of the plugin. To ensure robust logging, we must anticipate and clearly handle these failures.

Action: In the _perform_write method (or wherever the BigQuery Write API call is executed), wrap the write operation in a try...except block.
Implementation: Specifically catch exceptions that indicate a schema mismatch (e.g., fields missing, type errors) when attempting to write data.
Suggestion: When a schema error is caught, perform the following:
1. Do not crash the entire agent run.
2. Log a prominent, helpful warning that clearly instructs the user, such as:
  
  ⚠️ BigQuery Logging Warning: Schema mismatch detected for table [table_name]. This write failed. Please ensure your BigQuery table schema matches the current version of the ADK analytics plugin.
3. Continue the agent run.

This prevents a minor mismatch from taking down production logs and provides an immediate, clear diagnostic path for the engineer who manages the BigQuery table.

Genesis929 · 2025-11-17T23:08:47Z

📈 Enable High-Value Product Analytics (Session & User Context)

The current logging fields (run_id, parent_run_id) are excellent for technical traceability, but they fall short for genuine product analytics. To unlock queries that directly inform product development (the core value of this plugin), we need context about who is running the agent and within which session.

Action: Add two new optional columns to the BigQuery schema: session_id (STRING) and user_id (STRING).

Implementation: Implement logic to attempt extracting these values from the metadata dictionary provided during the run. Use a safe metadata.get("session_id") and metadata.get("user_id") (or equivalent keys as agreed upon).

Rationale: This simple addition enables critical product-level queries like:

"How many distinct users interacted with the agent last week?"

"What was the average number of turns per session?"

"How did User X's sentiment change over Session Y?"

There is no official support, so we are using the metadata field if user want to include such information https://github.com/orgs/langfuse/discussions/7331

haiyuan-eng-google · 2025-11-17T23:53:01Z

📈 Enable High-Value Product Analytics (Session & User Context)

The current logging fields (run_id, parent_run_id) are excellent for technical traceability, but they fall short for genuine product analytics. To unlock queries that directly inform product development (the core value of this plugin), we need context about who is running the agent and within which session.

Action: Add two new optional columns to the BigQuery schema: session_id (STRING) and user_id (STRING).

Implementation: Implement logic to attempt extracting these values from the metadata dictionary provided during the run. Use a safe metadata.get("session_id") and metadata.get("user_id") (or equivalent keys as agreed upon).

Rationale: This simple addition enables critical product-level queries like:

"How many distinct users interacted with the agent last week?"

"What was the average number of turns per session?"

"How did User X's sentiment change over Session Y?"

There is no official support, so we are using the metadata field if user want to include such information https://github.com/orgs/langfuse/discussions/7331

then we need to think about how to unify the schema for ADK and langchain telelmetry logging.

Genesis929 · 2025-11-24T21:35:15Z

📈 Enable High-Value Product Analytics (Session & User Context)

The current logging fields (run_id, parent_run_id) are excellent for technical traceability, but they fall short for genuine product analytics. To unlock queries that directly inform product development (the core value of this plugin), we need context about who is running the agent and within which session.

Action: Add two new optional columns to the BigQuery schema: session_id (STRING) and user_id (STRING).

Implementation: Implement logic to attempt extracting these values from the metadata dictionary provided during the run. Use a safe metadata.get("session_id") and metadata.get("user_id") (or equivalent keys as agreed upon).

Rationale: This simple addition enables critical product-level queries like:

"How many distinct users interacted with the agent last week?"

"What was the average number of turns per session?"

"How did User X's sentiment change over Session Y?"

There is no official support, so we are using the metadata field if user want to include such information https://github.com/orgs/langfuse/discussions/7331

then we need to think about how to unify the schema for ADK and langchain telelmetry logging.

In that case we may need something like a struct or json string for everything that only exists in one library.

mdrxy · 2025-11-27T08:18:06Z

libs/community/pyproject.toml

+# [project.optional-dependencies]
+# bigquery = [
+#     "google-cloud-bigquery",
+#     "google-cloud-bigquery-storage",
+#     "pyarrow",
+# ]


commented out?

Removed, thanks!

mdrxy · 2025-11-27T08:21:43Z

I think this might be more appropriate in our langchain-google-community package -- let me know what you think

https://github.com/langchain-ai/langchain-google/tree/main/libs/community/langchain_google_community

Genesis929 · 2025-12-02T18:58:28Z

I think this might be more appropriate in our langchain-google-community package -- let me know what you think

https://github.com/langchain-ai/langchain-google/tree/main/libs/community/langchain_google_community

Hi Mason, we discussed and think it's better to keep this in langchain_community package, the main reason: 1. put our callback to the same place as other callbacks. 2. langchain_community is much more popular comparing to langchain_google_community.

Please let me know what you think.

Genesis929 added 3 commits November 17, 2025 21:50

Add: BigQuery Callbacks

c960920

test update

2a72ce9

doc update

4d544ec

Genesis929 changed the title ~~Add: BigQuery Callbacks~~ feat(callbacks): Add BigQuery Callbacks (Sync and Async) Nov 17, 2025

github-actions bot added feature and removed feature labels Nov 17, 2025

add truncate

ac33529

Genesis929 marked this pull request as ready for review November 25, 2025 19:06

Merge branch 'main' into bigquery_cb

e6adc8e

mdrxy reviewed Nov 27, 2025

View reviewed changes

remove extra code

d719145

Genesis929 requested a review from mdrxy December 3, 2025 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(callbacks): Add BigQuery Callbacks (Sync and Async) #439

feat(callbacks): Add BigQuery Callbacks (Sync and Async) #439

Genesis929 commented Nov 17, 2025 •

edited

Loading

Uh oh!

haiyuan-eng-google commented Nov 17, 2025

Uh oh!

haiyuan-eng-google commented Nov 17, 2025 •

edited

Loading

Uh oh!

haiyuan-eng-google commented Nov 17, 2025

Uh oh!

Genesis929 commented Nov 17, 2025

📈 Enable High-Value Product Analytics (Session & User Context)

Uh oh!

haiyuan-eng-google commented Nov 17, 2025

📈 Enable High-Value Product Analytics (Session & User Context)

Uh oh!

Genesis929 commented Nov 24, 2025

📈 Enable High-Value Product Analytics (Session & User Context)

Uh oh!

mdrxy Nov 27, 2025

Uh oh!

Genesis929 Dec 3, 2025

Uh oh!

mdrxy commented Nov 27, 2025

Uh oh!

Genesis929 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(callbacks): Add BigQuery Callbacks (Sync and Async) #439

Are you sure you want to change the base?

feat(callbacks): Add BigQuery Callbacks (Sync and Async) #439

Conversation

Genesis929 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haiyuan-eng-google commented Nov 17, 2025

💡 Review Comment: Addressing Large Content Fields in BigQuery Logging

✍️ Proposed Implementation Changes

1. Introduce a Configuration Parameter

2. Implement Safe Content Truncation

3. Update the BigQuery Schema

Uh oh!

haiyuan-eng-google commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📈 Enable High-Value Product Analytics (Session & User Context)

Uh oh!

haiyuan-eng-google commented Nov 17, 2025

🛡️ Robust Error Handling for Schema Mismatches

Uh oh!

Genesis929 commented Nov 17, 2025

📈 Enable High-Value Product Analytics (Session & User Context)

Uh oh!

haiyuan-eng-google commented Nov 17, 2025

📈 Enable High-Value Product Analytics (Session & User Context)

Uh oh!

Genesis929 commented Nov 24, 2025

📈 Enable High-Value Product Analytics (Session & User Context)

Uh oh!

mdrxy Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Genesis929 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

mdrxy commented Nov 27, 2025

Uh oh!

Genesis929 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Genesis929 commented Nov 17, 2025 •

edited

Loading

haiyuan-eng-google commented Nov 17, 2025 •

edited

Loading