Skip to content

Conversation

@Genesis929
Copy link

@Genesis929 Genesis929 commented Nov 17, 2025

This PR introduces BigQueryCallbackHandler and AsyncBigQueryCallbackHandler to log LangChain run lifecycle events (LLM, Chain, Tool, Agent, Retriever, and Chat Model activities) directly to Google Cloud BigQuery.

Key features include:

Logging of all standard LangChain event types (e.g., on_llm_start, on_tool_end, on_chain_error).

Automatic dataset and table creation during handler initialization if they don't exist.

Utilizes BigQuery's specialized write clients for both synchronous and asynchronous data ingestion.

Issue: N/A

Dependencies: google-cloud-bigquery (Optional dependency)

@Genesis929 Genesis929 changed the title Add: BigQuery Callbacks feat(callbacks): Add BigQuery Callbacks (Sync and Async) Nov 17, 2025
@github-actions github-actions bot added feature and removed feature labels Nov 17, 2025
@haiyuan-eng-google
Copy link

💡 Review Comment: Addressing Large Content Fields in BigQuery Logging

This is a crucial point regarding the scalability and reliability of our BigQuery logging plugin, especially when integrating with large context window LLMs. The current approach of using json.dumps on the full prompts or generations content could easily lead to failed streaming writes or significant, unexpected storage costs due to BigQuery's row size limits (even if the hard limit is high, practical limits for smooth streaming are much lower).

My primary recommendation is to implement content truncation before serialization to ensure the integrity of our logging pipeline and optimize storage.

✍️ Proposed Implementation Changes

I suggest the following steps be integrated into the BigQuery logging plugin's init and data formatting logic:

1. Introduce a Configuration Parameter

  • Action: Add a max_content_length parameter to the plugin's __init__ method.
  • Recommendation: Set a conservative, production-safe default value (e.g., 50 KB) to prevent immediate write failures while still capturing sufficient contextual data.
def __init__(
    self,
    ...,
    max_content_length: int = 50 * 1024,  # Default to 50KB
):
    self._max_content_length = max_content_length

2. Implement Safe Content Truncation

  • Action: Create a helper method (similar to _format_content_safely in the ADK) to handle the truncation logic. This function should operate on the raw content string before it is passed to json.dumps.
def _truncate_content_safely(self, content: str) -> str:
    """Truncates the content string if it exceeds the configured max length."""
    if len(content.encode('utf-8')) > self._max_content_length:
        # Truncate and add an indicator
        truncated_content = content.encode('utf-8')[:self._max_content_length].decode('utf-8', 'ignore')
        return f"{truncated_content} [TRUNCATED_MAX_BYTES:{self._max_content_length}]"
    return content

# ... then use it before serialization
# "content": json.dumps({"prompts": self._truncate_content_safely(prompts)}),

3. Update the BigQuery Schema

  • Action: Add a new boolean field to the BigQuery schema for the relevant logging table (e.g., agent_runs). This field will act as a flag for downstream analytics.
  • Field: is_truncated (type BOOLEAN)
  • Purpose: This is vital for analysts to understand when the logged content is partial. We need to be able to tell if a low-token prompt was genuinely short or if a massive prompt was cut off.

@haiyuan-eng-google
Copy link

haiyuan-eng-google commented Nov 17, 2025

📈 Enable High-Value Product Analytics (Session & User Context)

The current logging fields (run_id, parent_run_id) are excellent for technical traceability, but they fall short for genuine product analytics. To unlock queries that directly inform product development (the core value of this plugin), we need context about who is running the agent and within which session.

  • Action: Add two new optional columns to the BigQuery schema: session_id (STRING) and user_id (STRING).
  • Implementation: Implement logic to attempt extracting these values from the metadata dictionary provided during the run. Use a safe metadata.get("session_id") and metadata.get("user_id") (or equivalent keys as agreed upon).
  • Rationale: This simple addition enables critical product-level queries like:
    • "How many distinct users interacted with the agent last week?"
    • "What was the average number of turns per session?"
    • "How did User X's sentiment change over Session Y?"

@haiyuan-eng-google
Copy link

🛡️ Robust Error Handling for Schema Mismatches

Schema evolution in BigQuery can lead to silent, hard-to-debug failures in our streaming write pipeline if the table schema has been modified independently of the plugin. To ensure robust logging, we must anticipate and clearly handle these failures.

  • Action: In the _perform_write method (or wherever the BigQuery Write API call is executed), wrap the write operation in a try...except block.
  • Implementation: Specifically catch exceptions that indicate a schema mismatch (e.g., fields missing, type errors) when attempting to write data.
  • Suggestion: When a schema error is caught, perform the following:
    1. Do not crash the entire agent run.
    2. Log a prominent, helpful warning that clearly instructs the user, such as:

      ⚠️ BigQuery Logging Warning: Schema mismatch detected for table [table_name]. This write failed. Please ensure your BigQuery table schema matches the current version of the ADK analytics plugin.

    3. Continue the agent run.

This prevents a minor mismatch from taking down production logs and provides an immediate, clear diagnostic path for the engineer who manages the BigQuery table.

@Genesis929
Copy link
Author

📈 Enable High-Value Product Analytics (Session & User Context)

The current logging fields (run_id, parent_run_id) are excellent for technical traceability, but they fall short for genuine product analytics. To unlock queries that directly inform product development (the core value of this plugin), we need context about who is running the agent and within which session.

  • Action: Add two new optional columns to the BigQuery schema: session_id (STRING) and user_id (STRING).

  • Implementation: Implement logic to attempt extracting these values from the metadata dictionary provided during the run. Use a safe metadata.get("session_id") and metadata.get("user_id") (or equivalent keys as agreed upon).

  • Rationale: This simple addition enables critical product-level queries like:

    • "How many distinct users interacted with the agent last week?"
    • "What was the average number of turns per session?"
    • "How did User X's sentiment change over Session Y?"

There is no official support, so we are using the metadata field if user want to include such information https://github.com/orgs/langfuse/discussions/7331

@haiyuan-eng-google
Copy link

📈 Enable High-Value Product Analytics (Session & User Context)

The current logging fields (run_id, parent_run_id) are excellent for technical traceability, but they fall short for genuine product analytics. To unlock queries that directly inform product development (the core value of this plugin), we need context about who is running the agent and within which session.

  • Action: Add two new optional columns to the BigQuery schema: session_id (STRING) and user_id (STRING).

  • Implementation: Implement logic to attempt extracting these values from the metadata dictionary provided during the run. Use a safe metadata.get("session_id") and metadata.get("user_id") (or equivalent keys as agreed upon).

  • Rationale: This simple addition enables critical product-level queries like:

    • "How many distinct users interacted with the agent last week?"
    • "What was the average number of turns per session?"
    • "How did User X's sentiment change over Session Y?"

There is no official support, so we are using the metadata field if user want to include such information https://github.com/orgs/langfuse/discussions/7331

then we need to think about how to unify the schema for ADK and langchain telelmetry logging.

@Genesis929
Copy link
Author

📈 Enable High-Value Product Analytics (Session & User Context)

The current logging fields (run_id, parent_run_id) are excellent for technical traceability, but they fall short for genuine product analytics. To unlock queries that directly inform product development (the core value of this plugin), we need context about who is running the agent and within which session.

  • Action: Add two new optional columns to the BigQuery schema: session_id (STRING) and user_id (STRING).

  • Implementation: Implement logic to attempt extracting these values from the metadata dictionary provided during the run. Use a safe metadata.get("session_id") and metadata.get("user_id") (or equivalent keys as agreed upon).

  • Rationale: This simple addition enables critical product-level queries like:

    • "How many distinct users interacted with the agent last week?"
    • "What was the average number of turns per session?"
    • "How did User X's sentiment change over Session Y?"

There is no official support, so we are using the metadata field if user want to include such information https://github.com/orgs/langfuse/discussions/7331

then we need to think about how to unify the schema for ADK and langchain telelmetry logging.

In that case we may need something like a struct or json string for everything that only exists in one library.

@Genesis929 Genesis929 marked this pull request as ready for review November 25, 2025 19:06
Comment on lines 116 to 121
# [project.optional-dependencies]
# bigquery = [
# "google-cloud-bigquery",
# "google-cloud-bigquery-storage",
# "pyarrow",
# ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented out?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

@mdrxy
Copy link
Member

mdrxy commented Nov 27, 2025

I think this might be more appropriate in our langchain-google-community package -- let me know what you think

https://github.com/langchain-ai/langchain-google/tree/main/libs/community/langchain_google_community

@Genesis929
Copy link
Author

I think this might be more appropriate in our langchain-google-community package -- let me know what you think

https://github.com/langchain-ai/langchain-google/tree/main/libs/community/langchain_google_community

Hi Mason, we discussed and think it's better to keep this in langchain_community package, the main reason: 1. put our callback to the same place as other callbacks. 2. langchain_community is much more popular comparing to langchain_google_community.

Please let me know what you think.

@Genesis929 Genesis929 requested a review from mdrxy December 3, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants