-
Notifications
You must be signed in to change notification settings - Fork 301
feat(callbacks): Add BigQuery Callbacks (Sync and Async) #439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
💡 Review Comment: Addressing Large Content Fields in BigQuery LoggingThis is a crucial point regarding the scalability and reliability of our BigQuery logging plugin, especially when integrating with large context window LLMs. The current approach of using My primary recommendation is to implement content truncation before serialization to ensure the integrity of our logging pipeline and optimize storage. ✍️ Proposed Implementation ChangesI suggest the following steps be integrated into the BigQuery logging plugin's 1. Introduce a Configuration Parameter
def __init__(
self,
...,
max_content_length: int = 50 * 1024, # Default to 50KB
):
self._max_content_length = max_content_length2. Implement Safe Content Truncation
def _truncate_content_safely(self, content: str) -> str:
"""Truncates the content string if it exceeds the configured max length."""
if len(content.encode('utf-8')) > self._max_content_length:
# Truncate and add an indicator
truncated_content = content.encode('utf-8')[:self._max_content_length].decode('utf-8', 'ignore')
return f"{truncated_content} [TRUNCATED_MAX_BYTES:{self._max_content_length}]"
return content
# ... then use it before serialization
# "content": json.dumps({"prompts": self._truncate_content_safely(prompts)}),3. Update the BigQuery Schema
|
📈 Enable High-Value Product Analytics (Session & User Context)The current logging fields (
|
🛡️ Robust Error Handling for Schema MismatchesSchema evolution in BigQuery can lead to silent, hard-to-debug failures in our streaming write pipeline if the table schema has been modified independently of the plugin. To ensure robust logging, we must anticipate and clearly handle these failures.
This prevents a minor mismatch from taking down production logs and provides an immediate, clear diagnostic path for the engineer who manages the BigQuery table. |
There is no official support, so we are using the metadata field if user want to include such information https://github.com/orgs/langfuse/discussions/7331 |
then we need to think about how to unify the schema for ADK and langchain telelmetry logging. |
In that case we may need something like a struct or json string for everything that only exists in one library. |
libs/community/pyproject.toml
Outdated
| # [project.optional-dependencies] | ||
| # bigquery = [ | ||
| # "google-cloud-bigquery", | ||
| # "google-cloud-bigquery-storage", | ||
| # "pyarrow", | ||
| # ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
|
I think this might be more appropriate in our https://github.com/langchain-ai/langchain-google/tree/main/libs/community/langchain_google_community |
Hi Mason, we discussed and think it's better to keep this in langchain_community package, the main reason: 1. put our callback to the same place as other callbacks. 2. langchain_community is much more popular comparing to langchain_google_community. Please let me know what you think. |
This PR introduces BigQueryCallbackHandler and AsyncBigQueryCallbackHandler to log LangChain run lifecycle events (LLM, Chain, Tool, Agent, Retriever, and Chat Model activities) directly to Google Cloud BigQuery.
Key features include:
Logging of all standard LangChain event types (e.g., on_llm_start, on_tool_end, on_chain_error).
Automatic dataset and table creation during handler initialization if they don't exist.
Utilizes BigQuery's specialized write clients for both synchronous and asynchronous data ingestion.
Issue: N/A
Dependencies: google-cloud-bigquery (Optional dependency)