-
Couldn't load subscription status.
- Fork 18.1k
fix(metadata_service): ensure database rollback on exceptions during metadata operations #27462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…metadata operations
…metadata operations
Summary of ChangesHello @liugddx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the stability and data integrity of metadata operations within the system. It introduces mechanisms to ensure proper database transaction rollbacks in the event of failures, preventing inconsistent states. Additionally, it improves the handling of metadata fields by safely resolving data source types and providing a fallback for potentially missing uploader information, thereby making the system more resilient to unexpected data conditions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the metadata service by adding proper exception handling with database rollbacks and improving data source type handling. The changes prevent database inconsistencies when metadata operations fail and handle edge cases for missing or unknown data source types.
Key Changes:
- Added database rollback calls in exception handlers across metadata operations to ensure transaction consistency
- Introduced
get_safe_data_source_value()function to handle unknown data source types gracefully - Added fallback value "Unknown" for cases where uploader might be None
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| api/services/metadata_service.py | Added db.session.rollback() calls in exception handlers for metadata operations and replaced direct enum access with safe accessor function |
| api/services/dataset_service.py | Updated to use safe data source accessor and added None-coalescing for uploader field |
| api/core/rag/index_processor/constant/built_in_field.py | Added online_drive to MetadataDataSource enum and implemented get_safe_data_source_value() helper function |
| api/core/app/apps/pipeline/pipeline_generator.py | Updated to use safe data source accessor and added None-coalescing for uploader field |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| The mapped enum value if exists, otherwise the original value | ||
| """ | ||
| try: | ||
| return MetadataDataSource[data_source_type].value |
Copilot
AI
Oct 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function accesses the enum using bracket notation which will raise KeyError if the key doesn't exist, but the docstring states it should return the original value in that case. The try-except block catches KeyError correctly, but line 36 should use .get() method or the current approach is fine. However, the actual issue is that MetadataDataSource[data_source_type] returns an enum member, and calling .value on it is correct. But since this is using bracket notation on an Enum class (not a dict), this is actually correct behavior for enum member access. The exception handling is appropriate.
| doc_metadata = copy.deepcopy(document.doc_metadata) | ||
| doc_metadata[BuiltInField.document_name] = document.name | ||
| doc_metadata[BuiltInField.uploader] = document.uploader | ||
| doc_metadata[BuiltInField.uploader] = document.uploader or "Unknown" |
Copilot
AI
Oct 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using a hardcoded string 'Unknown' for missing uploader data may cause issues with data consistency and internationalization. Consider defining this as a constant (e.g., UNKNOWN_UPLOADER = 'Unknown') at the module level or in a constants file to ensure consistency across the codebase and easier maintenance.
| doc_metadata = { | ||
| BuiltInField.document_name: name, | ||
| BuiltInField.uploader: account.name, | ||
| BuiltInField.uploader: account.name or "Unknown", |
Copilot
AI
Oct 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using a hardcoded string 'Unknown' for missing uploader data may cause issues with data consistency and internationalization. Consider defining this as a constant (e.g., UNKNOWN_UPLOADER = 'Unknown') at the module level or in a constants file to ensure consistency across the codebase and easier maintenance.
| doc_metadata = { | ||
| BuiltInField.document_name: name, | ||
| BuiltInField.uploader: account.name, | ||
| BuiltInField.uploader: account.name or "Unknown", |
Copilot
AI
Oct 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using a hardcoded string 'Unknown' for missing uploader data may cause issues with data consistency and internationalization. Consider defining this as a constant (e.g., UNKNOWN_UPLOADER = 'Unknown') at the module level or in a constants file to ensure consistency across the codebase and easier maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces crucial fixes to enhance the reliability and robustness of metadata operations. The primary improvement is ensuring database transactions are rolled back upon exceptions in metadata_service, which prevents data inconsistency. The changes also correctly handle potential None values for uploaders and use a safe method for accessing data source types to avoid KeyError exceptions. Additionally, a missing data source type has been added to the MetadataDataSource enumeration. The code is well-structured, and the fixes are valuable. I've added one suggestion to further improve transaction atomicity in update_documents_metadata by consolidating multiple commits into a single one.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several important fixes to enhance robustness and error handling, particularly within the metadata_service. The key changes include ensuring database rollbacks on exceptions, safely handling potentially missing data, and expanding the MetadataDataSource enumeration. The addition of db.session.rollback() in exception handlers is a crucial improvement for maintaining data integrity. While the changes are positive, I've identified a bug in the name uniqueness check, a potential risk with breaking API contract changes, and an opportunity to further improve the batch update logic for better resilience. My detailed comments are below.
|
|
||
| @staticmethod | ||
| def update_metadata_name(dataset_id: str, metadata_id: str, name: str) -> DatasetMetadata: # type: ignore | ||
| def update_metadata_name(dataset_id: str, metadata_id: str, name: str) -> DatasetMetadata | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return types of update_metadata_name and delete_metadata have been changed to return ... | None instead of raising a ValueError when an item is not found. This is a breaking change to their contracts. It's crucial that all callers are updated to handle the None return value to prevent TypeError or other unexpected behavior. This pull request does not seem to include updates to the call sites, which is risky.
| except Exception: | ||
| db.session.rollback() | ||
| logger.exception("Update documents metadata failed") | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update_documents_metadata function processes a batch of metadata updates. Inside the loop, if an exception occurs during the update of a single document's metadata, the transaction for that document is rolled back, and the exception is re-raised, which terminates the entire batch operation. This is a "fail-fast" strategy.
For a more robust batch operation, you might consider a strategy where the failure of one document does not stop the processing of others. You could collect the errors for each failed document and report them all at the end, allowing successful updates to proceed.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several important fixes for the metadata service. The primary improvement is ensuring database transactions are rolled back on exceptions, which is critical for data integrity. The changes are well-implemented across multiple methods. Additionally, the PR enhances robustness by safely handling potentially missing data sources and None values for uploader names. The batch update logic has also been improved to be more resilient to individual document failures. I've added a couple of suggestions for future refactoring to further improve maintainability.
| except Exception: | ||
| db.session.rollback() | ||
| logger.exception("Update metadata name failed") | ||
| raise | ||
| finally: | ||
| redis_client.delete(lock_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern of try...except...finally for database transactions and Redis locks is repeated in update_metadata_name, delete_metadata, enable_built_in_field, disable_built_in_field, and update_documents_metadata. This code duplication can make future maintenance difficult.
To adhere to the DRY (Don't Repeat Yourself) principle, you could abstract this pattern into a reusable context manager. This would make the intent of the code clearer and reduce the chance of errors.
Here's an example of how it could look:
from contextlib import contextmanager
@contextmanager
def managed_transaction(lock_key: str, error_message: str):
try:
yield
db.session.commit()
except Exception:
db.session.rollback()
logger.exception(error_message)
raise
finally:
if redis_client.get(lock_key):
redis_client.delete(lock_key)Using this, the update_metadata_name method would be simplified, improving readability.
| except Exception as e: | ||
| logger.warning("Failed to get document %s: %s", operation.document_id, str(e)) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a broad Exception and then continuing the loop can mask serious underlying problems. For instance, if the database connection is lost, this will log a warning for every document in the batch and continue, instead of failing fast. It's better to catch more specific, expected exceptions for a single-item failure and let unexpected exceptions (like database connection errors) propagate to the outer except block. This would correctly halt the entire batch operation on a fatal error, which is the safer default behavior.
Related to #27427
Important
Fixes #<issue number>.Summary
Screenshots
Checklist
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods