fix(metadata_service): ensure database rollback on exceptions during metadata operations #27462

liugddx · 2025-10-26T15:31:38Z

Related to #27427

Fix the MetadataDataSource enumeration by adding the missing data source types.
Use safe dictionary access methods to prevent KeyError.
Handle the case where the uploader might be None.

Important

Make sure you have read our contribution guidelines
Ensure there is an associated issue and you have been assigned to it
Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

Screenshots

Before	After
...	...

Checklist

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

…metadata operations

gemini-code-assist · 2025-10-26T15:31:57Z

Summary of Changes

Hello @liugddx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the stability and data integrity of metadata operations within the system. It introduces mechanisms to ensure proper database transaction rollbacks in the event of failures, preventing inconsistent states. Additionally, it improves the handling of metadata fields by safely resolving data source types and providing a fallback for potentially missing uploader information, thereby making the system more resilient to unexpected data conditions.

Highlights

Database Transaction Rollback: Implemented db.session.rollback() and re-raised exceptions across various metadata operations to ensure data integrity and proper transaction handling in case of failures.
Safe Data Source Handling: Introduced a new utility function, get_safe_data_source_value, to robustly retrieve data source types from the MetadataDataSource enum, preventing KeyError by falling back to the original value and logging a warning if the type is unknown.
Uploader Information Robustness: Modified document metadata creation to handle cases where the uploader's name might be None, defaulting it to 'Unknown' to prevent potential errors and ensure consistent data.
MetadataDataSource Expansion: Extended the MetadataDataSource enumeration to include online_drive, accommodating new data source types within the system.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull Request Overview

This PR enhances the metadata service by adding proper exception handling with database rollbacks and improving data source type handling. The changes prevent database inconsistencies when metadata operations fail and handle edge cases for missing or unknown data source types.

Key Changes:

Added database rollback calls in exception handlers across metadata operations to ensure transaction consistency
Introduced get_safe_data_source_value() function to handle unknown data source types gracefully
Added fallback value "Unknown" for cases where uploader might be None

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
api/services/metadata_service.py	Added `db.session.rollback()` calls in exception handlers for metadata operations and replaced direct enum access with safe accessor function
api/services/dataset_service.py	Updated to use safe data source accessor and added None-coalescing for uploader field
api/core/rag/index_processor/constant/built_in_field.py	Added `online_drive` to `MetadataDataSource` enum and implemented `get_safe_data_source_value()` helper function
api/core/app/apps/pipeline/pipeline_generator.py	Updated to use safe data source accessor and added None-coalescing for uploader field

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-26T15:32:24Z

api/core/rag/index_processor/constant/built_in_field.py

+        The mapped enum value if exists, otherwise the original value
+    """
+    try:
+        return MetadataDataSource[data_source_type].value


The function accesses the enum using bracket notation which will raise KeyError if the key doesn't exist, but the docstring states it should return the original value in that case. The try-except block catches KeyError correctly, but line 36 should use .get() method or the current approach is fine. However, the actual issue is that MetadataDataSource[data_source_type] returns an enum member, and calling .value on it is correct. But since this is using bracket notation on an Enum class (not a dict), this is actually correct behavior for enum member access. The exception handling is appropriate.

Copilot · 2025-10-26T15:32:24Z

api/services/metadata_service.py

                        doc_metadata = copy.deepcopy(document.doc_metadata)
                    doc_metadata[BuiltInField.document_name] = document.name
-                    doc_metadata[BuiltInField.uploader] = document.uploader
+                    doc_metadata[BuiltInField.uploader] = document.uploader or "Unknown"


[nitpick] Using a hardcoded string 'Unknown' for missing uploader data may cause issues with data consistency and internationalization. Consider defining this as a constant (e.g., UNKNOWN_UPLOADER = 'Unknown') at the module level or in a constants file to ensure consistency across the codebase and easier maintenance.

Copilot · 2025-10-26T15:32:25Z

api/services/dataset_service.py

            doc_metadata = {
                BuiltInField.document_name: name,
-                BuiltInField.uploader: account.name,
+                BuiltInField.uploader: account.name or "Unknown",


[nitpick] Using a hardcoded string 'Unknown' for missing uploader data may cause issues with data consistency and internationalization. Consider defining this as a constant (e.g., UNKNOWN_UPLOADER = 'Unknown') at the module level or in a constants file to ensure consistency across the codebase and easier maintenance.

Copilot · 2025-10-26T15:32:25Z

api/core/app/apps/pipeline/pipeline_generator.py

            doc_metadata = {
                BuiltInField.document_name: name,
-                BuiltInField.uploader: account.name,
+                BuiltInField.uploader: account.name or "Unknown",


[nitpick] Using a hardcoded string 'Unknown' for missing uploader data may cause issues with data consistency and internationalization. Consider defining this as a constant (e.g., UNKNOWN_UPLOADER = 'Unknown') at the module level or in a constants file to ensure consistency across the codebase and easier maintenance.

gemini-code-assist

Code Review

This pull request introduces crucial fixes to enhance the reliability and robustness of metadata operations. The primary improvement is ensuring database transactions are rolled back upon exceptions in metadata_service, which prevents data inconsistency. The changes also correctly handle potential None values for uploaders and use a safe method for accessing data source types to avoid KeyError exceptions. Additionally, a missing data source type has been added to the MetadataDataSource enumeration. The code is well-structured, and the fixes are valuable. I've added one suggestion to further improve transaction atomicity in update_documents_metadata by consolidating multiple commits into a single one.

api/services/metadata_service.py

liugddx · 2025-10-27T03:16:30Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces several important fixes to enhance robustness and error handling, particularly within the metadata_service. The key changes include ensuring database rollbacks on exceptions, safely handling potentially missing data, and expanding the MetadataDataSource enumeration. The addition of db.session.rollback() in exception handlers is a crucial improvement for maintaining data integrity. While the changes are positive, I've identified a bug in the name uniqueness check, a potential risk with breaking API contract changes, and an opportunity to further improve the batch update logic for better resilience. My detailed comments are below.

gemini-code-assist · 2025-10-27T03:18:29Z

api/services/metadata_service.py


    @staticmethod
-    def update_metadata_name(dataset_id: str, metadata_id: str, name: str) -> DatasetMetadata:  # type: ignore
+    def update_metadata_name(dataset_id: str, metadata_id: str, name: str) -> DatasetMetadata | None:


The return types of update_metadata_name and delete_metadata have been changed to return ... | None instead of raising a ValueError when an item is not found. This is a breaking change to their contracts. It's crucial that all callers are updated to handle the None return value to prevent TypeError or other unexpected behavior. This pull request does not seem to include updates to the call sites, which is risky.

gemini-code-assist · 2025-10-27T03:18:29Z

api/services/metadata_service.py

            except Exception:
+                db.session.rollback()
                logger.exception("Update documents metadata failed")
+                raise


The update_documents_metadata function processes a batch of metadata updates. Inside the loop, if an exception occurs during the update of a single document's metadata, the transaction for that document is rolled back, and the exception is re-raised, which terminates the entire batch operation. This is a "fail-fast" strategy.

For a more robust batch operation, you might consider a strategy where the failure of one document does not stop the processing of others. You could collect the errors for each failed document and report them all at the end, allowing successful updates to proceed.

liugddx · 2025-10-27T03:49:16Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces several important fixes for the metadata service. The primary improvement is ensuring database transactions are rolled back on exceptions, which is critical for data integrity. The changes are well-implemented across multiple methods. Additionally, the PR enhances robustness by safely handling potentially missing data sources and None values for uploader names. The batch update logic has also been improved to be more resilient to individual document failures. I've added a couple of suggestions for future refactoring to further improve maintainability.

gemini-code-assist · 2025-10-27T03:51:25Z

api/services/metadata_service.py

        except Exception:
+            db.session.rollback()
            logger.exception("Update metadata name failed")
+            raise
        finally:
            redis_client.delete(lock_key)


The pattern of try...except...finally for database transactions and Redis locks is repeated in update_metadata_name, delete_metadata, enable_built_in_field, disable_built_in_field, and update_documents_metadata. This code duplication can make future maintenance difficult.

To adhere to the DRY (Don't Repeat Yourself) principle, you could abstract this pattern into a reusable context manager. This would make the intent of the code clearer and reduce the chance of errors.

Here's an example of how it could look:

from contextlib import contextmanager @contextmanager def managed_transaction(lock_key: str, error_message: str): try: yield db.session.commit() except Exception: db.session.rollback() logger.exception(error_message) raise finally: if redis_client.get(lock_key): redis_client.delete(lock_key)

Using this, the update_metadata_name method would be simplified, improving readability.

gemini-code-assist · 2025-10-27T03:51:25Z

api/services/metadata_service.py

+                except Exception as e:
+                    logger.warning("Failed to get document %s: %s", operation.document_id, str(e))
+                    continue


Catching a broad Exception and then continuing the loop can mask serious underlying problems. For instance, if the database connection is lost, this will log a warning for every document in the batch and continue, instead of failing fast. It's better to catch more specific, expected exceptions for a single-item failure and let unexpected exceptions (like database connection errors) propagate to the outer except block. This would correctly halt the entire batch operation on a fatal error, which is the safer default behavior.

liugddx added 2 commits October 26, 2025 23:23

fix(metadata_service): ensure database rollback on exceptions during …

1a00a83

…metadata operations

fix(metadata_service): ensure database rollback on exceptions during …

0f18aa2

…metadata operations

Copilot AI review requested due to automatic review settings October 26, 2025 15:31

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 26, 2025

Copilot AI reviewed Oct 26, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 26, 2025

View reviewed changes

api/services/metadata_service.py Show resolved Hide resolved

liugddx added 3 commits October 27, 2025 09:41

fix: update TypeError match patterns for improved clarity in tests

90068c3

fix: update TypeError match patterns for improved clarity in tests

400f287

fix: update TypeError match patterns for improved clarity in tests

af72a5c

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

fix: update TypeError match patterns for improved clarity in tests

eaa7de0

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

Uh oh!

fix(metadata_service): ensure database rollback on exceptions during metadata operations #27462

Are you sure you want to change the base?

fix(metadata_service): ensure database rollback on exceptions during metadata operations #27462

Conversation

liugddx commented Oct 26, 2025

Summary

Screenshots

Checklist

Uh oh!

gemini-code-assist bot commented Oct 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

liugddx commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

liugddx commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant