-
Notifications
You must be signed in to change notification settings - Fork 9
Mcp auto expand datasets #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Mcp auto expand datasets #25
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the MCP retrieval tool by auto-expanding dataset_ids when omitted or empty, adds a new dataset summary tool, and removes duplicate code while improving formatting consistency.
Key Changes:
- Automatic expansion of
dataset_idsto include all available datasets when the field is not provided or empty - Introduction of
ragflow_dataset_summarytool for quick dataset metadata retrieval - Removal of duplicate dataset auto-expansion logic from the
retrievalmethod
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if name == "ragflow_retrieval": | ||
| document_ids = arguments.get("document_ids", []) | ||
| dataset_ids = arguments.get("dataset_ids", []) | ||
| dataset_ids = arguments.get("dataset_ids") or [] |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression arguments.get("dataset_ids") or [] will treat an empty list as falsy and replace it with a new empty list. This is redundant but harmless. However, for consistency with document_ids on line 445 which uses arguments.get("document_ids", []), consider using the same pattern: arguments.get("dataset_ids", []).
| dataset_ids = arguments.get("dataset_ids") or [] | |
| dataset_ids = arguments.get("dataset_ids", []) |
| dataset_ids = arguments.get("dataset_ids", []) | ||
| dataset_ids = arguments.get("dataset_ids") or [] | ||
|
|
||
|
|
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two consecutive blank lines here. Consider removing one to maintain consistent spacing throughout the file, as single blank lines are used elsewhere (e.g., between the comment and the if statement on lines 459-460).
| try: | ||
| dataset_ids.append(json.loads(line)["id"]) | ||
| except Exception: | ||
| continue |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad exception handler except Exception: silently ignores all parsing errors, including potentially serious issues like network errors or unexpected JSON structure. Consider at least logging these exceptions for debugging purposes, or being more specific about which exceptions to catch (e.g., json.JSONDecodeError, KeyError).
| try: | ||
| datasets.append(json.loads(line)) | ||
| except Exception: | ||
| continue |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad exception handler except Exception: silently ignores all parsing errors. Similar to the issue in the ragflow_retrieval handler, consider logging these exceptions or being more specific about which exceptions to catch (e.g., json.JSONDecodeError).
| "dataset_ids": { | ||
| "type": "array", | ||
| "items": {"type": "string"}, | ||
| "description": "Optional array of dataset IDs to search. If not provided or empty, all datasets will be searched." | ||
| }, | ||
| "document_ids": { | ||
| "type": "array", | ||
| "items": {"type": "string"}, | ||
| "description": "Optional array of document IDs to search within." | ||
| }, | ||
| "question": { | ||
| "type": "string", | ||
| "description": "The question or query to search for." | ||
| }, | ||
| "page": { | ||
| "type": "integer", | ||
| "description": "Page number for pagination", | ||
| "default": 1, | ||
| "minimum": 1, | ||
| }, | ||
| "page_size": { | ||
| "type": "integer", | ||
| "description": "Number of results to return per page (default: 10, max recommended: 50 to avoid token limits)", | ||
| "default": 10, | ||
| "minimum": 1, | ||
| "maximum": 100, | ||
| }, | ||
| "similarity_threshold": { | ||
| "type": "number", | ||
| "description": "Minimum similarity threshold for results", | ||
| "default": 0.2, | ||
| "minimum": 0.0, | ||
| "maximum": 1.0, | ||
| }, | ||
| "vector_similarity_weight": { | ||
| "type": "number", | ||
| "description": "Weight for vector similarity vs term similarity", | ||
| "default": 0.3, | ||
| "minimum": 0.0, | ||
| "maximum": 1.0, | ||
| }, | ||
| "keyword": { | ||
| "type": "boolean", | ||
| "description": "Enable keyword-based search", | ||
| "default": False, | ||
| }, | ||
| "top_k": { | ||
| "type": "integer", | ||
| "description": "Maximum results to consider before ranking", | ||
| "default": 1024, | ||
| "minimum": 1, | ||
| "maximum": 1024, | ||
| }, | ||
| "rerank_id": { | ||
| "type": "string", | ||
| "description": "Optional reranking model identifier", | ||
| }, | ||
| "force_refresh": { | ||
| "type": "boolean", | ||
| "description": "Set to true only if fresh dataset and document metadata is explicitly required. Otherwise, cached metadata is used (default: false).", | ||
| "default": False, | ||
| }, |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inputSchema properties are missing their "description" fields. While the properties have appropriate defaults and constraints, adding descriptions would improve the developer experience by providing inline documentation about what each parameter does. This is especially important for parameters like "similarity_threshold", "vector_similarity_weight", and "top_k" that require domain knowledge to use effectively.
| dataset_ids = [] | ||
|
|
||
| # Parse the dataset list to extract IDs | ||
|
|
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty line 462 immediately followed by another empty line on 463 creates inconsistent spacing. Consider removing one of these blank lines to maintain uniform code style.
Summary
Auto-expand
dataset_idsin the MCP retrieval tool when the field isomitted or an empty list is provided.
Solution Description
When
dataset_idsis not supplied by the client, the MCP server nowautomatically retrieves all available datasets and uses their IDs
for the retrieval request.
This aligns the runtime behaviour with the tool description, improves
developer experience, and removes the need for clients to manually
fetch dataset IDs before querying.