Skip to content

Conversation

@BurhanAbdullah
Copy link

Summary

Auto-expand dataset_ids in the MCP retrieval tool when the field is
omitted or an empty list is provided.

Solution Description

When dataset_ids is not supplied by the client, the MCP server now
automatically retrieves all available datasets and uses their IDs
for the retrieval request.

This aligns the runtime behaviour with the tool description, improves
developer experience, and removes the need for clients to manually
fetch dataset IDs before querying.

@CLAassistant
Copy link

CLAassistant commented Jan 4, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the MCP retrieval tool by auto-expanding dataset_ids when omitted or empty, adds a new dataset summary tool, and removes duplicate code while improving formatting consistency.

Key Changes:

  • Automatic expansion of dataset_ids to include all available datasets when the field is not provided or empty
  • Introduction of ragflow_dataset_summary tool for quick dataset metadata retrieval
  • Removal of duplicate dataset auto-expansion logic from the retrieval method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if name == "ragflow_retrieval":
document_ids = arguments.get("document_ids", [])
dataset_ids = arguments.get("dataset_ids", [])
dataset_ids = arguments.get("dataset_ids") or []
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression arguments.get("dataset_ids") or [] will treat an empty list as falsy and replace it with a new empty list. This is redundant but harmless. However, for consistency with document_ids on line 445 which uses arguments.get("document_ids", []), consider using the same pattern: arguments.get("dataset_ids", []).

Suggested change
dataset_ids = arguments.get("dataset_ids") or []
dataset_ids = arguments.get("dataset_ids", [])

Copilot uses AI. Check for mistakes.
dataset_ids = arguments.get("dataset_ids", [])
dataset_ids = arguments.get("dataset_ids") or []


Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two consecutive blank lines here. Consider removing one to maintain consistent spacing throughout the file, as single blank lines are used elsewhere (e.g., between the comment and the if statement on lines 459-460).

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +466 to +469
try:
dataset_ids.append(json.loads(line)["id"])
except Exception:
continue
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad exception handler except Exception: silently ignores all parsing errors, including potentially serious issues like network errors or unexpected JSON structure. Consider at least logging these exceptions for debugging purposes, or being more specific about which exceptions to catch (e.g., json.JSONDecodeError, KeyError).

Copilot uses AI. Check for mistakes.
Comment on lines +491 to +494
try:
datasets.append(json.loads(line))
except Exception:
continue
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad exception handler except Exception: silently ignores all parsing errors. Similar to the issue in the ragflow_retrieval handler, consider logging these exceptions or being more specific about which exceptions to catch (e.g., json.JSONDecodeError).

Copilot uses AI. Check for mistakes.
Comment on lines 378 to 428
"dataset_ids": {
"type": "array",
"items": {"type": "string"},
"description": "Optional array of dataset IDs to search. If not provided or empty, all datasets will be searched."
},
"document_ids": {
"type": "array",
"items": {"type": "string"},
"description": "Optional array of document IDs to search within."
},
"question": {
"type": "string",
"description": "The question or query to search for."
},
"page": {
"type": "integer",
"description": "Page number for pagination",
"default": 1,
"minimum": 1,
},
"page_size": {
"type": "integer",
"description": "Number of results to return per page (default: 10, max recommended: 50 to avoid token limits)",
"default": 10,
"minimum": 1,
"maximum": 100,
},
"similarity_threshold": {
"type": "number",
"description": "Minimum similarity threshold for results",
"default": 0.2,
"minimum": 0.0,
"maximum": 1.0,
},
"vector_similarity_weight": {
"type": "number",
"description": "Weight for vector similarity vs term similarity",
"default": 0.3,
"minimum": 0.0,
"maximum": 1.0,
},
"keyword": {
"type": "boolean",
"description": "Enable keyword-based search",
"default": False,
},
"top_k": {
"type": "integer",
"description": "Maximum results to consider before ranking",
"default": 1024,
"minimum": 1,
"maximum": 1024,
},
"rerank_id": {
"type": "string",
"description": "Optional reranking model identifier",
},
"force_refresh": {
"type": "boolean",
"description": "Set to true only if fresh dataset and document metadata is explicitly required. Otherwise, cached metadata is used (default: false).",
"default": False,
},
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputSchema properties are missing their "description" fields. While the properties have appropriate defaults and constraints, adding descriptions would improve the developer experience by providing inline documentation about what each parameter does. This is especially important for parameters like "similarity_threshold", "vector_similarity_weight", and "top_k" that require domain knowledge to use effectively.

Copilot uses AI. Check for mistakes.
dataset_ids = []

# Parse the dataset list to extract IDs

Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line 462 immediately followed by another empty line on 463 creates inconsistent spacing. Consider removing one of these blank lines to maintain uniform code style.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants