Skip to content

Conversation

@luis5tb
Copy link
Contributor

@luis5tb luis5tb commented Nov 28, 2025

Description

Implements the function parse_referenced_documents_from_responses_api checking at the Response API output at:

  • file_search_call objects (filename and attributes)
  • annotations within messages content (type, url, title)
    • 2 type of annoations, url_citation and file_citation

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude Code and Cursor

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Extracts and deduplicates referenced documents found in search results and message annotations, with normalized URL handling.
    • File search result payloads are now preserved as JSON strings to better retain structure.
  • Tests

    • Added unit tests validating extraction and deduplication of referenced documents from annotated messages and search results.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 28, 2025

Walkthrough

Serialize file_search_call tool responses as JSON strings and add a new utility parse_referenced_documents_from_responses_api to extract and deduplicate referenced documents from response outputs and message annotations.

Changes

Cohort / File(s) Summary
Query endpoint
src/app/endpoints/query_v2.py
Added import json. _build_tool_call_summary now JSON-serializes response_payload for file_search_call results (stores a JSON string or None). Removed local stub and now imports/calls parse_referenced_documents_from_responses_api (replacing internal parse function).
Referenced-documents parsing utilities
src/utils/responses.py
New parsing utilities and public API parse_referenced_documents_from_responses_api(response: Any) -> list[ReferencedDocument]. Implements parsing of file_search_call results and message url_citation/file_citation annotations, normalizes URLs (empty → None), validates via pydantic.AnyUrl, deduplicates by (doc_url, doc_title), and returns models.responses.ReferencedDocument instances.
Unit tests
tests/unit/app/endpoints/test_query_v2.py
Updated test_retrieve_response_parses_output_and_tool_calls to use multipart content with text and annotations. Added helper _create_message_output_with_annotations. Added test_retrieve_response_parses_referenced_documents to assert extraction and deduplication of referenced documents from annotated message parts and file_search_call results (including entries lacking URLs).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect query_v2.py consumers of response_payload to ensure they handle JSON strings (no code expects original dicts).
  • Review parse_referenced_documents_from_responses_api for edge cases: missing keys, annotation shapes (dict vs object), and URL validation/fallback behavior.
  • Verify deduplication strategy (seen keying) matches intended uniqueness semantics and doesn't drop distinct docs with similar titles/URLs.
  • Confirm new tests cover both annotated message parsing and file_search_call result variations.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly matches the main change: implementing the parse_referenced_documents_from_responses_api function across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/app/endpoints/query_v2.py (1)

398-495: Refactor to address pipeline failures: too many locals, branches, and nested blocks.

The linter reports multiple violations that block CI:

  • Too many local variables (18/15)
  • Too many branches (19/12)
  • Too many nested blocks (7/5)

Also, remove # pylint: disable=unused-argument on line 399 since response is actually used.

Extract helper functions to reduce complexity:

+def _parse_file_search_results(
+    results: list[Any],
+    documents: list[ReferencedDocument],
+    seen_docs: set[tuple[str | None, str | None]],
+) -> None:
+    """Extract referenced documents from file_search_call results."""
+    for result in results:
+        if isinstance(result, dict):
+            filename = result.get("filename")
+            attributes = result.get("attributes", {}) or {}
+        else:
+            filename = getattr(result, "filename", None)
+            attributes = getattr(result, "attributes", {}) or {}
+
+        doc_url = (
+            attributes.get("link")
+            or attributes.get("url")
+            or attributes.get("doc_url")
+        )
+        final_url = doc_url if doc_url else None
+
+        if (filename or doc_url) and (final_url, filename) not in seen_docs:
+            documents.append(ReferencedDocument(doc_url=final_url, doc_title=filename))
+            seen_docs.add((final_url, filename))
+
+
+def _parse_message_annotations(
+    content: list[Any],
+    documents: list[ReferencedDocument],
+    seen_docs: set[tuple[str | None, str | None]],
+) -> None:
+    """Extract referenced documents from message content annotations."""
+    for part in content:
+        if isinstance(part, str):
+            continue
+        for annotation in getattr(part, "annotations", []) or []:
+            _process_annotation(annotation, documents, seen_docs)
+
+
+def _process_annotation(
+    annotation: Any,
+    documents: list[ReferencedDocument],
+    seen_docs: set[tuple[str | None, str | None]],
+) -> None:
+    """Process a single annotation and add to documents if applicable."""
+    if isinstance(annotation, dict):
+        anno_type = annotation.get("type")
+        anno_url = annotation.get("url")
+        anno_title = annotation.get("title") or annotation.get("filename")
+    else:
+        anno_type = getattr(annotation, "type", None)
+        anno_url = getattr(annotation, "url", None)
+        anno_title = getattr(annotation, "title", None) or getattr(
+            annotation, "filename", None
+        )
+
+    final_url = anno_url if anno_url else None
+
+    if anno_type == "url_citation" and (final_url, anno_title) not in seen_docs:
+        documents.append(ReferencedDocument(doc_url=final_url, doc_title=anno_title))
+        seen_docs.add((final_url, anno_title))
+    elif anno_type == "file_citation" and (None, anno_title) not in seen_docs:
+        documents.append(ReferencedDocument(doc_url=None, doc_title=anno_title))
+        seen_docs.add((None, anno_title))
+
+
 def parse_referenced_documents_from_responses_api(
-    response: OpenAIResponseObject,  # pylint: disable=unused-argument
+    response: OpenAIResponseObject,
 ) -> list[ReferencedDocument]:
-    """
-    Parse referenced documents from OpenAI Responses API response.
-    ...
-    """
+    """Parse referenced documents from OpenAI Responses API response."""
     documents: list[ReferencedDocument] = []
     seen_docs: set[tuple[str | None, str | None]] = set()
 
     if not response.output:
         return documents
 
     for output_item in response.output:
         item_type = getattr(output_item, "type", None)
 
-        # 1. Parse from file_search_call results
         if item_type == "file_search_call":
-            results = getattr(output_item, "results", []) or []
-            for result in results:
-                # Handle both object and dict access
-                ...
+            _parse_file_search_results(
+                getattr(output_item, "results", []) or [],
+                documents,
+                seen_docs,
+            )
-        # 2. Parse from message content annotations
         elif item_type == "message":
             content = getattr(output_item, "content", None)
             if isinstance(content, list):
-                for part in content:
-                    ...
+                _parse_message_annotations(content, documents, seen_docs)
 
     return documents
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cfaeab and 26ed602.

📒 Files selected for processing (2)
  • src/app/endpoints/query_v2.py (3 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Use absolute imports for internal modules in LCS project (e.g., from auth import get_auth_dependency)
All modules must start with descriptive docstrings explaining their purpose
Use logger = logging.getLogger(__name__) pattern for module logging
All functions must include complete type annotations for parameters and return types, using modern syntax (str | int) and Optional[Type] or Type | None
All functions must have docstrings with brief descriptions following Google Python docstring conventions
Function names must use snake_case with descriptive, action-oriented names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying input parameters
Use async def for I/O operations and external API calls
All classes must include descriptive docstrings explaining their purpose following Google Python docstring conventions
Class names must use PascalCase with descriptive names and standard suffixes: Configuration for config classes, Error/Exception for exceptions, Resolver for strategy patterns, Interface for abstract base classes
Abstract classes must use ABC with @abstractmethod decorators
Include complete type annotations for all class attributes in Python classes
Use import logging and module logger pattern with standard log levels: debug, info, warning, error

Files:

  • src/app/endpoints/query_v2.py
src/app/endpoints/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

  • src/app/endpoints/query_v2.py
src/**/{client,app/endpoints/**}.py

📄 CodeRabbit inference engine (CLAUDE.md)

Handle APIConnectionError from Llama Stack in integration code

Files:

  • src/app/endpoints/query_v2.py
tests/{unit,integration}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests; do not use unittest framework
Unit tests must achieve 60% code coverage; integration tests must achieve 10% coverage

Files:

  • tests/unit/app/endpoints/test_query_v2.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest-mock with AsyncMock objects for mocking in tests

Files:

  • tests/unit/app/endpoints/test_query_v2.py
🧬 Code graph analysis (2)
src/app/endpoints/query_v2.py (1)
src/models/responses.py (1)
  • ReferencedDocument (326-336)
tests/unit/app/endpoints/test_query_v2.py (2)
src/models/requests.py (1)
  • QueryRequest (73-233)
src/app/endpoints/query_v2.py (1)
  • retrieve_response (265-395)
🪛 GitHub Actions: Python linter
src/app/endpoints/query_v2.py

[error] 398-398: R0914: Too many local variables (18/15) (too-many-locals)


[error] 398-398: R0912: Too many branches (19/12) (too-many-branches)


[error] 417-417: R1702: Too many nested blocks (7/5) (too-many-nested-blocks)


[error] 417-417: R1702: Too many nested blocks (7/5) (too-many-nested-blocks)

tests/unit/app/endpoints/test_query_v2.py

[error] 721-721: R0914: Too many local variables (18/15) (too-many-locals)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-pr
  • GitHub Check: e2e_tests (ci)
  • GitHub Check: e2e_tests (azure)
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🔇 Additional comments (2)
src/app/endpoints/query_v2.py (1)

136-136: LGTM on JSON serialization change.

Serializing response_payload to JSON string ensures consistent data format for storage in ToolCallSummary.response.

tests/unit/app/endpoints/test_query_v2.py (1)

201-209: LGTM on adding annotations attribute to mock parts.

This prevents iteration errors when the code accesses part.annotations in the updated parsing logic.

@luis5tb luis5tb force-pushed the parse_referenced_documents_from_responses_api branch from 26ed602 to 212b294 Compare November 28, 2025 13:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/app/endpoints/query_v2.py (3)

398-400: Remove incorrect pylint: disable=unused-argument comment.

The response parameter is used throughout the function body, so this disable comment is unnecessary and misleading.

-def parse_referenced_documents_from_responses_api(
-    response: OpenAIResponseObject,  # pylint: disable=unused-argument
-) -> list[ReferencedDocument]:
+def parse_referenced_documents_from_responses_api(
+    response: OpenAIResponseObject,
+) -> list[ReferencedDocument]:

417-494: Refactor to address pipeline linting failures.

The pipeline reports:

  • Too many local variables (18/15)
  • Too many nested blocks (7/5)
  • Too many branches (19/12)

Consider extracting helper functions to reduce complexity:

+def _parse_file_search_results(
+    results: list[Any],
+    documents: list[ReferencedDocument],
+    seen_docs: set[tuple[str | None, str | None]],
+) -> None:
+    """Extract documents from file_search_call results."""
+    for result in results:
+        if isinstance(result, dict):
+            filename = result.get("filename")
+            attributes = result.get("attributes", {}) or {}
+        else:
+            filename = getattr(result, "filename", None)
+            attributes = getattr(result, "attributes", {}) or {}
+
+        doc_url = (
+            attributes.get("link")
+            or attributes.get("url")
+            or attributes.get("doc_url")
+        )
+
+        if filename or doc_url:
+            final_url = doc_url if doc_url else None
+            if (final_url, filename) not in seen_docs:
+                documents.append(
+                    ReferencedDocument(doc_url=final_url, doc_title=filename)
+                )
+                seen_docs.add((final_url, filename))
+
+
+def _parse_message_annotations(
+    content: list[Any],
+    documents: list[ReferencedDocument],
+    seen_docs: set[tuple[str | None, str | None]],
+) -> None:
+    """Extract documents from message content annotations."""
+    for part in content:
+        if isinstance(part, str):
+            continue
+
+        annotations = getattr(part, "annotations", []) or []
+        for annotation in annotations:
+            if isinstance(annotation, dict):
+                anno_type = annotation.get("type")
+                anno_url = annotation.get("url")
+                anno_title = annotation.get("title") or annotation.get("filename")
+            else:
+                anno_type = getattr(annotation, "type", None)
+                anno_url = getattr(annotation, "url", None)
+                anno_title = getattr(annotation, "title", None) or getattr(
+                    annotation, "filename", None
+                )
+
+            if anno_type == "url_citation":
+                final_url = anno_url if anno_url else None
+                if (final_url, anno_title) not in seen_docs:
+                    documents.append(
+                        ReferencedDocument(doc_url=final_url, doc_title=anno_title)
+                    )
+                    seen_docs.add((final_url, anno_title))
+            elif anno_type == "file_citation":
+                if (None, anno_title) not in seen_docs:
+                    documents.append(
+                        ReferencedDocument(doc_url=None, doc_title=anno_title)
+                    )
+                    seen_docs.add((None, anno_title))

Then simplify the main function:

def parse_referenced_documents_from_responses_api(
    response: OpenAIResponseObject,
) -> list[ReferencedDocument]:
    """Parse referenced documents from OpenAI Responses API response."""
    documents: list[ReferencedDocument] = []
    seen_docs: set[tuple[str | None, str | None]] = set()

    if not response.output:
        return documents

    for output_item in response.output:
        item_type = getattr(output_item, "type", None)

        if item_type == "file_search_call":
            results = getattr(output_item, "results", []) or []
            _parse_file_search_results(results, documents, seen_docs)
        elif item_type == "message":
            content = getattr(output_item, "content", None)
            if isinstance(content, list):
                _parse_message_annotations(content, documents, seen_docs)

    return documents

486-493: Consider handling file_citation with URL if available.

Currently, file_citation always sets doc_url=None. If the annotation has a URL attribute (some implementations might include one), it would be discarded.

                         elif anno_type == "file_citation":
-                            if (None, anno_title) not in seen_docs:
+                            final_url = anno_url if anno_url else None
+                            if (final_url, anno_title) not in seen_docs:
                                 documents.append(
                                     ReferencedDocument(
-                                        doc_url=None, doc_title=anno_title
+                                        doc_url=final_url, doc_title=anno_title
                                     )
                                 )
-                                seen_docs.add((None, anno_title))
+                                seen_docs.add((final_url, anno_title))
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26ed602 and 212b294.

📒 Files selected for processing (2)
  • src/app/endpoints/query_v2.py (3 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/app/endpoints/test_query_v2.py
🧰 Additional context used
📓 Path-based instructions (3)
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Use absolute imports for internal modules in LCS project (e.g., from auth import get_auth_dependency)
All modules must start with descriptive docstrings explaining their purpose
Use logger = logging.getLogger(__name__) pattern for module logging
All functions must include complete type annotations for parameters and return types, using modern syntax (str | int) and Optional[Type] or Type | None
All functions must have docstrings with brief descriptions following Google Python docstring conventions
Function names must use snake_case with descriptive, action-oriented names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying input parameters
Use async def for I/O operations and external API calls
All classes must include descriptive docstrings explaining their purpose following Google Python docstring conventions
Class names must use PascalCase with descriptive names and standard suffixes: Configuration for config classes, Error/Exception for exceptions, Resolver for strategy patterns, Interface for abstract base classes
Abstract classes must use ABC with @abstractmethod decorators
Include complete type annotations for all class attributes in Python classes
Use import logging and module logger pattern with standard log levels: debug, info, warning, error

Files:

  • src/app/endpoints/query_v2.py
src/app/endpoints/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

  • src/app/endpoints/query_v2.py
src/**/{client,app/endpoints/**}.py

📄 CodeRabbit inference engine (CLAUDE.md)

Handle APIConnectionError from Llama Stack in integration code

Files:

  • src/app/endpoints/query_v2.py
🧬 Code graph analysis (1)
src/app/endpoints/query_v2.py (1)
src/models/responses.py (1)
  • ReferencedDocument (326-336)
🪛 GitHub Actions: Python linter
src/app/endpoints/query_v2.py

[error] 398-398: pylint: Too many local variables (18/15) (R0914). Command: uv run pylint src tests.


[error] 417-417: pylint: Too many nested blocks (7/5) (R1702). Command: uv run pylint src tests.


[error] 398-398: pylint: Too many branches (19/12) (R0912). Command: uv run pylint src tests.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-pr
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: e2e_tests (ci)
  • GitHub Check: e2e_tests (azure)
🔇 Additional comments (2)
src/app/endpoints/query_v2.py (2)

3-3: LGTM!

Import is correctly added and used for JSON serialization.


136-136: LGTM!

JSON serialization ensures consistent string storage and avoids potential issues with object references.

@luis5tb luis5tb force-pushed the parse_referenced_documents_from_responses_api branch from 212b294 to bad689e Compare November 28, 2025 14:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
src/utils/responses.py (3)

63-89: Handle attributes being None in _parse_file_search_result.

attributes = getattr(result, "attributes", {}) will leave attributes as None if the attribute exists but is None, and attributes.get(...) will then raise at runtime. Using an or {} guard would make this path more robust against partial/malformed tool responses.

-    else:
-        filename = getattr(result, "filename", None)
-        attributes = getattr(result, "attributes", {})
+    else:
+        filename = getattr(result, "filename", None)
+        attributes = getattr(result, "attributes", {}) or {}

141-160: File search output parsing is sound; consider small clarity tweak.

The loop over results and the (filename or doc_url) guard correctly avoids emitting empty entries. With the suggested attributes or {} change above, this helper is reasonably robust against partially populated results.


196-229: Public parser is cohesive; consider minor resilience improvement.

The orchestration in parse_referenced_documents_from_responses_api cleanly combines file_search and message-annotation sources with deduping by (doc_url, doc_title). One minor hardening you might consider is guarding access to response.output via getattr so accidental misuse with a dict-like response doesn’t raise:

-    if not response.output:
-        return documents
-
-    for output_item in response.output:
+    output_items = getattr(response, "output", None) or []
+    if not output_items:
+        return documents
+
+    for output_item in output_items:
tests/unit/app/endpoints/test_query_v2.py (1)

720-743: Good extraction of annotation-message setup into a helper.

_create_message_output_with_annotations nicely centralizes the annotated message construction, reduces local variables in the test, and mirrors the shapes your parser expects (url_citation and file_citation with filename fallback).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 212b294 and bad689e.

📒 Files selected for processing (3)
  • src/app/endpoints/query_v2.py (3 hunks)
  • src/utils/responses.py (2 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/app/endpoints/query_v2.py
🧰 Additional context used
📓 Path-based instructions (3)
tests/{unit,integration}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests; do not use unittest framework
Unit tests must achieve 60% code coverage; integration tests must achieve 10% coverage

Files:

  • tests/unit/app/endpoints/test_query_v2.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest-mock with AsyncMock objects for mocking in tests

Files:

  • tests/unit/app/endpoints/test_query_v2.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Use absolute imports for internal modules in LCS project (e.g., from auth import get_auth_dependency)
All modules must start with descriptive docstrings explaining their purpose
Use logger = logging.getLogger(__name__) pattern for module logging
All functions must include complete type annotations for parameters and return types, using modern syntax (str | int) and Optional[Type] or Type | None
All functions must have docstrings with brief descriptions following Google Python docstring conventions
Function names must use snake_case with descriptive, action-oriented names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying input parameters
Use async def for I/O operations and external API calls
All classes must include descriptive docstrings explaining their purpose following Google Python docstring conventions
Class names must use PascalCase with descriptive names and standard suffixes: Configuration for config classes, Error/Exception for exceptions, Resolver for strategy patterns, Interface for abstract base classes
Abstract classes must use ABC with @abstractmethod decorators
Include complete type annotations for all class attributes in Python classes
Use import logging and module logger pattern with standard log levels: debug, info, warning, error

Files:

  • src/utils/responses.py
🧬 Code graph analysis (2)
tests/unit/app/endpoints/test_query_v2.py (2)
src/models/requests.py (1)
  • QueryRequest (73-233)
src/app/endpoints/query_v2.py (1)
  • retrieve_response (268-398)
src/utils/responses.py (1)
src/models/responses.py (1)
  • ReferencedDocument (326-336)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-pr
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: e2e_tests (ci)
  • GitHub Check: e2e_tests (azure)
🔇 Additional comments (4)
src/utils/responses.py (2)

92-115: Annotation parsing logic looks correct and flexible.

The _parse_annotation helper correctly supports both dict and object shapes and sensibly falls back from title to filename for file citations; this aligns with the test expectations.


162-194: Message annotation parsing matches spec and avoids false positives.

_parse_message_annotations safely:

  • ignores non-list content,
  • skips pure-string parts,
  • handles missing/empty annotations,
  • distinguishes url_citation vs file_citation, normalizing empty URLs to None.

This aligns with the new tests and should not interfere with messages that have no annotations.

tests/unit/app/endpoints/test_query_v2.py (2)

201-205: Content parts setup is compatible with annotation parsing.

Initializing part1/part2 with text and explicit annotations = [] ensures they behave as simple text parts without spurious citations, while remaining safe if the message-annotation parser inspects annotations.


746-807: I'm unable to proceed with verification due to a repository clone failure. However, based on the review comment provided, I can identify the issue:

The review comment includes a verification request with a pylint script that cannot be executed in the current environment. Since this verification step cannot be completed, I need to rewrite the comment to reflect this limitation.

The review-requested verification (pylint R0914 check) cannot be performed due to repository access limitations.

The substantive review feedback about the test coverage, order-independent assertions, and optional refactoring suggestion can be preserved, but the specific pylint verification request must be removed or reframed.


Referenced-documents test accurately exercises all supported sources.

The new test_retrieve_response_parses_referenced_documents:

  • Covers both message annotations (url_citation + file_citation) and file_search_call results (with and without URLs).
  • Asserts on doc_title/doc_url combinations in an order-independent way.
  • Ensures four unique ReferencedDocument entries, matching the dedupe-by-(url,title) semantics.

One follow-up to watch: this refactor should also resolve the previous too-many-locals lint error; if that warning persists, consider similarly extracting the file_search output construction into a small helper.

Implements the function parse_referenced_documents_from_responses_api
checking at the Response API output at:
- file_search_call objects (filename and attributes)
- annotations within messages content (type, url, title)
   - 2 type of annoations, url_citation and file_citation
@luis5tb luis5tb force-pushed the parse_referenced_documents_from_responses_api branch from bad689e to e45a1f4 Compare November 28, 2025 16:37
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/unit/app/endpoints/test_query_v2.py (1)

763-766: Consider testing object-based file search results.

The test uses dict-based results, which exercises result.get() paths. The parsing logic also handles object-based results via getattr(). Consider adding a test case with mock objects to ensure both code paths are covered.

Example additional test data using mock objects:

# Object-based result for broader coverage
result_mock = mocker.Mock()
result_mock.filename = "object_file.pdf"
result_mock.attributes = {"url": "http://example.com/object"}
output_item_2.results = [result_mock, ...]
src/utils/responses.py (2)

118-143: Good: ValidationError handling prevents crashes on malformed URLs.

The try/except ValidationError block correctly handles invalid URLs by skipping the document, addressing the concern from the past review. Consider adding debug logging for skipped invalid URLs to aid troubleshooting.

             try:
                 validated_url = AnyUrl(doc_url)  # type: ignore[arg-type]
             except ValidationError:
-                # Skip documents with invalid URLs
+                # Skip documents with invalid URLs - log for debugging
+                import logging
+                logging.getLogger(__name__).debug(
+                    "Skipping document with invalid URL: %s", doc_url
+                )
                 return

Note: If adding logging, move the import to the module level to follow the logger = logging.getLogger(__name__) pattern from coding guidelines.


201-234: Consider guarding against None response.

The function accesses response.output directly on line 221. If response itself is None, this will raise AttributeError. Consider adding a guard or documenting that response must not be None.

 def parse_referenced_documents_from_responses_api(
     response: Any,
 ) -> list[ReferencedDocument]:
     ...
     documents: list[ReferencedDocument] = []
     # Use a set to track unique documents by (doc_url, doc_title) tuple
     seen_docs: set[tuple[str | None, str | None]] = set()

-    if not response.output:
+    if not response or not response.output:
         return documents

Alternatively, if callers always pass a valid response, document this precondition in the docstring.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bad689e and e45a1f4.

📒 Files selected for processing (3)
  • src/app/endpoints/query_v2.py (3 hunks)
  • src/utils/responses.py (2 hunks)
  • tests/unit/app/endpoints/test_query_v2.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
tests/{unit,integration}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/{unit,integration}/**/*.py: Use pytest for all unit and integration tests; do not use unittest framework
Unit tests must achieve 60% code coverage; integration tests must achieve 10% coverage

Files:

  • tests/unit/app/endpoints/test_query_v2.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest-mock with AsyncMock objects for mocking in tests

Files:

  • tests/unit/app/endpoints/test_query_v2.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Use absolute imports for internal modules in LCS project (e.g., from auth import get_auth_dependency)
All modules must start with descriptive docstrings explaining their purpose
Use logger = logging.getLogger(__name__) pattern for module logging
All functions must include complete type annotations for parameters and return types, using modern syntax (str | int) and Optional[Type] or Type | None
All functions must have docstrings with brief descriptions following Google Python docstring conventions
Function names must use snake_case with descriptive, action-oriented names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying input parameters
Use async def for I/O operations and external API calls
All classes must include descriptive docstrings explaining their purpose following Google Python docstring conventions
Class names must use PascalCase with descriptive names and standard suffixes: Configuration for config classes, Error/Exception for exceptions, Resolver for strategy patterns, Interface for abstract base classes
Abstract classes must use ABC with @abstractmethod decorators
Include complete type annotations for all class attributes in Python classes
Use import logging and module logger pattern with standard log levels: debug, info, warning, error

Files:

  • src/app/endpoints/query_v2.py
  • src/utils/responses.py
src/app/endpoints/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

  • src/app/endpoints/query_v2.py
src/**/{client,app/endpoints/**}.py

📄 CodeRabbit inference engine (CLAUDE.md)

Handle APIConnectionError from Llama Stack in integration code

Files:

  • src/app/endpoints/query_v2.py
🧬 Code graph analysis (3)
tests/unit/app/endpoints/test_query_v2.py (1)
src/models/requests.py (1)
  • QueryRequest (73-233)
src/app/endpoints/query_v2.py (1)
src/utils/responses.py (2)
  • extract_text_from_response_output_item (10-60)
  • parse_referenced_documents_from_responses_api (201-234)
src/utils/responses.py (1)
src/models/responses.py (1)
  • ReferencedDocument (326-336)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-pr
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: e2e_tests (azure)
  • GitHub Check: e2e_tests (ci)
🔇 Additional comments (11)
src/app/endpoints/query_v2.py (3)

3-3: LGTM! Import changes are clean and follow project conventions.

The new imports use absolute paths as required by coding guidelines, and the json module import is correctly placed with standard library imports.

Also applies to: 42-45


135-140: LGTM! JSON serialization ensures consistent response format.

Storing response_payload as a JSON string (via json.dumps) ensures the file search results are serialized consistently for downstream consumers. The conditional correctly preserves None when no results are present.


386-392: Good refactoring: Delegation to centralized parsing utility.

Moving document parsing logic to parse_referenced_documents_from_responses_api promotes code reuse and maintainability. The integration is clean.

tests/unit/app/endpoints/test_query_v2.py (3)

201-209: Good fix: Adding empty annotations prevents iteration errors.

Adding annotations = [] to mock parts ensures the parsing logic doesn't fail when iterating over annotations in _parse_message_annotations. This defensive setup is appropriate.


720-743: Good refactoring: Helper function reduces test complexity.

Extracting mock setup into _create_message_output_with_annotations addresses the previous review's "too many local variables" concern and improves test readability.


746-807: Comprehensive test for referenced document parsing.

The test validates:

  1. URL citations from message annotations
  2. File citations from message annotations
  3. File search results with URL attributes
  4. File search results without URL attributes

The refactoring keeps the local variable count at 14, under the pylint limit of 15.

src/utils/responses.py (5)

5-7: LGTM! Imports follow project conventions.

Using absolute imports for internal modules (models.responses) and standard Pydantic imports is correct per coding guidelines.


63-89: LGTM! Robust handling of file search results.

The function correctly handles:

  • Both dict and object access patterns
  • None attributes via or {} fallback (line 81)
  • Empty string URL normalization to None

92-115: LGTM! Clean annotation parsing with appropriate fallbacks.

The function handles both dict and object annotations, with sensible fallback from title to filename for the document title.


146-164: LGTM! Defensive iteration with proper null handling.

The or [] fallback on line 159 prevents TypeError when results is None, and the condition on line 163 ensures documents have at least one identifier.


167-198: LGTM! Proper handling of annotation types.

The function correctly differentiates between:

  • url_citation: Extracts URL and title
  • file_citation: Uses only title (URL set to None)

The empty string normalization on line 195 ensures consistent None handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant