Skip to content

Conversation

@Akrog
Copy link
Contributor

@Akrog Akrog commented Nov 21, 2025

Description

When processing documents using the markdown chunker we may end up with chunks that don't really have any useful information, as in having only headers.

For example, the following text:

# Configuration

## cinder.conf

Blah, blah

Will result in 2 chunks:

# Configuration

and

## cinder.conf

Blah, blah

With the first chunk being useless, moreover considering the second chunk has this in its metadata (that is used for the embedding):

header_path: /Configuration

In this patch we change how we process markdown doc types and remove chunks that have no real content because they only have headers or only have space related characters.

Type of change

  • New feature

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Summary by CodeRabbit

  • New Features

    • Enhanced content validation to better identify and filter meaningful text within documents, with improved handling of header-only sections in markdown.
  • Tests

    • Added comprehensive unit tests for text node validation and content filtering logic across multiple document type scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

When processing documents using the markdown chunker we may end up with
chunks that don't really have any useful information, as in having only
headers.

For example, the following text:

    # Configuration

    ## cinder.conf

    Blah, blah

Will result in 2 chunks:

    # Configuration

and

    ## cinder.conf

    Blah, blah

With the first chunk being useless, moreover considering the second
chunk has this in its metadata (that is used for the embedding):

    header_path: /Configuration

In this patch we change how we process markdown doc types and remove
chunks that have no real content because they only have headers or only
have space related characters.
@coderabbitai
Copy link

coderabbitai bot commented Nov 21, 2025

Walkthrough

The PR refactors text validation in the document processor by introducing two new validation methods (_got_non_headers static helper and _valid_text_node instance method), converting two classmethods to instance methods, and adding comprehensive unit test coverage for the validation logic.

Changes

Cohort / File(s) Summary
Text validation refactoring
src/lightspeed_rag_content/document_processor.py
Added static helper _got_non_headers() to detect non-header markdown content. Added instance method _valid_text_node() to validate TextNode by checking whitespace and non-header content for markdown. Converted _filter_out_invalid_nodes() and _split_and_filter() from classmethods to instance methods. Updated filtering logic to use _valid_text_node() and changed log message from "Skipping node without whitespace" to "Skipping invalid node".
Unit test coverage
tests/test_document_processor_llama_index.py
Added tests for _valid_text_node() behavior across document types (plain and markdown). Added tests for _got_non_headers() with content-present and headers-only markdown scenarios. Added tests for _filter_out_invalid_nodes() logic with Node type differentiation for text validation and filtering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Method signature changes: Verify that conversions from classmethod to instance method (_filter_out_invalid_nodes and _split_and_filter) don't break existing call sites throughout the codebase.
  • Validation logic: Ensure _valid_text_node() correctly handles both plain and markdown document types and that _got_non_headers() accurately identifies non-header content in markdown.
  • Test coverage completeness: Confirm that test cases comprehensively cover edge cases for the new validation methods, including empty strings, whitespace-only content, and various markdown header formats.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: modifying the markdown chunker to skip header-only chunks, which is the core objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/lightspeed_rag_content/document_processor.py (1)

102-111: LGTM - Validation correctly updated to use new logic.

The conversion to an instance method and delegation to _valid_text_node is correct. The generic log message is appropriate given that validation now includes multiple conditions (whitespace and header checks).

Optional: Consider logging more details about why a node is invalid (e.g., "no whitespace", "only headers") to aid debugging, but this is a minor enhancement.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b854493 and dfd3085.

📒 Files selected for processing (2)
  • src/lightspeed_rag_content/document_processor.py (1 hunks)
  • tests/test_document_processor_llama_index.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_document_processor_llama_index.py (1)
src/lightspeed_rag_content/document_processor.py (3)
  • _valid_text_node (96-100)
  • _got_non_headers (88-94)
  • _filter_out_invalid_nodes (102-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pylinter
  • GitHub Check: build-and-push-dev
  • GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request
🔇 Additional comments (7)
src/lightspeed_rag_content/document_processor.py (3)

87-94: LGTM - Clean header detection logic.

The implementation correctly identifies non-header content in markdown by checking if any non-empty line doesn't start with "#". The early return optimization is good.

Note: Code blocks containing lines starting with "#" will be treated as headers, but this is acceptable given the goal of filtering header-only chunks.


96-100: LGTM - Correct validation logic for markdown chunks.

The validation correctly enforces that markdown chunks must contain non-header content in addition to whitespace. The early return when markdown has only headers is efficient.

This implementation aligns with the PR objective to filter header-only and whitespace-only chunks specifically for markdown documents.


113-116: LGTM - Method correctly converted to instance method.

The conversion from classmethod to instance method is necessary and correctly implemented. All call sites (lines 155 and 357) are properly updated to invoke this as an instance method.

tests/test_document_processor_llama_index.py (4)

22-22: LGTM - Import correctly updated for test requirements.

The addition of Node import is necessary to properly spec the non-TextNode mock in test__filter_out_invalid_nodes, allowing the test to verify that only TextNode instances are validated.


75-120: LGTM - Comprehensive test coverage for validation logic.

These tests thoroughly cover the _valid_text_node method behavior across different document types and content scenarios:

  • Plain doc type correctly checks only whitespace
  • Markdown doc type correctly checks both non-headers and whitespace
  • Early return for header-only markdown is properly tested

The mock assertions ensure the correct methods are called in the right sequence.


122-152: LGTM - Excellent parametrized test coverage with edge cases.

These parametrized tests provide comprehensive coverage of the _got_non_headers method:

  • Content detection tests verify various patterns of headers mixed with content
  • Header-only tests cover important edge cases: leading/trailing whitespace, whitespace-only text, multiple headers, and headers without spaces

The edge case coverage (lines 143-147) is particularly thorough, including whitespace-only text and headers with various spacing patterns.


154-175: LGTM - Test correctly verifies updated filtering behavior.

The test properly validates that:

  • Only TextNode instances are checked for validity via _valid_text_node
  • Non-TextNode instances are filtered out without validation
  • The validation results correctly determine which nodes are kept

The use of side_effect to return different validation results for different nodes is an effective testing approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant