doc processor: Markdown skip only header chunks #48

Akrog · 2025-11-21T09:48:03Z

Description

When processing documents using the markdown chunker we may end up with chunks that don't really have any useful information, as in having only headers.

For example, the following text:

# Configuration

## cinder.conf

Blah, blah

Will result in 2 chunks:

# Configuration

and

## cinder.conf

Blah, blah

With the first chunk being useless, moreover considering the second chunk has this in its metadata (that is used for the embedding):

header_path: /Configuration

In this patch we change how we process markdown doc types and remove chunks that have no real content because they only have headers or only have space related characters.

Type of change

New feature

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Summary by CodeRabbit

New Features
- Enhanced content validation to better identify and filter meaningful text within documents, with improved handling of header-only sections in markdown.
Tests
- Added comprehensive unit tests for text node validation and content filtering logic across multiple document type scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

When processing documents using the markdown chunker we may end up with chunks that don't really have any useful information, as in having only headers. For example, the following text: # Configuration ## cinder.conf Blah, blah Will result in 2 chunks: # Configuration and ## cinder.conf Blah, blah With the first chunk being useless, moreover considering the second chunk has this in its metadata (that is used for the embedding): header_path: /Configuration In this patch we change how we process markdown doc types and remove chunks that have no real content because they only have headers or only have space related characters.

coderabbitai · 2025-11-21T09:48:19Z

Walkthrough

The PR refactors text validation in the document processor by introducing two new validation methods (_got_non_headers static helper and _valid_text_node instance method), converting two classmethods to instance methods, and adding comprehensive unit test coverage for the validation logic.

Changes

Cohort / File(s)	Summary
Text validation refactoring `src/lightspeed_rag_content/document_processor.py`	Added static helper `_got_non_headers()` to detect non-header markdown content. Added instance method `_valid_text_node()` to validate TextNode by checking whitespace and non-header content for markdown. Converted `_filter_out_invalid_nodes()` and `_split_and_filter()` from classmethods to instance methods. Updated filtering logic to use `_valid_text_node()` and changed log message from "Skipping node without whitespace" to "Skipping invalid node".
Unit test coverage `tests/test_document_processor_llama_index.py`	Added tests for `_valid_text_node()` behavior across document types (plain and markdown). Added tests for `_got_non_headers()` with content-present and headers-only markdown scenarios. Added tests for `_filter_out_invalid_nodes()` logic with Node type differentiation for text validation and filtering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Method signature changes: Verify that conversions from classmethod to instance method (_filter_out_invalid_nodes and _split_and_filter) don't break existing call sites throughout the codebase.
Validation logic: Ensure _valid_text_node() correctly handles both plain and markdown document types and that _got_non_headers() accurately identifies non-header content in markdown.
Test coverage completeness: Confirm that test cases comprehensively cover edge cases for the new validation methods, including empty strings, whitespace-only content, and various markdown header formats.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: modifying the markdown chunker to skip header-only chunks, which is the core objective of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_rag_content/document_processor.py (1)

102-111: LGTM - Validation correctly updated to use new logic.

The conversion to an instance method and delegation to _valid_text_node is correct. The generic log message is appropriate given that validation now includes multiple conditions (whitespace and header checks).

Optional: Consider logging more details about why a node is invalid (e.g., "no whitespace", "only headers") to aid debugging, but this is a minor enhancement.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b854493 and dfd3085.

📒 Files selected for processing (2)

src/lightspeed_rag_content/document_processor.py (1 hunks)
tests/test_document_processor_llama_index.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_document_processor_llama_index.py (1)

src/lightspeed_rag_content/document_processor.py (3)

_valid_text_node (96-100)

_got_non_headers (88-94)

_filter_out_invalid_nodes (102-111)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Pylinter
GitHub Check: build-and-push-dev
GitHub Check: Konflux kflux-prd-rh02 / rag-content-on-pull-request

🔇 Additional comments (7)

src/lightspeed_rag_content/document_processor.py (3)

87-94: LGTM - Clean header detection logic.

The implementation correctly identifies non-header content in markdown by checking if any non-empty line doesn't start with "#". The early return optimization is good.

Note: Code blocks containing lines starting with "#" will be treated as headers, but this is acceptable given the goal of filtering header-only chunks.

96-100: LGTM - Correct validation logic for markdown chunks.

The validation correctly enforces that markdown chunks must contain non-header content in addition to whitespace. The early return when markdown has only headers is efficient.

This implementation aligns with the PR objective to filter header-only and whitespace-only chunks specifically for markdown documents.

113-116: LGTM - Method correctly converted to instance method.

The conversion from classmethod to instance method is necessary and correctly implemented. All call sites (lines 155 and 357) are properly updated to invoke this as an instance method.

tests/test_document_processor_llama_index.py (4)

22-22: LGTM - Import correctly updated for test requirements.

The addition of Node import is necessary to properly spec the non-TextNode mock in test__filter_out_invalid_nodes, allowing the test to verify that only TextNode instances are validated.

75-120: LGTM - Comprehensive test coverage for validation logic.

These tests thoroughly cover the _valid_text_node method behavior across different document types and content scenarios:

Plain doc type correctly checks only whitespace

Markdown doc type correctly checks both non-headers and whitespace

Early return for header-only markdown is properly tested

The mock assertions ensure the correct methods are called in the right sequence.

122-152: LGTM - Excellent parametrized test coverage with edge cases.

These parametrized tests provide comprehensive coverage of the _got_non_headers method:

Content detection tests verify various patterns of headers mixed with content

Header-only tests cover important edge cases: leading/trailing whitespace, whitespace-only text, multiple headers, and headers without spaces

The edge case coverage (lines 143-147) is particularly thorough, including whitespace-only text and headers with various spacing patterns.

154-175: LGTM - Test correctly verifies updated filtering behavior.

The test properly validates that:

Only TextNode instances are checked for validity via _valid_text_node

Non-TextNode instances are filtered out without validation

The validation results correctly determine which nodes are kept

The use of side_effect to return different validation results for different nodes is an effective testing approach.

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

doc processor: Markdown skip only header chunks #48

doc processor: Markdown skip only header chunks #48

Uh oh!

Akrog commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

doc processor: Markdown skip only header chunks #48

Are you sure you want to change the base?

doc processor: Markdown skip only header chunks #48

Uh oh!

Conversation

Akrog commented Nov 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist before requesting a review

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Akrog commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 21, 2025 •

edited

Loading