Skip to content

Conversation

@mdashti
Copy link
Contributor

@mdashti mdashti commented Nov 18, 2025

What

Changes ExpUnrolledLinkedList::block_num from u16 to u32 to prevent integer overflow when indexing large datasets. The structure now supports up to ~4 billion blocks (128 TB) instead of just 65,535 blocks (2.1 GB).

Why

Users were experiencing index creation failures with the error "mid > len" when creating BM25 indexes on tables with large integer arrays (100k rows × 6,700 elements = 660M operations). This required ~103,000 blocks, exceeding the u16::MAX limit of 65,535, causing:

  • Integer overflow in release builds → memory corruption → "mid > len" errors
  • Direct overflow panic in debug builds → "attempt to add with overflow"

How

  1. Changed block_num type: u16u32 (supports 65,536× more blocks)
  2. Added safety measures:
    • Overflow protection with checked_add() in increment_num_blocks()
    • Metadata corruption detection with assert!() in read_to_end()
  3. Maintained compatibility: Block sizes still cap at 32 KB; only the count limit increased

Tests

Added 8 tests to verify the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant