Index size optimization #929

efritz · 2025-03-21T19:28:06Z

efritz
Mar 21, 2025

Internally, there are field types in Elasticsearch/OpenSearch/Lucuene that have a similar flow to Zoekt:

Explode a input regular expression or wildcard query into trigrams
Search an inverted index for the sets of trigrams that need to occur together
Produce a list of candidate documents over which a more expensive regular language automata is evaluated

Basically the Google code search algorithm. Interestingly, Lucene 8 had LZ4 compressed full document source text, which makes much smaller indexes than Zoekt.

Zoekt's internals rely on memory-mapped files and fast access to ranges within that file. I know this is true for table of contents and posting lists, but I'm less familiar with how document text is brought back.

I had assumed that the full document is read to apply the regex pattern, but I think that's a misunderstanding on my part. Assuming Zoekt does only bring back a subsection of the document for matching, how does it determine the range of the original document? The first/last trigram offsets? Newline analysis offsets?

Atlassian is looking to re-index an ES cluster of ~300TB, and the increase in size to transfer to Zoekt is substantial. I'm interested in exploring what work it would take to compress the document content in Zoekt indexes. Depending on how document bodies are read in the first place, this effort could range pretty wide on an effort scale.

I'm initially looking for pointers/validation on where sections of document bodies are read back, and I can perform some feasibility analysis on my end. It could turn out that compression of document bodies creates too much additional CPU load during searches to be an appropriate trade-off. But it could be a net win if the increase in CPU load from LZ4 is offset by the savings in data read from disk.

My initial understanding of implementation would likely require:

(If documents are read piece-wise...) Chunking document bodies into fixed-size chunks
Compressing each chunk
Storing an indirection table of offset boundaries -> chunk boundaries
When reading, determining the chunks that need to be read, then extracting a substring from the decoded text

jtibshirani · 2025-03-25T17:29:56Z

jtibshirani
Mar 25, 2025

@efritz compression is an important topic, great you're looking into this! As requested, here are some pointers on how the file contents are read.

Let's take a regexp query as an example.

First, Zoekt creates a trigram iterator to produce the list of candidate file matches.
As it iterates over each candidate file match, it needs to confirm whether the file actually matches.
To do so, it looks up the file contents in the memory-mapped index file. It knows the start and end offsets because we keep a mapping of file index to offsets in the boundaries map.
Then, it runs the regexp engine over the memory-mapped byte array. This should incur minimal data copying.

Some more background to help make sense of the above links:

All file contents for a repo are stored contiguously in the index file.

My example omits a bunch of optimizations around substring matching, offsets, etc. that help avoid searching in the entire file for matches. Here's an example where we check only a portion of the file's contents for a substring match:

zoekt/index/matchtree.go

Lines 948 to 957 in 47287ad

    
           for _, m := range t.current { 
        
           	if m.byteOffset == 0 && m.runeOffset > 0 { 
        
           		m.byteOffset = cp.findOffset(m.fileName, m.runeOffset) 
        
           	} 
        
           	if m.matchContent(cp.data(m.fileName)) { 
        
           		pruned = append(pruned, m) 
        
           	} 
        
           } 
        
           t.current = pruned 
        
           t.contEvaluated = true

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index size optimization #929

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Index size optimization #929

Uh oh!

efritz Mar 21, 2025

Replies: 1 comment

Uh oh!

Uh oh!

jtibshirani Mar 25, 2025

efritz
Mar 21, 2025

jtibshirani
Mar 25, 2025