Replies: 1 comment
-
|
@efritz compression is an important topic, great you're looking into this! As requested, here are some pointers on how the file contents are read. Let's take a regexp query as an example.
Some more background to help make sense of the above links:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Internally, there are field types in Elasticsearch/OpenSearch/Lucuene that have a similar flow to Zoekt:
Basically the Google code search algorithm. Interestingly, Lucene 8 had LZ4 compressed full document source text, which makes much smaller indexes than Zoekt.
Zoekt's internals rely on memory-mapped files and fast access to ranges within that file. I know this is true for table of contents and posting lists, but I'm less familiar with how document text is brought back.
I had assumed that the full document is read to apply the regex pattern, but I think that's a misunderstanding on my part. Assuming Zoekt does only bring back a subsection of the document for matching, how does it determine the range of the original document? The first/last trigram offsets? Newline analysis offsets?
Atlassian is looking to re-index an ES cluster of ~300TB, and the increase in size to transfer to Zoekt is substantial. I'm interested in exploring what work it would take to compress the document content in Zoekt indexes. Depending on how document bodies are read in the first place, this effort could range pretty wide on an effort scale.
I'm initially looking for pointers/validation on where sections of document bodies are read back, and I can perform some feasibility analysis on my end. It could turn out that compression of document bodies creates too much additional CPU load during searches to be an appropriate trade-off. But it could be a net win if the increase in CPU load from LZ4 is offset by the savings in data read from disk.
My initial understanding of implementation would likely require:
Beta Was this translation helpful? Give feedback.
All reactions