Memory optimizations on larger HDT files

### What new feature do you want?

To my understanding, calls to `hdt::Hdt::new()` read the entire HDT file into memory (header, dictionary, triples) and all operations performed on the HDT file afterwards are performed over that in-memory data model? If so, have there been any design considerations on how the source HDT file could create a memory-mapped view of the file starting from the triples section offset?

For context, I'm working on integrating the HDT library with oxigraph (https://github.com/oxigraph/oxigraph/pull/1087). On larger datasets there is a significant overhead on the initial call to `hdt::Hdt::new()`. For instance running against a dump of [DBLP](https://dblp.org/rdf/dblp.ttl.gz) (~500 million triples):
```
[DEBUG hdt::triples] Building wavelet matrix...
[DEBUG hdt::triples] Building OPS index...
[DEBUG hdt::triples] Built wavelet matrix with length 365063110
[DEBUG hdt::triples] built OPS index
[DEBUG hdt::hdt] HDT size in memory 5.9 GB, details:
[DEBUG hdt::hdt] Hdt {
        dict: FourSectDict {
            shared: total size 265.1 MB, 56011771 strings, sequence 12.3 MB with 3500737 entries, 28 bits per entry, packed data 252.9 MB,
            subjects: total size 292.0 KB, 41119 strings, sequence 6.1 KB with 2571 entries, 19 bits per entry, packed data 285.9 KB,
            predicates: total size 1.3 KB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.3 KB,
            objects: total size 1.5 GB, 56297252 strings, sequence 13.6 MB with 3518580 entries, 31 bits per entry, packed data 1.5 GB,
        },
        triples: total size 4.1 GB
        adjlist_z AdjList {
            sequence: 1.8 GB with 490171629 entries, 29 bits per entry,
            bitmap: 79.4 MB,
        }
        op_index total size 1.9 GB {
            sequence: 1.8 GB with 29 bits,
            bitmap: 76.6 MB
        }
        wavelet_y 419.3 MB,
    }
Load Time: 169.108882004s
```

About 6GB memory is used and if multiple HDT files of similar size are loaded, that memory consumption starts to become unwieldy - with a database choosing to either keep files loaded into memory or deal with the longer response times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory optimizations on larger HDT files #49

What new feature do you want?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Memory optimizations on larger HDT files #49

Description

What new feature do you want?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions