-
Notifications
You must be signed in to change notification settings - Fork 6
Description
What new feature do you want?
To my understanding, calls to hdt::Hdt::new() read the entire HDT file into memory (header, dictionary, triples) and all operations performed on the HDT file afterwards are performed over that in-memory data model? If so, have there been any design considerations on how the source HDT file could create a memory-mapped view of the file starting from the triples section offset?
For context, I'm working on integrating the HDT library with oxigraph (oxigraph/oxigraph#1087). On larger datasets there is a significant overhead on the initial call to hdt::Hdt::new(). For instance running against a dump of DBLP (~500 million triples):
[DEBUG hdt::triples] Building wavelet matrix...
[DEBUG hdt::triples] Building OPS index...
[DEBUG hdt::triples] Built wavelet matrix with length 365063110
[DEBUG hdt::triples] built OPS index
[DEBUG hdt::hdt] HDT size in memory 5.9 GB, details:
[DEBUG hdt::hdt] Hdt {
dict: FourSectDict {
shared: total size 265.1 MB, 56011771 strings, sequence 12.3 MB with 3500737 entries, 28 bits per entry, packed data 252.9 MB,
subjects: total size 292.0 KB, 41119 strings, sequence 6.1 KB with 2571 entries, 19 bits per entry, packed data 285.9 KB,
predicates: total size 1.3 KB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.3 KB,
objects: total size 1.5 GB, 56297252 strings, sequence 13.6 MB with 3518580 entries, 31 bits per entry, packed data 1.5 GB,
},
triples: total size 4.1 GB
adjlist_z AdjList {
sequence: 1.8 GB with 490171629 entries, 29 bits per entry,
bitmap: 79.4 MB,
}
op_index total size 1.9 GB {
sequence: 1.8 GB with 29 bits,
bitmap: 76.6 MB
}
wavelet_y 419.3 MB,
}
Load Time: 169.108882004s
About 6GB memory is used and if multiple HDT files of similar size are loaded, that memory consumption starts to become unwieldy - with a database choosing to either keep files loaded into memory or deal with the longer response times.