Skip to content

Memory optimizations on larger HDT files #49

@GregHanson

Description

@GregHanson

What new feature do you want?

To my understanding, calls to hdt::Hdt::new() read the entire HDT file into memory (header, dictionary, triples) and all operations performed on the HDT file afterwards are performed over that in-memory data model? If so, have there been any design considerations on how the source HDT file could create a memory-mapped view of the file starting from the triples section offset?

For context, I'm working on integrating the HDT library with oxigraph (oxigraph/oxigraph#1087). On larger datasets there is a significant overhead on the initial call to hdt::Hdt::new(). For instance running against a dump of DBLP (~500 million triples):

[DEBUG hdt::triples] Building wavelet matrix...
[DEBUG hdt::triples] Building OPS index...
[DEBUG hdt::triples] Built wavelet matrix with length 365063110
[DEBUG hdt::triples] built OPS index
[DEBUG hdt::hdt] HDT size in memory 5.9 GB, details:
[DEBUG hdt::hdt] Hdt {
        dict: FourSectDict {
            shared: total size 265.1 MB, 56011771 strings, sequence 12.3 MB with 3500737 entries, 28 bits per entry, packed data 252.9 MB,
            subjects: total size 292.0 KB, 41119 strings, sequence 6.1 KB with 2571 entries, 19 bits per entry, packed data 285.9 KB,
            predicates: total size 1.3 KB, 90 strings, sequence 16 B with 7 entries, 11 bits per entry, packed data 1.3 KB,
            objects: total size 1.5 GB, 56297252 strings, sequence 13.6 MB with 3518580 entries, 31 bits per entry, packed data 1.5 GB,
        },
        triples: total size 4.1 GB
        adjlist_z AdjList {
            sequence: 1.8 GB with 490171629 entries, 29 bits per entry,
            bitmap: 79.4 MB,
        }
        op_index total size 1.9 GB {
            sequence: 1.8 GB with 29 bits,
            bitmap: 76.6 MB
        }
        wavelet_y 419.3 MB,
    }
Load Time: 169.108882004s

About 6GB memory is used and if multiple HDT files of similar size are loaded, that memory consumption starts to become unwieldy - with a database choosing to either keep files loaded into memory or deal with the longer response times.

Metadata

Metadata

Labels

enhancementNew feature or requestoptimizereduce memory or CPU usageramreduce memory usage

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions