Skip to content

Conversation

@GregHanson
Copy link
Collaborator

@KonradHoeffner I'm not sure if you and I were on similar thought process with regards to rayon because I found your issue here but this PR has some performance optimizations for the following:

  • parallel dictionary compression
  • parallel triple encoding

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Sep 25, 2025

Great! I originally wanted to parallelize this part:

        nt::parse_bufread(r)
            .for_each_triple(|q| {
                // execute this in parallel using into_par_iter()
           [...]
       }

Unfortunately that didn't work because Sophia doesn't yield an actual iterator at this step.
Parallelizing the dictionary compression is great as well!
Now we just need to adapt the benchmark so we can measure the impact it has on a large file.

@GregHanson
Copy link
Collaborator Author

I apparently need to get better at criterion, for now I have some metrics just manually capturing timing portions of the different parts of FourDictSect::read_nt using logs:

image

I used a larger file from https://download.bio2rdf.org/files/release/4/taxonomy/taxonomy-nodes.nq.gz which I had riot convert to NT format, which has about 23.5 million triples. Ran 5 converts. Writes, Header build, and BitmapTriples::from_triples() are included in the "total" timings but not part of the individual breakdowns

Just watching my top output I verify that the rayon crate is utilizing multiple CPU's during the tests

So the encoding actually performs better, but building the dictionary takes longer when run in parallel. Maybe trying to run all 4 compress() at once isn't a good idea

@GregHanson
Copy link
Collaborator Author

I made an attempt at separating the 3 core components of the FourDictSect::read_nt, so that the functions can be tested individually via criterion.

I used a larger file from https://download.bio2rdf.org/files/release/4/taxonomy/taxonomy-nodes.nq.gz which I had riot convert to NT format

The benchmarks are using this NT file. I tried using the persondata_en from https://github.com/KonradHoeffner/hdt/releases/tag/benchmarkdata , but I got a bunch of validation errors on the resulting NT when I tried the WIP hdt-to-rdf CLI branch, and also from the original ttl.gz file.

$ riot --count -q persondata_en.ttl
14:26:29 WARN  riot            :: [line: 4, col: 94] Lexical form '1860-2-21' not valid for datatype XSD date
14:26:29 WARN  riot            :: [line: 8, col: 94] Lexical form '1927-11-3' not valid for datatype XSD date
14:26:29 WARN  riot            :: [line: 13, col: 91] Lexical form '1884-11-5' not valid for datatype XSD date
...

They take a long time to run so a different sample NT file may have to be used

@KonradHoeffner
Copy link
Owner

Great work! Which CPU did you run those measurements on, does it have enough cores to run all the threads on physical cores?
I'm working on a dual core laptop CPU (i3-1115G4) right now so I can't measure the parallelism very well from here but will try it with my office machine with the Core i9 12900k when I get to it.

@KonradHoeffner
Copy link
Owner

If the speedup is minor even for very large files the question is also if it is worth the developer time and the increase in compilation time and binary size for adding the rayon dependency, especially if this only helps with converting reading NT to HDT, which is probably not the main focus of the library.
However in case the CLI gets adopted by a mainstream audience then the convert function could actually be used a lot.
@GregHanson: What do you think?

@KonradHoeffner
Copy link
Owner

The benchmarks are using this NT file. I tried using the persondata_en from https://github.com/KonradHoeffner/hdt/releases/tag/benchmarkdata , but I got a bunch of validation errors on the resulting NT when I tried the WIP hdt-to-rdf CLI branch, and also from the original ttl.gz file.
[...]
They take a long time to run so a different sample NT file may have to be used

Can you try it again with the CLI branch?
I just rebased it to the current state of main and also print more info now: file size of input and output and also duration of the conversion.

On my low-end dual core laptop (still serial processing) it takes a while but does not print errors:

hdt$ cargo build --release && cp target/release/hdt rdf
hdt$ rdf convert tests/resources/persondata_en.hdt tests/resources/persondata_en.nt
Successfully converted "tests/resources/persondata_en.hdt" (85.6 MiB) to "tests/resources/persondata_en.nt" (1.1 GiB) in 11.11s
hdt$ rdf convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 60.35s

@KonradHoeffner
Copy link
Owner

Testing it on my Intel i9 10900k with 10 cores / 20 threads: (serial original one)

hdt$ rdf convert tests/resources/persondata_en.hdt tests/resources/persondata_en.nt                                                                                                                              cli
Successfully converted "tests/resources/persondata_en.hdt" (85.6 MiB) to "tests/resources/persondata_en.nt" (1.1 GiB) in 8.44s
hdt$ rdf convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt                                                                                                                                         cli
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 54.08s

Interesting, that it is so close in timing to the low-end laptop CPU.

@KonradHoeffner
Copy link
Owner

Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 53.77s

I tried it with scoped threads as well to skip the rayon dependency.
But at least for the persondata_en case it seems a really small part and cause any improvement in the runtime.

        let [shared, predicates, subjects, objects] = thread::scope(|s| {
            [&shared_terms, &predicate_terms_ref, &unique_subject_terms, &unique_object_terms]
                .map(|terms| s.spawn(|| DictSectPFC::compress(terms, block_size)))
                .map(|t| t.join().unwrap())
        });
        println!("compression finished");
        let dict = FourSectDict { shared, predicates, subjects, objects };

@GregHanson
Copy link
Collaborator Author

Can you try it again with the CLI branch?

I tried again with the updated CLI branch, same errors from riot. But you're right, the generated NT is still consumable by the HDT crate and queries work against it - so I guess it is an ignorable error

Which CPU did you run those measurements on

I am unfortunately running on WSL so my numbers are going to be on the high side no matter what. Tomorrow I should be able to try tunning it on an HPC node and get some more juicy numbers

@GregHanson
Copy link
Collaborator Author

I was surprised it didn't lead to a better performance to be honest. If the numbers are negligible, I'm OK skipping for now

@KonradHoeffner
Copy link
Owner

I am unfortunately running on WSL so my numbers are going to be on the high side no matter what. Tomorrow I should be able to try tunning it on an HPC node and get some more juicy numbers

I would assume that the kind of CPU would not change the relative numbers much, only the absolute numbers.
But can you report how many cores those machines have?

@KonradHoeffner
Copy link
Owner

I was surprised it didn't lead to a better performance to be honest. If the numbers are negligible, I'm OK skipping for now

I think it's still great to have a benchmark for that and we could try it with Rust threads first as then we don't have the downside of an additional dependency.

@KonradHoeffner KonradHoeffner force-pushed the add-rayon branch 2 times, most recently from 2a111d2 to 41cf010 Compare October 1, 2025 09:02
@KonradHoeffner
Copy link
Owner

Wow, the Intel i9 12900k is quite fast (still serial):

hdt$ rdf convert tests/resources/persondata_en_100k.hdt tests/resources/persondata_en_100k.nt
Successfully converted "tests/resources/persondata_en_100k.hdt" (1.6 MiB) to "tests/resources/persondata_en_100k.nt" (11.2 MiB) in 0.08s
hdt$ rdf convert tests/resources/persondata_en_1M.hdt tests/resources/persondata_en_1M.nt
Successfully converted "tests/resources/persondata_en_1M.hdt" (11.1 MiB) to "tests/resources/persondata_en_1M.nt" (111.1 MiB) in 0.59s
hdt$ rdf convert tests/resources/persondata_en.hdt tests/resources/persondata_en.nt
Successfully converted "tests/resources/persondata_en.hdt" (85.6 MiB) to "tests/resources/persondata_en.nt" (1.1 GiB) in 5.99s
hdt$ rdf convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 37.60s

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Oct 1, 2025

build_dict_from_terms 12900k persondata_en_1M.nt

Using 1M because 1 run of just one part takes 40 seconds already.

single thread

        let (shared, subjects, predicates, objects) = (
            DictSectPFC::compress(&shared_terms, block_size),         
            DictSectPFC::compress(&unique_subject_terms, block_size),   
            DictSectPFC::compress(&predicate_terms_ref, block_size),
            DictSectPFC::compress(&unique_object_terms, block_size),           
        );      
hdt$ time cargo bench --bench criterion -- dictionary_read_nt
    Finished `bench` profile [optimized] target(s) in 0.03s
     Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
dictionary_read_nt/dict_building
                        time:   [278.28 ms 297.21 ms 317.14 ms]
                        change: [−6.6888% −0.0491% +7.0741%] (p = 0.99 > 0.05)
                        No change in performance detected.

cargo bench --bench criterion -- dictionary_read_nt  40.00s user 0.49s system 101% cpu 40.059 total

rayon

        let ((shared, predicates), (subjects, objects)) = rayon::join(
            || {
                rayon::join(
                    || DictSectPFC::compress(&shared_terms, block_size),
                    || DictSectPFC::compress(&predicate_terms_ref, block_size),
                )
            },
            || {
                rayon::join(
                    || DictSectPFC::compress(&unique_subject_terms, block_size),
                    || DictSectPFC::compress(&unique_object_terms, block_size),
                )
            },
        );  
hdt$ time cargo bench --bench criterion -- dictionary_read_nt
   Compiling hdt v0.4.0 (/home/konrad/projekte/rust/hdt)
    Finished `bench` profile [optimized] target(s) in 2.37s
     Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
dictionary_read_nt/dict_building
                        time:   [300.92 ms 304.38 ms 308.23 ms]
                        change: [−4.1078% +2.4132% +9.4324%] (p = 0.50 > 0.05)
                        No change in performance detected.
cargo bench --bench criterion -- dictionary_read_nt  49.85s user 0.81s system 121% cpu 41.809 total

I don't know why but I only saw one active thread in htop.

std::thread

        use std::thread;
        let [shared, predicates, subjects, objects] = thread::scope(|s| {
            [&shared_terms, &predicate_terms_ref, &unique_subject_terms, &unique_object_terms]
                .map(|terms| s.spawn(|| DictSectPFC::compress(terms, block_size)))
                .map(|t| t.join().unwrap())
        });
hdt$ time cargo bench --bench criterion -- dictionary_read_nt                                                                                                     add-rayon
   Compiling hdt v0.4.0 (/home/konrad/projekte/rust/hdt)
    Finished `bench` profile [optimized] target(s) in 2.34s
     Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
dictionary_read_nt/dict_building
                        time:   [241.45 ms 245.12 ms 248.71 ms]
                        change: [−18.237% −17.093% −15.955%] (p = 0.00 < 0.05)
                        Performance has improved.

cargo bench --bench criterion -- dictionary_read_nt  49.56s user 0.74s system 122% cpu 41.142 total

hdt$ time cargo bench --bench criterion -- dictionary_read_nt                                                                                                     add-rayon
    Finished `bench` profile [optimized] target(s) in 0.11s
     Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
dictionary_read_nt/dict_building
                        time:   [296.98 ms 299.35 ms 301.95 ms]
                        change: [+20.124% +22.126% +24.255%] (p = 0.00 < 0.05)
                        Performance has regressed.

cargo bench --bench criterion -- dictionary_read_nt  38.53s user 0.52s system 102% cpu 38.169 total

Conclusions

We should test with the complete file as well because there is lots of variations between runs, maybe because of the CPU getting hot or reaching it's boost limits?
Also htop does not show multiple cores being used, maybe there is some shared resource that blocks effective parallelization.

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Oct 1, 2025

build_dict_from_terms 12900k persondata_en.nt

single thread

hdt$ time cargo bench --bench criterion -- dictionary_read_nt                                                                                                     add-rayon
   Compiling hdt v0.4.0 (/home/konrad/projekte/rust/hdt)
    Finished `bench` profile [optimized] target(s) in 2.27s
     Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
Benchmarking dictionary_read_nt/dict_building: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 44.5s.
dictionary_read_nt/dict_building
                        time:   [4.3421 s 4.3909 s 4.4389 s]
                        change: [+1346.3% +1366.8% +1386.3%] (p = 0.00 < 0.05)
                        Performance has regressed.

cargo bench --bench criterion -- dictionary_read_nt  272.16s user 3.61s system 102% cpu 4:27.85 total

rayon

std::thread

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Oct 1, 2025

serial set operations

It seems to me as if most of the time is spent in this spot here:

        let shared_terms: BTreeSet<&str> =
            subject_terms.intersection(object_terms).map(std::ops::Deref::deref).collect();
        let unique_subject_terms: BTreeSet<&str> =
            subject_terms.difference(object_terms).map(std::ops::Deref::deref).collect();
        let unique_object_terms: BTreeSet<&str> =
            object_terms.difference(subject_terms).map(std::ops::Deref::deref).collect();
dictionary_read_nt/dict_building
                        time:   [4.2262 s 4.2380 s 4.2554 s]
                        change: [+6.3368% +6.6206% +7.0208%] (p = 0.01 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

cargo bench --bench criterion -- dictionary_read_nt  265.32s user 3.46s system 103% cpu 4:20.89 total

parallel set operations

        use std::thread;
        let [shared_terms, unique_subject_terms, unique_object_terms]: [BTreeSet<&str>; 3] = thread::scope(|s| {
            [
                s.spawn(|| subject_terms.intersection(object_terms).map(std::ops::Deref::deref).collect()),
                s.spawn(|| subject_terms.difference(object_terms).map(std::ops::Deref::deref).collect()),
                s.spawn(|| object_terms.difference(subject_terms).map(std::ops::Deref::deref).collect()),
            ]
            .map(|t| t.join().unwrap())
        });
dictionary_read_nt/dict_building
                        time:   [3.7513 s 3.7591 s 3.7686 s]
                        change: [−11.706% −11.299% −10.952%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild

cargo bench --bench criterion -- dictionary_read_nt  262.04s user 3.57s system 104% cpu 4:14.50 total

Heaptrack serial peak RSS 35.4MB

hdt-build-dict-serial

Heaptrack parallel peak RSS 35.5MB

hdt-build-dict-parallel

Conclusions

Performing the set operations in parallel has a positive impact on performance of around 11%, thus I'm going to merge that.
This does not cause significant increases in RAM consumption, so merge it.

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Oct 1, 2025

Visualize performance of serial conversion to see bottlenecks

Note: "rdf" is my alias to the CLI binary.

hdt$ perf record --call-graph=dwarf rdf convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt                                                           add-rayon
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 38.26s
[ perf record: Woken up 4917 times to write data ]
Warning:
Processed 184626 events and lost 1 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 1230.130 MB perf.data (152806 samples) ]
hdt$ perf script > /tmp/convert.pdf
Warning: [...]

@KonradHoeffner
Copy link
Owner

I think I have to merge in the CLI branch, otherwise it's too cumbersome to test those changes.

@KonradHoeffner
Copy link
Owner

image

There is indeed some speedup, going from ~38s to ~32.5s, but the amount saved by the dictionary compression seems small in comparison to the triple encoding.
Interesting that the triple encoding does not create more time savings.

@KonradHoeffner
Copy link
Owner

memory usage for conversion persondata_en_1M.nt > persondata_en_1M.hdt

With Rayon: 523 MB
Without Rayon: 492 MB

@KonradHoeffner
Copy link
Owner

Using two threads with a channel for parsing also saves a lot of time, down to around 23s.

image

@KonradHoeffner
Copy link
Owner

The BTreeSet also seems to be problematic because of so many relatively slow insertions, using HashSet speeds it up a bit to 22.9s.

@KonradHoeffner
Copy link
Owner

This also takes a few seconds, I wonder if it's faster to use a HashSet instead:

        raw_triples.sort_unstable();
        raw_triples.dedup();

Result: no, that takes 25.5s.

@KonradHoeffner
Copy link
Owner

Let's just sort the raw triples in another thread while the main thread builds the dictionary.
Now it's down to less than 22s, enough for today.

image

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Oct 15, 2025

I finally managed to rebase this branch onto the current state of main.
Your index optimizations are great, however I think the intermediate functions are not that reusable anymore, so I think it would be better to merge them together again (however I was on the fence about this before anyways).
While this does not allow for your detailed benchmarking, I think given that profiling allows measure the steps of the read_nt operation, that this is not necessary anymore.
On the other hand separating a large function into multiple smaller ones does allow for easier understanding.
However return types like Result<(Vec<[usize; 3]>, HashSet<usize>, HashSet<usize>, HashSet<usize>, Vec<String>)> suggest to me that this is not an easy and natural separation.
What do you think @GregHanson?

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Oct 16, 2025

By the way, could the hashing performance be optimized by using some other data structure?
Because the keys are already integers, and we don't care about hashing as a security measure, it seems a bit redundant to convert one integer to another, especially as they are all in the range of 0..length of the string vector - 1.
Maybe a bitmap :-)
I guess then some of the set operations could also be optimized using AND/OR/XOR.
But this is probably not going to be worth the effort, because your approach is already so much more optimized than the previous way, I'm just curious if that is the default approach for such a problem.
Alternatively the HashSet could also be contain pointers into the vec but I guess that is unsafe (does that even work?).

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Oct 16, 2025

image

The bottleneck seemed to be the intial reading of the NTriples file and the string interning, so I used Rayon for that part as well.
Unfortunately both the string interning function and the rio-based Sophia parser were not suitable for parallel processing so I had to bring in oxttl with it's parallel NT reading support and the lasso string interning library.
I know you were bringing up oxttl a while ago, so I'm sorry for not believing in it earlier :-)
Binary size is still at 5.3MB, probably because a few other things like HashSets are not used anymore now.
But it could make sense to just put all the NT to HDT code into a separate file and feature flag.

The speedup with a Core i9-12900k is quite drastic, going from ~14s using lasso single threaded back up to around 15.5s with rayon and a single thread (overhead) and then down to ~5.7s when using the maximum amount of parallelism (24 apparently with this CPU).

I also tried to limit the number of NT readers but even using 16 threads was noticeably slower so I just gave it everything.
Given that the persondata_en.nt file used for those measurements is 1.2 GB in size, I think a sub 6s time is quite good and might be a good reason to publish the CLI as this could be a real benefit for users over the C++ and Java implementations.

The C++ implementation took 27.5s in a test with the same file.

However I noticed that when converting the file to and from HDT, certain strings had their double backslashes increased, I'm not sure if this has been the case all along or due to the changes and whether we can put this in the test cases.

191c191
< <http://dbpedia.org/resource/%22Bassy%22_Bob_Brockmann> <http://xmlns.com/foaf/0.1/name> "\\\\\\\\Bassy\\\\\\\\ Bob Brockmann"@en.
---
> <http://dbpedia.org/resource/%22Bassy%22_Bob_Brockmann> <http://xmlns.com/foaf/0.1/name> "\\\\\\\\\\\\\\\\Bassy\\\\\\\\\\\\\\\\ Bob Brockmann"@en.
197c197
< <http://dbpedia.org/resource/%22Big%22_Donnie_MacLeod> <http://xmlns.com/foaf/0.1/name> "\\\\\\\\Big\\\\\\\\ Donnie MacLeod"@en.
---
> <http://dbpedia.org/resource/%22Big%22_Donnie_MacLeod> <http://xmlns.com/foaf/0.1/name> "\\\\\\\\\\\\\\\\Big\\\\\\\\\\\\\\\\ Donnie MacLeod"@en.
****

@GregHanson
Copy link
Collaborator Author

However return types like Result<(Vec<[usize; 3]>, HashSet, HashSet, HashSet, Vec)> suggest to me that this is not an easy and natural separation.

Agreed. The separation is nice from a benchmark standpoint, but the complexity of the return types doesn't make sense in the long term.

you were bringing up oxttl a while ago, so I'm sorry for not believing in it earlier :-)

Well originally I did a time comparison between sophia and oxrdf parsers and they were close in performance, with sophia being just a little faster so I was willing to drop it - BUT I never tested which library handled parallelization better. I had wondered if oxttl might handle the parallelization better as we started diving into this investigation - but I backburned it since I thought switching libraries was off the table :D I was about to try a parallel BufReader implementation

@GregHanson
Copy link
Collaborator Author

However I noticed that when converting the file to and from HDT, certain strings had their double backslashes increased,

I would not be surprised. Quote escaping was something I had quite a few problems/hacks to incorporate when I was originally using the C++ version for conversion, oxrdf for query evaluation and also the underlying hdt string representation/conversion

@GregHanson
Copy link
Collaborator Author

GregHanson commented Oct 16, 2025

actually, @KonradHoeffner how was the original persondata_en.hdt file generated? Because Tpt pointed out that the original C++ RDF2HDT implementation was NOT escaping quotes properly (link). So the additional backslashes may be ripple effects of that original improper conversion if it was generated using the C++ tooling. What was the original source for the RDF?

@GregHanson
Copy link
Collaborator Author

GregHanson commented Oct 16, 2025

and wow, the latest more than halves the conversion time:

latest as of 1e331d3

$ for i in {1..5}; do target/release/hdt convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt; done
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 14.79s
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 14.66s
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 14.60s
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 14.58s
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 14.87s

numbers reported earlier for 0d00c15

$ for i in {1..5}; do target/release/hdt convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt; done
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 37.01s
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 34.43s
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 41.27s
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 38.07s
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 41.69s

@KonradHoeffner
Copy link
Owner

KonradHoeffner commented Oct 17, 2025

actually, @KonradHoeffner how was the original persondata_en.hdt file generated? Because Tpt pointed out that the original C++ RDF2HDT implementation was NOT escaping quotes properly (link). So the additional backslashes may be ripple effects of that original improper conversion if it was generated using the C++ tooling. What was the original source for the RDF?

I copied that over from the Sophia RDF benchmark by @pchampin, but unfortunately I don't know why that file was chosen in particular or if there are known issues for it, only that it was downloaded from http://downloads.dbpedia.org/2016-10/core-i18n/en/persondata_en.ttl.bz2 and then converted to N-Triples.

@KonradHoeffner KonradHoeffner merged commit 26c8890 into KonradHoeffner:main Oct 22, 2025
1 check passed
KonradHoeffner added a commit that referenced this pull request Oct 22, 2025
* parallel triple encoding, parallel dict build
* create separate function in FourDictSect::read_nt
* make Clippy happy
* parallelize set operations
* safe a lot of time using channels
* more parallelization
* avoid passing String's during read_nt, use HashSet for predicates to avoid duplicates
* adapt benchmark to new read_nt helper functions
* get rid of the mpsc channel again
* use bitsets instead of hashsets for string indices
* drastically speed up NT -> HDT conversion using Rayon, oxttl and ThreadedRodeo
* remove now unnecessary sophia feature flags behind read_nt functions
* remove now unnecessary sophia feature flags behind read_nt functions
* start refactoring read_nt code into its own file
* more refactoring
* fix hdt.rs path import
* remove cli and nt from default features
* feature gate benchmark
* make clippy happy
* refactore index pool
* upgrade version to 0.5.0

---------

Co-authored-by: Konrad Höffner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants