-
Couldn't load subscription status.
- Fork 6
Add rayon crate for parallel processing #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Great! I originally wanted to parallelize this part: nt::parse_bufread(r)
.for_each_triple(|q| {
// execute this in parallel using into_par_iter()
[...]
}Unfortunately that didn't work because Sophia doesn't yield an actual iterator at this step. |
|
I apparently need to get better at criterion, for now I have some metrics just manually capturing timing portions of the different parts of FourDictSect::read_nt using logs:
I used a larger file from https://download.bio2rdf.org/files/release/4/taxonomy/taxonomy-nodes.nq.gz which I had Just watching my So the encoding actually performs better, but building the dictionary takes longer when run in parallel. Maybe trying to run all 4 |
|
I made an attempt at separating the 3 core components of the FourDictSect::read_nt, so that the functions can be tested individually via criterion.
The benchmarks are using this NT file. I tried using the $ riot --count -q persondata_en.ttl
14:26:29 WARN riot :: [line: 4, col: 94] Lexical form '1860-2-21' not valid for datatype XSD date
14:26:29 WARN riot :: [line: 8, col: 94] Lexical form '1927-11-3' not valid for datatype XSD date
14:26:29 WARN riot :: [line: 13, col: 91] Lexical form '1884-11-5' not valid for datatype XSD date
...They take a long time to run so a different sample NT file may have to be used |
|
Great work! Which CPU did you run those measurements on, does it have enough cores to run all the threads on physical cores? |
|
If the speedup is minor even for very large files the question is also if it is worth the developer time and the increase in compilation time and binary size for adding the rayon dependency, especially if this only helps with converting reading NT to HDT, which is probably not the main focus of the library. |
Can you try it again with the CLI branch? On my low-end dual core laptop (still serial processing) it takes a while but does not print errors: hdt$ cargo build --release && cp target/release/hdt rdf
hdt$ rdf convert tests/resources/persondata_en.hdt tests/resources/persondata_en.nt
Successfully converted "tests/resources/persondata_en.hdt" (85.6 MiB) to "tests/resources/persondata_en.nt" (1.1 GiB) in 11.11s
hdt$ rdf convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 60.35s |
|
Testing it on my Intel i9 10900k with 10 cores / 20 threads: (serial original one) hdt$ rdf convert tests/resources/persondata_en.hdt tests/resources/persondata_en.nt cli
Successfully converted "tests/resources/persondata_en.hdt" (85.6 MiB) to "tests/resources/persondata_en.nt" (1.1 GiB) in 8.44s
hdt$ rdf convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt cli
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 54.08sInteresting, that it is so close in timing to the low-end laptop CPU. |
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 53.77sI tried it with scoped threads as well to skip the rayon dependency. let [shared, predicates, subjects, objects] = thread::scope(|s| {
[&shared_terms, &predicate_terms_ref, &unique_subject_terms, &unique_object_terms]
.map(|terms| s.spawn(|| DictSectPFC::compress(terms, block_size)))
.map(|t| t.join().unwrap())
});
println!("compression finished");
let dict = FourSectDict { shared, predicates, subjects, objects }; |
I tried again with the updated CLI branch, same errors from
I am unfortunately running on WSL so my numbers are going to be on the high side no matter what. Tomorrow I should be able to try tunning it on an HPC node and get some more juicy numbers |
|
I was surprised it didn't lead to a better performance to be honest. If the numbers are negligible, I'm OK skipping for now |
I would assume that the kind of CPU would not change the relative numbers much, only the absolute numbers. |
I think it's still great to have a benchmark for that and we could try it with Rust threads first as then we don't have the downside of an additional dependency. |
2a111d2 to
41cf010
Compare
|
Wow, the Intel i9 12900k is quite fast (still serial): hdt$ rdf convert tests/resources/persondata_en_100k.hdt tests/resources/persondata_en_100k.nt
Successfully converted "tests/resources/persondata_en_100k.hdt" (1.6 MiB) to "tests/resources/persondata_en_100k.nt" (11.2 MiB) in 0.08s
hdt$ rdf convert tests/resources/persondata_en_1M.hdt tests/resources/persondata_en_1M.nt
Successfully converted "tests/resources/persondata_en_1M.hdt" (11.1 MiB) to "tests/resources/persondata_en_1M.nt" (111.1 MiB) in 0.59s
hdt$ rdf convert tests/resources/persondata_en.hdt tests/resources/persondata_en.nt
Successfully converted "tests/resources/persondata_en.hdt" (85.6 MiB) to "tests/resources/persondata_en.nt" (1.1 GiB) in 5.99s
hdt$ rdf convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 37.60s |
build_dict_from_terms 12900k persondata_en_1M.ntUsing 1M because 1 run of just one part takes 40 seconds already. single thread let (shared, subjects, predicates, objects) = (
DictSectPFC::compress(&shared_terms, block_size),
DictSectPFC::compress(&unique_subject_terms, block_size),
DictSectPFC::compress(&predicate_terms_ref, block_size),
DictSectPFC::compress(&unique_object_terms, block_size),
); hdt$ time cargo bench --bench criterion -- dictionary_read_nt
Finished `bench` profile [optimized] target(s) in 0.03s
Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
dictionary_read_nt/dict_building
time: [278.28 ms 297.21 ms 317.14 ms]
change: [−6.6888% −0.0491% +7.0741%] (p = 0.99 > 0.05)
No change in performance detected.
cargo bench --bench criterion -- dictionary_read_nt 40.00s user 0.49s system 101% cpu 40.059 totalrayon let ((shared, predicates), (subjects, objects)) = rayon::join(
|| {
rayon::join(
|| DictSectPFC::compress(&shared_terms, block_size),
|| DictSectPFC::compress(&predicate_terms_ref, block_size),
)
},
|| {
rayon::join(
|| DictSectPFC::compress(&unique_subject_terms, block_size),
|| DictSectPFC::compress(&unique_object_terms, block_size),
)
},
); hdt$ time cargo bench --bench criterion -- dictionary_read_nt
Compiling hdt v0.4.0 (/home/konrad/projekte/rust/hdt)
Finished `bench` profile [optimized] target(s) in 2.37s
Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
dictionary_read_nt/dict_building
time: [300.92 ms 304.38 ms 308.23 ms]
change: [−4.1078% +2.4132% +9.4324%] (p = 0.50 > 0.05)
No change in performance detected.
cargo bench --bench criterion -- dictionary_read_nt 49.85s user 0.81s system 121% cpu 41.809 totalI don't know why but I only saw one active thread in htop. std::thread use std::thread;
let [shared, predicates, subjects, objects] = thread::scope(|s| {
[&shared_terms, &predicate_terms_ref, &unique_subject_terms, &unique_object_terms]
.map(|terms| s.spawn(|| DictSectPFC::compress(terms, block_size)))
.map(|t| t.join().unwrap())
});hdt$ time cargo bench --bench criterion -- dictionary_read_nt add-rayon
Compiling hdt v0.4.0 (/home/konrad/projekte/rust/hdt)
Finished `bench` profile [optimized] target(s) in 2.34s
Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
dictionary_read_nt/dict_building
time: [241.45 ms 245.12 ms 248.71 ms]
change: [−18.237% −17.093% −15.955%] (p = 0.00 < 0.05)
Performance has improved.
cargo bench --bench criterion -- dictionary_read_nt 49.56s user 0.74s system 122% cpu 41.142 total
hdt$ time cargo bench --bench criterion -- dictionary_read_nt add-rayon
Finished `bench` profile [optimized] target(s) in 0.11s
Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
dictionary_read_nt/dict_building
time: [296.98 ms 299.35 ms 301.95 ms]
change: [+20.124% +22.126% +24.255%] (p = 0.00 < 0.05)
Performance has regressed.
cargo bench --bench criterion -- dictionary_read_nt 38.53s user 0.52s system 102% cpu 38.169 totalConclusionsWe should test with the complete file as well because there is lots of variations between runs, maybe because of the CPU getting hot or reaching it's boost limits? |
build_dict_from_terms 12900k persondata_en.ntsingle threadhdt$ time cargo bench --bench criterion -- dictionary_read_nt add-rayon
Compiling hdt v0.4.0 (/home/konrad/projekte/rust/hdt)
Finished `bench` profile [optimized] target(s) in 2.27s
Running benches/criterion.rs (target/release/deps/criterion-ffa779d4837a26db)
Benchmarking dictionary_read_nt/dict_building: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 44.5s.
dictionary_read_nt/dict_building
time: [4.3421 s 4.3909 s 4.4389 s]
change: [+1346.3% +1366.8% +1386.3%] (p = 0.00 < 0.05)
Performance has regressed.
cargo bench --bench criterion -- dictionary_read_nt 272.16s user 3.61s system 102% cpu 4:27.85 totalrayonstd::thread |
serial set operationsIt seems to me as if most of the time is spent in this spot here: let shared_terms: BTreeSet<&str> =
subject_terms.intersection(object_terms).map(std::ops::Deref::deref).collect();
let unique_subject_terms: BTreeSet<&str> =
subject_terms.difference(object_terms).map(std::ops::Deref::deref).collect();
let unique_object_terms: BTreeSet<&str> =
object_terms.difference(subject_terms).map(std::ops::Deref::deref).collect();dictionary_read_nt/dict_building
time: [4.2262 s 4.2380 s 4.2554 s]
change: [+6.3368% +6.6206% +7.0208%] (p = 0.01 < 0.05)
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
cargo bench --bench criterion -- dictionary_read_nt 265.32s user 3.46s system 103% cpu 4:20.89 totalparallel set operations use std::thread;
let [shared_terms, unique_subject_terms, unique_object_terms]: [BTreeSet<&str>; 3] = thread::scope(|s| {
[
s.spawn(|| subject_terms.intersection(object_terms).map(std::ops::Deref::deref).collect()),
s.spawn(|| subject_terms.difference(object_terms).map(std::ops::Deref::deref).collect()),
s.spawn(|| object_terms.difference(subject_terms).map(std::ops::Deref::deref).collect()),
]
.map(|t| t.join().unwrap())
});dictionary_read_nt/dict_building
time: [3.7513 s 3.7591 s 3.7686 s]
change: [−11.706% −11.299% −10.952%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high mild
cargo bench --bench criterion -- dictionary_read_nt 262.04s user 3.57s system 104% cpu 4:14.50 totalHeaptrack serial peak RSS 35.4MBHeaptrack parallel peak RSS 35.5MBConclusionsPerforming the set operations in parallel has a positive impact on performance of around 11%, thus I'm going to merge that. |
d6f0fd2 to
ab7566f
Compare
Visualize performance of serial conversion to see bottlenecksNote: "rdf" is my alias to the CLI binary. hdt$ perf record --call-graph=dwarf rdf convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt add-rayon
Successfully converted "tests/resources/persondata_en.nt" (1.1 GiB) to "/tmp/persondata_en.hdt" (81.9 MiB) in 38.26s
[ perf record: Woken up 4917 times to write data ]
Warning:
Processed 184626 events and lost 1 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 1230.130 MB perf.data (152806 samples) ]
hdt$ perf script > /tmp/convert.pdf
Warning: [...] |
|
I think I have to merge in the CLI branch, otherwise it's too cumbersome to test those changes. |
ab7566f to
aef4e0c
Compare
memory usage for conversion persondata_en_1M.nt > persondata_en_1M.hdtWith Rayon: 523 MB |
|
The BTreeSet also seems to be problematic because of so many relatively slow insertions, using HashSet speeds it up a bit to 22.9s. |
|
This also takes a few seconds, I wonder if it's faster to use a HashSet instead: Result: no, that takes 25.5s. |
dd00edc to
a7e9e21
Compare
0d00c15 to
355c307
Compare
5c7fc00 to
c72f09f
Compare
|
I finally managed to rebase this branch onto the current state of main. |
|
By the way, could the hashing performance be optimized by using some other data structure? |
Agreed. The separation is nice from a benchmark standpoint, but the complexity of the return types doesn't make sense in the long term.
Well originally I did a time comparison between sophia and oxrdf parsers and they were close in performance, with sophia being just a little faster so I was willing to drop it - BUT I never tested which library handled parallelization better. I had wondered if oxttl might handle the parallelization better as we started diving into this investigation - but I backburned it since I thought switching libraries was off the table :D I was about to try a parallel BufReader implementation |
I would not be surprised. Quote escaping was something I had quite a few problems/hacks to incorporate when I was originally using the C++ version for conversion, oxrdf for query evaluation and also the underlying hdt string representation/conversion |
|
actually, @KonradHoeffner how was the original |
|
and wow, the latest more than halves the conversion time: latest as of 1e331d3$ for i in {1..5}; do target/release/hdt convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt; done numbers reported earlier for 0d00c15$ for i in {1..5}; do target/release/hdt convert tests/resources/persondata_en.nt /tmp/persondata_en.hdt; done |
I copied that over from the Sophia RDF benchmark by @pchampin, but unfortunately I don't know why that file was chosen in particular or if there are known issues for it, only that it was downloaded from http://downloads.dbpedia.org/2016-10/core-i18n/en/persondata_en.ttl.bz2 and then converted to N-Triples. |
84a57b7 to
83f451f
Compare
83f451f to
a06c8ab
Compare
608f481 to
0fd01c0
Compare
* parallel triple encoding, parallel dict build * create separate function in FourDictSect::read_nt * make Clippy happy * parallelize set operations * safe a lot of time using channels * more parallelization * avoid passing String's during read_nt, use HashSet for predicates to avoid duplicates * adapt benchmark to new read_nt helper functions * get rid of the mpsc channel again * use bitsets instead of hashsets for string indices * drastically speed up NT -> HDT conversion using Rayon, oxttl and ThreadedRodeo * remove now unnecessary sophia feature flags behind read_nt functions * remove now unnecessary sophia feature flags behind read_nt functions * start refactoring read_nt code into its own file * more refactoring * fix hdt.rs path import * remove cli and nt from default features * feature gate benchmark * make clippy happy * refactore index pool * upgrade version to 0.5.0 --------- Co-authored-by: Konrad Höffner <[email protected]>





@KonradHoeffner I'm not sure if you and I were on similar thought process with regards to rayon because I found your issue here but this PR has some performance optimizations for the following: