Skip to content

as_bytes() panics if string is not utf8 #57

@KonradHoeffner

Description

@KonradHoeffner

[...] found an unhandled bug while testing with dblp, pretty sure as_bytes() barfs if the string is not utf8?

./rdf2hdt convert -i dblp.ttl -o dblp-rust.hdt -vvv
[DEBUG hdt::rdf2hdt::rdf_reader] converting dblp.ttl to nt format
[DEBUG hdt::rdf2hdt::rdf_reader] RDF to NTriple convert time: 939.855296477s
[DEBUG hdt::rdf2hdt::dictionary] Four Section Dictions sort time: 1837.413708669s
[DEBUG hdt::rdf2hdt::dictionary] Encoding triples time: 1299.179455131s
[DEBUG hdt::rdf2hdt::dictionary] Dictionary build time: 3136.655445115s
[DEBUG hdt::rdf2hdt::bitmap_triples] BitmapTriples build time: 81.938644095s
[DEBUG hdt::rdf2hdt::builder] HDT build time: 4187.132880571s

thread 'main' panicked at src/rdf2hdt/dictionary.rs:237:52:
byte index 8 is not a char boundary; it is inside 'μ' (bytes 7..9) of `"$\\mu$μBench: An Open-Source Factory of Benchmark Microservice Applications."`

looking at the convert NT triples:

$ grep "An Open-Source Factory of Benchmark Microservice Applications." dblp.nt
<https://dblp.org/rec/journals/tpds/DettiFP23> <http://www.w3.org/2000/01/rdf-schema#label> "Andrea Detti et al.: $\\mu$μBench: An Open-Source Factory of Benchmark Microservice Applications. (2023)".
<https://dblp.org/rec/journals/tpds/DettiFP23> <https://dblp.org/rdf/schema#title> "$\\mu$μBench: An Open-Source Factory of Benchmark Microservice Applications.".

and the source TTL:

$ grep "An Open-Source Factory of Benchmark Microservice Applications." dblp.ttl
        rdfs:label "Andrea Detti et al.: $\\mu$\u03BCBench: An Open-Source Factory of Benchmark Microservice Applications. (2023)" ;
        dblp:title "$\\mu$\u03BCBench: An Open-Source Factory of Benchmark Microservice Applications." ;

Originally posted by @GregHanson in #56 (comment)

Metadata

Metadata

Labels

bugSomething isn't working

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions