-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Introduce parquet-variant-compute crate to transform batches of JSON strings to and from Variants #7884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Introduce parquet-variant-compute crate to transform batches of JSON strings to and from Variants #7884
Conversation
parquet-variant-compute/Cargo.toml
Outdated
| repository = { workspace = true } | ||
| authors = { workspace = true } | ||
| keywords = ["arrow", "parquet", "variant"] | ||
| readme = "README.md" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Let me know if I should create a readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a placeholder that says "part of arrow-rs" and what it contains is probably enough
I would basically follow along the model of other crates in the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed the readme line like some other crates in the repo.
|
|
||
| // Zero-copy builder | ||
| // The size per JSON string is assumed to be 128 bytes. If this holds true, resizing could be | ||
| // minimized to improve performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I've tried to minimize copying here. Took help from AI figuring this out. Please see if this looks good to you.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @harshmotw-db -- I think this is definitely the right direction and pretty close
I have some thoughts mostly on the arrow array types involved, but otherwise everything looks 👌
Major thoughts:
Different Output Type
- I suggest the output (and what will become the canonical representation of Variant in arrow) be
Struct<BinaryView, BinaryView> rather thanStruct<Binary, Binary> -- BinaryView allows both 1) multiple rows to point to the same underlying bytes, 2) storing the values in multiple buffers. BinaryArray requires one large buffer
Different Input types
- Eventually I think it might be good be good to have versions of this cast function for the (many) different kinds of string arrays in Arrow (StringArray, LargeStringArray, StringViewArray, etc). However, I think we can do that as a follow on PR/ I can help with the generics too as I am pretty familiar with how it works in arrow.
parquet-variant-compute/Cargo.toml
Outdated
| repository = { workspace = true } | ||
| authors = { workspace = true } | ||
| keywords = ["arrow", "parquet", "variant"] | ||
| readme = "README.md" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a placeholder that says "part of arrow-rs" and what it contains is probably enough
I would basically follow along the model of other crates in the repo
parquet-variant-compute/Cargo.toml
Outdated
| [dependencies] | ||
| arrow = { workspace = true } | ||
| arrow-schema = { workspace = true } | ||
| parquet-variant = { path = "../parquet-variant" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make this "workspace=true" (otherwise cargo publish gets angsty)?
| parquet-variant = { path = "../parquet-variant" } | |
| parquet-variant = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error: failed to parse manifest at `/Users/harsh.motwani/arrow-rs/parquet-variant-compute/Cargo.toml`
Caused by:
error inheriting `parquet-variant` from workspace root manifest's `workspace.dependencies.parquet-variant`
Caused by:
`dependency.parquet-variant` was not found in `workspace.dependencies`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding parquet-variant = { path = "./parquet-variant" } to the root Cargo.toml gets it working, so that's what I will do for now.
| DataType::Struct(fields.into()) | ||
| } | ||
|
|
||
| pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely add some docs / example to this kernel
I also might suggest calling it cast_to_variant but that is more of a personal preference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_json is a name we could consider for this. It's the name of the function that does this (json string to variant) in spark/databricks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine we'll eventually want a top-level cast_to_variant that converts strong types to variant? And passing a string array to such a function should produce a variant column full of string (or short-string) variant values?
This method here is json-parsing strings and casting the result to variant, not casting strings directly to variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with @scovich. As for parse_json, it makes sense to name it that way if it is a well defined SQL function. However, from a library point-of-view, I think parse_json is too vague.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well technically, it would be parquet_variant_compute::parse_json, which is a little clearer? Maybe we should add some nested module structure like arrow-compute does, so we get e.g. parquet_variant_compute::variant::parse_json. People would use that as variant::parse_json or parse_json if they prefer parsimony?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can call it cast_json_to_variant and cast_variant_to_json 🤔
parquet-variant-compute/src/lib.rs
Outdated
| mod from_json; | ||
| mod to_json; | ||
|
|
||
| /// Parse a batch of JSON strings into a batch of Variants represented as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically these comments would go on the function itself, not its pub use
| value_builder.append_null(); | ||
| validity.append(false); | ||
| } else { | ||
| let mut vb = VariantBuilder::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will pattern will cause the variant values to be copied twice -- once into the builder's buffers and then once into the output binary builder, which is probably ok for the first version;
With some care I think we will be able to avoid copying the values, though it will take using the lower level APIs (and building offsets directly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so just to be clear the two copies you are referring to are metadata_builder.append_value(&metadata); and metadata_builder.finish() right? If so, I'll take care of it in this PR itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I followed that one -- maybe it's a second issue?
I was just referring to all the try_new_with_metadata calls. Today VariantMetadata is Copy, so it's easy to forget that each such call is quietly copying an ~80 byte object. That cost could perhaps add up when iterating over hundreds/thousands of child objects? Or maybe the compiler is really good at making it ~free, since really only one meaningfully exists at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, would this resolve the issue you're talking about:
pub fn try_new(metadata: &'m [u8], value: &'v [u8]) -> Result<Self, ArrowError> {
// let metadata = VariantMetadata::try_new(metadata)?;
Self::try_new_with_metadata(VariantMetadata::try_new(metadata)?, value)
}
Edit: Oh no never mind. There are more copies downstream. I suppose that is more of a library issue that can potentially be fixed separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reduced copying in this function by manually constructing binary buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above would definitely not help, because each VariantMetadata::try_new would re-validate the same byte buffer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so just to be clear the two copies you are referring to are metadata_builder.append_value(&metadata); and metadata_builder.finish() right? If so, I'll take care of it in this PR itself.
What I was really imagining was updating VariantBuilder so it could take a pre-existing buffer (Vec) and append to it, rather than writing into a new buffer and then copying that into the output bytes.
| variant_to_json(&mut json_buffer, &variant)?; | ||
| let written = (json_buffer.len() - start_len) as i32; | ||
| current_offset += written; | ||
| offsets.push(current_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me
| let value_buffer = Buffer::from_vec(json_buffer); | ||
| let null_buffer = NullBuffer::new(validity.finish()); | ||
|
|
||
| Ok(StringArray::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if possible we should consider using StringViewArray here instead of StringArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about that. According to this documentation, the string must be stored in the offsets itself if it is short (<= 12 bytes). But how would we know that in advance? I think StringArray can more readily be implemented in a zero-copy fashion.
|
@alamb Wouldn't BinaryView have the same issues as the StringView issues I have highlighted here since both are just The best middle ground is a byte array which also stores lengths with offsets so we can have multiple offsets point to the same strings. |
| value_buffer.extend(value); | ||
| value_offsets.push(value_current_offset); | ||
| value_validity.append(true); | ||
| println!("{value_current_offset} {metadata_current_offset}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we use tracing for logging in arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was some debugging code I forgot to remove. Thanks for catching it.
| let input = StringArray::from(vec![ | ||
| Some("1"), | ||
| None, | ||
| Some("{\"a\": 32}"), | ||
| Some("null"), | ||
| None, | ||
| ]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please cover more cases here?
I mean all the types that JSON supports. I see you already added int, nulls, and simple dict here with string and int.
If it's test for happy-path can we please add:
- default values (e.g,
0forintbecause some engines can represent NULL as a default value). - booleans (both true/false)
- more nested json, not only 1-level nested json, e.g.
"{{{{true: false}, "-1": "+1"}, 0: 1}}" - some long strings to ensure we don't have any string size-based logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default values
Not quite sure what you mean by "default values" or how an engine's NULL handling relates to string (json) -> variant parsing?
"{{{{true: false}, "-1": "+1"}, 0: 1}}"
I'm pretty sure JSON objects requires string field names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gotocoding-DB The batch functions in this PR just run some underlying scalar functions on a whole batch of data. The underlying scalar functions have been validated on all sorts of inputs (this PR). I don't think the logical breadth of JSON test cases needs to be tested again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree full coverage here is redundant -- maybe we can add a comment that says "full json parsing coverage is handled by the tests for json_to_variant"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshmotw-db You're right. If the internally used parsing function is tested, we don't need to duplicate the tests here. I'm personally using tests as another example of function usage (if it's not fully covered in function docs).
| validity.append(false); | ||
| } else { | ||
| let mut vb = VariantBuilder::new(); | ||
| json_to_variant(input_string_array.value(i), &mut vb)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add not a happy-path test, for example, with invalid json, so this function should fail and return some well-known message to a user?
| }?; | ||
|
|
||
| // Zero-copy builders | ||
| let mut metadata_buffer: Vec<u8> = Vec::with_capacity(input.len() * 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the 128 scale factor come from? It seems a stretch to assume that a batch of N JSON strings will produce 128*N bytes worth of unique field names? Especially as N grows and that scale factor becomes more and more dangerous?
If we really feel it's important to pre-allocate capacity, I'd recommend capping it at e.g. 1MB. But honestly, I'd just allocate a normal vec and let it grow normally, unless/until we have some proof that the guaranteed O(n) cost to append n bytes to a vec isn't good enough, and that pre-allocation actually helps in a wide variety of situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should allocate the normal vec and let it grow -- if we want to pre-allocate I think we should allow the user to pass in the allocation size as an argument
Maybe something like
let variant_array = JsonToVariant::new()
.with_capacity(...)
.parse(input_array)| use parquet_variant::{json_to_variant, VariantBuilder}; | ||
|
|
||
| fn variant_arrow_repr() -> DataType { | ||
| // The subfields are expected to be non-nullable according to the parquet variant spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shredding spec makes value optional, with the possibility -- but not the requirement -- to have a typed_value instead when value is missing. Should we recognize that situation and throw a suitable "shredded variants not supported" error, instead of blowing up with an obscure schema mismatch error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but now that I think about it, that would be an issue when reading from parquet, which has nothing to do with this PR.
| let input_string_array = match input.as_any().downcast_ref::<StringArray>() { | ||
| Some(string_array) => Ok(string_array), | ||
| None => Err(ArrowError::CastError( | ||
| "Expected reference to StringArray as input".into(), | ||
| )), | ||
| }?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use ok_or_else?
| let input_string_array = match input.as_any().downcast_ref::<StringArray>() { | |
| Some(string_array) => Ok(string_array), | |
| None => Err(ArrowError::CastError( | |
| "Expected reference to StringArray as input".into(), | |
| )), | |
| }?; | |
| let input_string_array = input | |
| .as_any() | |
| .downcast_ref::<StringArray>() | |
| .ok_or_else(|| ArrowError::CastError("Expected a StringArray as input".into()))?; |
| let mut validity = BooleanBufferBuilder::new(input.len()); | ||
| for i in 0..input.len() { | ||
| if input.is_null(i) { | ||
| // The subfields are expected to be non-nullable according to the parquet variant spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already know they're non-nullable... maybe the comment can explain that we're pushing valid-but-empty subfields, to maintain proper positioning?
Also: Could we create nullable sub-field arrays even tho the schema says they're non-nullable, and rely on nested null masks? Does that save space in case the variant column has a lot of null entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess we still have to push something to the offset arrays, to maintain proper positioning... so valid-but-empty is probably the best we can do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess we still have to push something to the offset arrays, to maintain proper positioning... so valid-but-empty is probably the best we can do?
This offset layout is required by the Arrow spec for StringArrays
| let variant_fields = match variant_arrow_repr() { | ||
| DataType::Struct(fields) => fields, | ||
| _ => unreachable!("variant_arrow_repr is hard-coded and must match the expected schema"), | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should variant_arrow_repr just return Fields instead of DataType? Then we don't have to unpack it.
It's also a 4-line method with a single call site (3 lines if we don't convert it to a DataType, so we might also consider removing the method entirely? Could also make the relationship with the StructArray more explicit by something like:
| let variant_fields = match variant_arrow_repr() { | |
| DataType::Struct(fields) => fields, | |
| _ => unreachable!("variant_arrow_repr is hard-coded and must match the expected schema"), | |
| }; | |
| let metadata_field = Field::new("metadata", metadata_array.data_type(), false); | |
| let value_field = Field::new("value", value_array.data_type(), false); | |
| let variant_fields = vec![metadata_field, value_field].into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should variant_arrow_repr just return Fields instead of DataType? Then we don't have to unpack it.
Yes, please
| let input = StringArray::from(vec![ | ||
| Some("1"), | ||
| None, | ||
| Some("{\"a\": 32}"), | ||
| Some("null"), | ||
| None, | ||
| ]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default values
Not quite sure what you mean by "default values" or how an engine's NULL handling relates to string (json) -> variant parsing?
"{{{{true: false}, "-1": "+1"}, 0: 1}}"
I'm pretty sure JSON objects requires string field names?
| assert_eq!(metadata_array.value(3), &[1, 0, 0]); | ||
| assert_eq!(value_array.value(3), &[0]); | ||
|
|
||
| // Ensure that the subfields are not actually nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a useful thing to enforce? I'm pretty sure the arrow spec forbids to make any assumptions about the values of child arrays at positions an ancestor has marked invalid?
| DataType::Struct(fields.into()) | ||
| } | ||
|
|
||
| pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine we'll eventually want a top-level cast_to_variant that converts strong types to variant? And passing a string array to such a function should produce a variant column full of string (or short-string) variant values?
This method here is json-parsing strings and casting the result to variant, not casting strings directly to variant?
| let mut metadata_current_offset: i32 = 0; | ||
| metadata_offsets.push(metadata_current_offset); | ||
|
|
||
| let mut value_buffer: Vec<u8> = Vec::with_capacity(input.len() * 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above... I'm not sure how helpful it really is to pre-allocate capacity based only on input length. Some variants will be very small, and an unbounded over-allocation could hog a lot of memory. Others will be much (much) larger than 128B each, and the vec will anyway end up making multiple capacity increases along the way.
|
I plan to give this another review tomorrow -- sorry I ran out of time today |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @harshmotw-db , @scovich @gotocoding-DB and @Samyak2
I would like to get this PR merged sooner rather than later as I think then we can start working on some of the other compute kernels in parallel
Here are some follow on items I saw (I can file tickets to handle them in follow on PRs too):
- Use BinaryViewArray rather than BinaryArray
- Handle other types of input/output String arrays
- Remove hard coded 128*len capacity
- Avoid a second Variant copy via VariantBuilder
- A test for error paths
@harshmotw-db what is your preferred approach?
- Do you want to take another pass at this PR and address feedback?
- I'll merge this PR and we can make some follow on tickets / follow on PRs to address?
| let mut validity = BooleanBufferBuilder::new(input.len()); | ||
| for i in 0..input.len() { | ||
| if input.is_null(i) { | ||
| // The subfields are expected to be non-nullable according to the parquet variant spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess we still have to push something to the offset arrays, to maintain proper positioning... so valid-but-empty is probably the best we can do?
This offset layout is required by the Arrow spec for StringArrays
| let variant_fields = match variant_arrow_repr() { | ||
| DataType::Struct(fields) => fields, | ||
| _ => unreachable!("variant_arrow_repr is hard-coded and must match the expected schema"), | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should variant_arrow_repr just return Fields instead of DataType? Then we don't have to unpack it.
Yes, please
| let input = StringArray::from(vec![ | ||
| Some("1"), | ||
| None, | ||
| Some("{\"a\": 32}"), | ||
| Some("null"), | ||
| None, | ||
| ]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree full coverage here is redundant -- maybe we can add a comment that says "full json parsing coverage is handled by the tests for json_to_variant"
| }?; | ||
|
|
||
| // Zero-copy builders | ||
| let mut metadata_buffer: Vec<u8> = Vec::with_capacity(input.len() * 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should allocate the normal vec and let it grow -- if we want to pre-allocate I think we should allow the user to pass in the allocation size as an argument
Maybe something like
let variant_array = JsonToVariant::new()
.with_capacity(...)
.parse(input_array)| value_builder.append_null(); | ||
| validity.append(false); | ||
| } else { | ||
| let mut vb = VariantBuilder::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so just to be clear the two copies you are referring to are metadata_builder.append_value(&metadata); and metadata_builder.finish() right? If so, I'll take care of it in this PR itself.
What I was really imagining was updating VariantBuilder so it could take a pre-existing buffer (Vec) and append to it, rather than writing into a new buffer and then copying that into the output bytes.
| DataType::Struct(fields.into()) | ||
| } | ||
|
|
||
| pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can call it cast_json_to_variant and cast_variant_to_json 🤔
…rquet_variant_compute
|
I merged this PR up to main and resolved some conflicts |
|
Let's keep the train moving -- thanks again @harshmotw-db 🚀 |
|
@alamb Sorry I was busy with something else yesterday. Thanks for merging it! What do you think about this comment related to |
I incorporated that suggestion it into this PR: |
…rrays of Variants (#7905) # Which issue does this PR close? - Part of #6736 - Part of #7895 # Rationale for this change As we begin to add operations on Variants stored in arrays, we need some better abstractions of working with those arrays This PR builds on the great work of @harshmotw-db in #7884 to start adding t # What changes are included in this PR? 1. Add `VariantArray` that wraps a `StructArray` and adds useful accessors 2. Add `VariantArrayBuilder` as described in #7895 to construct `VariantArrays` 2. rework `batch_json_string_to_variant` to use the new builder and array wrapper Note while these APIs have no shredding support yet, I think shredding can be added in a straightforward way # Are these changes tested? Yes, unit tests and doc examples are included # Are there any user-facing changes? New VariantArray and VariantArrayBuilder
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
Explained in the ticket.
Note: This PR will go through changes once this PR is merged.
What changes are included in this PR?
This PR introduces two new functions
batch_json_string_to_variantandbatch_variant_to_json_stringwhich can be used to transform batches of JSON strings to batches of Variant structs and vice versa. This PR attempts to implementbatch_variant_to_json_stringin a zero-copy way (@alamb see if you agree) sincevariant_to_jsonallows an input implementing aWriteinterface.batch_json_string_to_variantshould also eventually be zero-copy once this issue is resolved.Are these changes tested?
Simple unit tests since the underlying functions have already been tested.
Are there any user-facing changes?
Yes, it introduces the
batch_json_string_to_variantandbatch_variant_to_json_stringAPIs in a new crate.