Skip to content

Conversation

@harshmotw-db
Copy link
Contributor

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

Explained in the ticket.
Note: This PR will go through changes once this PR is merged.

What changes are included in this PR?

This PR introduces two new functions batch_json_string_to_variant and batch_variant_to_json_string which can be used to transform batches of JSON strings to batches of Variant structs and vice versa. This PR attempts to implement batch_variant_to_json_string in a zero-copy way (@alamb see if you agree) since variant_to_json allows an input implementing a Write interface. batch_json_string_to_variant should also eventually be zero-copy once this issue is resolved.

Are these changes tested?

Simple unit tests since the underlying functions have already been tested.

Are there any user-facing changes?

Yes, it introduces the batch_json_string_to_variant and batch_variant_to_json_string APIs in a new crate.

@harshmotw-db harshmotw-db changed the title Harsh motwani data/parquet variant compute [Variant] Introduce parquet-variant-compute crate to transform batches of JSON strings to and from Variants Jul 8, 2025
@harshmotw-db
Copy link
Contributor Author

@alamb @scovich @Samyak2

repository = { workspace = true }
authors = { workspace = true }
keywords = ["arrow", "parquet", "variant"]
readme = "README.md"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Let me know if I should create a readme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a placeholder that says "part of arrow-rs" and what it contains is probably enough

I would basically follow along the model of other crates in the repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed the readme line like some other crates in the repo.


// Zero-copy builder
// The size per JSON string is assumed to be 128 bytes. If this holds true, resizing could be
// minimized to improve performance.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I've tried to minimize copying here. Took help from AI figuring this out. Please see if this looks good to you.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @harshmotw-db -- I think this is definitely the right direction and pretty close

I have some thoughts mostly on the arrow array types involved, but otherwise everything looks 👌

Major thoughts:

Different Output Type

  1. I suggest the output (and what will become the canonical representation of Variant in arrow) be Struct<BinaryView, BinaryView> rather than Struct<Binary, Binary> -- BinaryView allows both 1) multiple rows to point to the same underlying bytes, 2) storing the values in multiple buffers. BinaryArray requires one large buffer

Different Input types

  1. Eventually I think it might be good be good to have versions of this cast function for the (many) different kinds of string arrays in Arrow (StringArray, LargeStringArray, StringViewArray, etc). However, I think we can do that as a follow on PR/ I can help with the generics too as I am pretty familiar with how it works in arrow.

repository = { workspace = true }
authors = { workspace = true }
keywords = ["arrow", "parquet", "variant"]
readme = "README.md"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a placeholder that says "part of arrow-rs" and what it contains is probably enough

I would basically follow along the model of other crates in the repo

[dependencies]
arrow = { workspace = true }
arrow-schema = { workspace = true }
parquet-variant = { path = "../parquet-variant" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make this "workspace=true" (otherwise cargo publish gets angsty)?

Suggested change
parquet-variant = { path = "../parquet-variant" }
parquet-variant = { workspace = true }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: failed to parse manifest at `/Users/harsh.motwani/arrow-rs/parquet-variant-compute/Cargo.toml`

Caused by:
  error inheriting `parquet-variant` from workspace root manifest's `workspace.dependencies.parquet-variant`

Caused by:
  `dependency.parquet-variant` was not found in `workspace.dependencies`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding parquet-variant = { path = "./parquet-variant" } to the root Cargo.toml gets it working, so that's what I will do for now.

DataType::Struct(fields.into())
}

pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely add some docs / example to this kernel

I also might suggest calling it cast_to_variant but that is more of a personal preference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_json is a name we could consider for this. It's the name of the function that does this (json string to variant) in spark/databricks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine we'll eventually want a top-level cast_to_variant that converts strong types to variant? And passing a string array to such a function should produce a variant column full of string (or short-string) variant values?

This method here is json-parsing strings and casting the result to variant, not casting strings directly to variant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with @scovich. As for parse_json, it makes sense to name it that way if it is a well defined SQL function. However, from a library point-of-view, I think parse_json is too vague.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well technically, it would be parquet_variant_compute::parse_json, which is a little clearer? Maybe we should add some nested module structure like arrow-compute does, so we get e.g. parquet_variant_compute::variant::parse_json. People would use that as variant::parse_json or parse_json if they prefer parsimony?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can call it cast_json_to_variant and cast_variant_to_json 🤔

mod from_json;
mod to_json;

/// Parse a batch of JSON strings into a batch of Variants represented as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically these comments would go on the function itself, not its pub use

value_builder.append_null();
validity.append(false);
} else {
let mut vb = VariantBuilder::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will pattern will cause the variant values to be copied twice -- once into the builder's buffers and then once into the output binary builder, which is probably ok for the first version;

With some care I think we will be able to avoid copying the values, though it will take using the lower level APIs (and building offsets directly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so just to be clear the two copies you are referring to are metadata_builder.append_value(&metadata); and metadata_builder.finish() right? If so, I'll take care of it in this PR itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I followed that one -- maybe it's a second issue?

I was just referring to all the try_new_with_metadata calls. Today VariantMetadata is Copy, so it's easy to forget that each such call is quietly copying an ~80 byte object. That cost could perhaps add up when iterating over hundreds/thousands of child objects? Or maybe the compiler is really good at making it ~free, since really only one meaningfully exists at a time?

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, would this resolve the issue you're talking about:

pub fn try_new(metadata: &'m [u8], value: &'v [u8]) -> Result<Self, ArrowError> {
      // let metadata = VariantMetadata::try_new(metadata)?;
      Self::try_new_with_metadata(VariantMetadata::try_new(metadata)?, value)
  }

Edit: Oh no never mind. There are more copies downstream. I suppose that is more of a library issue that can potentially be fixed separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reduced copying in this function by manually constructing binary buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above would definitely not help, because each VariantMetadata::try_new would re-validate the same byte buffer!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so just to be clear the two copies you are referring to are metadata_builder.append_value(&metadata); and metadata_builder.finish() right? If so, I'll take care of it in this PR itself.

What I was really imagining was updating VariantBuilder so it could take a pre-existing buffer (Vec) and append to it, rather than writing into a new buffer and then copying that into the output bytes.

variant_to_json(&mut json_buffer, &variant)?;
let written = (json_buffer.len() - start_len) as i32;
current_offset += written;
offsets.push(current_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me

let value_buffer = Buffer::from_vec(json_buffer);
let null_buffer = NullBuffer::new(validity.finish());

Ok(StringArray::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if possible we should consider using StringViewArray here instead of StringArray

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about that. According to this documentation, the string must be stored in the offsets itself if it is short (<= 12 bytes). But how would we know that in advance? I think StringArray can more readily be implemented in a zero-copy fashion.

@harshmotw-db
Copy link
Contributor Author

harshmotw-db commented Jul 8, 2025

@alamb Wouldn't BinaryView have the same issues as the StringView issues I have highlighted here since both are just GenericByteViewArray? I suppose implementing these xxViewArrays might be much more complicated to implement in a zero-copy manner. I'll leave the types as Binary and String for now unless I'm missing something.

The best middle ground is a byte array which also stores lengths with offsets so we can have multiple offsets point to the same strings.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 9, 2025
@github-actions github-actions bot removed the parquet Changes to the parquet crate label Jul 9, 2025
@harshmotw-db harshmotw-db requested a review from alamb July 9, 2025 06:58
value_buffer.extend(value);
value_offsets.push(value_current_offset);
value_validity.append(true);
println!("{value_current_offset} {metadata_current_offset}");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we use tracing for logging in arrow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was some debugging code I forgot to remove. Thanks for catching it.

Comment on lines +130 to +136
let input = StringArray::from(vec![
Some("1"),
None,
Some("{\"a\": 32}"),
Some("null"),
None,
]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please cover more cases here?
I mean all the types that JSON supports. I see you already added int, nulls, and simple dict here with string and int.

If it's test for happy-path can we please add:

  • default values (e.g, 0 for int because some engines can represent NULL as a default value).
  • booleans (both true/false)
  • more nested json, not only 1-level nested json, e.g. "{{{{true: false}, "-1": "+1"}, 0: 1}}"
  • some long strings to ensure we don't have any string size-based logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default values

Not quite sure what you mean by "default values" or how an engine's NULL handling relates to string (json) -> variant parsing?

"{{{{true: false}, "-1": "+1"}, 0: 1}}"

I'm pretty sure JSON objects requires string field names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gotocoding-DB The batch functions in this PR just run some underlying scalar functions on a whole batch of data. The underlying scalar functions have been validated on all sorts of inputs (this PR). I don't think the logical breadth of JSON test cases needs to be tested again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree full coverage here is redundant -- maybe we can add a comment that says "full json parsing coverage is handled by the tests for json_to_variant"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db You're right. If the internally used parsing function is tested, we don't need to duplicate the tests here. I'm personally using tests as another example of function usage (if it's not fully covered in function docs).

validity.append(false);
} else {
let mut vb = VariantBuilder::new();
json_to_variant(input_string_array.value(i), &mut vb)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add not a happy-path test, for example, with invalid json, so this function should fail and return some well-known message to a user?

}?;

// Zero-copy builders
let mut metadata_buffer: Vec<u8> = Vec::with_capacity(input.len() * 128);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the 128 scale factor come from? It seems a stretch to assume that a batch of N JSON strings will produce 128*N bytes worth of unique field names? Especially as N grows and that scale factor becomes more and more dangerous?

If we really feel it's important to pre-allocate capacity, I'd recommend capping it at e.g. 1MB. But honestly, I'd just allocate a normal vec and let it grow normally, unless/until we have some proof that the guaranteed O(n) cost to append n bytes to a vec isn't good enough, and that pre-allocation actually helps in a wide variety of situations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should allocate the normal vec and let it grow -- if we want to pre-allocate I think we should allow the user to pass in the allocation size as an argument

Maybe something like

let variant_array = JsonToVariant::new()
  .with_capacity(...)
  .parse(input_array)

use parquet_variant::{json_to_variant, VariantBuilder};

fn variant_arrow_repr() -> DataType {
// The subfields are expected to be non-nullable according to the parquet variant spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shredding spec makes value optional, with the possibility -- but not the requirement -- to have a typed_value instead when value is missing. Should we recognize that situation and throw a suitable "shredded variants not supported" error, instead of blowing up with an obscure schema mismatch error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but now that I think about it, that would be an issue when reading from parquet, which has nothing to do with this PR.

Comment on lines +41 to +46
let input_string_array = match input.as_any().downcast_ref::<StringArray>() {
Some(string_array) => Ok(string_array),
None => Err(ArrowError::CastError(
"Expected reference to StringArray as input".into(),
)),
}?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ok_or_else?

Suggested change
let input_string_array = match input.as_any().downcast_ref::<StringArray>() {
Some(string_array) => Ok(string_array),
None => Err(ArrowError::CastError(
"Expected reference to StringArray as input".into(),
)),
}?;
let input_string_array = input
.as_any()
.downcast_ref::<StringArray>()
.ok_or_else(|| ArrowError::CastError("Expected a StringArray as input".into()))?;

let mut validity = BooleanBufferBuilder::new(input.len());
for i in 0..input.len() {
if input.is_null(i) {
// The subfields are expected to be non-nullable according to the parquet variant spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already know they're non-nullable... maybe the comment can explain that we're pushing valid-but-empty subfields, to maintain proper positioning?

Also: Could we create nullable sub-field arrays even tho the schema says they're non-nullable, and rely on nested null masks? Does that save space in case the variant column has a lot of null entries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess we still have to push something to the offset arrays, to maintain proper positioning... so valid-but-empty is probably the best we can do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess we still have to push something to the offset arrays, to maintain proper positioning... so valid-but-empty is probably the best we can do?

This offset layout is required by the Arrow spec for StringArrays

Comment on lines +108 to +111
let variant_fields = match variant_arrow_repr() {
DataType::Struct(fields) => fields,
_ => unreachable!("variant_arrow_repr is hard-coded and must match the expected schema"),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should variant_arrow_repr just return Fields instead of DataType? Then we don't have to unpack it.

It's also a 4-line method with a single call site (3 lines if we don't convert it to a DataType, so we might also consider removing the method entirely? Could also make the relationship with the StructArray more explicit by something like:

Suggested change
let variant_fields = match variant_arrow_repr() {
DataType::Struct(fields) => fields,
_ => unreachable!("variant_arrow_repr is hard-coded and must match the expected schema"),
};
let metadata_field = Field::new("metadata", metadata_array.data_type(), false);
let value_field = Field::new("value", value_array.data_type(), false);
let variant_fields = vec![metadata_field, value_field].into();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should variant_arrow_repr just return Fields instead of DataType? Then we don't have to unpack it.

Yes, please

Comment on lines +130 to +136
let input = StringArray::from(vec![
Some("1"),
None,
Some("{\"a\": 32}"),
Some("null"),
None,
]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default values

Not quite sure what you mean by "default values" or how an engine's NULL handling relates to string (json) -> variant parsing?

"{{{{true: false}, "-1": "+1"}, 0: 1}}"

I'm pretty sure JSON objects requires string field names?

assert_eq!(metadata_array.value(3), &[1, 0, 0]);
assert_eq!(value_array.value(3), &[0]);

// Ensure that the subfields are not actually nullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a useful thing to enforce? I'm pretty sure the arrow spec forbids to make any assumptions about the values of child arrays at positions an ancestor has marked invalid?

DataType::Struct(fields.into())
}

pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine we'll eventually want a top-level cast_to_variant that converts strong types to variant? And passing a string array to such a function should produce a variant column full of string (or short-string) variant values?

This method here is json-parsing strings and casting the result to variant, not casting strings directly to variant?

let mut metadata_current_offset: i32 = 0;
metadata_offsets.push(metadata_current_offset);

let mut value_buffer: Vec<u8> = Vec::with_capacity(input.len() * 128);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above... I'm not sure how helpful it really is to pre-allocate capacity based only on input length. Some variants will be very small, and an unbounded over-allocation could hog a lot of memory. Others will be much (much) larger than 128B each, and the vec will anyway end up making multiple capacity increases along the way.

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

I plan to give this another review tomorrow -- sorry I ran out of time today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @harshmotw-db , @scovich @gotocoding-DB and @Samyak2

I would like to get this PR merged sooner rather than later as I think then we can start working on some of the other compute kernels in parallel

Here are some follow on items I saw (I can file tickets to handle them in follow on PRs too):

  1. Use BinaryViewArray rather than BinaryArray
  2. Handle other types of input/output String arrays
  3. Remove hard coded 128*len capacity
  4. Avoid a second Variant copy via VariantBuilder
  5. A test for error paths

@harshmotw-db what is your preferred approach?

  1. Do you want to take another pass at this PR and address feedback?
  2. I'll merge this PR and we can make some follow on tickets / follow on PRs to address?

let mut validity = BooleanBufferBuilder::new(input.len());
for i in 0..input.len() {
if input.is_null(i) {
// The subfields are expected to be non-nullable according to the parquet variant spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess we still have to push something to the offset arrays, to maintain proper positioning... so valid-but-empty is probably the best we can do?

This offset layout is required by the Arrow spec for StringArrays

Comment on lines +108 to +111
let variant_fields = match variant_arrow_repr() {
DataType::Struct(fields) => fields,
_ => unreachable!("variant_arrow_repr is hard-coded and must match the expected schema"),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should variant_arrow_repr just return Fields instead of DataType? Then we don't have to unpack it.

Yes, please

Comment on lines +130 to +136
let input = StringArray::from(vec![
Some("1"),
None,
Some("{\"a\": 32}"),
Some("null"),
None,
]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree full coverage here is redundant -- maybe we can add a comment that says "full json parsing coverage is handled by the tests for json_to_variant"

}?;

// Zero-copy builders
let mut metadata_buffer: Vec<u8> = Vec::with_capacity(input.len() * 128);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should allocate the normal vec and let it grow -- if we want to pre-allocate I think we should allow the user to pass in the allocation size as an argument

Maybe something like

let variant_array = JsonToVariant::new()
  .with_capacity(...)
  .parse(input_array)

value_builder.append_null();
validity.append(false);
} else {
let mut vb = VariantBuilder::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so just to be clear the two copies you are referring to are metadata_builder.append_value(&metadata); and metadata_builder.finish() right? If so, I'll take care of it in this PR itself.

What I was really imagining was updating VariantBuilder so it could take a pre-existing buffer (Vec) and append to it, rather than writing into a new buffer and then copying that into the output bytes.

DataType::Struct(fields.into())
}

pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can call it cast_json_to_variant and cast_variant_to_json 🤔

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

I merged this PR up to main and resolved some conflicts

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

Let's keep the train moving -- thanks again @harshmotw-db 🚀

@alamb alamb merged commit 058243a into apache:main Jul 11, 2025
10 checks passed
@harshmotw-db
Copy link
Contributor Author

@alamb Sorry I was busy with something else yesterday. Thanks for merging it! What do you think about this comment related to BinaryViewArray and StringViewArray?

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

@alamb Sorry I was busy with something else yesterday. Thanks for merging it! What do you think about this comment related to BinaryViewArray and StringViewArray?

I incorporated that suggestion it into this PR:

alamb added a commit that referenced this pull request Jul 12, 2025
…rrays of Variants (#7905)

# Which issue does this PR close?

- Part of #6736
- Part of #7895


# Rationale for this change

As we begin to add operations on Variants stored in arrays, we need some
better abstractions of working with those arrays

This PR builds on the great work of @harshmotw-db in
#7884 to start adding t


# What changes are included in this PR?

1. Add `VariantArray` that wraps a `StructArray` and adds useful
accessors
2. Add `VariantArrayBuilder` as described in
#7895 to construct
`VariantArrays`
2. rework `batch_json_string_to_variant` to use the new builder and
array wrapper
Note while these APIs have no shredding support yet, I think shredding
can be added in a straightforward way

# Are these changes tested?

Yes, unit tests and doc examples are included


# Are there any user-facing changes?

New VariantArray and VariantArrayBuilder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant][Compute] Add batch processing for Variant-JSON String conversion

5 participants