-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] [Pathfinding] Variant builder rollback, variant array builder, and read-only metadata builder #8167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Attn @alamb @carpecodeum @friendlymatthew , very interested in your thoughts? |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some hopefully helpful context for reviewers
| nulls: NullBufferBuilder, | ||
| /// buffer for all the metadata | ||
| metadata_buffer: Vec<u8>, | ||
| metadata_builder: MetadataBuilderXX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the metadata builder is now "reusable" (each new finish call just appends another chunk of bytes to the underlying byte buffer), we can instantiate it directly and just pass it by reference to the VariantArrayVariantBuilder helper class.
| metadata_offsets: Vec<usize>, | ||
| /// buffer for values | ||
| value_buffer: Vec<u8>, | ||
| value_builder: ValueBuilder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be called ValueBuffer, and it turns out to have a better API for our purposes than VariantBuilder. So we no longer create a VariantBuilder at all, and just pass the value and metadata builders around by reference as needed.
| metadata_builder: MetadataBuilderXX, | ||
| /// (offset, len) pairs for locations of metadata in the buffer | ||
| metadata_locations: Vec<(usize, usize)>, | ||
| metadata_offsets: Vec<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Track starting offsets (similar to normal variant code), and infer the length of each entry as the difference between adjacent offsets. Ditto for value_offsets below.
| let metadata_buffer = std::mem::take(&mut self.metadata_buffer); | ||
| let value_buffer = std::mem::take(&mut self.value_buffer); | ||
| VariantArrayVariantBuilder::new(self, metadata_buffer, value_buffer) | ||
| VariantArrayVariantBuilder::new(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move all the magic into the constructor (this could be done as a separate PR)
| impl VariantBuilderExt for VariantArrayVariantBuilder<'_> { | ||
| fn append_value<'m, 'v>(&mut self, value: impl Into<Variant<'m, 'v>>) { | ||
| self.variant_builder.append_value(value); | ||
| ValueBuilder::try_append_variant(self.parent_state(), value.into()).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed one
| ValueBuilder::try_append_variant(self.parent_state(), value.into()).unwrap() | |
| ValueBuilder::append_variant(self.parent_state(), value.into()) |
| buffer.try_append_variant(value.into(), metadata_builder)?; | ||
| Ok(()) | ||
| let (state, _) = self.parent_state(key)?; | ||
| ValueBuilder::try_append_variant(state, value.into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this used to not roll back fully on failure.
| fn new_list(&mut self) -> ListBuilder<'_> { | ||
| self.try_new_list().unwrap() | ||
| } | ||
|
|
||
| fn new_object(&mut self) -> ObjectBuilder<'_> { | ||
| self.try_new_object().unwrap() | ||
| } | ||
|
|
||
| fn new_object(&mut self) -> ObjectBuilder<'_>; | ||
| fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>; | ||
|
|
||
| fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I just changed the new_XX methods to return Result, but that churned a lot of unit test code.
I'm still on the fence whether this is better in the long run tho?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, me too -- it is pretty annoying to have to check for errors on every single call to creating these objects, even when it can not fail (e.g. creating in memory variants)
Another thought I had was we could potentially defer errors that could arise into the call to build() -- but then that might results in hard to track down errors 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that really put me off of tracking deferred errors is, it became yet another thing that has to be rolled back if the offending builder never finishes. I'm pretty sure we weren't rolling it back properly before, and I didn't want to complicate the ParentState stuff even further by "fixing" it, if we can just eliminate it.
| .new_object() | ||
| .with_field("a", 1) | ||
| .with_field("b", 2) | ||
| .try_with_field("a", 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicate field is now detected immediately upon insertion, not at finish
| let nested_result = inner_list | ||
| .new_object() | ||
| .with_field("x", 1) | ||
| .try_with_field("x", 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, duplicates are now detected immediately.
|
|
||
| /// Test reusing buffers with nested objects | ||
| #[test] | ||
| fn test_with_existing_buffers_nested() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was testing support code for VariantArrayBuilder that has been deleted.
|
I will review this PR carefully today |
|
Hi, I have to catch up on all the variant progress (work is busy). I'll review this by the end of this week, but no need to wait for me. If I have questions, I will leave them post-merge |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a pretty close look at this PR and I think the direction makes a lot of sense to me. I left some inline comments, but basically I think it is ready to polish up and get ready to merge (if you can break it into smaller PRs that would be nice)
cc @klion26 and @codephage2020 who may also be interested and have some additional comments to share
| let value_length = 0; | ||
| self.value_locations.push((value_offset, value_length)); | ||
| self.metadata_offsets.push(self.metadata_builder.offset()); | ||
| self.value_offsets.push(self.value_builder.offset()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as part of this PR, but I think the Null elements are supposed to have a Variant::Null rather than empty for all elements that are marked as null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For top-level (which this is), I believe you are correct. Good catch!
| // Size of innermost_list: 1 + 1 + 2*(128 + 1) + 2*128 = 516 | ||
| // Size of inner object: 1 + 4 + 2*256 + 3*(256 + 1) + 256 * 516 = 133384 | ||
| // Size of json: 1 + 4 + 2*256 + 4*(256 + 1) + 256 * 133384 = 34147849 | ||
| assert_eq!(value.len(), 34147849); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this analysis -- the test seems overly sensitive to small pturbatons
| /// Same as `Index::index`, but with the correct lifetime. | ||
| pub(crate) fn get_infallible(&self, i: usize) -> &'m str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment.
Minor nit: directly using get(i).unwrap() at relevant locations rather than add a new function might be easier to understand than understanding when to use get_infallible 🤔
| impl From<Vec<u8>> for ValueBuffer { | ||
| fn from(value: Vec<u8>) -> Self { | ||
| Self(value) | ||
| } | ||
| } | ||
|
|
||
| impl From<ValueBuffer> for Vec<u8> { | ||
| fn from(value_buffer: ValueBuffer) -> Self { | ||
| value_buffer.0 | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it isn't needed I support removing it (we can always add it back later)
| fn append_object(state: ParentState<'_>, obj: VariantObject) { | ||
| let mut object_builder = ObjectBuilder::new(state, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea -- it makes a lot of sense to me
| pub trait MetadataBuilder: std::fmt::Debug { | ||
| fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError>; | ||
| fn field_name(&self, field_id: usize) -> &str; | ||
| fn num_field_names(&self) -> usize; | ||
| fn truncate_field_names(&mut self, new_size: usize); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API makes sense to me -- I think it would be nice to add some comments, but in general 👍
| /// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl. | ||
| #[derive(Default, Debug)] | ||
| struct MetadataBuilder { | ||
| pub struct MetadataBuilderXX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about OwnedMetadataBuilder or DefaultMetadataBuilder
| /// The starting offset in the parent's buffer where this object starts | ||
| parent_value_offset_base: usize, | ||
| /// The starting offset in the parent's metadata buffer where this object starts | ||
| /// used to truncate the written fields in `drop` if the current object has not been finished | ||
| parent_metadata_offset_base: usize, | ||
| /// Whether the object has been finished, the written content of the current object | ||
| /// will be truncated in `drop` if `has_been_finished` is false | ||
| has_been_finished: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is very nice that the parent state handles this nicely now
|
|
||
| /// Returns an object builder that can be used to append a new (nested) object to this object. | ||
| /// | ||
| /// Panics if the requested key cannot be inserted (e.g. because it is a duplicate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason it could panic is that the builder has read only metadata, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way we could avoid panic's would be to defer errors until the final finish() is called 🤔
I think this is fine for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could panic [if] the builder has read only metadata, right?
Yep! And having the same method fail for both error cases is really appealing to me.
(I just wrote the code comment before adding the read-only builder to this PR)
We could avoid panic's [by deferring] errors until the final finish() is called
The goal was to make finish signature infallible, now that it actually is infallible in practice. I just didn't want to bloat this PR with the extra churn that would cause.
Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?
the only disadvantage is that it is annoying if you have to do error checking for operations that can't fail - like making a new object with owned metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top-level builder was, and remains infallible in both its new_object and finish methods?
Here we're dealing with the object builder's (nested) new_object, which takes a field name and so could fail due to duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And actually... it wasn't even checking in the first place:
| fn new_list(&mut self) -> ListBuilder<'_> { | ||
| self.try_new_list().unwrap() | ||
| } | ||
|
|
||
| fn new_object(&mut self) -> ObjectBuilder<'_> { | ||
| self.try_new_object().unwrap() | ||
| } | ||
|
|
||
| fn new_object(&mut self) -> ObjectBuilder<'_>; | ||
| fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>; | ||
|
|
||
| fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, me too -- it is pretty annoying to have to check for errors on every single call to creating these objects, even when it can not fail (e.g. creating in memory variants)
Another thought I had was we could potentially defer errors that could arise into the call to build() -- but then that might results in hard to track down errors 🤔
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the direction makes a lot of sense to me. I left some inline comments, but basically I think it is ready to polish up and get ready to merge (if you can break it into smaller PRs that would be nice)
Thanks for the quick review!
This will definitely become a stream of smaller/scoped PR, I just wanted a sanity check before starting down that path.
| /// Same as `Index::index`, but with the correct lifetime. | ||
| pub(crate) fn get_infallible(&self, i: usize) -> &'m str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TheIndex::index method has the following signature:
fn index(&self, i: usize) -> &Self::Outputwhich is really
fn index<'a>(&'a self, i: usize) -> &'a Self::Outputand because the (elided) 'a is not related to 'm, we can't use the return value in a place that requires &'m str.
You're right that self.get(i).unwrap() would also work; I was just trying to preserve the expect at both call sites. Should we just unwrap everywhere and be done with it?
NOTE: The get_infallible should be private, except then Index can't access it.
| let value_length = 0; | ||
| self.value_locations.push((value_offset, value_length)); | ||
| self.metadata_offsets.push(self.metadata_builder.offset()); | ||
| self.value_offsets.push(self.value_builder.offset()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For top-level (which this is), I believe you are correct. Good catch!
| pub trait MetadataBuilder: std::fmt::Debug { | ||
| fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError>; | ||
| fn field_name(&self, field_id: usize) -> &str; | ||
| fn num_field_names(&self) -> usize; | ||
| fn truncate_field_names(&mut self, new_size: usize); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs comments for sure!
| /// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl. | ||
| #[derive(Default, Debug)] | ||
| struct MetadataBuilder { | ||
| pub struct MetadataBuilderXX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my other PR used DefaultMetadataBuilder, but I wasn't sure the name conveyed what it actually does differently than a "not default" (or "not owned") metadata builder might do? Maybe BasicMetadataBuilder would convey the fact that any other impl must be doing something fancy/special (= not basic)?
|
|
||
| /// Returns an object builder that can be used to append a new (nested) object to this object. | ||
| /// | ||
| /// Panics if the requested key cannot be inserted (e.g. because it is a duplicate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could panic [if] the builder has read only metadata, right?
Yep! And having the same method fail for both error cases is really appealing to me.
(I just wrote the code comment before adding the read-only builder to this PR)
We could avoid panic's [by deferring] errors until the final finish() is called
The goal was to make finish signature infallible, now that it actually is infallible in practice. I just didn't want to bloat this PR with the extra churn that would cause.
Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately?
| fn new_list(&mut self) -> ListBuilder<'_> { | ||
| self.try_new_list().unwrap() | ||
| } | ||
|
|
||
| fn new_object(&mut self) -> ObjectBuilder<'_> { | ||
| self.try_new_object().unwrap() | ||
| } | ||
|
|
||
| fn new_object(&mut self) -> ObjectBuilder<'_>; | ||
| fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>; | ||
|
|
||
| fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that really put me off of tracking deferred errors is, it became yet another thing that has to be rolled back if the offending builder never finishes. I'm pretty sure we weren't rolling it back properly before, and I didn't want to complicate the ParentState stuff even further by "fixing" it, if we can just eliminate it.
klion26
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. This is very nice; we can now handle the rollback logic more elegantly.
| } | ||
|
|
||
| // Performs any parent-specific aspects of rolling back a builder if an insertion failed. | ||
| fn rollback(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will rollback_if_needed be better here?
| "Field name '{field_name}' not found in metadata", | ||
| ))); | ||
| }; | ||
| self.known_field_names.insert(self.metadata.get_infallible(field_id), field_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not filling known_field_names in at initialization due to performance considerations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- this specific object's field names are an arbitrarily small subset of the metadata dictionary. In an extreme case, the dictionary could contain thousands of entries while the specific object being manipulated contains only a handful of fields.
|
@klion26 -- thanks for the review! This pathfinding PR cannot merge (too big and messy), but I pulled out the first round of rollback changes as a cleaned up and better scoped PR here: PTAL, if you have time? |
|
I'll go ahead and close this now. The work items are all tracked by And almost everything has merged except |
Which issue does this PR close?
Pathfinding. Even if we like what we see here, it needs to split into a half dozen other PR in order to actually merge. But first we need to see the big picture to know whether it's worth going further.
Several flavors of pathfinding rolled up in one, but with the overarching goal of:
An alternate approach to
Rationale for this change
While trying to rebase #7915, I kept running into more and more ugliness. And it was already kind of ugly.
Also, I realized that the
ValueBufferuse of builders with a genericParentState::Variantwas incorrect in some rollback scenarios where the attempt to insert a value fails (try_insert_objectortry_insert_list).Further, the interaction between
VariantBuilderandVariantArrayBuilderwas quite convoluted.It turns out all of those issues are a bit related, and this PR is the result of my exploration.
What changes are included in this PR?
A lot! In no particular order:
ValueBufferasValueBuilderand make it publicParentStateto be much clearer and hopefully easier to usefinishedstatus on behalf of the builders that use it (their drop glue goes away)finishis not called. In particular, the value and metadata buffer offsets are tracked and rolled back uniformly.finishafter all.ObjectBuildermethods become fallible, because they immediately detect and report duplicated field names when that checking is enabled.VariantBuilder::finishbecomes infallible (because any failed field insertion is caught immediately by the now-fallibleObjectBuildermethods). However, the method still returnsResult::Okfor now, to avoid excessive unit test churn (removing dozens ofunwrapand?calls scattered all over the place)ValueBuilder::[try_]append_variantto be an associated function instead of a method, taking aParentStatefrom its caller in order to ensure correct rollback semantics.ParentStatepublicMetadataBuilderstruct becomes reusable, similar to normal arrow array builders -- itsfinishmethod takes&mut self.VariantArrayBuildernow directly instantiates aValueBuilderandMetadataBuilderpair up front, and converts them to byte slices only when its ownfinishmethod is called. It no longer attempts to create aVariantBuilderat any point, and works directly with theValueBuilderAPI instead. No moremem::takemagic needed -- theVariantArrayVariantBuilderhelper class just takes a mut ref to its owner and its finish method handles the rest; the drop glue becomes empty (just to help catch forgottenfinishcalls) because theParentStatenow deals with all needed cleanup.VariantArrayBuilderalso tracks starting offsets instead of(offset, len)pairs -- the latter is easily derived from the former.MetadataBuilderinto a trait (the original struct is renamed asMetadataBuilderXXfor now, final naming still TBD). Most methods that previously took aMetadataBuilderstruct now take a&mut dyn MetadataBuilderinstead.VariantArrayBuilder, now that it's no longer needed.new_objectandnew_listtoVariantBuilderExttrait. That way, the existing unit test code that calls those methods can keep using the infallible versions (panicking if there's a duplicate field name). Prod code that actually cares calls the new infallible versions instead.ReadOnlyMetadataBuilderthat implements theMetadataBuildertrait but "mounts" an existingVariantMetadata. It takes advantage of the fact that object field insertions are now fallible, in order to return an error if the requested field name does not exist in the underlying metadata dictionary.Are these changes tested?
Partly. I need to port over the unit tests from #7915 to fully exercise the new
ReadOnlyVariantBuilder.Are there any user-facing changes?
Yes, lots of stuff became public.
Also, the semantics of some functions changed (breaking change)