-
Couldn't load subscription status.
- Fork 30
Leverage dsbulk string codecs to convert from strings to Java types (and back again) #496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 1.x
Are you sure you want to change the base?
Conversation
…onversion.. still need to sort out the JSON node version
| // representation | ||
| // known to work with vectors but it certainly isn't obligated to do so. | ||
| if (s == null || s.isEmpty() || s.equalsIgnoreCase("NULL")) return null; | ||
| ArrayList<SubtypeT> vals = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look 100% right to me.
If you are going down the path of having a subcodec – which I think is the right thing to do btw – then I would suggest to go even further: here, you are using a string codec to parse N elements of the vector. Because of that, you need to handle the collection manually, by stripping the first and last character of the input, which are assumed to be the collection delimiters.
But instead, I think you should use a collection codec that would parse the entire collection for you. That collection codec would in turn use another inner codec to parse the elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly you're thinking of something more like this then?
public class StringToVectorCodec<SubtypeT extends Number>
extends StringToCollectionCodec<SubtypeT, CqlVector<SubtypeT>> {
...
}If so then we have an issue with the underlying Java type: CqlVector implements Iterable but doesn't implement Collection, so it can't satisfy the type bounds in question. The rationale here was that CqlVector is closer to an array than a Collection type so implementing Collection would be somewhat confusing. I readily agree there's room for debate here but it is the state of the world as of 4.18.1.
Does that fact change your analysis? It seems like there are at least three paths forward here:
- Make CqlVector implement Collection. Would require a new Java driver release + a dsbulk update to use the new version.
- Adapt the JSON parsing logic in StringToCollectionCodec to be used here instead of the manual process. Provides similar parsing behaviour to what dsbulk currently exhibits for collections without the need for a driver upgrade at the cost of some code duplication.
- Leave it as-is, possibly with a ticket for future work
I'm not opposed to any of these options, although (1) will certainly take more time than the other two.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@absurdfarce here is what I had in mind:
issue484...adutra:dsbulk:issue484
It's loosely inspired by how StringToMapCodec is designed.
Also, it adheres to the general principle in dsbulk that mandates that all complex codecs rely in a standard json representation of the data, and not their CQL representation. So here StringToVectorCodec relies on a json codec that outputs a json array of the vector dimensions.
Let me know if that looks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhhhh, I see what you mean now. Yeah, this makes a ton more sense than my original suggestion! I'm gonna make a PR out of that diff and merge it in here; there are several improvements in there I'd like to get into this work.
Many thanks, my friend; that diff absolutely crystalized exactly what you were talking about!
Co-authored-by: Alexandre Dutra <[email protected]>
| (StringToVectorCodec<Float>) | ||
| codecFactory.<String, CqlVector<Float>>createConvertingCodec( | ||
| DataTypes.vectorOf(DataTypes.FLOAT, 5), GenericType.STRING, true); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much cleaner way to get to a workable ConversionContext! 👍
| try { | ||
| JsonNode node = objectMapper.readTree(StringUtils.ensureBrackets(s)); | ||
| List<SubtypeT> vals = jsonCodec.externalToInternal(node); | ||
| return CqlVector.newInstance(vals); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use JSON codecs to eval input strings as JSON, build a list from that and then build a CqlVector from that list. This makes behaviour of the vector codec consistent with codecs for the collection types by enforcing a common policy around string representations of these types (i.e. they have to be JSON-friendly).
Idea (and implementation) provided by @adutra
| codecFactory.createConvertingCodec( | ||
| DataTypes.listOf(vectorType.getElementType()), JSON_NODE_TYPE, false); | ||
| return new StringToVectorCodec<>( | ||
| vectorCodec, jsonCodec, context.getAttribute(OBJECT_MAPPER), nullStrings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See StringToVectorCodec changes below. jsonCodec is here to convert raw string values into Lists; StringToVectorCodec builds CqlVectors out of them.
| // arithmetic overflow. | ||
| @Test | ||
| void should_not_convert_too_much_precision() { | ||
| assertThatThrownBy(() -> codec.encode("6.646329843", ProtocolVersion.DEFAULT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue this is incorrect. This test is intended to mirror the equivalent test in JsonNodeToVectorCodecTest. In that case we're trying to confirm that the JSON representation for an otherwise valid vector (really just a JSON array in that case) fails to convert because precision policies in the dsbulk codecs are being enforced. If we want to model the same thing here this should be the string representation for an otherwise valid vector... which means it should be something like:
// Issue 484: now that we're using the dsbulk string-to-subtype converters we should get
// enforcement of existing dsbulk policies. For our purposes that means the failure on
// arithmetic overflow.
@Test
void should_not_convert_too_much_precision() {
assertThatThrownBy(() -> codec.encode("[1.1, 2.2, 3.3, 6.646329843]", ProtocolVersion.DEFAULT))
.isInstanceOf(ArithmeticException.class);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your example if fine, but the current one is too. In DSBulk, enclosing brackets and braces are generally optional. So codec.encode("6.646329843", ProtocolVersion.DEFAULT)) should behave like codec.encode("[6.646329843]", ProtocolVersion.DEFAULT)). You can add both tests btw.
But note that this test is almost identical to should_not_convert_from_invalid_external. Maybe merge both into one single test?
| void should_not_convert_from_invalid_internal() { | ||
| assertThat(dsbulkCodec).cannotConvertFromInternal("not a valid vector"); | ||
| void should_not_convert_from_invalid_external() { | ||
| assertThat(codec).cannotConvertFromExternal("[6.646329843]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively winds up duplicating should_not_convert_too_much_precision() in a way that isn't very clear. The original intent of this method was to perform something similar to JsonNodeToVectorCodecTest.should_not_convert_from_invalid_internal(), specifically given something that isn't a CqlVector this method fails completely. We could certainly add a few more cases but I'd argue it's worthwhile to preserve the symmetry.
| void should_not_convert_too_much_precision() { | ||
| ArrayNode tooPreciseNode = JSON_NODE_FACTORY.arrayNode(); | ||
| tooPreciseNode.add(JSON_NODE_FACTORY.numberNode(6.646329843)); | ||
| assertThat(dsbulkCodec).cannotConvertFromInternal(tooPreciseNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't get why you are trying to use dsbulkCodec to convert from an internal type that would be... JsonNode?
The internal type of this codec is CqlVector, so calling cannotConvertFromInternal would make sense only
if you had some instance of CqlVector that is somehow "invalid" – maybe a CqlVector with the wrong number of dimensions, or something like that.
But calling cannotConvertFromInternal(tooPreciseNode) does not make sense to me. It's only possible, btw, because dsbulkCodec is of the raw type JsonNodeToVectorCodec. It should be JsonNodeToVectorCodec<Float> – but in that case, I bet this statement wouldn't compile anymore.
If you are trying to check whether the external type tooPreciseNode causes a runtime error, then call cannotConvertFromExternal(tooPreciseNode) – I fixed something similar for StringToVectorCodecTest, see test should_not_convert_from_invalid_external.
dsbulk codecs for Strings enforce additional constraints on their input values (like overflow checks for numerical values) that we want to be able to leverage when dealing with vectors. This change accomplishes that by implementing a variant of the CQLVector.from() logic directly within the codec infrastructure of dsbulk.
The JSON vector codec was already using dsbulk JSON codecs so that path appears to be fine.
Also added some test cases to demonstrate the expected behaviour.