Leverage dsbulk string codecs to convert from strings to Java types (and back again) #496

absurdfarce · 2024-06-26T21:04:20Z

dsbulk codecs for Strings enforce additional constraints on their input values (like overflow checks for numerical values) that we want to be able to leverage when dealing with vectors. This change accomplishes that by implementing a variant of the CQLVector.from() logic directly within the codec infrastructure of dsbulk.

The JSON vector codec was already using dsbulk JSON codecs so that path appears to be fine.

Also added some test cases to demonstrate the expected behaviour.

…onversion.. still need to sort out the JSON node version

adutra · 2024-06-30T17:04:45Z

codecs/text/src/main/java/com/datastax/oss/dsbulk/codecs/text/string/StringToVectorCodec.java

+    // representation
+    // known to work with vectors but it certainly isn't obligated to do so.
+    if (s == null || s.isEmpty() || s.equalsIgnoreCase("NULL")) return null;
+    ArrayList<SubtypeT> vals =


This doesn't look 100% right to me.

If you are going down the path of having a subcodec – which I think is the right thing to do btw – then I would suggest to go even further: here, you are using a string codec to parse N elements of the vector. Because of that, you need to handle the collection manually, by stripping the first and last character of the input, which are assumed to be the collection delimiters.

But instead, I think you should use a collection codec that would parse the entire collection for you. That collection codec would in turn use another inner codec to parse the elements.

If I understand correctly you're thinking of something more like this then?

public class StringToVectorCodec<SubtypeT extends Number> extends StringToCollectionCodec<SubtypeT, CqlVector<SubtypeT>> { ... }

If so then we have an issue with the underlying Java type: CqlVector implements Iterable but doesn't implement Collection, so it can't satisfy the type bounds in question. The rationale here was that CqlVector is closer to an array than a Collection type so implementing Collection would be somewhat confusing. I readily agree there's room for debate here but it is the state of the world as of 4.18.1.

Does that fact change your analysis? It seems like there are at least three paths forward here:

Make CqlVector implement Collection. Would require a new Java driver release + a dsbulk update to use the new version.

Adapt the JSON parsing logic in StringToCollectionCodec to be used here instead of the manual process. Provides similar parsing behaviour to what dsbulk currently exhibits for collections without the need for a driver upgrade at the cost of some code duplication.

Leave it as-is, possibly with a ticket for future work

I'm not opposed to any of these options, although (1) will certainly take more time than the other two.

Thoughts?

@absurdfarce here is what I had in mind:

issue484...adutra:dsbulk:issue484

It's loosely inspired by how StringToMapCodec is designed.

Also, it adheres to the general principle in dsbulk that mandates that all complex codecs rely in a standard json representation of the data, and not their CQL representation. So here StringToVectorCodec relies on a json codec that outputs a json array of the vector dimensions.

Let me know if that looks better.

Ohhhhh, I see what you mean now. Yeah, this makes a ton more sense than my original suggestion! I'm gonna make a PR out of that diff and merge it in here; there are several improvements in there I'd like to get into this work.

Many thanks, my friend; that diff absolutely crystalized exactly what you were talking about!

Co-authored-by: Alexandre Dutra <[email protected]>

absurdfarce · 2024-07-10T22:18:29Z

...s/text/src/test/java/com/datastax/oss/dsbulk/codecs/text/string/StringToVectorCodecTest.java

+        (StringToVectorCodec<Float>)
+            codecFactory.<String, CqlVector<Float>>createConvertingCodec(
+                DataTypes.vectorOf(DataTypes.FLOAT, 5), GenericType.STRING, true);
+  }


This is a much cleaner way to get to a workable ConversionContext! 👍

absurdfarce · 2024-07-10T22:22:31Z

codecs/text/src/main/java/com/datastax/oss/dsbulk/codecs/text/string/StringToVectorCodec.java

+    try {
+      JsonNode node = objectMapper.readTree(StringUtils.ensureBrackets(s));
+      List<SubtypeT> vals = jsonCodec.externalToInternal(node);
+      return CqlVector.newInstance(vals);


Use JSON codecs to eval input strings as JSON, build a list from that and then build a CqlVector from that list. This makes behaviour of the vector codec consistent with codecs for the collection types by enforcing a common policy around string representations of these types (i.e. they have to be JSON-friendly).

Idea (and implementation) provided by @adutra

absurdfarce · 2024-07-10T22:24:12Z

.../src/main/java/com/datastax/oss/dsbulk/codecs/text/string/StringConvertingCodecProvider.java

+                  codecFactory.createConvertingCodec(
+                      DataTypes.listOf(vectorType.getElementType()), JSON_NODE_TYPE, false);
+              return new StringToVectorCodec<>(
+                  vectorCodec, jsonCodec, context.getAttribute(OBJECT_MAPPER), nullStrings);


See StringToVectorCodec changes below. jsonCodec is here to convert raw string values into Lists; StringToVectorCodec builds CqlVectors out of them.

absurdfarce · 2024-07-11T03:14:59Z

...s/text/src/test/java/com/datastax/oss/dsbulk/codecs/text/string/StringToVectorCodecTest.java

+  // arithmetic overflow.
+  @Test
+  void should_not_convert_too_much_precision() {
+    assertThatThrownBy(() -> codec.encode("6.646329843", ProtocolVersion.DEFAULT))


I'd argue this is incorrect. This test is intended to mirror the equivalent test in JsonNodeToVectorCodecTest. In that case we're trying to confirm that the JSON representation for an otherwise valid vector (really just a JSON array in that case) fails to convert because precision policies in the dsbulk codecs are being enforced. If we want to model the same thing here this should be the string representation for an otherwise valid vector... which means it should be something like:

// Issue 484: now that we're using the dsbulk string-to-subtype converters we should get // enforcement of existing dsbulk policies. For our purposes that means the failure on // arithmetic overflow. @Test void should_not_convert_too_much_precision() { assertThatThrownBy(() -> codec.encode("[1.1, 2.2, 3.3, 6.646329843]", ProtocolVersion.DEFAULT)) .isInstanceOf(ArithmeticException.class); }

Your example if fine, but the current one is too. In DSBulk, enclosing brackets and braces are generally optional. So codec.encode("6.646329843", ProtocolVersion.DEFAULT)) should behave like codec.encode("[6.646329843]", ProtocolVersion.DEFAULT)). You can add both tests btw.

But note that this test is almost identical to should_not_convert_from_invalid_external. Maybe merge both into one single test?

absurdfarce · 2024-07-11T03:18:41Z

...s/text/src/test/java/com/datastax/oss/dsbulk/codecs/text/string/StringToVectorCodecTest.java

-  void should_not_convert_from_invalid_internal() {
-    assertThat(dsbulkCodec).cannotConvertFromInternal("not a valid vector");
+  void should_not_convert_from_invalid_external() {
+    assertThat(codec).cannotConvertFromExternal("[6.646329843]");


This effectively winds up duplicating should_not_convert_too_much_precision() in a way that isn't very clear. The original intent of this method was to perform something similar to JsonNodeToVectorCodecTest.should_not_convert_from_invalid_internal(), specifically given something that isn't a CqlVector this method fails completely. We could certainly add a few more cases but I'd argue it's worthwhile to preserve the symmetry.

absurdfarce · 2024-07-11T03:20:30Z

Pulled in changes from @adutra 's branch; this makes the usage of codecs in vector parsing much more consistent with existing dsbulk code. I have a few nits about tests that I'd like to hash out with @adutra but that's all that's left.

adutra · 2024-07-13T11:43:16Z

...s/text/src/test/java/com/datastax/oss/dsbulk/codecs/text/json/JsonNodeToVectorCodecTest.java

+  void should_not_convert_too_much_precision() {
+    ArrayNode tooPreciseNode = JSON_NODE_FACTORY.arrayNode();
+    tooPreciseNode.add(JSON_NODE_FACTORY.numberNode(6.646329843));
+    assertThat(dsbulkCodec).cannotConvertFromInternal(tooPreciseNode);


Hmm I don't get why you are trying to use dsbulkCodec to convert from an internal type that would be... JsonNode?

The internal type of this codec is CqlVector, so calling cannotConvertFromInternal would make sense only
if you had some instance of CqlVector that is somehow "invalid" – maybe a CqlVector with the wrong number of dimensions, or something like that.

But calling cannotConvertFromInternal(tooPreciseNode) does not make sense to me. It's only possible, btw, because dsbulkCodec is of the raw type JsonNodeToVectorCodec. It should be JsonNodeToVectorCodec<Float> – but in that case, I bet this statement wouldn't compile anymore.

If you are trying to check whether the external type tooPreciseNode causes a runtime error, then call cannotConvertFromExternal(tooPreciseNode) – I fixed something similar for StringToVectorCodecTest, see test should_not_convert_from_invalid_external.

absurdfarce added 4 commits June 26, 2024 14:53

Harmonize String and JSON vector code implementations

ddc0f45

Merge branch '1.x' into issue484

c35bcfd

String-to-vector codecs are now using the dsbulk codecs for initial c…

e0c84b4

…onversion.. still need to sort out the JSON node version

Added explicit test cases for the overflow case

d436693

absurdfarce linked an issue Jun 26, 2024 that may be closed by this pull request

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

Open

adutra reviewed Jun 30, 2024

View reviewed changes

Refactor StringToVectorCodec (#498)

7e1d16d

Co-authored-by: Alexandre Dutra <[email protected]>

absurdfarce commented Jul 10, 2024

View reviewed changes

absurdfarce commented Jul 11, 2024

View reviewed changes

adutra reviewed Jul 13, 2024

View reviewed changes

Uh oh!

Leverage dsbulk string codecs to convert from strings to Java types (and back again) #496

Are you sure you want to change the base?

Leverage dsbulk string codecs to convert from strings to Java types (and back again) #496

Uh oh!

Conversation

absurdfarce commented Jun 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

absurdfarce Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

absurdfarce commented Jul 11, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

absurdfarce Jul 10, 2024 •

edited

Loading