Skip to content

Conversation

@absurdfarce
Copy link
Collaborator

dsbulk codecs for Strings enforce additional constraints on their input values (like overflow checks for numerical values) that we want to be able to leverage when dealing with vectors. This change accomplishes that by implementing a variant of the CQLVector.from() logic directly within the codec infrastructure of dsbulk.

The JSON vector codec was already using dsbulk JSON codecs so that path appears to be fine.

Also added some test cases to demonstrate the expected behaviour.

// representation
// known to work with vectors but it certainly isn't obligated to do so.
if (s == null || s.isEmpty() || s.equalsIgnoreCase("NULL")) return null;
ArrayList<SubtypeT> vals =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look 100% right to me.

If you are going down the path of having a subcodec – which I think is the right thing to do btw – then I would suggest to go even further: here, you are using a string codec to parse N elements of the vector. Because of that, you need to handle the collection manually, by stripping the first and last character of the input, which are assumed to be the collection delimiters.

But instead, I think you should use a collection codec that would parse the entire collection for you. That collection codec would in turn use another inner codec to parse the elements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly you're thinking of something more like this then?

public class StringToVectorCodec<SubtypeT extends Number>
    extends StringToCollectionCodec<SubtypeT, CqlVector<SubtypeT>> {
...
}

If so then we have an issue with the underlying Java type: CqlVector implements Iterable but doesn't implement Collection, so it can't satisfy the type bounds in question. The rationale here was that CqlVector is closer to an array than a Collection type so implementing Collection would be somewhat confusing. I readily agree there's room for debate here but it is the state of the world as of 4.18.1.

Does that fact change your analysis? It seems like there are at least three paths forward here:

  1. Make CqlVector implement Collection. Would require a new Java driver release + a dsbulk update to use the new version.
  2. Adapt the JSON parsing logic in StringToCollectionCodec to be used here instead of the manual process. Provides similar parsing behaviour to what dsbulk currently exhibits for collections without the need for a driver upgrade at the cost of some code duplication.
  3. Leave it as-is, possibly with a ticket for future work

I'm not opposed to any of these options, although (1) will certainly take more time than the other two.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@absurdfarce here is what I had in mind:

issue484...adutra:dsbulk:issue484

It's loosely inspired by how StringToMapCodec is designed.

Also, it adheres to the general principle in dsbulk that mandates that all complex codecs rely in a standard json representation of the data, and not their CQL representation. So here StringToVectorCodec relies on a json codec that outputs a json array of the vector dimensions.

Let me know if that looks better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhhh, I see what you mean now. Yeah, this makes a ton more sense than my original suggestion! I'm gonna make a PR out of that diff and merge it in here; there are several improvements in there I'd like to get into this work.

Many thanks, my friend; that diff absolutely crystalized exactly what you were talking about!

(StringToVectorCodec<Float>)
codecFactory.<String, CqlVector<Float>>createConvertingCodec(
DataTypes.vectorOf(DataTypes.FLOAT, 5), GenericType.STRING, true);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much cleaner way to get to a workable ConversionContext! 👍

try {
JsonNode node = objectMapper.readTree(StringUtils.ensureBrackets(s));
List<SubtypeT> vals = jsonCodec.externalToInternal(node);
return CqlVector.newInstance(vals);
Copy link
Collaborator Author

@absurdfarce absurdfarce Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use JSON codecs to eval input strings as JSON, build a list from that and then build a CqlVector from that list. This makes behaviour of the vector codec consistent with codecs for the collection types by enforcing a common policy around string representations of these types (i.e. they have to be JSON-friendly).

Idea (and implementation) provided by @adutra

codecFactory.createConvertingCodec(
DataTypes.listOf(vectorType.getElementType()), JSON_NODE_TYPE, false);
return new StringToVectorCodec<>(
vectorCodec, jsonCodec, context.getAttribute(OBJECT_MAPPER), nullStrings);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See StringToVectorCodec changes below. jsonCodec is here to convert raw string values into Lists; StringToVectorCodec builds CqlVectors out of them.

// arithmetic overflow.
@Test
void should_not_convert_too_much_precision() {
assertThatThrownBy(() -> codec.encode("6.646329843", ProtocolVersion.DEFAULT))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue this is incorrect. This test is intended to mirror the equivalent test in JsonNodeToVectorCodecTest. In that case we're trying to confirm that the JSON representation for an otherwise valid vector (really just a JSON array in that case) fails to convert because precision policies in the dsbulk codecs are being enforced. If we want to model the same thing here this should be the string representation for an otherwise valid vector... which means it should be something like:

  // Issue 484: now that we're using the dsbulk string-to-subtype converters we should get
  // enforcement of existing dsbulk policies.  For our purposes that means the failure on
  // arithmetic overflow.
  @Test
  void should_not_convert_too_much_precision() {
    assertThatThrownBy(() -> codec.encode("[1.1, 2.2, 3.3, 6.646329843]", ProtocolVersion.DEFAULT))
        .isInstanceOf(ArithmeticException.class);
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example if fine, but the current one is too. In DSBulk, enclosing brackets and braces are generally optional. So codec.encode("6.646329843", ProtocolVersion.DEFAULT)) should behave like codec.encode("[6.646329843]", ProtocolVersion.DEFAULT)). You can add both tests btw.

But note that this test is almost identical to should_not_convert_from_invalid_external. Maybe merge both into one single test?

void should_not_convert_from_invalid_internal() {
assertThat(dsbulkCodec).cannotConvertFromInternal("not a valid vector");
void should_not_convert_from_invalid_external() {
assertThat(codec).cannotConvertFromExternal("[6.646329843]");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively winds up duplicating should_not_convert_too_much_precision() in a way that isn't very clear. The original intent of this method was to perform something similar to JsonNodeToVectorCodecTest.should_not_convert_from_invalid_internal(), specifically given something that isn't a CqlVector this method fails completely. We could certainly add a few more cases but I'd argue it's worthwhile to preserve the symmetry.

@absurdfarce
Copy link
Collaborator Author

Pulled in changes from @adutra 's branch; this makes the usage of codecs in vector parsing much more consistent with existing dsbulk code. I have a few nits about tests that I'd like to hash out with @adutra but that's all that's left.

void should_not_convert_too_much_precision() {
ArrayNode tooPreciseNode = JSON_NODE_FACTORY.arrayNode();
tooPreciseNode.add(JSON_NODE_FACTORY.numberNode(6.646329843));
assertThat(dsbulkCodec).cannotConvertFromInternal(tooPreciseNode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't get why you are trying to use dsbulkCodec to convert from an internal type that would be... JsonNode?

The internal type of this codec is CqlVector, so calling cannotConvertFromInternal would make sense only
if you had some instance of CqlVector that is somehow "invalid" – maybe a CqlVector with the wrong number of dimensions, or something like that.

But calling cannotConvertFromInternal(tooPreciseNode) does not make sense to me. It's only possible, btw, because dsbulkCodec is of the raw type JsonNodeToVectorCodec. It should be JsonNodeToVectorCodec<Float> – but in that case, I bet this statement wouldn't compile anymore.

If you are trying to check whether the external type tooPreciseNode causes a runtime error, then call cannotConvertFromExternal(tooPreciseNode) – I fixed something similar for StringToVectorCodecTest, see test should_not_convert_from_invalid_external.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles)

3 participants