-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark: Handle complex type in expression in RemoteScanPlanning #14882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Expression filter = null; | ||
| if (jsonNode.has(RESIDUAL_FILTER)) { | ||
| filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL_FILTER)); | ||
| filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL_FILTER), spec.schema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was caught during the execution phase of spark, need to pass schema for residual
core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java
Outdated
Show resolved
Hide resolved
| } else if (defaultValue.isIntegralNumber() && defaultValue.canConvertToLong()) { | ||
| return defaultValue.longValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is mostly required now since we are now binding with schema, need to think this through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now : Integer → SingleValueParser tries to parse as DATE → Expects string
f5ae0f6 to
7d7dcaa
Compare
7d7dcaa to
51d4aab
Compare
51d4aab to
f071179
Compare
| } | ||
| if (request.filter() != null) { | ||
| configuredScan = configuredScan.filter(request.filter()); | ||
| Expression filter = request.filter(schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this change is probably not needed
| * @param schema the table schema to use for type-aware deserialization of filter values | ||
| * @return the filter expression, or null if no filter was specified | ||
| */ | ||
| public Expression filter(Schema schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me think about this a bit more. I also think we have a few more cases across the codebase where we also ser/de Expression without a Schema and theoretically we would have the same issue in those places as well.
Whatever approach we pick, we'd want to follow up in those other places too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other thing we might need to consider is how we would be lazily binding this in other client implementations. @Fokko does pyiceberg have examples of how it does a late-binding similar to this one?
The issue that we have here is that we deserialize an Expression where we can only correctly do so when we bind it to a Schema
| private static Object parseDateValue(Type type, JsonNode value) { | ||
| if (value.isTextual()) { | ||
| return DateTimeUtil.isoDateToDays(value.textValue()); | ||
| } else if (value.isIntegralNumber() && value.canConvertToInt()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes are probably not required to fix the underlying issue, so we might want to separate them out and test them individually

About the change
Presently the expression when serialized doesn't capture the type so even binary when de-serialized its used as string which later fails. For the parsers its important to know the schema so that they could de-serialize stuff correctly, a part of it is handled in the SDK during response de-serialization via parser context but while the client can set this since its making the call the same can't be assumed by the server which would be doing the same deserialization of the request.
There is 3 ways to solve this problem :
filter(schema)and then wire the schema while deserialization in ExpressionParser.Considering where we are i implemented Approach#1
Testing
New test and existing tests