-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: correct date_trunc for times before the epoch #18356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The array-based implementation of date_trunc can produce incorrect results for negative timestamps (i.e. dates before 1970-01-01). Check for any such incorrect values and compensate accordingly.
|
@mhilton notes this is very similar to the fix in |
I think we need to have correctness before performance. I'll run the benchmarks on this PR to be sure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mhilton
I kicked off some benchmarks for this PR as well, so we'll get a second opinion.
I also have some ideas on how to make this PR faster which I'll share shortly
|
I have also created a PR to try and get the performance back |
| let input = arrow::compute::cast(array, &DataType::Int64)?; | ||
| let array = arrow::compute::kernels::numeric::div(&input, &unit)?; | ||
| let array = arrow::compute::kernels::numeric::mul(&array, &unit)?; | ||
| // For timestamps before 1970-01-01T00:00:00Z (negative values) | ||
| // it is possible that the truncated value is actually later | ||
| // than the original value. Correct any such cases by | ||
| // subtracting `unit`. | ||
| let too_late = arrow::compute::kernels::cmp::gt(&array, &input)?; | ||
| let array = if too_late.true_count() > 0 { | ||
| let earlier = arrow::compute::kernels::numeric::sub(&array, &unit)?; | ||
| arrow::compute::kernels::zip::zip(&too_late, &earlier, &array)? | ||
| } else { | ||
| array | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in scalar terms, what we're computing it
value - (value floor mod unit)
can we maybe express this logic directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in rust terms "floor mod" is probably rem_euclid
https://doc.rust-lang.org/std/primitive.i64.html#method.rem_euclid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we maybe express this logic directly
That is an excellent point -- I think we could use the code in #18360 to do so directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#18360 looks more complicated to me.
We can fix semantics fix without that, or with that. Either way works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#18360 turns unit into a primitive, which really helps write the logic value - (value floor mod unit). it can become a one-step vectorized operation. that's the part of that PR we should copy over here.
#18360 goes further with try_unary_mut_or_clone. that's the part orthogonal to semantic fix and can be a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#18360 goes further with try_unary_mut_or_clone. that's the part orthogonal to semantic fix and can be a follow-up
Agreed -- I will work on that as a follow up
| SELECT d, DATE_TRUNC('hour', d), DATE_TRUNC('hour', TIMESTAMP '1900-06-15 07:09:00') | ||
| FROM (VALUES (TIMESTAMP '1900-06-15 07:09:00')) AS t(d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's test the whole lot:
| SELECT d, DATE_TRUNC('hour', d), DATE_TRUNC('hour', TIMESTAMP '1900-06-15 07:09:00') | |
| FROM (VALUES (TIMESTAMP '1900-06-15 07:09:00')) AS t(d); | |
| select d as datetime, | |
| DATE_TRUNC('year', d) as year, | |
| DATE_TRUNC('quarter', d) as quarter, | |
| DATE_TRUNC('month', d) as month, | |
| DATE_TRUNC('week', d) as week, | |
| DATE_TRUNC('day', d) as day, | |
| DATE_TRUNC('hour', d) as hour, | |
| DATE_TRUNC('minute', d) as minute, | |
| DATE_TRUNC('second', d) as second | |
| DATE_TRUNC('microsecond', d) as microsecond | |
| DATE_TRUNC('millisecond', d) as millisecond | |
| from (values (timestamp '1900-06-15 07:31:23'), | |
| (timestamp '1970-01-01 00:00:00'), | |
| (timestamp '2024-12-31 23:39:01')) as T(d) |
|
🤖 |
|
🤖: Benchmark completed Details
|
Add some additional test cases to the DATE_TRUNC logic tests. These were suggested by @sadboy.
|
Running a totally unscientific benchmark on my laptop suggests that the newer version based on review suggestions is significantly faster than main. |
Sweet! I have queued up a benchmark to reproduce |
| array | ||
| .values() | ||
| .iter() | ||
| .map(|v| *v - i64::rem_euclid(*v, unit)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty fancy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Another example for LLVM understand SIMD better than developers 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, thank you!
|
🤖 |
|
🤖: Benchmark completed Details
|
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change Found when was testing #18356 ``` > select date_trunc('YY', now()); Execution error: Unsupported date_trunc granularity: yy ``` Which is confusing, I would like to get a list of supported values <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
DATE_TRUNCexpression regression for values before epoch #18334.Rationale for this change
What changes are included in this PR?
The array-based implementation of date_trunc can produce incorrect results for negative timestamps (i.e. dates before 1970-01-01). Check for any such incorrect values and compensate accordingly.
Running the date_trunc benchmark suggests this fix introduces an ~9% performance cost.
Are these changes tested?
Yes, an SLT is added based on the issue.
Are there any user-facing changes?