Skip to content

Conversation

@mhilton
Copy link
Contributor

@mhilton mhilton commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

The array-based implementation of date_trunc can produce incorrect results for negative timestamps (i.e. dates before 1970-01-01). Check for any such incorrect values and compensate accordingly.

Running the date_trunc benchmark suggests this fix introduces an ~9% performance cost.

date_trunc_minute_1000  time:   [1.7424 µs 1.7495 µs 1.7583 µs]
                        change: [+7.9289% +8.5950% +9.1955%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

Are these changes tested?

Yes, an SLT is added based on the issue.

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Oct 29, 2025
The array-based implementation of date_trunc can produce incorrect
results for negative timestamps (i.e. dates before 1970-01-01). Check
for any such incorrect values and compensate accordingly.
@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

@mhilton notes this is very similar to the fix in

@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

Running the date_trunc benchmark suggests this fix introduces an ~9% performance cost.

I think we need to have correctness before performance. I'll run the benchmarks on this PR to be sure

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mhilton

I kicked off some benchmarks for this PR as well, so we'll get a second opinion.

I also have some ideas on how to make this PR faster which I'll share shortly

@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

cc @waynexia @sadboy

@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

I have also created a PR to try and get the performance back

Comment on lines 485 to 498
let input = arrow::compute::cast(array, &DataType::Int64)?;
let array = arrow::compute::kernels::numeric::div(&input, &unit)?;
let array = arrow::compute::kernels::numeric::mul(&array, &unit)?;
// For timestamps before 1970-01-01T00:00:00Z (negative values)
// it is possible that the truncated value is actually later
// than the original value. Correct any such cases by
// subtracting `unit`.
let too_late = arrow::compute::kernels::cmp::gt(&array, &input)?;
let array = if too_late.true_count() > 0 {
let earlier = arrow::compute::kernels::numeric::sub(&array, &unit)?;
arrow::compute::kernels::zip::zip(&too_late, &earlier, &array)?
} else {
array
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in scalar terms, what we're computing it

value - (value floor mod unit)

can we maybe express this logic directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in rust terms "floor mod" is probably rem_euclid

https://doc.rust-lang.org/std/primitive.i64.html#method.rem_euclid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe express this logic directly

That is an excellent point -- I think we could use the code in #18360 to do so directly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#18360 looks more complicated to me.
We can fix semantics fix without that, or with that. Either way works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#18360 turns unit into a primitive, which really helps write the logic value - (value floor mod unit). it can become a one-step vectorized operation. that's the part of that PR we should copy over here.

#18360 goes further with try_unary_mut_or_clone. that's the part orthogonal to semantic fix and can be a follow-up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#18360 goes further with try_unary_mut_or_clone. that's the part orthogonal to semantic fix and can be a follow-up

Agreed -- I will work on that as a follow up

Comment on lines 1692 to 1693
SELECT d, DATE_TRUNC('hour', d), DATE_TRUNC('hour', TIMESTAMP '1900-06-15 07:09:00')
FROM (VALUES (TIMESTAMP '1900-06-15 07:09:00')) AS t(d);
Copy link
Contributor

@sadboy sadboy Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test the whole lot:

Suggested change
SELECT d, DATE_TRUNC('hour', d), DATE_TRUNC('hour', TIMESTAMP '1900-06-15 07:09:00')
FROM (VALUES (TIMESTAMP '1900-06-15 07:09:00')) AS t(d);
select d as datetime,
DATE_TRUNC('year', d) as year,
DATE_TRUNC('quarter', d) as quarter,
DATE_TRUNC('month', d) as month,
DATE_TRUNC('week', d) as week,
DATE_TRUNC('day', d) as day,
DATE_TRUNC('hour', d) as hour,
DATE_TRUNC('minute', d) as minute,
DATE_TRUNC('second', d) as second
DATE_TRUNC('microsecond', d) as microsecond
DATE_TRUNC('millisecond', d) as millisecond
from (values (timestamp '1900-06-15 07:31:23'),
(timestamp '1970-01-01 00:00:00'),
(timestamp '2024-12-31 23:39:01')) as T(d)

@alamb
Copy link
Contributor

alamb commented Oct 30, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue-18334 (f70abf1) to 6cc73fa diff
BENCH_NAME=date_trunc
BENCH_COMMAND=cargo bench --bench date_trunc
BENCH_FILTER=
BENCH_BRANCH_NAME=issue-18334
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 30, 2025

🤖: Benchmark completed

Details

group                     issue-18334                            main
-----                     -----------                            ----
date_trunc_minute_1000    1.20      5.8±0.02µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec

Add some additional test cases to the DATE_TRUNC logic tests. These
were suggested by @sadboy.
Apply the review suggestions for @alamb to avoid unnecessary
allocations, and from @findepi to perform the calculation directly using
rem_euclid.

This is a fairly nieve iterator-based implementation, but it is showing
a significant speed improvement, probably due to avoiding memory
allocations.
@mhilton
Copy link
Contributor Author

mhilton commented Oct 30, 2025

Running a totally unscientific benchmark on my laptop suggests that the newer version based on review suggestions is significantly faster than main.

date_trunc_minute_1000  time:   [643.37 ns 644.49 ns 645.64 ns]
                        change: [-59.158% -59.017% -58.870%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

@alamb
Copy link
Contributor

alamb commented Oct 30, 2025

Running a totally unscientific benchmark on my laptop suggests that the newer version based on review suggestions is significantly faster than main.

Sweet! I have queued up a benchmark to reproduce

array
.values()
.iter()
.map(|v| *v - i64::rem_euclid(*v, unit)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty fancy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Another example for LLVM understand SIMD better than developers 🙂‍↕️

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, thank you!

@alamb
Copy link
Contributor

alamb commented Oct 30, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue-18334 (587856d) to 6cc73fa diff
BENCH_NAME=date_trunc
BENCH_COMMAND=cargo bench --bench date_trunc
BENCH_FILTER=
BENCH_BRANCH_NAME=issue-18334
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 30, 2025

🤖: Benchmark completed

Details

group                     issue-18334                            main
-----                     -----------                            ----
date_trunc_minute_1000    1.00      2.9±0.01µs        ? ?/sec    1.66      4.8±0.01µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Oct 30, 2025

Fixing bug and making it 60% faster. Great team effort. Thank you @mhilton @findepi @waynexia

@alamb alamb added this pull request to the merge queue Oct 30, 2025
Merged via the queue into apache:main with commit 52894db Oct 30, 2025
28 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Oct 31, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change
Found when was testing #18356

```
> select date_trunc('YY', now());
Execution error: Unsupported date_trunc granularity: yy

```

Which is confusing, I would like to get a list of supported values
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non-constant DATE_TRUNC expression regression for values before epoch

5 participants