fix: correct date_trunc for times before the epoch #18356

mhilton · 2025-10-29T15:51:10Z

Which issue does this PR close?

Closes Non-constant DATE_TRUNC expression regression for values before epoch #18334.

Rationale for this change

What changes are included in this PR?

The array-based implementation of date_trunc can produce incorrect results for negative timestamps (i.e. dates before 1970-01-01). Check for any such incorrect values and compensate accordingly.

Running the date_trunc benchmark suggests this fix introduces an ~9% performance cost.

date_trunc_minute_1000  time:   [1.7424 µs 1.7495 µs 1.7583 µs]
                        change: [+7.9289% +8.5950% +9.1955%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

Are these changes tested?

Yes, an SLT is added based on the issue.

Are there any user-facing changes?

The array-based implementation of date_trunc can produce incorrect results for negative timestamps (i.e. dates before 1970-01-01). Check for any such incorrect values and compensate accordingly.

alamb · 2025-10-29T16:44:03Z

@mhilton notes this is very similar to the fix in

fix: date_bin() on timstamps before 1970 #13204

alamb · 2025-10-29T16:45:19Z

Running the date_trunc benchmark suggests this fix introduces an ~9% performance cost.

I think we need to have correctness before performance. I'll run the benchmarks on this PR to be sure

alamb

Thank you @mhilton

I kicked off some benchmarks for this PR as well, so we'll get a second opinion.

I also have some ideas on how to make this PR faster which I'll share shortly

alamb · 2025-10-29T18:31:27Z

cc @waynexia @sadboy

alamb · 2025-10-29T18:31:39Z

I have also created a PR to try and get the performance back

Optimize date_trunc function by avoiding allocations #18360

findepi · 2025-10-29T20:32:18Z

datafusion/functions/src/datetime/date_trunc.rs

+        let input = arrow::compute::cast(array, &DataType::Int64)?;
+        let array = arrow::compute::kernels::numeric::div(&input, &unit)?;
        let array = arrow::compute::kernels::numeric::mul(&array, &unit)?;
+        // For timestamps before 1970-01-01T00:00:00Z (negative values)
+        // it is possible that the truncated value is actually later
+        // than the original value. Correct any such cases by
+        // subtracting `unit`.
+        let too_late = arrow::compute::kernels::cmp::gt(&array, &input)?;
+        let array = if too_late.true_count() > 0 {
+            let earlier = arrow::compute::kernels::numeric::sub(&array, &unit)?;
+            arrow::compute::kernels::zip::zip(&too_late, &earlier, &array)?
+        } else {
+            array
+        };


in scalar terms, what we're computing it

value - (value floor mod unit)

can we maybe express this logic directly?

in rust terms "floor mod" is probably rem_euclid

https://doc.rust-lang.org/std/primitive.i64.html#method.rem_euclid

can we maybe express this logic directly

That is an excellent point -- I think we could use the code in #18360 to do so directly

#18360 looks more complicated to me.
We can fix semantics fix without that, or with that. Either way works.

#18360 turns unit into a primitive, which really helps write the logic value - (value floor mod unit). it can become a one-step vectorized operation. that's the part of that PR we should copy over here.

#18360 goes further with try_unary_mut_or_clone. that's the part orthogonal to semantic fix and can be a follow-up

#18360 goes further with try_unary_mut_or_clone. that's the part orthogonal to semantic fix and can be a follow-up

Agreed -- I will work on that as a follow up

sadboy · 2025-10-30T00:00:53Z

datafusion/sqllogictest/test_files/timestamps.slt

+SELECT d, DATE_TRUNC('hour', d), DATE_TRUNC('hour', TIMESTAMP '1900-06-15 07:09:00')
+FROM (VALUES (TIMESTAMP '1900-06-15 07:09:00')) AS t(d);


Let's test the whole lot:

Suggested change

SELECT d, DATE_TRUNC('hour', d), DATE_TRUNC('hour', TIMESTAMP '1900-06-15 07:09:00')

FROM (VALUES (TIMESTAMP '1900-06-15 07:09:00')) AS t(d);

select d as datetime,

DATE_TRUNC('year', d) as year,

DATE_TRUNC('quarter', d) as quarter,

DATE_TRUNC('month', d) as month,

DATE_TRUNC('week', d) as week,

DATE_TRUNC('day', d) as day,

DATE_TRUNC('hour', d) as hour,

DATE_TRUNC('minute', d) as minute,

DATE_TRUNC('second', d) as second

DATE_TRUNC('microsecond', d) as microsecond

DATE_TRUNC('millisecond', d) as millisecond

from (values (timestamp '1900-06-15 07:31:23'),

(timestamp '1970-01-01 00:00:00'),

(timestamp '2024-12-31 23:39:01')) as T(d)

alamb · 2025-10-30T08:19:29Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue-18334 (f70abf1) to 6cc73fa diff
BENCH_NAME=date_trunc
BENCH_COMMAND=cargo bench --bench date_trunc
BENCH_FILTER=
BENCH_BRANCH_NAME=issue-18334
Results will be posted here when complete

alamb · 2025-10-30T08:24:44Z

🤖: Benchmark completed

Details

group                     issue-18334                            main
-----                     -----------                            ----
date_trunc_minute_1000    1.20      5.8±0.02µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec

@sadboy

Add some additional test cases to the DATE_TRUNC logic tests. These were suggested by @sadboy.

@alamb

Apply the review suggestions for @alamb to avoid unnecessary allocations, and from @findepi to perform the calculation directly using rem_euclid. This is a fairly nieve iterator-based implementation, but it is showing a significant speed improvement, probably due to avoiding memory allocations.

mhilton · 2025-10-30T10:44:03Z

Running a totally unscientific benchmark on my laptop suggests that the newer version based on review suggestions is significantly faster than main.

date_trunc_minute_1000  time:   [643.37 ns 644.49 ns 645.64 ns]
                        change: [-59.158% -59.017% -58.870%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

alamb · 2025-10-30T10:47:25Z

Running a totally unscientific benchmark on my laptop suggests that the newer version based on review suggestions is significantly faster than main.

Sweet! I have queued up a benchmark to reproduce

alamb · 2025-10-30T10:48:37Z

datafusion/functions/src/datetime/date_trunc.rs

+            array
+                .values()
+                .iter()
+                .map(|v| *v - i64::rem_euclid(*v, unit)),


this is pretty fancy

Interesting. Another example for LLVM understand SIMD better than developers 🙂‍↕️

waynexia

Looking great, thank you!

alamb · 2025-10-30T11:28:47Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue-18334 (587856d) to 6cc73fa diff
BENCH_NAME=date_trunc
BENCH_COMMAND=cargo bench --bench date_trunc
BENCH_FILTER=
BENCH_BRANCH_NAME=issue-18334
Results will be posted here when complete

alamb · 2025-10-30T11:33:24Z

🤖: Benchmark completed

Details

group                     issue-18334                            main
-----                     -----------                            ----
date_trunc_minute_1000    1.00      2.9±0.01µs        ? ?/sec    1.66      4.8±0.01µs        ? ?/sec

alamb · 2025-10-30T13:17:38Z

Fixing bug and making it 60% faster. Great team effort. Thank you @mhilton @findepi @waynexia

## Which issue does this PR close?  - Closes #. ## Rationale for this change Found when was testing #18356 ``` > select date_trunc('YY', now()); Execution error: Unsupported date_trunc granularity: yy ``` Which is confusing, I would like to get a list of supported values  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Oct 29, 2025

fix: correct date_trunc for times before the epoch

f70abf1

The array-based implementation of date_trunc can produce incorrect results for negative timestamps (i.e. dates before 1970-01-01). Check for any such incorrect values and compensate accordingly.

mhilton force-pushed the issue-18334 branch from 6e3e069 to f70abf1 Compare October 29, 2025 15:59

alamb mentioned this pull request Oct 29, 2025

Non-constant DATE_TRUNC expression regression for values before epoch #18334

Closed

alamb approved these changes Oct 29, 2025

View reviewed changes

alamb mentioned this pull request Oct 29, 2025

Optimize date_trunc function by avoiding allocations #18360

Draft

alamb requested a review from waynexia October 29, 2025 18:28

findepi reviewed Oct 29, 2025

View reviewed changes

sadboy reviewed Oct 30, 2025

View reviewed changes

mhilton added 2 commits October 30, 2025 09:04

test: add additional date_trunc cases

0bd2f92

Add some additional test cases to the DATE_TRUNC logic tests. These were suggested by @sadboy.

alamb reviewed Oct 30, 2025

View reviewed changes

waynexia approved these changes Oct 30, 2025

View reviewed changes

alamb added this pull request to the merge queue Oct 30, 2025

Merged via the queue into apache:main with commit 52894db Oct 30, 2025
28 checks passed

comphead mentioned this pull request Oct 30, 2025

chore: use enum as date_trunc granularity #18390

Merged

		SELECT d, DATE_TRUNC('hour', d), DATE_TRUNC('hour', TIMESTAMP '1900-06-15 07:09:00')
		FROM (VALUES (TIMESTAMP '1900-06-15 07:09:00')) AS t(d);

-SELECT d, DATE_TRUNC('hour', d), DATE_TRUNC('hour', TIMESTAMP '1900-06-15 07:09:00')
-FROM (VALUES (TIMESTAMP '1900-06-15 07:09:00')) AS t(d);
+select d as datetime,
+       DATE_TRUNC('year', d) as year,
+       DATE_TRUNC('quarter', d) as quarter,
+       DATE_TRUNC('month', d) as month,
+       DATE_TRUNC('week', d) as week,
+       DATE_TRUNC('day', d) as day,
+       DATE_TRUNC('hour', d) as hour,
+       DATE_TRUNC('minute', d) as minute,
+       DATE_TRUNC('second', d) as second
+       DATE_TRUNC('microsecond', d) as microsecond
+       DATE_TRUNC('millisecond', d) as millisecond
+from (values (timestamp '1900-06-15 07:31:23'),
+             (timestamp '1970-01-01 00:00:00'),
+             (timestamp '2024-12-31 23:39:01')) as T(d)

fix: correct date_trunc for times before the epoch #18356

fix: correct date_trunc for times before the epoch #18356

Conversation

mhilton commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Oct 29, 2025

Uh oh!

alamb commented Oct 29, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 29, 2025

Uh oh!

alamb commented Oct 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadboy Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 30, 2025

Uh oh!

alamb commented Oct 30, 2025

Uh oh!

mhilton commented Oct 30, 2025

Uh oh!

alamb commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

waynexia left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 30, 2025

Uh oh!

alamb commented Oct 30, 2025

Uh oh!

alamb commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sadboy Oct 30, 2025 •

edited

Loading

alamb commented Oct 30, 2025 •

edited

Loading