Skip to content

Optimize casts at the logical layer #18326

@adriangb

Description

@adriangb

Is your feature request related to a problem or challenge?

COPY (SELECT '2025-10-25T00:15:00Z'::timestamptz AS ts) TO 'ts.parquet';
CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 'ts.parquet';
EXPLAIN
SELECT * FROM t WHERE ts = '1761630189642';

Results in

+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │ ts = CAST(1761630189642 AS│ |
|               | │    Timestamp(Nanosecond,  │ |
|               | │      Some("+00:00")))     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          1 -> 12          │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │    RoundRobinBatch(12)    │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │          files: 1         │ |
|               | │      format: parquet      │ |
|               | │                           │ |
|               | │         predicate:        │ |
|               | │ ts = CAST(1761630189642 AS│ |
|               | │    Timestamp(Nanosecond,  │ |
|               | │      Some("+00:00")))     │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

Note that we keep the cast all the way down into the physical plan.

This makes it harder to catch and surface errors. For example if you remove the EXPLAIN on this query you get:

Arrow error: Parser error: Error parsing timestamp from '1761630189642': error parsing date

This is actually an error that happens inside of ParquetOpener(ArrowRowFilter(ParquetRowFilter(...))). If we applied casts to literals during logical plan optimization we would:

  1. Catch the error earlier and within DataFusion itself before we start calling into arrow-rs and it calls us back.
  2. Possibly be more efficient (I guess the cast gets evaluated for each batch?).

Describe the solution you'd like

Resolve literal casts during logical plan optimizations.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions