Skip to content

Conversation

@colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Oct 29, 2025

Changes Made

This PR changes our estimate for json inflation factor based on a real workload. Additionally it allows users to configure the estimated inflation factor via config.

This required threading the execution config through to the optimizer.

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly

@github-actions github-actions bot added the fix label Oct 29, 2025
@codecov
Copy link

codecov bot commented Oct 29, 2025

Codecov Report

❌ Patch coverage is 69.36937% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.07%. Comparing base (668fd5b) to head (0fab4b6).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...src/optimization/rules/reorder_joins/join_graph.rs 66.66% 8 Missing ⚠️
src/common/daft-config/src/lib.rs 58.82% 7 Missing ⚠️
src/daft-logical-plan/src/builder/mod.rs 45.45% 6 Missing ⚠️
daft/dataframe/display.py 0.00% 3 Missing ⚠️
src/common/daft-config/src/python.rs 57.14% 3 Missing ⚠️
...l-plan/src/optimization/rules/enrich_with_stats.rs 72.72% 3 Missing ⚠️
...daft-physical-plan/src/physical_planner/planner.rs 0.00% 3 Missing ⚠️
daft/dataframe/dataframe.py 83.33% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5461      +/-   ##
==========================================
- Coverage   71.72%   70.07%   -1.65%     
==========================================
  Files         997      996       -1     
  Lines      126858   128630    +1772     
==========================================
- Hits        90984    90138     -846     
- Misses      35874    38492    +2618     
Files with missing lines Coverage Δ
daft/context.py 87.64% <ø> (ø)
daft/logical/builder.py 92.57% <100.00%> (ø)
daft/runners/native_runner.py 75.36% <100.00%> (ø)
daft/runners/ray_runner.py 55.32% <100.00%> (ø)
src/daft-logical-plan/src/logical_plan.rs 72.54% <100.00%> (-2.87%) ⬇️
src/daft-logical-plan/src/ops/source.rs 91.83% <100.00%> (-4.09%) ⬇️
...rc/daft-logical-plan/src/optimization/optimizer.rs 95.23% <100.00%> (+1.19%) ⬆️
...tion/rules/reorder_joins/brute_force_join_order.rs 99.80% <100.00%> (ø)
...l-plan/src/optimization/rules/reorder_joins/mod.rs 76.00% <100.00%> (+3.27%) ⬆️
.../rules/reorder_joins/naive_left_deep_join_order.rs 94.54% <100.00%> (ø)
... and 9 more

... and 162 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@colin-ho colin-ho marked this pull request as ready for review October 31, 2025 02:15
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR lowers the default JSON inflation factor from 0.5 to 0.25 based on real-world workload testing and makes it configurable via API and environment variables. It threads the execution config through the optimizer to ensure size estimations use the correct inflation factors during query planning.

Key Changes:

  • Changed default json_inflation_factor from 0.5 to 0.25 in DaftExecutionConfig
  • Added json_inflation_factor parameter to Python API (set_execution_config)
  • Added environment variable support: DAFT_JSON_INFLATION_FACTOR, DAFT_PARQUET_INFLATION_FACTOR, DAFT_CSV_INFLATION_FACTOR
  • Fixed JSON file size estimation to use json_inflation_factor instead of csv_inflation_factor
  • Threaded execution_config through optimizer → EnrichWithStatsSourceScanTask for accurate size estimates
  • Added test coverage for JSON inflation factor configuration

Issues Found:

  • Critical bug in optimizer.rs:221 where ReorderJoins doesn't receive the passed config, using default values instead

Confidence Score: 3/5

  • PR has a critical logic bug preventing join reordering from using configured inflation factors
  • The implementation correctly adds the new config parameter and threads it through most of the codebase, but there's a critical bug where ReorderJoins ignores the passed config parameter and uses default values. This means join reordering decisions won't use the configured inflation factors, defeating part of the PR's purpose. The rest of the changes are sound.
  • src/daft-logical-plan/src/optimization/optimizer.rs requires immediate attention to fix the config threading bug

Important Files Changed

File Analysis

Filename Score Overview
src/daft-logical-plan/src/optimization/optimizer.rs 2/5 Added execution config threading through optimizer; critical bug where ReorderJoins doesn't receive config
src/common/daft-config/src/lib.rs 5/5 Changed default json_inflation_factor from 0.5 to 0.25; added env var support for inflation factors
src/daft-scan/src/lib.rs 5/5 Fixed JSON file size estimation to use json_inflation_factor instead of csv_inflation_factor
tests/test_size_estimations.py 5/5 Added new test for json_inflation_factor configuration
src/daft-logical-plan/src/builder/mod.rs 4/5 Updated optimize methods to accept and pass execution config to optimizer and join reordering

Sequence Diagram

sequenceDiagram
    participant User
    participant DataFrame
    participant LogicalPlanBuilder
    participant Optimizer
    participant EnrichWithStats
    participant ReorderJoins
    participant JoinGraphBuilder
    participant Source
    participant ScanTask
    
    User->>DataFrame: set_execution_config(json_inflation_factor=0.25)
    User->>DataFrame: read_json() / optimize()
    DataFrame->>LogicalPlanBuilder: optimize(execution_config)
    LogicalPlanBuilder->>Optimizer: optimize with execution_config
    
    Note over Optimizer: Default optimizations (pushdowns, etc.)
    
    Optimizer->>EnrichWithStats: enrich_with_stats(Some(execution_config))
    EnrichWithStats->>Source: with_materialized_stats(cfg)
    Source->>ScanTask: estimate_in_memory_size_bytes(Some(cfg))
    Note over ScanTask: Uses cfg.json_inflation_factor=0.25<br/>instead of 0.5
    ScanTask-->>Source: estimated size
    Source-->>EnrichWithStats: stats with updated size
    EnrichWithStats-->>Optimizer: enriched plan
    
    Optimizer->>ReorderJoins: reorder_joins(Some(execution_config))
    Note over ReorderJoins: BUG: Uses default config<br/>instead of passed config
    ReorderJoins->>JoinGraphBuilder: from_logical_plan(plan, cfg)
    JoinGraphBuilder->>Source: with_materialized_stats(cfg)
    Source->>ScanTask: estimate_in_memory_size_bytes(Some(cfg))
    ScanTask-->>Source: estimated size
    Source-->>JoinGraphBuilder: stats
    JoinGraphBuilder-->>ReorderJoins: reordered joins
    ReorderJoins->>EnrichWithStats: enrich_with_stats(Some(execution_config))
    EnrichWithStats-->>Optimizer: final enriched plan
    
    Optimizer-->>LogicalPlanBuilder: optimized plan
    LogicalPlanBuilder-->>DataFrame: optimized plan
    DataFrame-->>User: result
Loading

21 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

pub fn reorder_joins(mut self, cfg: Option<Arc<DaftExecutionConfig>>) -> Self {
self.rule_batches.push(RuleBatch::new(
vec![
Box::new(ReorderJoins::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: ReorderJoins::new() ignores the cfg parameter passed to the function, should use with_config

Suggested change
Box::new(ReorderJoins::new()),
Box::new(ReorderJoins::with_config(cfg.clone().unwrap_or_default())),
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daft-logical-plan/src/optimization/optimizer.rs
Line: 221:221

Comment:
**logic:** `ReorderJoins::new()` ignores the `cfg` parameter passed to the function, should use `with_config`

```suggestion
                Box::new(ReorderJoins::with_config(cfg.clone().unwrap_or_default())),
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Collaborator

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hm do we really want to pass execution configs into the optimizer?

I'm also kind of curious, why were we able to handle CSV and Parquet inflation factors without having to do this?

@colin-ho
Copy link
Contributor Author

Oh hm do we really want to pass execution configs into the optimizer?

I'm also kind of curious, why were we able to handle CSV and Parquet inflation factors without having to do this?

We weren't. Currently the optimizer has an enrich_with_stats rule, which internally calls with_materialized_stats.

The with_materialize_stats implementation for source nodes call functions like approx_num_rows or estimate_size_bytes, which accept execution configs in order to get the estimated inflation factor. However, currently they just pass None as the execution config because they don't have it, and so the default estimation is used, making the user unable to configure it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants