-
Notifications
You must be signed in to change notification settings - Fork 328
fix: Lower json inflation factor #5461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5461 +/- ##
==========================================
- Coverage 71.72% 70.07% -1.65%
==========================================
Files 997 996 -1
Lines 126858 128630 +1772
==========================================
- Hits 90984 90138 -846
- Misses 35874 38492 +2618
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR lowers the default JSON inflation factor from 0.5 to 0.25 based on real-world workload testing and makes it configurable via API and environment variables. It threads the execution config through the optimizer to ensure size estimations use the correct inflation factors during query planning.
Key Changes:
- Changed default
json_inflation_factorfrom 0.5 to 0.25 inDaftExecutionConfig - Added
json_inflation_factorparameter to Python API (set_execution_config) - Added environment variable support:
DAFT_JSON_INFLATION_FACTOR,DAFT_PARQUET_INFLATION_FACTOR,DAFT_CSV_INFLATION_FACTOR - Fixed JSON file size estimation to use
json_inflation_factorinstead ofcsv_inflation_factor - Threaded
execution_configthrough optimizer →EnrichWithStats→Source→ScanTaskfor accurate size estimates - Added test coverage for JSON inflation factor configuration
Issues Found:
- Critical bug in
optimizer.rs:221whereReorderJoinsdoesn't receive the passed config, using default values instead
Confidence Score: 3/5
- PR has a critical logic bug preventing join reordering from using configured inflation factors
- The implementation correctly adds the new config parameter and threads it through most of the codebase, but there's a critical bug where ReorderJoins ignores the passed config parameter and uses default values. This means join reordering decisions won't use the configured inflation factors, defeating part of the PR's purpose. The rest of the changes are sound.
- src/daft-logical-plan/src/optimization/optimizer.rs requires immediate attention to fix the config threading bug
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/daft-logical-plan/src/optimization/optimizer.rs | 2/5 | Added execution config threading through optimizer; critical bug where ReorderJoins doesn't receive config |
| src/common/daft-config/src/lib.rs | 5/5 | Changed default json_inflation_factor from 0.5 to 0.25; added env var support for inflation factors |
| src/daft-scan/src/lib.rs | 5/5 | Fixed JSON file size estimation to use json_inflation_factor instead of csv_inflation_factor |
| tests/test_size_estimations.py | 5/5 | Added new test for json_inflation_factor configuration |
| src/daft-logical-plan/src/builder/mod.rs | 4/5 | Updated optimize methods to accept and pass execution config to optimizer and join reordering |
Sequence Diagram
sequenceDiagram
participant User
participant DataFrame
participant LogicalPlanBuilder
participant Optimizer
participant EnrichWithStats
participant ReorderJoins
participant JoinGraphBuilder
participant Source
participant ScanTask
User->>DataFrame: set_execution_config(json_inflation_factor=0.25)
User->>DataFrame: read_json() / optimize()
DataFrame->>LogicalPlanBuilder: optimize(execution_config)
LogicalPlanBuilder->>Optimizer: optimize with execution_config
Note over Optimizer: Default optimizations (pushdowns, etc.)
Optimizer->>EnrichWithStats: enrich_with_stats(Some(execution_config))
EnrichWithStats->>Source: with_materialized_stats(cfg)
Source->>ScanTask: estimate_in_memory_size_bytes(Some(cfg))
Note over ScanTask: Uses cfg.json_inflation_factor=0.25<br/>instead of 0.5
ScanTask-->>Source: estimated size
Source-->>EnrichWithStats: stats with updated size
EnrichWithStats-->>Optimizer: enriched plan
Optimizer->>ReorderJoins: reorder_joins(Some(execution_config))
Note over ReorderJoins: BUG: Uses default config<br/>instead of passed config
ReorderJoins->>JoinGraphBuilder: from_logical_plan(plan, cfg)
JoinGraphBuilder->>Source: with_materialized_stats(cfg)
Source->>ScanTask: estimate_in_memory_size_bytes(Some(cfg))
ScanTask-->>Source: estimated size
Source-->>JoinGraphBuilder: stats
JoinGraphBuilder-->>ReorderJoins: reordered joins
ReorderJoins->>EnrichWithStats: enrich_with_stats(Some(execution_config))
EnrichWithStats-->>Optimizer: final enriched plan
Optimizer-->>LogicalPlanBuilder: optimized plan
LogicalPlanBuilder-->>DataFrame: optimized plan
DataFrame-->>User: result
21 files reviewed, 1 comment
| pub fn reorder_joins(mut self, cfg: Option<Arc<DaftExecutionConfig>>) -> Self { | ||
| self.rule_batches.push(RuleBatch::new( | ||
| vec![ | ||
| Box::new(ReorderJoins::new()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: ReorderJoins::new() ignores the cfg parameter passed to the function, should use with_config
| Box::new(ReorderJoins::new()), | |
| Box::new(ReorderJoins::with_config(cfg.clone().unwrap_or_default())), |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daft-logical-plan/src/optimization/optimizer.rs
Line: 221:221
Comment:
**logic:** `ReorderJoins::new()` ignores the `cfg` parameter passed to the function, should use `with_config`
```suggestion
Box::new(ReorderJoins::with_config(cfg.clone().unwrap_or_default())),
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh hm do we really want to pass execution configs into the optimizer?
I'm also kind of curious, why were we able to handle CSV and Parquet inflation factors without having to do this?
We weren't. Currently the optimizer has an The |
Changes Made
This PR changes our estimate for json inflation factor based on a real workload. Additionally it allows users to configure the estimated inflation factor via config.
This required threading the execution config through to the optimizer.
Related Issues
Checklist
docs/mkdocs.ymlnavigation