-
-
Notifications
You must be signed in to change notification settings - Fork 42
feat: Add Ice Chunk support for high-performance cloud data access #292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
devsjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, thanks for putting this together Dakshbir! I can see there's been a lot of work that's gone into this, and I know there's more still behind the scenes trial-and-erroring the optimal values and picking up the problem.
There's quite a lot going on in this PR, so I'm going to review it in stages. I'll suggest several changes over these stages, but please don't take it as a slight on your hard work! It's more getting the code lined up with the organisational structure and layout, so it's as easy as possible for the active maintainers to understand it.
For this first review stage, I'm ignoring the scripts and the benchmark functions for now and just focussing on the core logic.
The main changes that Ive suggested are a consolidation or removal of a lot of the new configuration options - I suspect the fewer changes to the config file structure, the better for the ML team. I think we can assume a lot of the values are wanted to be the optimal defaults you've found without allowing explicit configuring - rather leave them as inputs to the functions so they can be changed in code if necessary.
This consolidation into the zarr_path requires a bit of new logic to be added in, which is similar to what you have in the load_dataset function, but I'm suggeating a regex based approach, and moving those branch conditionals into the open_sat_data function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making my requested changes!
I'm now going to revise them a bit in this next set, by suggesting that
- All the conditional logic joins the existing conditionals in
get_single_sat_datain a newmatch case _open_sat_data_icechunkchanges its signature to reflect the nature of the computationopen_sat_datareturns to as-was
I know I previously said to put the logic elsewhere to what I'm saying now, so apologies for being a bit confusing - but as we chip away at the code pathways it becomes easier to spot where to put things sensibly!
This commit introduces high-performance satellite data loading from cloud storage using the icechunk library and optimized GCS streaming. Includes new data loaders, configuration model updates, a data conversion pipeline, and a comprehensive benchmarking suite.
…ta_sampler/config/model.py
…f_data_sampler/config/model.py
This commit introduces Ice Chunk support for significantly faster satellite data loading, achieving up to 1.9x performance improvement over plain Zarr. Key changes: - Unified satellite data loading using a single function that intelligently dispatches to either standard Zarr or Ice Chunk based on the path. - Simplified configuration by removing Ice Chunk-specific parameters and using a single field. - Optimized Ice Chunk conversion process using the library. - Added comprehensive benchmarking scripts to compare Ice Chunk and plain Zarr performance. The benchmark results show a significant performance boost with Ice Chunk. This enhancement reduces data loading times and improves overall efficiency, bringing the OCF Data Sampler closer to production readiness. The main files that were changed are: - ocf_data_sampler/config/model.py - ocf_data_sampler/load/satellite.py - ocf_data_sampler/load/load_dataset.py - scripts/full_dataset_icechunk_conversion.py - scripts/production_benchmark_comparison.py
✅ Complete Ice Chunk integration with match/case patterns: - Consolidated all conditional logic in get_single_sat_data() - Updated _open_sat_data_icechunk signature per Sol's requirements - Removed fallback logic - proper error handling for missing commits - Fixed Ice Chunk commit access via snapshot verification - Applied code formatting with black and isort ✅ All tests passing: - Standard Zarr: ✅ (187, 11, 3712, 1392) - Ice Chunk main: ✅ (187, 11, 3712, 1392) - Ice Chunk commit: ✅ (187, 11, 3712, 1392) Ready for Sol's final review.
8f5dc98 to
5773715
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more code suggestion for the main logic - thanks for all the changes!
Can you also move the benchmarks into the the tests into the scripts folder, andtests folder?
devsjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to line the tests up with the current implementations, it would be good to refactor the test cases 2 and 3 from test_loading into new test_<test-name> functions in tests/load/test_load_satellite.py . These should test on a small, custom created local icechunk store that has the expected dimensions, that is created and passed in to the test functions using a new conftest.py function (perhaps called sat_icechunk_path). Test case 1 (plain zarr) can go, it is covered by the existing zarr loading tests, since it uses thte open_sat_zarr function and we have now kept that as the entrypoint)
…name> functions in tests/load/test_load_satellite.py , making a new function in conftest.py called sat_icechunk_path that is passed into the test functions.
devsjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few suggestions removing superfluos comments!
devsjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more changes to make after speaking to the lead dev on this - logic wise it's reverting the channel selection functionality back to how it was before, and removing some more comments.
Also, since this is the production codebase, we've decided that the benchmarking scripts shouldn't be a part of the codebase changes (since they are more for verification of the results). I'm not saying the code should be lost though, as it represents a lot of your hard work during this project! So my suggestion is:
- Move the
torch_datasets/utils/benchmark.py,scripts/benchmark_cli.py,scripts/full_dataset_icechunk_conversion.py,scripts/production_benchmark_comparison,test_satellite/configs/production_icechunk_2024-02_config.yamlandtest_satellite/configs/test_plain_zarr_clean.yamlto github gists, similar to the report - Then in the PR description, mention that the results of the PR can be verified using the following code, and link to the relevant gists.
That way, your investigation work and code is still a part of this PR (and therefore, still linked to in your report). It's just where that code is saved that is changed!
|
We need to pass the linting checks, can you run |
This reverts commit 93bf294. Trigger CI checks after rollback
Pull Request Description
🎓 GSoC 2025: High-Performance Ice Chunk Integration for OCF Data Sampler
Google Summer of Code 2025 Project | Organization: Open Climate Fix
Contributor: [Your Name] | Mentors: Solomon Cotton, Peter Dudfield
📋 Project Overview
This Pull Request implements a comprehensive cloud-native satellite data streaming solution for OCF Data Sampler using Ice Chunk technology. This work was completed as part of Google Summer of Code 2025 to modernize OCF's climate ML infrastructure and eliminate the bottleneck of downloading massive Zarr datasets.
🔗 Complete Project Documentation: GSoC 2025 Ice Chunk Integration Gist
🎯 Project Goals Achieved
zarr_patharchitecture🏗️ Technical Implementation
Unified Architecture Design
Implemented clean, suffix-based dispatching using a single
zarr_pathfield:Core Components Added
ocf_data_sampler/load/satellite.pyocf_data_sampler/load/satellite.py📊 Performance Results
🧪 Testing & Validation
Integration Tests
Comprehensive Test Coverage
Production Benchmarking
🔧 Files Changed
Core Implementation
ocf_data_sampler/load/satellite.py- Major refactor: Unified loader with Ice Chunk integrationocf_data_sampler/load/load_dataset.py- Updated to use unified zarr_path approachocf_data_sampler/config/model.py- Enhanced configuration modelProduction Tools (Just for Benchmarking- Not added in the codebase)
scripts/full_dataset_icechunk_conversion.py- New: Dataset migration toolscripts/benchmark_cli.py- New: CLI benchmarking interfacescripts/production_benchmark_comparison.py- New: Performance comparison utilityocf_data_sampler/torch_datasets/utils/benchmark.py- New: Benchmarking frameworkTesting & Configuration
tests/load/test_load_satellite.py- Comprehensive test coverage for all scenariostests/conftest.py- Ice Chunk test fixtures and utilitiestests/test_satellite/configs/production_icechunk_2024-02_config.yaml- New: Production configtests/test_satellite/configs/test_plain_zarr_clean.yaml- New: Baseline config🚀 Major Challenges Overcome
OCF-Blosc2 Codec Compatibility
Memory Management for Large Datasets
API Version Compatibility
Performance Optimization
🎯 Real-World Impact
Climate ML Acceleration
📋 Future Work & Extensibility
This implementation provides a solid foundation for:
🔗 Related Links
📊 Benchmarking & Verification Scripts
To maintain a clean production codebase, the benchmarking scripts and configuration files have been included below for full verification of the performance claims:
Benchmarking Framework
File:
torch_datasets/utils/benchmark.pyCLI Benchmarking Interface
File:
scripts/benchmark_cli.pyDataset Conversion Pipeline
File:
scripts/full_dataset_icechunk_conversion.pyPerformance Comparison Tool
File:
scripts/production_benchmark_comparison.pyProduction Ice Chunk Config
File:
test_satellite/configs/production_icechunk_2024-02_config.yamlBaseline Plain Zarr Config
File:
test_satellite/configs/test_plain_zarr_clean.yamlNote: These external verification resources allow complete reproduction of the 2.09x performance improvement while maintaining a clean production codebase.
🏆 GSoC 2025 Completion Status
[commit-hash]- All core functionality implemented and testedThis Google Summer of Code 2025 project successfully establishes the technical foundation for OCF's next generation of cloud-native climate ML workflows.