Save ICON-EU GRIB files from DWD's FTP server to cloud object storage

## Motivation

It seems that people are generally in favour of the idea of us archiving ICON-EU GRIB files on cloud object storage. For example, in [this comment](https://github.com/dynamical-org/reformatters/issues/176#issuecomment-3069883223), @aldenks says:

> I like the idea of archiving the complete gribs (e.g. on Source Coop) and then starting with a more limited surface variable/level set in the Zarr and seeing what people ask to add in. For AI WP model training I'd imagine almost everyone wants the whole domain at each timestep, meaning the native "chunking" of the grib files is actually optimal and we could expose it as a virtualzarr/virtual icechunk dataset and get equivalent read performance.

And, TBH, it's turning out to be quite hard to find a high-quality archive of ICON-EU data. (DWD delete GRIBs from their HTTP server after 24 hours). So archiving the GRIBs would be a useful contribution in its own right. And, before we know it, a year will have elapsed and we'll have a reasonable-sized archive.

And, if I've understood correctly, `RegionJob.operational_update_jobs` kind of expects to have access to an archive of GRIB files that spans from whenever `operational_update_jobs` was _last_ run until _now_. Which could be more than 24 hours if, for example, the ICON-EU `reformatters` code fails on a Friday night, and I don't fix it until Monday. So I'm pausing work on PR #226, until we have an operational service that copies GRIBs and stores them on object storage.

## Implementation

How would you like me to implement the script that grabs GRIBs from DWD's HTTP server and copies them to Source Coop?

I'm imagining a fairly simple script Python that you could run using cron (or similar). The script would:

* Take command line arguments for the source and destination. (e.g. source = `https://opendata.dwd.de/weather/nwp/icon-eu/grib/`, destination = `s3://sourcecoop/dynamical/dwd/icon-eu`)
* Check what's already in the destination
* Copy anything that's in the source that's not already in the destination
* Check the filesizes of the files in the source, and compare with the filesizes in the destination, and re-copy if any filesizes are too small (which indicates a partial transfer).
* Log using Python's standard logging mechanism

I'd probably use [`obstore`](https://github.com/developmentseed/obstore) instead of `fsspec`.

Does that sound like something I should build? Or would you like it implemented differently? (No worries either way!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Save ICON-EU GRIB files from DWD's FTP server to cloud object storage #258

Motivation

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Save ICON-EU GRIB files from DWD's FTP server to cloud object storage #258

Description

Motivation

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions