Skip to content

Save ICON-EU GRIB files from DWD's FTP server to cloud object storage #258

@JackKelly

Description

@JackKelly

Motivation

It seems that people are generally in favour of the idea of us archiving ICON-EU GRIB files on cloud object storage. For example, in this comment, @aldenks says:

I like the idea of archiving the complete gribs (e.g. on Source Coop) and then starting with a more limited surface variable/level set in the Zarr and seeing what people ask to add in. For AI WP model training I'd imagine almost everyone wants the whole domain at each timestep, meaning the native "chunking" of the grib files is actually optimal and we could expose it as a virtualzarr/virtual icechunk dataset and get equivalent read performance.

And, TBH, it's turning out to be quite hard to find a high-quality archive of ICON-EU data. (DWD delete GRIBs from their HTTP server after 24 hours). So archiving the GRIBs would be a useful contribution in its own right. And, before we know it, a year will have elapsed and we'll have a reasonable-sized archive.

And, if I've understood correctly, RegionJob.operational_update_jobs kind of expects to have access to an archive of GRIB files that spans from whenever operational_update_jobs was last run until now. Which could be more than 24 hours if, for example, the ICON-EU reformatters code fails on a Friday night, and I don't fix it until Monday. So I'm pausing work on PR #226, until we have an operational service that copies GRIBs and stores them on object storage.

Implementation

How would you like me to implement the script that grabs GRIBs from DWD's HTTP server and copies them to Source Coop?

I'm imagining a fairly simple script Python that you could run using cron (or similar). The script would:

  • Take command line arguments for the source and destination. (e.g. source = https://opendata.dwd.de/weather/nwp/icon-eu/grib/, destination = s3://sourcecoop/dynamical/dwd/icon-eu)
  • Check what's already in the destination
  • Copy anything that's in the source that's not already in the destination
  • Check the filesizes of the files in the source, and compare with the filesizes in the destination, and re-copy if any filesizes are too small (which indicates a partial transfer).
  • Log using Python's standard logging mechanism

I'd probably use obstore instead of fsspec.

Does that sound like something I should build? Or would you like it implemented differently? (No worries either way!)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions