Skip to content

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Aug 4, 2025

This PR will give each codec a v2 and v3 JSON de/serialization routines.

depends on #3318

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 29, 2025

I think the last thing I need to do here is write a test that ensures compatibility between these changes and older versions of zarr python 3.x.

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 30, 2025

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

@maxrjones
Copy link
Member

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

I think this would probably be an issue for other parsers too, such as gribberish and Sean's new HRRRarser.

The VirtualiZarr tests also fail with errors such as: TypeError: Zlib.__init__() got an unexpected keyword argument 'name'.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 3, 2025

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

I think this would probably be an issue for other parsers too, such as gribberish and Sean's new HRRRarser.

The VirtualiZarr tests also fail with errors such as: TypeError: Zlib.__init__() got an unexpected keyword argument 'name'.

This is super useful feedback. I'll add virtual-tiff as a dev dependency while I work out how to make these changes non-breaking.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 3, 2025

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

More context for this:

>>> from numcodecs.zarr3 import Zlib as ZlibV3
>>> from numcodecs import Zlib
>>> Zlib().get_config()
{'id': 'zlib', 'level': 1}
>>> ZlibV3().to_dict()
/Users/d-v-b/.cache/uv/archive-v0/RPIFUeEX8IUCTWZnqf1cL/lib/python3.12/site-packages/numcodecs/zarr3.py:164: UserWarning: Numcodecs codecs are not in the Zarr version 3 specification and may not be supported by other zarr implementations.
  super().__init__(**codec_config)
{'name': 'numcodecs.zlib', 'configuration': {}}
>>> ZlibV3(fake_param=10).to_dict()
{'name': 'numcodecs.zlib', 'configuration': {'fake_param': 10}}

What you see here is a massive flaw in the slapdash design of the codecs in numcodecs.zarr3, which is that __init__ does not inspect the parameters at all! Zlib is configured with a level parameter, which has a default value of 1 for numcodecs.Zlib. But numcodecs.zarr3.Zlib doesn't know anything about the codec it wraps, and so it doesn't generate the default level parameter. This means numcodecs.zarr3.Zlib generates invalid zlib metadata! Great stuff.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 28, 2025

Recent changes in this PR:

  • all the codecs in zarr.codecs.numcodecs produce JSON like {name: "astype", "configuration": {"dtype": "int8", "astype": "uint8"}}. Note two things: the name does not have the numcodecs. prefix, and the configuration has been changed to be zarr v3 compliant (for example, using Zarr v3 data type JSON, instead of zarr v2 data type JSON). This is a breaking change to the JSON form produced by these codecs. Old versions of Zarr python will not be able to read this metadata.
  • all the codecs in zarr.codecs.numcodecs consume JSON formatted as described in the previous bullet point, but also the old style of {"name": "numcodecs.astype", "configuration": {"dtype": "|i1", "astype": "|u1"}}. This means they are compatible with old data.

I think this breakage is warranted for a few reasons:

  • we do not want to normalize the "numcodecs" name prefix. It is confusing, and potentially a source of needless divergence in the codec ecosystem. For example, we currently have "numcodecs.blosc", which produces different, incompatible zarr v3 JSON from the regular blosc codec defined in the zarr v3 spec. I find this extremely problematic and worth stopping. This PR effectively makes the zarr.codecs.numcodecs.Blosc codec an alias for zarr.codecs.BloscCodec.
  • we do not want abstraction leakage from zarr v2 into zarr v3. We have several codecs that use data type identifiers as part of their definition, like astype, delta, and fixedscaleoffset. The zarr.codecs.numcodecs versions of these codecs currently use the zarr v2 data type identifiers that numcodecs understands. This is highly problematic, because we have perfectly good zarr v3 data type identifiers that we chose specifically to decouple ourselves from numpy. Disseminating codecs that use the old numpy-style data type identifiers is asking for headaches.

So a basic question for @zarr-developers/python-core-devs: how much do we value preserving the current, problematic JSON serialization of the codecs in zarr.codecs.numcodecs, versus the value of keeping the codec ecosystem simpler?

@maxrjones
Copy link
Member

  • all the codecs in zarr.codecs.numcodecs produce JSON like {name: "astype", "configuration": {"dtype": "int8", "astype": "uint8"}}. Note two things: the name does not have the numcodecs. prefix, and the configuration has been changed to be zarr v3 compliant (for example, using Zarr v3 data type JSON, instead of zarr v2 data type JSON). This is a breaking change to the JSON form produced by these codecs. Old versions of Zarr python will not be able to read this metadata.
  • all the codecs in zarr.codecs.numcodecs consume JSON formatted as described in the previous bullet point, but also the old style of {"name": "numcodecs.astype", "configuration": {"dtype": "|i1", "astype": "|u1"}}. This means they are compatible with old data.

@d-v-b this behavior is consistent with the alias solution discussed in zarr-developers/zarr-extensions#2, right? I support using the more interoperable alias for serialization moving forward, but ideally would like zarr-developers/zarr-extensions#2 to be finalized/merged at the same time.

I have two other questions:

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 3, 2025

whether the configuration key should be omitted or the object should be None or empty.

For a codec that takes no configuration at all, {'name': 'foo', 'configuration': {}} or {'name': 'foo'} or 'foo' (plain string) are all synonyms, as per the 3.1 spec.

Can the codecs consume/produce v2 style filters/compressors/serializers (e.g., {id: "astype", "dtype": "int8", "astype": "uint8"})?

In this PR yes, all the codecs that handle dtype stuff in their configuration will take the v2 dtypes metadata, even for a v3 codec. this is necessary for compatibility with existing data. But this PR makes a breaking change by ensuring that the v3 JSON form of such a codec uses the zarr v3 data type metadata.

@maxrjones
Copy link
Member

But this PR makes a breaking change by ensuring that the v3 JSON form of such a codec uses the zarr v3 data type metadata.

I agree with making this breaking change, favoring a simpler ecosystem over bug-for-bug compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants