Add v2 and v3 metadata support to codecs #3332

d-v-b · 2025-08-04T14:13:44Z

This PR will give each codec a v2 and v3 JSON de/serialization routines.

depends on #3318

…e needed

… annotations

d-v-b · 2025-09-29T15:26:41Z

I think the last thing I need to do here is write a test that ensures compatibility between these changes and older versions of zarr python 3.x.

d-v-b · 2025-09-30T15:52:10Z

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

maxrjones · 2025-10-03T19:36:52Z

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

I think this would probably be an issue for other parsers too, such as gribberish and Sean's new HRRRarser.

The VirtualiZarr tests also fail with errors such as: TypeError: Zlib.__init__() got an unexpected keyword argument 'name'.

d-v-b · 2025-10-03T20:42:03Z

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

I think this would probably be an issue for other parsers too, such as gribberish and Sean's new HRRRarser.

The VirtualiZarr tests also fail with errors such as: TypeError: Zlib.__init__() got an unexpected keyword argument 'name'.

This is super useful feedback. I'll add virtual-tiff as a dev dependency while I work out how to make these changes non-breaking.

…into feat/v2-v3-codecs

d-v-b · 2025-10-03T21:31:11Z

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

More context for this:

>>> from numcodecs.zarr3 import Zlib as ZlibV3
>>> from numcodecs import Zlib
>>> Zlib().get_config()
{'id': 'zlib', 'level': 1}
>>> ZlibV3().to_dict()
/Users/d-v-b/.cache/uv/archive-v0/RPIFUeEX8IUCTWZnqf1cL/lib/python3.12/site-packages/numcodecs/zarr3.py:164: UserWarning: Numcodecs codecs are not in the Zarr version 3 specification and may not be supported by other zarr implementations.
  super().__init__(**codec_config)
{'name': 'numcodecs.zlib', 'configuration': {}}
>>> ZlibV3(fake_param=10).to_dict()
{'name': 'numcodecs.zlib', 'configuration': {'fake_param': 10}}

What you see here is a massive flaw in the slapdash design of the codecs in numcodecs.zarr3, which is that __init__ does not inspect the parameters at all! Zlib is configured with a level parameter, which has a default value of 1 for numcodecs.Zlib. But numcodecs.zarr3.Zlib doesn't know anything about the codec it wraps, and so it doesn't generate the default level parameter. This means numcodecs.zarr3.Zlib generates invalid zlib metadata! Great stuff.

into feat/v2-v3-codecs

…into feat/v2-v3-codecs

d-v-b · 2025-10-28T17:02:42Z

Recent changes in this PR:

all the codecs in zarr.codecs.numcodecs produce JSON like {name: "astype", "configuration": {"dtype": "int8", "astype": "uint8"}}. Note two things: the name does not have the numcodecs. prefix, and the configuration has been changed to be zarr v3 compliant (for example, using Zarr v3 data type JSON, instead of zarr v2 data type JSON). This is a breaking change to the JSON form produced by these codecs. Old versions of Zarr python will not be able to read this metadata.
all the codecs in zarr.codecs.numcodecs consume JSON formatted as described in the previous bullet point, but also the old style of {"name": "numcodecs.astype", "configuration": {"dtype": "|i1", "astype": "|u1"}}. This means they are compatible with old data.

I think this breakage is warranted for a few reasons:

we do not want to normalize the "numcodecs" name prefix. It is confusing, and potentially a source of needless divergence in the codec ecosystem. For example, we currently have "numcodecs.blosc", which produces different, incompatible zarr v3 JSON from the regular blosc codec defined in the zarr v3 spec. I find this extremely problematic and worth stopping. This PR effectively makes the zarr.codecs.numcodecs.Blosc codec an alias for zarr.codecs.BloscCodec.
we do not want abstraction leakage from zarr v2 into zarr v3. We have several codecs that use data type identifiers as part of their definition, like astype, delta, and fixedscaleoffset. The zarr.codecs.numcodecs versions of these codecs currently use the zarr v2 data type identifiers that numcodecs understands. This is highly problematic, because we have perfectly good zarr v3 data type identifiers that we chose specifically to decouple ourselves from numpy. Disseminating codecs that use the old numpy-style data type identifiers is asking for headaches.

So a basic question for @zarr-developers/python-core-devs: how much do we value preserving the current, problematic JSON serialization of the codecs in zarr.codecs.numcodecs, versus the value of keeping the codec ecosystem simpler?

maxrjones · 2025-11-03T15:50:54Z

all the codecs in zarr.codecs.numcodecs produce JSON like {name: "astype", "configuration": {"dtype": "int8", "astype": "uint8"}}. Note two things: the name does not have the numcodecs. prefix, and the configuration has been changed to be zarr v3 compliant (for example, using Zarr v3 data type JSON, instead of zarr v2 data type JSON). This is a breaking change to the JSON form produced by these codecs. Old versions of Zarr python will not be able to read this metadata.

all the codecs in zarr.codecs.numcodecs consume JSON formatted as described in the previous bullet point, but also the old style of {"name": "numcodecs.astype", "configuration": {"dtype": "|i1", "astype": "|u1"}}. This means they are compatible with old data.

@d-v-b this behavior is consistent with the alias solution discussed in zarr-developers/zarr-extensions#2, right? I support using the more interoperable alias for serialization moving forward, but ideally would like zarr-developers/zarr-extensions#2 to be finalized/merged at the same time.

I have two other questions:

How do you handle a codec that doesn't have a configuration object? I couldn't quite tell from https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#extension-definition whether the configuration key should be omitted or the object should be None or empty.
Can the codecs consume/produce v2 style filters/compressors/serializers (e.g., {id: "astype", "dtype": "int8", "astype": "uint8"})?

d-v-b · 2025-11-03T16:14:26Z

whether the configuration key should be omitted or the object should be None or empty.

For a codec that takes no configuration at all, {'name': 'foo', 'configuration': {}} or {'name': 'foo'} or 'foo' (plain string) are all synonyms, as per the 3.1 spec.

Can the codecs consume/produce v2 style filters/compressors/serializers (e.g., {id: "astype", "dtype": "int8", "astype": "uint8"})?

In this PR yes, all the codecs that handle dtype stuff in their configuration will take the v2 dtypes metadata, even for a v3 codec. this is necessary for compatibility with existing data. But this PR makes a breaking change by ensuring that the v3 JSON form of such a codec uses the zarr v3 data type metadata.

maxrjones · 2025-11-03T16:41:34Z

But this PR makes a breaking change by ensuring that the v3 JSON form of such a codec uses the zarr v3 data type metadata.

I agree with making this breaking change, favoring a simpler ecosystem over bug-for-bug compatibility.

d-v-b added 30 commits July 31, 2025 15:44

add numcodec protocol

a367268

add tests for numcodecs compatibility

1d424c0

changelog

41dd6ff

ignore unknown key

c435a59

remove re-implementation of get_codec

8e50ef8

Merge branch 'main' into feat/numcodecs-protocol

ef31c5b

Merge branch 'main' into feat/numcodecs-protocol

4ba7914

Merge branch 'main' into feat/numcodecs-protocol

ab52539

Merge branch 'main' into feat/numcodecs-protocol

95c9c8b

add to_json methods to codecs

156134f

add codecvalidationerror

486f837

fix v2 codec json models to avoid inheritance

dd53981

add blosc json test

678889a

distinguish namedconfig from namedrequiredconfig

dfca3ec

lint

262e369

make codecvalidationerror effectively single-argument

4c7fe8a

rename test_endian to test_bytes

1e23a91

bring in update codec abc

d7d4e02

add to_json_tests

9980823

Merge branch 'main' into feat/numcodecs-protocol

fcf84b3

lint

cbb32d7

fix broken tests that used invalid codec JSON

e2d4df8

update test_info

d91b0e9

avoid circular imports by moving numcodec protocol to codec abc

1eb5b3c

use Numcodec instead of numcodecs.abc.Codec

94ba77a

Wip implementation of v2 / v3 codec behavior

f1ca290

Merge branch 'main' into feat/numcodecs-protocol

5b0c3ac

avoid circular imports by importing lower-level routines exactly wher…

84c9780

…e needed

push numcodec prototol into abcs; remove all numcodecs.abc.Codec type…

9a2f35b

… annotations

add tests for codecjson typeguard

0d0712f

d-v-b added 5 commits September 25, 2025 22:10

make numcodecs codecs backwards compatible

d0d5e92

lint

f4598c8

update test to use to_json output

79f4e31

lint

7a47ef7

fix crc32c json decoding

a483153

d-v-b added 3 commits September 29, 2025 17:39

use explicit integer dtype in blosc test

e72f905

ensure forward compatibility for older versions of zarr python 3.x

172e01f

fix astype test

b2526d2

d-v-b added 2 commits October 3, 2025 23:03

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

e1afd34

…into feat/v2-v3-codecs

Merge branch 'main' into feat/v2-v3-codecs

1111c93

d-v-b added 7 commits October 22, 2025 14:49

Merge branch 'feat/v2-v3-codecs' of https://github.com/d-v-b/zarr-python

a11d875

into feat/v2-v3-codecs

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

0817e74

…into feat/v2-v3-codecs

add alias logic for numcodecs codecs

4f17ee6

handle crc32c changes

1ca85b7

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

ff7a1a5

…into feat/v2-v3-codecs

fix aliases, deprecate old codecs

e3b3fe2

full json alias logic

4e03381

d-v-b mentioned this pull request Oct 31, 2025

Feature request: Mechanism to extend codec/filter registry manzt/zarrita.js#310

Open

d-v-b mentioned this pull request Nov 3, 2025

adds codecs that numcodecs defines zarr-developers/zarr-extensions#2

Draft

15 tasks

maxrjones mentioned this pull request Nov 3, 2025

Bug: Handle synonyms of V3 codecs when serializing to V2 formats zarr-developers/VirtualiZarr#823

Open

maxrjones mentioned this pull request Nov 3, 2025

Imagecodecs support NASA-IMPACT/veda-odd#214

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add v2 and v3 metadata support to codecs #3332

Add v2 and v3 metadata support to codecs #3332

Uh oh!

d-v-b commented Aug 4, 2025 •

edited

Loading

Uh oh!

d-v-b commented Sep 29, 2025

Uh oh!

d-v-b commented Sep 30, 2025 •

edited

Loading

Uh oh!

maxrjones commented Oct 3, 2025

Uh oh!

d-v-b commented Oct 3, 2025

Uh oh!

d-v-b commented Oct 3, 2025

Uh oh!

d-v-b commented Oct 28, 2025 •

edited

Loading

Uh oh!

maxrjones commented Nov 3, 2025

Uh oh!

d-v-b commented Nov 3, 2025

Uh oh!

maxrjones commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add v2 and v3 metadata support to codecs #3332

Are you sure you want to change the base?

Add v2 and v3 metadata support to codecs #3332

Uh oh!

Conversation

d-v-b commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Sep 29, 2025

Uh oh!

d-v-b commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxrjones commented Oct 3, 2025

Uh oh!

d-v-b commented Oct 3, 2025

Uh oh!

d-v-b commented Oct 3, 2025

Uh oh!

d-v-b commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxrjones commented Nov 3, 2025

Uh oh!

d-v-b commented Nov 3, 2025

Uh oh!

maxrjones commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d-v-b commented Aug 4, 2025 •

edited

Loading

d-v-b commented Sep 30, 2025 •

edited

Loading

d-v-b commented Oct 28, 2025 •

edited

Loading