-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
map_over_datasets: skip empty nodes #10042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
map_over_datasets: skip empty nodes #10042
Conversation
|
A interpolation use case that doesn't crash with this PR: import numpy as np
import xarray as xr
number_of_files = 700
number_of_groups = 5
number_of_variables = 10
datasets = {}
for f in range(number_of_files):
for g in range(number_of_groups):
# Create random data
time = np.linspace(0, 50 + f, 1 + 1000 * g)
y = f * time + g
# Create dataset:
ds = xr.Dataset(
data_vars={
f"temperature_{g}{i}": ("time", y)
for i in range(number_of_variables // number_of_groups)
},
coords={"time": ("time", time)},
).chunk()
# Prepare for xr.DataTree:
name = f"file_{f}/group_{g}"
datasets[name] = ds
dt = xr.DataTree.from_dict(datasets)
# %% Interpolate to same time coordinate
def ds_interp(ds, *args, **kwargs):
return ds.interp(*args, **kwargs)
new_time = np.linspace(0, 100, 50)
dt_interp = dt.map_over_datasets(
ds_interp, kwargs=dict(time=new_time, assume_sorted=True)
) |
|
Thanks for the example. This PR would also close #10013. This would be a huge plus for me. Not being able to subtract a ds from a datatree makes it extremely cumbersome. However, this implies that the binary ops are implemented using |
|
@TomNicholas do you see any chance this PR might get merged (after adding tests etc. obviously)? Are there discussions beside #9693 that I am missing? |
|
Hey @mathause - sorry for forgetting about this - I've been busy. I think something like this should get merged, but there are various small and fairly arbitrary choices to quibble over. They are basically all already mentioned in #9693 though.
I don't understand this statement though - aren't binary ops already implemented using xarray/xarray/core/datatree.py Line 1590 in 5ea1e81
We're changing the behaviour, but changing it to be closer to the old datatree, which is what a lot of users expect anyway. |
|
I think this is ready for review
No worries! Thanks for considering this PR!
The one unclear choice from #9693 was the comment by @shoyer #9693 (comment):
I currently use
Yes sorry that was not clear. I just wanted to say that binary ops are also affected. |
|
gentle ping @TomNicholas (apologies for bothering you again - I am unfortunately currently blocked by this in another project. Or is there someone else who could potentially review this?) |
|
no worries, replied here #9693 (comment) |
|
Any updates on this? #10013 makes binary operations on datatrees quite unfeasible, and a fix would be very welcome! |
xarray/core/datatree_mapping.py
Outdated
| node_dataset_args.insert(i, arg) | ||
| with add_path_context_to_errors(path): | ||
| results = func(*node_dataset_args, **kwargs) | ||
| if node_tree_args[0].has_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would check every argument here, instead of just the first one.
| elif node_tree_args[0].has_attrs: | ||
| # propagate attrs | ||
| results = node_tree_args[0].dataset | ||
| func_called[path] = False | ||
|
|
||
| else: | ||
| # use Dataset instead of None to ensure it has copy method | ||
| results = Dataset() | ||
| func_called[path] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigning a single dataset as the result implicitly assumes that the result of map_over_datasets is a single Dataset, which may not be the case.
In cases of at least one non-empty node, we can infer the number of outputs, but this isn't possible in general, e.g., consider map_over_datasets(np.divmod, Datatree(), Datatree()).
I think we now need an explicit num_return_values argument which can be supplied to disambiguate this.
|
Thanks a lot for the review! I updated the function to check all passed datatree nodes and added |
whats-new.rstapi.rst