-
Notifications
You must be signed in to change notification settings - Fork 88
Initial implementation of np.unique(return_index=True) #1138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-24.03
Are you sure you want to change the base?
Initial implementation of np.unique(return_index=True) #1138
Conversation
rohany
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, please add more comments to the code, especially in places where you have changed some logic in a non-trivial manner. Another good place for comments is when you introduce (reasonable) code duplication -- a small comment saying why this is the case is good for future readers.
cunumeric/array.py
Outdated
| """ | ||
| thunk = self._thunk.unique() | ||
| return ndarray(shape=thunk.shape, thunk=thunk) | ||
| thunk = self._thunk.unique(return_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] -- i wouldn't call this thunk anymore, since when return_index=True it's not a thunk anymore, it's a tuple of thunks.
| task.add_scalar_arg(return_index, ty.bool_) | ||
|
|
||
| result = None | ||
| # Assuming legate core will always choose GPU variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]: Add a comment as to why the branches here are creating the thunk in different ways (i.e. different types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra nit -- comments should be complete sentences, so punctuation and periods please.
| if return_index: | ||
| returned_indices = self.runtime.create_unbound_thunk(ty.int64) | ||
| if self.runtime.num_gpus > 0: | ||
| task.add_output(returned_indices.base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, comment this asymmetry between the CPU and GPU implementations
| ) | ||
| if return_index: | ||
| task = self.context.create_auto_task(CuNumericOpCode.UNZIP) | ||
| task.add_input(result.base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an unclear way of writing the code, adding result.base as input, overwriting result, and then adding result.base as output.
| int64_t index; | ||
| }; | ||
|
|
||
| // Surprisingly it seems as though thrust can't figure out this comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might not need this if you either:
- implement comparison overloads on
ZippedIndex - use a builtin type like
std::pairinstead, which has equality implemented already
|
|
||
| if (return_index) { | ||
| indices.first.destroy(); | ||
| other_index.first.destroy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you be early releasing this above like other_piece?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wanted to destroy the buffers here largely for the sake of clean code to minimize the number of special blocks for the return_index case. Are you saying we should free these buffers closer to where we release my_piece and other_piece to potentially free up space before
my_piece.first = output.create_output_buffer<VAL, 1>(buf_size);
in line 182?
| sizeof(int64_t) * my_piece.second, | ||
| cudaMemcpyDeviceToDevice, | ||
| stream)); | ||
| merged_index.destroy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it is safe for you to do a cudaMemcpyAsync and then destroy merged_index before the cudaMemcpyAsync is guaranteed to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating here it is presently safe to do this in Legate as all operations on a given GPU use the same stream. Since this is a tree reduction I think it's probably best to destroy without manually synchronizing to avoid excess latency and duplicating data across multiple iterations of the reduction.
| case CUNUMERIC_MATVECMUL: | ||
| case CUNUMERIC_UNIQUE_REDUCE: { | ||
| case CUNUMERIC_UNIQUE_REDUCE: | ||
| case CUNUMERIC_UNZIP_INDICES: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on looking at the implementation of unzip_indices below, you don't need any special mapping for it and should fall to the default case at the bottom here.
…age when unzipping indices
for more information, see https://pre-commit.ci
| task.add_scalar_arg(return_index, ty.bool_) | ||
|
|
||
| result = None | ||
| # Assuming legate core will always choose GPU variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra nit -- comments should be complete sentences, so punctuation and periods please.
PR motivating nv-legate/legate#942
I also decided to create a special task for unzipping indices and values as opposed to modifying legate.core's Reduce implementation after speaking with @magnatelee .
@rohany