Skip to content

Conversation

@koskampt
Copy link

@koskampt koskampt commented Nov 23, 2025

…62501)

@koskampt koskampt requested a review from rhshadrach as a code owner November 23, 2025 15:13
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Please always add tests. Does this also handle the tuple case on L667?

@koskampt koskampt force-pushed the bug-fix-grouby-with-none-values-with-filter branch from edd8a1f to 8d2126a Compare November 25, 2025 19:57
@rhshadrach
Copy link
Member

rhshadrach commented Nov 25, 2025

@koskampt - I opened #63202 to give some idea of what I'm thinking. If you like that, can incorporate it here. But still open to alternative solutions that do not iterate through indices within _get_indices for the reasons provided.

Even with such a solution, will still want to see the result of running the groupby ASVs to evaluate performance impact. I can also help assist here if desired.

@koskampt
Copy link
Author

@rhshadrach I had a look at your pull request and incorporated your suggestions in mine. I also made the change _get_indices(self, names) to _get_indices(self, name).

I am not familiar with the (groupby) ASVs, but I guess it referring to this: https://pandas.pydata.org/community/benchmarks.html. Help would be greatly appreciated, although I will through the docs by myself first.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the (groupby) ASVs, but I guess it referring to this: https://pandas.pydata.org/community/benchmarks.html. Help would be greatly appreciated, although I will through the docs by myself first.

Correct - if you're using conda for your virtual environment, then this should be sufficient:

asv continuous -f 1.1 upstream/main HEAD -b ^groupby

@koskampt koskampt force-pushed the bug-fix-grouby-with-none-values-with-filter branch from 2ef342b to f7c5e23 Compare November 29, 2025 16:04
@koskampt
Copy link
Author

I was able to get the asv up and running (a couple of days ago). I will run the benchmark with the below command and report back the results.

asv continuous -f 1.1 upstream/main HEAD -b ^groupby

@koskampt
Copy link
Author

Just checking, I also went through asv_bench/benchmarks/groupby.py file but couldn't see a specific benchmark test that cover the case where dropna = False, there are no values and indices is called. Am I missing something or should we add a new benchmark in order to test the performance impact of the change in this pull request?

@koskampt
Copy link
Author

@rhshadrach, I was able to run asv with the command you mentioned. During the benchmark run I did not use my computer. The results can be found below:

Change Before [b6d67b7] After [f7c5e23] Ratio Benchmark (Parameter)
+ 16.6±0.1ms 19.9±2ms 1.2 groupby.String.time_str_func('string[python]', 'max')
+ 6.91±0.09ms 8.20±1ms 1.19 groupby.String.time_str_func('string[python]', 'sum')
+ 7.89±0.08ms 9.31±0.7ms 1.18 groupby.GroupByNumbaAgg.time_frame_agg('float64', 'max')
+ 272±6μs 320±30μs 1.18 groupby.RankWithTies.time_rank_ties('datetime64', 'average')
+ 3.67±0.07ms 4.33±0.5ms 1.18 groupby.String.time_str_func('str', 'all')
+ 16.7±0.3ms 19.6±2ms 1.18 groupby.String.time_str_func('string[python]', 'min')
+ 1.05±0.01ms 1.24±0.2ms 1.18 groupby.Transform.time_transform_str_max
+ 885±2μs 1.03±0.08ms 1.17 groupby.CountMultiInt.time_multi_int_nunique
+ 12.0±0.06ms 14.0±0.7ms 1.17 groupby.GroupByNumbaAgg.time_frame_agg('float64', 'mean')
+ 53.3±0.9ms 62.2±7ms 1.17 groupby.Groups.time_series_groups('object_large')
+ 35.0±0.2ms 41.1±3ms 1.17 groupby.Groups.time_series_indices('object_large')
+ 16.1±0.1ms 18.6±2ms 1.16 groupby.GroupByNumbaAgg.time_frame_agg('float64', 'var')
+ 17.7±0.2ms 20.6±2ms 1.16 groupby.Groups.time_series_indices('object_small')
+ 3.64±0.03ms 4.22±0.4ms 1.16 groupby.String.time_str_func('str', 'any')
+ 3.81±0.04ms 4.43±0.5ms 1.16 groupby.String.time_str_func('string[python]', 'any')
+ 3.33±0.02ms 3.88±0.4ms 1.16 groupby.String.time_str_func('string[python]', 'first')
+ 3.38±0.01ms 3.91±0.4ms 1.16 groupby.String.time_str_func('string[python]', 'last')
+ 12.7±0.08ms 14.6±1ms 1.15 groupby.Groups.time_series_indices('int64_large')
+ 7.77±0.09ms 8.96±0.9ms 1.15 groupby.Groups.time_series_indices('int64_small')
+ 55.9±0.7ms 64.0±6ms 1.15 groupby.Int64.time_overflow
+ 6.17±0.05ms 7.08±0.7ms 1.15 groupby.TransformEngine.time_dataframe_cython(False)
+ 16.0±0.1μs 18.3±1μs 1.14 groupby.GroupByMethods.time_dtype_as_group('uint', 'var', 'direct', 1, 'cython')
+ 613±6μs 698±50μs 1.14 groupby.MultipleCategories.time_groupby_extra_cat_sort
+ 2.23±0.01ms 2.55±0.2ms 1.14 groupby.SumTimeDelta.time_groupby_sum_int
+ 33.4±0.1ms 38.1±3ms 1.14 groupby.Transform.time_transform_lambda_max
+ 3.56±0.01ms 4.01±0.3ms 1.13 groupby.AggFunctions.time_different_str_functions_singlecol
+ 7.35±0.04μs 8.32±0.6μs 1.13 groupby.GroupByMethods.time_dtype_as_group('uint', 'count', 'direct', 1, 'cython')
+ 16.2±0.1μs 18.2±1μs 1.13 groupby.GroupByMethods.time_dtype_as_group('uint', 'prod', 'direct', 1, 'cython')
+ 12.9±0.04ms 14.6±1ms 1.13 groupby.Groups.time_series_groups('object_small')
+ 45.1±0.3ms 50.8±5ms 1.13 groupby.MultiColumn.time_col_select_lambda_sum
+ 3.24±0.01ms 3.66±0.3ms 1.13 groupby.Size.time_multi_size
+ 18.5±0.1μs 20.7±0.6μs 1.12 groupby.GroupByMethods.time_dtype_as_group('int16', 'min', 'direct', 1, 'cython')
+ 6.10±0.01ms 6.84±0.6ms 1.12 groupby.GroupByMethods.time_dtype_as_group('uint', 'describe', 'direct', 1, 'cython')
+ 4.76±0.01ms 5.34±0.3ms 1.12 groupby.MultiColumn.time_cython_sum
+ 86.5±0.3ms 96.8±7ms 1.12 groupby.MultiColumn.time_lambda_sum
+ 5.49±0.03ms 6.13±0.6ms 1.12 groupby.Nth.time_series_nth_any('datetime')
+ 268±3μs 300±30μs 1.12 groupby.RankWithTies.time_rank_ties('int64', 'average')
+ 3.19±0.01ms 3.59±0.3ms 1.12 groupby.String.time_str_func('str', 'first')
+ 16.1±0.2ms 18.0±1ms 1.12 groupby.String.time_str_func('str', 'max')
+ 5.47±0.08ms 6.12±0.5ms 1.12 groupby.String.time_str_func('str', 'sum')
+ 12.6±0.05ms 14.1±1ms 1.12 groupby.Transform.time_transform_lambda_max_wide
+ 201±1ms 225±20ms 1.12 groupby.TransformEngine.time_dataframe_numba(True)
+ 21.8±0.5μs 24.2±1μs 1.11 groupby.GroupByMethods.time_dtype_as_group('int', 'sum', 'direct', 1, 'cython')
+ 258±1μs 286±20μs 1.11 groupby.RankWithTies.time_rank_ties('int64', 'dense')
+ 7.38±0.05ms 8.17±0.6ms 1.11 groupby.TransformEngine.time_series_cython(False)
+ 247±0.4μs 274±20μs 1.11 groupby.TransformNaN.time_first
+ 3.71±0.03ms 4.09±0.3ms 1.1 groupby.AggFunctions.time_different_str_functions_multicol
+ 17.0±0.1μs 18.7±0.5μs 1.1 groupby.GroupByMethods.time_dtype_as_field('int16', 'mean', 'direct', 1, 'cython')
+ 24.7±0.08μs 27.3±2μs 1.1 groupby.GroupByMethods.time_dtype_as_group('uint', 'tail', 'direct', 1, 'cython')
+ 2.25±0.02ms 2.48±0.2ms 1.1 groupby.SumTimeDelta.time_groupby_sum_timedelta
+ 129±0.5ms 142±10ms 1.1 groupby.TransformEngine.time_dataframe_numba(False)
- 22.0±2ms 19.9±0.2ms 0.91 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int32', 'var')
- 32.8±2ms 29.7±0.4ms 0.91 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int64', 'var')
- 572±20μs 517±10μs 0.9 groupby.Categories.time_groupby_nosort(False)
- 18.3±1μs 16.5±0.4μs 0.9 groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'direct', 1, 'cython')
- 49.2±5ms 43.7±0.8ms 0.89 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int64', 'mean')
- 114±10ms 101±0.3ms 0.89 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int64', 'median')
- 30.4±2μs 27.2±0.5μs 0.89 groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'direct', 1, 'cython')
- 17.9±2ms 15.6±0.1ms 0.88 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int32', 'sum')
- 18.5±0.9ms 16.2±0.2ms 0.88 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int64', 'sum')
- 450±30μs 398±6μs 0.88 groupby.GroupByMethods.time_dtype_as_field('int', 'quantile', 'direct', 1, 'cython')
- 3.94±0.4ms 3.43±0.02ms 0.87 groupby.Float32.time_sum
- 18.5±2ms 16.1±0.5ms 0.87 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Float64', 'any')
- 20.0±2ms 17.4±0.3ms 0.87 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Float64', 'var')
- 20.2±3ms 17.6±0.1ms 0.87 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int32', 'max')
- 20.6±2ms 17.7±0.4ms 0.86 groupby.GroupByCythonAggEaDtypes.time_frame_agg('Float64', 'sum')
- 12.2±1ms 10.4±0.2ms 0.85 groupby.GroupByCythonAgg.time_frame_agg('float64', 'max')
- 3.56±0.8ms 2.77±0.06ms 0.78 groupby.Cumulative.time_frame_transform('int64', 'cumsum', False)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@rhshadrach
Copy link
Member

couldn't see a specific benchmark test that cover the case where dropna = False, there are no values and indices is called.

Indeed, in addition to indices itself, I'm seeing that this PR only hits:

  • sample
  • __iter__
  • get_group
  • filter

So seems like a pretty limited surface area for performance impact, and I do not see a more performant way to do this that would be limited in scope. Would like another eye here - cc @jbrockmendel.

@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Filters e.g. head, tail, nth labels Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Filters e.g. head, tail, nth Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent behavior of Groupby with None values with filter

2 participants