BUG: Inconsistent behavior of Groupby with None values with filter (#… #63178

koskampt · 2025-11-23T15:13:32Z

…62501)

closes BUG: Inconsistent behavior of Groupby with None values with filter #62501
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/v2.3.4.rst file if fixing a bug or adding a new feature.

rhshadrach

Thanks for the PR! Please always add tests. Does this also handle the tuple case on L667?

pandas/core/groupby/groupby.py

rhshadrach · 2025-11-25T22:42:50Z

@koskampt - I opened #63202 to give some idea of what I'm thinking. If you like that, can incorporate it here. But still open to alternative solutions that do not iterate through indices within _get_indices for the reasons provided.

Even with such a solution, will still want to see the result of running the groupby ASVs to evaluate performance impact. I can also help assist here if desired.

koskampt · 2025-11-26T22:16:47Z

@rhshadrach I had a look at your pull request and incorporated your suggestions in mine. I also made the change _get_indices(self, names) to _get_indices(self, name).

I am not familiar with the (groupby) ASVs, but I guess it referring to this: https://pandas.pydata.org/community/benchmarks.html. Help would be greatly appreciated, although I will through the docs by myself first.

rhshadrach

I am not familiar with the (groupby) ASVs, but I guess it referring to this: https://pandas.pydata.org/community/benchmarks.html. Help would be greatly appreciated, although I will through the docs by myself first.

Correct - if you're using conda for your virtual environment, then this should be sufficient:

asv continuous -f 1.1 upstream/main HEAD -b ^groupby

pandas/core/groupby/groupby.py

pandas/core/groupby/ops.py

…andas-dev#62501)

…andas-dev#62501) - Add test cases - Add tuple support - Incorporate feedback

BUG: Inconsistent behavior of Groupby with None values with filter

koskampt · 2025-11-29T16:10:02Z

I was able to get the asv up and running (a couple of days ago). I will run the benchmark with the below command and report back the results.

asv continuous -f 1.1 upstream/main HEAD -b ^groupby

koskampt · 2025-11-29T16:33:46Z

Just checking, I also went through asv_bench/benchmarks/groupby.py file but couldn't see a specific benchmark test that cover the case where dropna = False, there are no values and indices is called. Am I missing something or should we add a new benchmark in order to test the performance impact of the change in this pull request?

koskampt · 2025-11-30T19:16:56Z

@rhshadrach, I was able to run asv with the command you mentioned. During the benchmark run I did not use my computer. The results can be found below:

Change	Before [`b6d67b7`]	After [`f7c5e23`]	Ratio	Benchmark (Parameter)
+	16.6±0.1ms	19.9±2ms	1.2	groupby.String.time_str_func('string[python]', 'max')
+	6.91±0.09ms	8.20±1ms	1.19	groupby.String.time_str_func('string[python]', 'sum')
+	7.89±0.08ms	9.31±0.7ms	1.18	groupby.GroupByNumbaAgg.time_frame_agg('float64', 'max')
+	272±6μs	320±30μs	1.18	groupby.RankWithTies.time_rank_ties('datetime64', 'average')
+	3.67±0.07ms	4.33±0.5ms	1.18	groupby.String.time_str_func('str', 'all')
+	16.7±0.3ms	19.6±2ms	1.18	groupby.String.time_str_func('string[python]', 'min')
+	1.05±0.01ms	1.24±0.2ms	1.18	groupby.Transform.time_transform_str_max
+	885±2μs	1.03±0.08ms	1.17	groupby.CountMultiInt.time_multi_int_nunique
+	12.0±0.06ms	14.0±0.7ms	1.17	groupby.GroupByNumbaAgg.time_frame_agg('float64', 'mean')
+	53.3±0.9ms	62.2±7ms	1.17	groupby.Groups.time_series_groups('object_large')
+	35.0±0.2ms	41.1±3ms	1.17	groupby.Groups.time_series_indices('object_large')
+	16.1±0.1ms	18.6±2ms	1.16	groupby.GroupByNumbaAgg.time_frame_agg('float64', 'var')
+	17.7±0.2ms	20.6±2ms	1.16	groupby.Groups.time_series_indices('object_small')
+	3.64±0.03ms	4.22±0.4ms	1.16	groupby.String.time_str_func('str', 'any')
+	3.81±0.04ms	4.43±0.5ms	1.16	groupby.String.time_str_func('string[python]', 'any')
+	3.33±0.02ms	3.88±0.4ms	1.16	groupby.String.time_str_func('string[python]', 'first')
+	3.38±0.01ms	3.91±0.4ms	1.16	groupby.String.time_str_func('string[python]', 'last')
+	12.7±0.08ms	14.6±1ms	1.15	groupby.Groups.time_series_indices('int64_large')
+	7.77±0.09ms	8.96±0.9ms	1.15	groupby.Groups.time_series_indices('int64_small')
+	55.9±0.7ms	64.0±6ms	1.15	groupby.Int64.time_overflow
+	6.17±0.05ms	7.08±0.7ms	1.15	groupby.TransformEngine.time_dataframe_cython(False)
+	16.0±0.1μs	18.3±1μs	1.14	groupby.GroupByMethods.time_dtype_as_group('uint', 'var', 'direct', 1, 'cython')
+	613±6μs	698±50μs	1.14	groupby.MultipleCategories.time_groupby_extra_cat_sort
+	2.23±0.01ms	2.55±0.2ms	1.14	groupby.SumTimeDelta.time_groupby_sum_int
+	33.4±0.1ms	38.1±3ms	1.14	groupby.Transform.time_transform_lambda_max
+	3.56±0.01ms	4.01±0.3ms	1.13	groupby.AggFunctions.time_different_str_functions_singlecol
+	7.35±0.04μs	8.32±0.6μs	1.13	groupby.GroupByMethods.time_dtype_as_group('uint', 'count', 'direct', 1, 'cython')
+	16.2±0.1μs	18.2±1μs	1.13	groupby.GroupByMethods.time_dtype_as_group('uint', 'prod', 'direct', 1, 'cython')
+	12.9±0.04ms	14.6±1ms	1.13	groupby.Groups.time_series_groups('object_small')
+	45.1±0.3ms	50.8±5ms	1.13	groupby.MultiColumn.time_col_select_lambda_sum
+	3.24±0.01ms	3.66±0.3ms	1.13	groupby.Size.time_multi_size
+	18.5±0.1μs	20.7±0.6μs	1.12	groupby.GroupByMethods.time_dtype_as_group('int16', 'min', 'direct', 1, 'cython')
+	6.10±0.01ms	6.84±0.6ms	1.12	groupby.GroupByMethods.time_dtype_as_group('uint', 'describe', 'direct', 1, 'cython')
+	4.76±0.01ms	5.34±0.3ms	1.12	groupby.MultiColumn.time_cython_sum
+	86.5±0.3ms	96.8±7ms	1.12	groupby.MultiColumn.time_lambda_sum
+	5.49±0.03ms	6.13±0.6ms	1.12	groupby.Nth.time_series_nth_any('datetime')
+	268±3μs	300±30μs	1.12	groupby.RankWithTies.time_rank_ties('int64', 'average')
+	3.19±0.01ms	3.59±0.3ms	1.12	groupby.String.time_str_func('str', 'first')
+	16.1±0.2ms	18.0±1ms	1.12	groupby.String.time_str_func('str', 'max')
+	5.47±0.08ms	6.12±0.5ms	1.12	groupby.String.time_str_func('str', 'sum')
+	12.6±0.05ms	14.1±1ms	1.12	groupby.Transform.time_transform_lambda_max_wide
+	201±1ms	225±20ms	1.12	groupby.TransformEngine.time_dataframe_numba(True)
+	21.8±0.5μs	24.2±1μs	1.11	groupby.GroupByMethods.time_dtype_as_group('int', 'sum', 'direct', 1, 'cython')
+	258±1μs	286±20μs	1.11	groupby.RankWithTies.time_rank_ties('int64', 'dense')
+	7.38±0.05ms	8.17±0.6ms	1.11	groupby.TransformEngine.time_series_cython(False)
+	247±0.4μs	274±20μs	1.11	groupby.TransformNaN.time_first
+	3.71±0.03ms	4.09±0.3ms	1.1	groupby.AggFunctions.time_different_str_functions_multicol
+	17.0±0.1μs	18.7±0.5μs	1.1	groupby.GroupByMethods.time_dtype_as_field('int16', 'mean', 'direct', 1, 'cython')
+	24.7±0.08μs	27.3±2μs	1.1	groupby.GroupByMethods.time_dtype_as_group('uint', 'tail', 'direct', 1, 'cython')
+	2.25±0.02ms	2.48±0.2ms	1.1	groupby.SumTimeDelta.time_groupby_sum_timedelta
+	129±0.5ms	142±10ms	1.1	groupby.TransformEngine.time_dataframe_numba(False)
-	22.0±2ms	19.9±0.2ms	0.91	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int32', 'var')
-	32.8±2ms	29.7±0.4ms	0.91	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int64', 'var')
-	572±20μs	517±10μs	0.9	groupby.Categories.time_groupby_nosort(False)
-	18.3±1μs	16.5±0.4μs	0.9	groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'direct', 1, 'cython')
-	49.2±5ms	43.7±0.8ms	0.89	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int64', 'mean')
-	114±10ms	101±0.3ms	0.89	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int64', 'median')
-	30.4±2μs	27.2±0.5μs	0.89	groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'direct', 1, 'cython')
-	17.9±2ms	15.6±0.1ms	0.88	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int32', 'sum')
-	18.5±0.9ms	16.2±0.2ms	0.88	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int64', 'sum')
-	450±30μs	398±6μs	0.88	groupby.GroupByMethods.time_dtype_as_field('int', 'quantile', 'direct', 1, 'cython')
-	3.94±0.4ms	3.43±0.02ms	0.87	groupby.Float32.time_sum
-	18.5±2ms	16.1±0.5ms	0.87	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Float64', 'any')
-	20.0±2ms	17.4±0.3ms	0.87	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Float64', 'var')
-	20.2±3ms	17.6±0.1ms	0.87	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Int32', 'max')
-	20.6±2ms	17.7±0.4ms	0.86	groupby.GroupByCythonAggEaDtypes.time_frame_agg('Float64', 'sum')
-	12.2±1ms	10.4±0.2ms	0.85	groupby.GroupByCythonAgg.time_frame_agg('float64', 'max')
-	3.56±0.8ms	2.77±0.06ms	0.78	groupby.Cumulative.time_frame_transform('int64', 'cumsum', False)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

rhshadrach · 2025-12-02T22:08:40Z

couldn't see a specific benchmark test that cover the case where dropna = False, there are no values and indices is called.

Indeed, in addition to indices itself, I'm seeing that this PR only hits:

sample
__iter__
get_group
filter

So seems like a pretty limited surface area for performance impact, and I do not see a more performant way to do this that would be limited in scope. Would like another eye here - cc @jbrockmendel.

koskampt requested a review from rhshadrach as a code owner November 23, 2025 15:13

rhshadrach reviewed Nov 23, 2025

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

koskampt force-pushed the bug-fix-grouby-with-none-values-with-filter branch from edd8a1f to 8d2126a Compare November 25, 2025 19:57

rhshadrach requested changes Nov 29, 2025

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

T. Koskamp added 4 commits November 29, 2025 17:04

BUG: Inconsistent behavior of Groupby with None values with filter (p…

b5b447e

…andas-dev#62501)

BUG: Inconsistent behavior of Groupby with None values with filter (p…

d2046e9

…andas-dev#62501) - Add test cases - Add tuple support - Incorporate feedback

Update indices property from groupby

74057eb

Incorporate review suggestion for issue pandas-dev#63178

f7c5e23

BUG: Inconsistent behavior of Groupby with None values with filter

koskampt force-pushed the bug-fix-grouby-with-none-values-with-filter branch from 2ef342b to f7c5e23 Compare November 29, 2025 16:04

rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Filters e.g. head, tail, nth labels Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Inconsistent behavior of Groupby with None values with filter (#… #63178

BUG: Inconsistent behavior of Groupby with None values with filter (#… #63178

koskampt commented Nov 23, 2025 •

edited

Loading

Uh oh!

rhshadrach left a comment

Uh oh!

Uh oh!

Uh oh!

rhshadrach commented Nov 25, 2025 •

edited

Loading

Uh oh!

koskampt commented Nov 26, 2025

Uh oh!

rhshadrach left a comment

Uh oh!

Uh oh!

Uh oh!

koskampt commented Nov 29, 2025

Uh oh!

koskampt commented Nov 29, 2025

Uh oh!

koskampt commented Nov 30, 2025

Uh oh!

rhshadrach commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

BUG: Inconsistent behavior of Groupby with None values with filter (#… #63178

Are you sure you want to change the base?

BUG: Inconsistent behavior of Groupby with None values with filter (#… #63178

Conversation

koskampt commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rhshadrach commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koskampt commented Nov 26, 2025

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

koskampt commented Nov 29, 2025

Uh oh!

koskampt commented Nov 29, 2025

Uh oh!

koskampt commented Nov 30, 2025

Uh oh!

rhshadrach commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

koskampt commented Nov 23, 2025 •

edited

Loading

rhshadrach commented Nov 25, 2025 •

edited

Loading