-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: for ordered categorical data implements correct computation of kendall/spearman correlations #62880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| self, | ||
| method, | ||
| ): | ||
| pytest.importorskip("scipy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you are going to use the import, you can just add this as a @td.skip_if_no("scipy") decorator to the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
| tm.assert_almost_equal(df.transpose().corr(method=my_corr), expected) | ||
|
|
||
| @pytest.mark.parametrize("method", ["kendall", "spearman"]) | ||
| def test_corr_rank_ordered_categorical( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is pretty long, to the point where its unclear what its intent is. Maybe its worth breaking up into a few tests? Or adding parameterization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/frame.py
Outdated
| cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns | ||
|
|
||
| if len(cols_convert) > 0: | ||
| data = self.copy(deep=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit wary of taking an entire copy of the dataframe in instances where there might be ordered categoricals; that's a potentially large performance hit, and the usage of this seems pretty niche
I see @rhshadrach commented on the original issue, so lets see what his thoughts are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach are you suggesting an asv benchmark or to profile it and paste the results in the description of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For benchmarking, we don't have any ASVs that hit this case. You can just setup an example that hits this case and use timeit to compare this PR to main. Aim for 10-100ms in runtime so we aren't merely benchmarking overhead. If you want any assistance in setting this up, just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time profiling stats added to the description of the PR, please let me know if it makes sense or something else is needed as well.
pandas/core/frame.py
Outdated
| cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns | ||
|
|
||
| if len(cols_convert) > 0: | ||
| data = self.copy(deep=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.
pandas/core/frame.py
Outdated
| data[cols_convert] = data[cols_convert].transform( | ||
| lambda x: x.cat.codes.replace(-1, np.nan) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will fail when a DataFrame has duplicate column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching this, fixing this!
pandas/core/frame.py
Outdated
|
|
||
| return correl | ||
|
|
||
| def _transform_ord_cat_cols_to_coded_cols(self) -> DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this a bit and make it more performant.
result = self
made_copy = False
for idx, dtype in enumerate(self.dtypes):
if not dtype == "category" or not dtype.ordered:
continue
col = result._ixs(idx, axis=1)
if not made_copy:
made_copy = True
result = result.copy(deep=False)
result._iset_item(idx, col.cat.codes.replace(-1, np.nan))
return resultCan you move this to pandas.core.methods.corr (this file does not yet exist) and make it take a DataFrame as input - we can move the remaining parts of the implementation in a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/methods/corr.py
Outdated
| result = df | ||
| made_copy = False | ||
| for idx, dtype in enumerate(df.dtypes): | ||
| if not dtype == "category" or not dtype.ordered: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I should have used is_catgorical_dtype(dtype) here - you can import this from pandas.core.dtypes.common. I think that should also fix the pyright issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing pandas.errors.Pandas4Warning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, pd.CategoricalDtype) instead - will switch to the recommended one, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah! Yep - thanks.
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! A few requests on simplifying the tests here.
pandas/core/methods/corr.py
Outdated
| def transform_ord_cat_cols_to_coded_cols(df: DataFrame) -> DataFrame: | ||
| """ | ||
| any ordered categorical columns are transformed to the respective | ||
| categorical codes while other columns remain untouched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our docstring standards require a single (not multi) first line. I think a one-line here is sufficient, can just make this more concise. E.g.
Replace ordered categoricals with their codes, making a shallow copy if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| "ord_int": Series([0, 1, 2, 3]), | ||
| "ord_float": Series([2.0, 3.0, 4.5, 6.5]), | ||
| "ord_float_nan": Series([2.0, 3.0, 4.5, np.nan]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the value in testing these, aren't these tested elsewhere? Can you remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| "ord_cat": Series( | ||
| pd.Categorical( | ||
| ["low", "m", "h", "vh"], | ||
| categories=["low", "m", "h", "vh"], | ||
| ordered=True, | ||
| ) | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me these test are unnecessarily long. Can you simplify - e.g. remove the Series call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| df = DataFrame( | ||
| { | ||
| "a": [1, 2, 3, 4], | ||
| "b": [4, 3, 2, 1], | ||
| "c": [4, 3, 2, 1], | ||
| "d": [10, 20, 30, 40], | ||
| "e": [100, 200, 300, 400], | ||
| } | ||
| ) | ||
| df["a"] = ( | ||
| df["a"].astype("category").cat.set_categories([4, 3, 2, 1], ordered=True) | ||
| ) | ||
| df["b"] = ( | ||
| df["b"].astype("category").cat.set_categories([4, 3, 2, 1], ordered=True) | ||
| ) | ||
| df["c"] = ( | ||
| df["c"].astype("category").cat.set_categories([4, 3, 2, 1], ordered=True) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat = pd.CategoricalDtype(categories=[4, 3, 2, 1], ordered=True)
df = DataFrame(
{
"a": pd.array([1, 2, 3, 4], dtype=cat),
"b": pd.array([4, 3, 2, 1], dtype=cat),
...
},
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| "a": [1, 2, 3, 4], | ||
| "b": [4, 3, 2, 1], | ||
| "c": [4, 3, 2, 1], | ||
| "d": [10, 20, 30, 40], | ||
| "e": [100, 200, 300, 400], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me just a and b are sufficient here; what case does adding the other columns test which we don't already have coverage for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated columns names could be all categorical or a mix of categorical and other non-categorical, wanted to capture this
pandas/tests/methods/corr.py
Outdated
| "dup": Series( | ||
| Categorical( | ||
| ["low", "m", "h"], | ||
| categories=["low", "m", "h"], | ||
| ordered=True, | ||
| ) | ||
| ), | ||
| "dup": Series([5, 6, 7]), # duplicate name, later column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a DataFrame with a single columns since a dictionary can only hold one entry per unique key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah saw that - fixed
pandas/tests/methods/corr.py
Outdated
| ), | ||
| ], | ||
| ) | ||
| def test_transform_ord_cat_cols_to_coded_cols(input_df, expected_df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test is necessary; your other tests are sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function in itself can also be potentially used for things other than correlation as it is a specific type of transformation. Correlation is one use case of transforming to these codes, so to me it seems like this function should be anyway tested for what it is supposed to do irrespective of its use in correlation. Please lmk what do you think.
| ): | ||
| stats = pytest.importorskip("scipy.stats") | ||
| method_scipy_func = {"kendall": stats.kendalltau, "spearman": stats.spearmanr} | ||
| ord_ser_cat_codes = ord_cat_series.cat.codes.replace(-1, np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicating the code from within pandas. I think we'd prefer less cases but hard coded results. I'd suggest breaking this up into two tests: one with a fixed Categorical with no NA value and one with a fixed Categorical with an NA value. You can parametrize this with two Series (one categorical, one not) that both give rise to the same answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Profiling main vs PR time taken with varying number of rows in the data frame. Median time computed for each correlation over 5 runs for each data frame size:
This PR was picked up and inspired from https://github.com/pandas-dev/pandas/pull/60493/files since that PR became stale almost a year ago