BUG: for ordered categorical data implements correct computation of kendall/spearman correlations #62880

pandeconscious · 2025-10-27T13:44:55Z

closes BUG: spearman correlation doesn't work on non-numeric data #60306
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Profiling main vs PR time taken with varying number of rows in the data frame. Median time computed for each correlation over 5 runs for each data frame size:

This PR was picked up and inspired from https://github.com/pandas-dev/pandas/pull/60493/files since that PR became stale almost a year ago

WillAyd · 2025-11-05T18:35:29Z

pandas/tests/frame/methods/test_cov_corr.py

+        self,
+        method,
+    ):
+        pytest.importorskip("scipy")


Unless you are going to use the import, you can just add this as a @td.skip_if_no("scipy") decorator to the test

thanks, fixed

WillAyd · 2025-11-05T18:36:17Z

pandas/tests/series/methods/test_cov_corr.py

        tm.assert_almost_equal(df.transpose().corr(method=my_corr), expected)
+
+    @pytest.mark.parametrize("method", ["kendall", "spearman"])
+    def test_corr_rank_ordered_categorical(


This test is pretty long, to the point where its unclear what its intent is. Maybe its worth breaking up into a few tests? Or adding parameterization?

WillAyd · 2025-11-05T19:20:24Z

pandas/core/frame.py

+        cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns
+
+        if len(cols_convert) > 0:
+            data = self.copy(deep=False)


I'm a bit wary of taking an entire copy of the dataframe in instances where there might be ordered categoricals; that's a potentially large performance hit, and the usage of this seems pretty niche

I see @rhshadrach commented on the original issue, so lets see what his thoughts are

deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.

@rhshadrach are you suggesting an asv benchmark or to profile it and paste the results in the description of the PR?

For benchmarking, we don't have any ASVs that hit this case. You can just setup an example that hits this case and use timeit to compare this PR to main. Aim for 10-100ms in runtime so we aren't merely benchmarking overhead. If you want any assistance in setting this up, just let me know.

Time profiling stats added to the description of the PR, please let me know if it makes sense or something else is needed as well.

rhshadrach · 2025-11-06T18:46:00Z

pandas/core/frame.py

+        cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns
+
+        if len(cols_convert) > 0:
+            data = self.copy(deep=False)


deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.

rhshadrach · 2025-11-06T18:51:24Z

pandas/core/frame.py

+            data[cols_convert] = data[cols_convert].transform(
+                lambda x: x.cat.codes.replace(-1, np.nan)
+            )


I think this will fail when a DataFrame has duplicate column names.

thanks for catching this, fixing this!

rhshadrach · 2025-11-18T02:32:24Z

pandas/core/frame.py


        return correl

+    def _transform_ord_cat_cols_to_coded_cols(self) -> DataFrame:


I think we can simplify this a bit and make it more performant.

result = self made_copy = False for idx, dtype in enumerate(self.dtypes): if not dtype == "category" or not dtype.ordered: continue col = result._ixs(idx, axis=1) if not made_copy: made_copy = True result = result.copy(deep=False) result._iset_item(idx, col.cat.codes.replace(-1, np.nan)) return result

Can you move this to pandas.core.methods.corr (this file does not yet exist) and make it take a DataFrame as input - we can move the remaining parts of the implementation in a later PR.

rhshadrach · 2025-11-18T21:19:36Z

pandas/core/methods/corr.py

+    result = df
+    made_copy = False
+    for idx, dtype in enumerate(df.dtypes):
+        if not dtype == "category" or not dtype.ordered:


Ah, I should have used is_catgorical_dtype(dtype) here - you can import this from pandas.core.dtypes.common. I think that should also fix the pyright issue.

I am seeing pandas.errors.Pandas4Warning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, pd.CategoricalDtype) instead - will switch to the recommended one, right?

Hah! Yep - thanks.

rhshadrach

Looking good! A few requests on simplifying the tests here.

rhshadrach · 2025-11-18T21:48:01Z

pandas/core/methods/corr.py

+def transform_ord_cat_cols_to_coded_cols(df: DataFrame) -> DataFrame:
+    """
+    any ordered categorical columns are transformed to the respective
+    categorical codes while other columns remain untouched


Our docstring standards require a single (not multi) first line. I think a one-line here is sufficient, can just make this more concise. E.g.

Replace ordered categoricals with their codes, making a shallow copy if necessary.

rhshadrach · 2025-11-18T21:53:17Z

pandas/tests/frame/methods/test_cov_corr.py

+                "ord_int": Series([0, 1, 2, 3]),
+                "ord_float": Series([2.0, 3.0, 4.5, 6.5]),
+                "ord_float_nan": Series([2.0, 3.0, 4.5, np.nan]),


I don't see the value in testing these, aren't these tested elsewhere? Can you remove.

rhshadrach · 2025-11-18T21:54:27Z

pandas/tests/frame/methods/test_cov_corr.py

+                "ord_cat": Series(
+                    pd.Categorical(
+                        ["low", "m", "h", "vh"],
+                        categories=["low", "m", "h", "vh"],
+                        ordered=True,
+                    )
+                ),


It seems to me these test are unnecessarily long. Can you simplify - e.g. remove the Series call here.

rhshadrach · 2025-11-18T21:59:19Z

pandas/tests/frame/methods/test_cov_corr.py

+        df = DataFrame(
+            {
+                "a": [1, 2, 3, 4],
+                "b": [4, 3, 2, 1],
+                "c": [4, 3, 2, 1],
+                "d": [10, 20, 30, 40],
+                "e": [100, 200, 300, 400],
+            }
+        )
+        df["a"] = (
+            df["a"].astype("category").cat.set_categories([4, 3, 2, 1], ordered=True)
+        )
+        df["b"] = (
+            df["b"].astype("category").cat.set_categories([4, 3, 2, 1], ordered=True)
+        )
+        df["c"] = (
+            df["c"].astype("category").cat.set_categories([4, 3, 2, 1], ordered=True)
+        )


cat = pd.CategoricalDtype(categories=[4, 3, 2, 1], ordered=True) df = DataFrame( { "a": pd.array([1, 2, 3, 4], dtype=cat), "b": pd.array([4, 3, 2, 1], dtype=cat), ... }, )

rhshadrach · 2025-11-18T22:02:01Z

pandas/tests/frame/methods/test_cov_corr.py

+                "a": [1, 2, 3, 4],
+                "b": [4, 3, 2, 1],
+                "c": [4, 3, 2, 1],
+                "d": [10, 20, 30, 40],
+                "e": [100, 200, 300, 400],


It seems to me just a and b are sufficient here; what case does adding the other columns test which we don't already have coverage for?

duplicated columns names could be all categorical or a mix of categorical and other non-categorical, wanted to capture this

rhshadrach · 2025-11-18T22:05:26Z

pandas/tests/methods/corr.py

+                    "dup": Series(
+                        Categorical(
+                            ["low", "m", "h"],
+                            categories=["low", "m", "h"],
+                            ordered=True,
+                        )
+                    ),
+                    "dup": Series([5, 6, 7]),  # duplicate name, later column


This creates a DataFrame with a single columns since a dictionary can only hold one entry per unique key.

yeah saw that - fixed

rhshadrach · 2025-11-18T22:07:06Z

pandas/tests/methods/corr.py

+        ),
+    ],
+)
+def test_transform_ord_cat_cols_to_coded_cols(input_df, expected_df):


I don't think this test is necessary; your other tests are sufficient.

I think this function in itself can also be potentially used for things other than correlation as it is a specific type of transformation. Correlation is one use case of transforming to these codes, so to me it seems like this function should be anyway tested for what it is supposed to do irrespective of its use in correlation. Please lmk what do you think.

rhshadrach · 2025-11-18T22:13:19Z

pandas/tests/series/methods/test_cov_corr.py

+    ):
+        stats = pytest.importorskip("scipy.stats")
+        method_scipy_func = {"kendall": stats.kendalltau, "spearman": stats.spearmanr}
+        ord_ser_cat_codes = ord_cat_series.cat.codes.replace(-1, np.nan)


This is duplicating the code from within pandas. I think we'd prefer less cases but hard coded results. I'd suggest breaking this up into two tests: one with a fixed Categorical with no NA value and one with a fixed Categorical with an NA value. You can parametrize this with two Series (one categorical, one not) that both give rise to the same answer.

pandeconscious added 7 commits October 23, 2025 10:46

init commit kendall spearman ordinal cats

1f8c628

Merge branch 'pandas-dev:main' into ordered_cat_corr

906f1e4

series test update and fixes

497dc7e

cat desc longer in tests

583aca6

testing frame corr

e069810

pre commit fixes v2

b90726f

cleanup

65a506c

pandeconscious changed the title ~~BUG: ordered categorical data now calculates right kendall/spearman correlations~~ BUG: for ordered categorical data implements correct computation of kendall/spearman correlations Oct 27, 2025

pandeconscious added 5 commits November 4, 2025 15:00

Merge branch 'pandas-dev:main' into ordered_cat_corr

ab3b8b9

test import scipy fix

e93ed83

rst sorting autofix

ec4d97e

Merge branch 'pandas-dev:main' into ordered_cat_corr

ebfc3b0

Merge branch 'pandas-dev:main' into ordered_cat_corr

8cfacef

pandeconscious marked this pull request as ready for review November 5, 2025 14:29

pandeconscious mentioned this pull request Nov 5, 2025

BUG: spearman correlation doesn't work on non-numeric data #60306

Open

1 task

WillAyd requested changes Nov 5, 2025

View reviewed changes

rhshadrach requested changes Nov 6, 2025

View reviewed changes

pandeconscious marked this pull request as draft November 7, 2025 21:03

pandeconscious added 6 commits November 12, 2025 12:07

Merge branch 'pandas-dev:main' into ordered_cat_corr

7ef7fb2

refactor

588808a

fix dtype for duplicates

c484552

Merge branch 'pandas-dev:main' into ordered_cat_corr

216475c

clean up

e997747

Merge branch 'pandas-dev:main' into ordered_cat_corr

4184167

pandeconscious marked this pull request as ready for review November 17, 2025 14:33

pandeconscious requested review from WillAyd and rhshadrach November 17, 2025 14:35

rhshadrach requested changes Nov 18, 2025

View reviewed changes

Merge branch 'pandas-dev:main' into ordered_cat_corr

8bcd3dc

pandeconscious marked this pull request as draft November 18, 2025 14:46

pandeconscious added 4 commits November 18, 2025 15:10

clean up

2673281

import fix

ff48847

test tranform ordered cat func

1c69e29

tests and mypy fixes

8b26a7d

rhshadrach reviewed Nov 18, 2025

View reviewed changes

type check fix

a625520

rhshadrach requested changes Nov 18, 2025

View reviewed changes

addressing review comments

259424e

pandeconscious requested a review from rhshadrach November 18, 2025 23:42

pandeconscious added 6 commits November 18, 2025 18:42

Merge branch 'main' into ordered_cat_corr

f141e6a

type fix corr.py

d2d0f71

ruff format

858d0c2

mypy fix

a8c88c7

Merge branch 'main' into ordered_cat_corr

1a472e3

scipy unavailable fix in test

71305aa

pandeconscious marked this pull request as ready for review November 19, 2025 16:28


		return correl

		def _transform_ord_cat_cols_to_coded_cols(self) -> DataFrame:

Uh oh!

BUG: for ordered categorical data implements correct computation of kendall/spearman correlations #62880

Are you sure you want to change the base?

BUG: for ordered categorical data implements correct computation of kendall/spearman correlations #62880

Conversation

pandeconscious commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

pandeconscious commented Oct 27, 2025 •

edited

Loading