Skip to content

Having .raw objects in the constituent modalities can cause incorrect embedding plots when var_names are different #162

@timslittle

Description

@timslittle

Describe the bug
I think that this could be an issue with AnnData and probably technically not a bug either. However, I believe that this is more likely to occur when running Muon and it is difficult to diagnose what the issue is so I believe it is worth at least a warning.

When plotting feature data with mu.pl.embedding with default use_raw=True, the results can be messed up if the var_names do not match between the object var_names and the .raw objects.

This may seem like an unusual situation to be in but it could be easily and inadvertently done by using mdata.var_make_names_unique() after combining two modality datasets with .raw already set.

To Reproduce

import numpy as np
import pandas as pd
import re
import scanpy as sc
import muon as mu
# 10x PBMC public data
mdata = mu.read_10x_h5('pbmc5k_protein/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5')
mdata.var_names_make_unique()
# Create pointers to each modality and set raw objects.
rna = mdata['rna']
mdata['rna'].raw = rna.copy()
prot = mdata['prot']
mdata['prot'].raw = prot.copy()
# Run UMAP
sc.pp.pca(rna)
sc.pp.pca(prot)
sc.pp.neighbors(rna)
sc.pp.neighbors(prot)
sc.tl.umap(rna)

Works great:

mu.pl.embedding(mdata, 
                basis = 'rna:umap',
                color = ['CD4', 'CD8A', 'CD4_TotalSeqB', 'CD8a_TotalSeqB'],
                s = 50,
                vmax = "p99"
               )

Now let's assume that the antibody data is not annotated with the '_TotalSeqB' suffix, and we need to make the var_names unique between modalities:

mdata['prot'].var_names = [re.search(".+(?=_TotalSeqB)",i).group(0) for i in mdata['prot'].var_names]
mdata['prot'].raw = prot.copy()
# Need to make the var_names unique between modalities
mdata.var_names_make_unique()

Now the embedding plot is completely messed up for both modalities:

mu.pl.embedding(mdata, 
                basis = 'rna:umap',
                color = ['rna:CD4', 'rna:CD8A', 'prot:CD4', 'prot:CD8a'],
                s = 50,
                vmax = "p99",
               )

Note that specifying use_raw = False will fix this.

Can also fix by correcting the var_names in the raw object:

mdata['rna'].raw = rna.copy()
mdata['prot'].raw = prot.copy()
mu.pl.embedding(mdata, 
                basis = 'rna:umap',
                color = ['rna:CD4', 'rna:CD8A', 'prot:CD4', 'prot:CD8a'],
                s = 50,
                vmax = "p99"
               )

It it worth noting that in Scanpy similar attempts to plot var_names that do not match between the layer in use and the raw object will return an error if use_raw=True, albeit not one that explains where the discrepancy lies:

sc.pl.embedding(mdata['rna'], 
           basis = 'umap',
           color = ['rna:CD4', 'rna:CD8A'],
           s = 50,
           vmax = "p99",
           use_raw=True
               )

Expected behaviour
When plotting using the raw object, and the var_names do not match between the current layers and raw, the function should return with a descriptive error or warning e.g. "Warning: Var_names between 'raw' and current layer do not match, may lead to unwanted behaviour".

System
Python 3.12.9 | packaged by conda-forge | (main, Mar 4 2025, 22:44:42) [Clang 18.1.8 ]
macOS-15.5-arm64-arm-64bit
anndata 0.11.3
mudata 0.3.1
muon 0.1.7
numpy 2.1.3
pandas 2.2.3
scanpy 1.11.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions