Skip to content

Programmatic Access to Cell Ontology Mappings for All CellTypist Model Cell Types #162

@nick-youngblut

Description

@nick-youngblut

Is your feature request related to a problem?

CellTypist currently provides 58 pre-trained models containing 2,789 unique cell type labels (as of version 1.7.1), but Cell Ontology (CL) mappings are only available for a tiny fraction of these cell types. The [celltypist_wiki repository](https://github.com/Teichlab/celltypist_wiki) contains Basic_celltype_information.xlsx files with CL IDs, but these are:

  1. Severely limited in coverage: The Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx file contains only ~100 cell types, representing just 3.6% of all 2,789 CellTypist cell type labels
  2. Not programmatically accessible: Users must manually download and parse Excel files from a separate repository
  3. Not embedded in model objects: The Python Model class provides no access to ontology information despite CellTypist being described as "a working group in charge of curating models and ontologies"

This creates a critical barrier for users who need to standardize CellTypist annotations to Cell Ontology terms, which is increasingly mandatory for data submission to major platforms like CZ CELLxGENE (85M+ cells), HuBMAP, and the Human Cell Atlas.

Scope of the problem

import celltypist
from celltypist import models

# Download all models
celltypist.models.download_models(force_update=False)

# Extract all unique cell types across all 58 models
ct_cell_types = []
for model_name in models.get_all_models():
    model = models.Model.load(model_name)
    ct_cell_types.extend(model.cell_types)

unique_cell_types = sorted(list(set(ct_cell_types)))
print(f"Total pre-trained models: {len(models.get_all_models())}")  # 58
print(f"Total unique cell types: {len(unique_cell_types)}")  # 2789

# Problem: No way to get Cell Ontology IDs for these 2789 cell types
# model.ontology_ids  # ❌ Does not exist
# model.get_cl_id('Memory B cells')  # ❌ Does not exist

Cell types lacking ontology mappings

The vast majority (96.4%) of CellTypist's 2,789 cell type labels have no programmatically accessible CL IDs, including:

  • Brain models: Highly granular Allen Institute nomenclature like '001 CLA-EPd-CTX Car3 Glut', '016 CA1-ProS Glut'
  • Organ-specific models: Lung, kidney, intestine, heart, muscle, pancreas, liver, spleen models
  • Developmental models: Fetal lung, fetal intestine, pan-fetal human, developing mouse brain
  • Disease models: COVID-19 immune landscape, pulmonary fibrosis
  • Mouse models: Adult mouse gut, developing mouse brain
  • Novel cell states: 'vCM3_stressed', 'tissue_repair mac', 'cycling B cells', 'valve_down_lec'

Only the Pan Immune atlas (~100 cell types) has publicly accessible CL mappings, leaving ~2,689 cell types (96.4%) without ontology standardization.

Describe the solution you'd like

Provide comprehensive Cell Ontology mappings for all 2,789 cell type labels across all 58 pre-trained models, accessible through the Python API:

import celltypist
from celltypist import models

model = models.Model.load('Immune_All_Low.pkl')

# Proposed API options:

# Option 1: Model attribute with full metadata
model.ontology_metadata
# Returns DataFrame with columns: 
# ['cell_type', 'cl_id', 'cl_label', 'description', 'hierarchy_level', 'curated_markers']

# Option 2: Lookup methods
model.get_ontology_id('Memory B cells')  # Returns 'CL:0000787'
model.get_ontology_label('Memory B cells')  # Returns 'memory B cell'
model.get_description('Memory B cells')  # Returns full CL description

# Option 3: Global mapping export
models.export_all_ontology_mappings('celltypist_cl_mappings.csv')
# Exports comprehensive mapping for all 2789 cell types across all 58 models

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions