-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Is your feature request related to a problem?
CellTypist currently provides 58 pre-trained models containing 2,789 unique cell type labels (as of version 1.7.1), but Cell Ontology (CL) mappings are only available for a tiny fraction of these cell types. The [celltypist_wiki repository](https://github.com/Teichlab/celltypist_wiki) contains Basic_celltype_information.xlsx files with CL IDs, but these are:
- Severely limited in coverage: The
Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsxfile contains only ~100 cell types, representing just 3.6% of all 2,789 CellTypist cell type labels - Not programmatically accessible: Users must manually download and parse Excel files from a separate repository
- Not embedded in model objects: The Python
Modelclass provides no access to ontology information despite CellTypist being described as "a working group in charge of curating models and ontologies"
This creates a critical barrier for users who need to standardize CellTypist annotations to Cell Ontology terms, which is increasingly mandatory for data submission to major platforms like CZ CELLxGENE (85M+ cells), HuBMAP, and the Human Cell Atlas.
Scope of the problem
import celltypist
from celltypist import models
# Download all models
celltypist.models.download_models(force_update=False)
# Extract all unique cell types across all 58 models
ct_cell_types = []
for model_name in models.get_all_models():
model = models.Model.load(model_name)
ct_cell_types.extend(model.cell_types)
unique_cell_types = sorted(list(set(ct_cell_types)))
print(f"Total pre-trained models: {len(models.get_all_models())}") # 58
print(f"Total unique cell types: {len(unique_cell_types)}") # 2789
# Problem: No way to get Cell Ontology IDs for these 2789 cell types
# model.ontology_ids # ❌ Does not exist
# model.get_cl_id('Memory B cells') # ❌ Does not existCell types lacking ontology mappings
The vast majority (96.4%) of CellTypist's 2,789 cell type labels have no programmatically accessible CL IDs, including:
- Brain models: Highly granular Allen Institute nomenclature like
'001 CLA-EPd-CTX Car3 Glut','016 CA1-ProS Glut' - Organ-specific models: Lung, kidney, intestine, heart, muscle, pancreas, liver, spleen models
- Developmental models: Fetal lung, fetal intestine, pan-fetal human, developing mouse brain
- Disease models: COVID-19 immune landscape, pulmonary fibrosis
- Mouse models: Adult mouse gut, developing mouse brain
- Novel cell states:
'vCM3_stressed','tissue_repair mac','cycling B cells','valve_down_lec'
Only the Pan Immune atlas (~100 cell types) has publicly accessible CL mappings, leaving ~2,689 cell types (96.4%) without ontology standardization.
Describe the solution you'd like
Provide comprehensive Cell Ontology mappings for all 2,789 cell type labels across all 58 pre-trained models, accessible through the Python API:
import celltypist
from celltypist import models
model = models.Model.load('Immune_All_Low.pkl')
# Proposed API options:
# Option 1: Model attribute with full metadata
model.ontology_metadata
# Returns DataFrame with columns:
# ['cell_type', 'cl_id', 'cl_label', 'description', 'hierarchy_level', 'curated_markers']
# Option 2: Lookup methods
model.get_ontology_id('Memory B cells') # Returns 'CL:0000787'
model.get_ontology_label('Memory B cells') # Returns 'memory B cell'
model.get_description('Memory B cells') # Returns full CL description
# Option 3: Global mapping export
models.export_all_ontology_mappings('celltypist_cl_mappings.csv')
# Exports comprehensive mapping for all 2789 cell types across all 58 models