XMR4EL – eXtreme Multi-Label Ranking for Entity Linking

XMR4EL is a research-friendly framework that extends the PECOS pipeline for extreme multi-label ranking (XMR). It is designed to help you build, evaluate, and iterate on entity linking systems that must choose from very large label spaces. Although our primary focus is biomedical entity linking, every component is modular so you can plug in alternative algorithms or apply the framework to any XMR-ready dataset.

Key ideas

Hierarchical modelling – Large label sets are handled by recursively clustering labels into a tree. At prediction time the tree is traversed top-down to focus computation on the most promising label subsets (xmr4el/xmr/base.py).
Flexible featurisation – Text is converted to dense and sparse features through a configurable pipeline composed of vectorisers, transformers, and dimensionality reducers (xmr4el/featurization).
Label-aware ranking – The framework creates label embeddings (PIFA) and trains ranking models that fuse matcher and ranker scores for better retrieval quality (xmr4el/models).
Swappable components – Every stage (clustering, matcher, ranker) is configured through lightweight wrappers so you can experiment without editing the core training loop.

Installation

Python: 3.12

Install the package directly from GitHub:

Install As An Package

pip install git+https://github.com/lasigeBioTM/XMR4EL.git

All runtime dependencies are listed in requirements.txt.

GPU support

The repository contains CUDA-ready configurations (11.4 – 11.8). RAPIDS-backed GPU models have been validated with the provided Docker image.

Repository layout

xmr4el/ ├── featurization/ # Text encoders, preprocessing utilities, label embeddings ├── clustering/ # Clustering wrappers used to build the hierarchical tree ├── matcher/ # Candidate generation models ├── ranker/ # Ranking algorithms for leaf scoring ├── models/ # Component factories and configuration helpers └── xmr/ # Hierarchical model, training loop, persistence utilities

Test fixtures live under test/test_data and are useful when experimenting with the API.

How the pipeline works

Preprocessing – Raw training files are grouped by label, producing lists of synonyms per concept. See Preprocessor.load_data_labels_from_file for the exact behaviour (xmr4el/featurization/preprocessor.py).
Text encoding – TextEncoder builds sparse (e.g., TF–IDF) and dense (e.g., BioBERT) representations according to the chosen configuration (xmr4el/featurization/text_encoder.py). Optional dimensionality reduction can be applied before training.
Label embeddings – LabelEmbeddingFactory converts grouped texts into a binary label matrix and produces PIFA label embeddings (xmr4el/featurization/label_embedding_factory.py).
Hierarchical tree building – HierarchicaMLModel recursively clusters labels, trains matchers to route queries, and fits ranking models (xmr4el/xmr/base.py).
Prediction – Queries are encoded with the trained TextEncoder, routed through the tree, and scored. Fusion strategies combine matcher and ranker outputs to produce the final top-k predictions.

The orchestration class XModel glues these stages together so you can train and evaluate an end-to-end system with a handful of method calls.

Training a model

from xmr4el.featurization.preprocessor import Preprocessor
from xmr4el.xmr.model import XModel

# 1. Load synonym groups and labels from disk
paths = {
    "train": "data/raw/mesh_data/bc5cdr/train_bc5cdr.txt",
    "labels": "data/raw/mesh_data/medic/labels.txt",
}
dataset = Preprocessor.load_data_labels_from_file(paths["train"], paths["labels"])
X_text, Y_labels = dataset["corpus"], dataset["labels"]

# 2. Describe the components you want to use
model = XModel(
    vectorizer_config={"type": "tfidf"},
    transformer_config={"type": "sentencetbiobert", "kwargs": {"batch_size": 400}},
    clustering_config={"type": "sklearnminibatchkmeans", "kwargs": {"n_clusters": 256}},
    matcher_config={"type": "linear_l2"},
    ranker_config={"type": "sklearnlogisticregression"},
    min_leaf_size=20,
    ranker_every_layer=True,
    n_workers=8,
)

# 3. Train
model.train(X_text, Y_labels)

During training the model:

Persists the raw texts and labels temporarily so they can be restored after fitting.
Encodes the corpus, builds label embeddings, and prepares sparse training matrices.
Constructs a hierarchical tree of classifiers and ranking models.

Running inference

queries = [
    "chromosome 10p deletion",
    "13q deletion syndrome",
]

# Request the top 5 labels per query
scores = model.predict(queries, topk=5)

# `scores` is a scipy CSR matrix with label scores per query
# Additional metadata (paths, fused scores) is returned when requesting global mode

You can adjust the traversal strategy with parameters such as beam_size, fusion (geometric vs. arithmetic), topk_mode, and topk_inside_global for exhaustive scoring.

Saving and loading models

model.save("artifacts/")
restored = XModel.load("artifacts/xmodel_2024-05-01_12-00-00")

The saved directory contains:

Serialized hierarchical models (tree structure + trained components).
Vectoriser/transformer artefacts used by the TextEncoder.
Cached training metadata (label mappings, embeddings, configuration).

Customising components

The following implementations are known to work well and are enabled out of the box (README.md in each submodule lists additional options):

Vectorisers – tfidf
Transformers – biobert, sentencetbiobert
Clustering – sklearnminibatchkmeans, balancedkmeans
Rankers – sklearnlogisticregression, sklearnrandomforestclassifier, sgdclassifier

To introduce a new algorithm, add a wrapper under xmr4el/models that exposes a fit/predict compatible interface, then reference it in the configuration dictionaries used when instantiating XModel.

Input data expectations

Label file – One label identifier per line.
Training file – Tab-separated values where each row begins with the index of the corresponding label and is followed by a text mention.

Example label file:

C538288
C535484
C579849

Example training file:

0\t10p deletion syndrome (partial)
0\tchromosome 10, 10p- partial
1\t13q deletion syndrome

When loading with Preprocessor.load_data_labels_from_file the method:

Groups mentions by the numeric label index.
Aligns the resulting groups with label IDs.
Optionally truncates the dataset for quick experiments.

Ensure the number of labels matches the number of grouped entries; otherwise an exception is raised.

Working with Docker

Use the included dockerfile to spin up a CUDA-enabled development container.

docker build -t xmr4el .
docker run -v .:/app --name xmr4el -it xmr4el bash

Inside the container you may want to activate a virtual environment and set the PYTHONPATH:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

Add the export command to .venv/bin/activate if you prefer a persistent configuration.

Status

Paper: in progress.
Implementation: functional but still evolving. Contributions and experimental feedback are welcome – see CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 753 Commits
test/xmr4el		test/xmr4el
xmr4el		xmr4el
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
dockerfile		dockerfile
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XMR4EL – eXtreme Multi-Label Ranking for Entity Linking

Table of contents

Key ideas

Installation

Install As An Package

GPU support

Repository layout

How the pipeline works

Training a model

Running inference

Saving and loading models

Customising components

Input data expectations

Working with Docker

Status

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

lasigeBioTM/XMR4EL

Folders and files

Latest commit

History

Repository files navigation

XMR4EL – eXtreme Multi-Label Ranking for Entity Linking

Table of contents

Key ideas

Installation

Install As An Package

GPU support

Repository layout

How the pipeline works

Training a model

Running inference

Saving and loading models

Customising components

Input data expectations

Working with Docker

Status

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages