🧩 DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems
DataRec focuses on the data management phase of recommender systems, promoting standardization, interoperability, and best practices for data filtering, splitting, analysis, and export.
Official repository of the paper:
📄 DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems (SIGIR 2025) doi
- Features
- Installation
- Quickstart
- Datasets
- Documentation
- Contributing
- Citation
- Authors and Contributors
- Related Projects
- License
- Dataset Management: multi-format I/O with dynamic schema specification.
- Reference Datasets: curated, versioned, and traceable datasets.
- Filtering Strategies: widely used user/item interaction filters.
- Splitting Strategies: temporal and random splits for reproducible evaluation.
- Data Characteristics: compute dataset-level statistics (e.g., sparsity, popularity).
- Interoperability: export datasets to external recommendation frameworks.
git clone https://github.com/sisinflab/DataRec.git
cd DataRec
python3.9 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
# editable mode + optional dependency groups (defined in pyproject.toml)
pip install -e '.[dev,docs]'from datarec.datasets import AmazonOffice
from datarec.processing import FilterOutDuplicatedInteractions, UserItemIterativeKCore
from datarec.splitters import RandomHoldOut
# 1️⃣ Load a reference dataset
data = AmazonOffice(version='2014').prepare_and_load()
# 2️⃣ Apply preprocessing filters
data = FilterOutDuplicatedInteractions().run(data)
data = UserItemIterativeKCore(cores=5).run(data)
# 3️⃣ Split into train/validation/test
splitter = RandomHoldOut(test_ratio=0.2, val_ratio=0.1, seed=42)
splits = splitter.run(data)
train, val, test = splits['train'], splits['val'], splits['test']The complete and up-to-date list of datasets (with metadata and statistics) is available in the documentation:
Full documentation available at: https://sisinflab.github.io/DataRec/
Includes API reference, guides, tutorials, and dataset overview.
Contributions are welcome!
To contribute:
- Create a feature/fix branch.
- Add tests and documentation updates as needed.
- Run tests before pushing.
- Open a pull request describing your changes clearly.
The project also receives updates from a private development repository maintained by SisInfLab.
If you use DataRec in your research, please cite our SIGIR 2025 paper:
@inproceedings{DBLP:conf/sigir/MancinoBF0MPN25,
author = {Alberto Carlo Maria Mancino and
Salvatore Bufi and
Angela Di Fazio and
Antonio Ferrara and
Daniele Malitesta and
Claudio Pomo and
Tommaso Di Noia},
title = {DataRec: {A} Python Library for Standardized and Reproducible Data
Management in Recommender Systems},
booktitle = {{SIGIR}},
pages = {3478--3487},
publisher = {{ACM}},
year = {2025}
}Authors
- Alberto Carlo Maria Mancino (Politecnico di Bari)
- Salvatore Bufi
- Angela Di Fazio
- Daniele Malitesta
- Antonio Ferrara
- Claudio Pomo
- Tommaso Di Noia
Alberto C. M. Mancino |
Angela Di Fazio |
Salvatore Bufi |
Giuseppe Fasano |
Gianluca Colonna |
Maria L. N. De Bonis |
Marco Valentini |
- Ducho — library for multimodal representation learning: https://github.com/sisinflab/Ducho
- D&D4Rec Tutorial (RecSys 2025) — Standard Practices for Data Processing and Multimodal Feature Extraction in Recommendation with DataRec and Ducho:
https://sites.google.com/view/dd4rec-tutorial/home
Distributed under the MIT License.
See LICENSE.
Maintained with ❤️ by SisInfLab


