This repository contains the code for the paper "Leveraging neural network interatomic potentials for a foundation model of chemistry" by Kim et al. (2025).
📄 Paper: arXiv:2506.18497
This codebase implements a comprehensive framework for leveraging neural network interatomic potentials (NNIPs) as foundation models for chemistry. The repository provides tools for:
- Structure relaxation: Using pretrained NNIPs (ORB, EquiformerV2, MACE) to relax molecular and crystal structures
- Feature extraction: Extracting graph-level features from relaxed structures
- Property prediction: Training downstream ML models for various chemical properties
- Benchmark evaluation: Standardized evaluation on multiple datasets including MoleculeNet, Materials Project, and Matbench
- Support for multiple datasets:
- MoleculeNet: BACE, BBBP, ClinTox, ESOL, FreeSolv, HIV, Lipophilicity, SIDER, Tox21
- Materials Project: Band gap prediction, trajectory analysis (MPtrj)
- Amorphous materials: Diffusivity prediction
- Matbench: 8 standardized materials science benchmarks
- Three NNIP backends:
- ORB (Orbital Materials)
- EquiformerV2
- MACE (Machine Learning of Atomic Cluster Expansion)
- Automated preprocessing and evaluation pipelines
- Parallel processing support for large-scale experiments
HackNIP/
├── src/
│ ├── preprocessing_relaxation_*.py # Data preprocessing scripts
│ ├── train_eval_*.py # Model training and evaluation
│ ├── run_preprocessing.py # Batch preprocessing runner
│ ├── run_evaluation.py # Batch evaluation runner
│ ├── utils.py # Utility functions
│ ├── matbench/ # Matbench benchmark scripts
│ │ ├── 1_retrieve_data.py # Download Matbench datasets
│ │ ├── 2_build_sc.py # Build supercells
│ │ ├── 3_featurize_orb2.py # Extract ORB features
│ │ ├── 4_construct_pkl.py # Create pickle files
│ │ ├── 5_train_modnet.py # Train MODNet models
│ │ ├── 6_opt_hp_modnet.py # Hyperparameter optimization
│ │ └── 7_get_parity_data.py # Generate parity plots
│ ├── pnas_ce.ipynb # PNAS analysis notebook
│ └── visualization.ipynb # Visualization tools
└── README.md
conda create -n hacknip python=3.9
conda activate hacknip
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install dgl -f https://data.dgl.ai/wheels/torch-2.4/cu121/repo.html
pip install torch_geometric pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cu121.html
python -m pip install lightning
pip install hydra-core joblib wandb matplotlib scikit-learn python-dotenv jarvis-tools pymatgen ase rdkit tqdm transformers datasets diffusers fairchem-core
pip install orb-models
pip install "pynanoflann@git+https://github.com/dwastberg/pynanoflann#egg=af434039ae14bedcbb838a7808924d6689274168"
pip3 install auto-sklearn
!pip install git+https://github.com/DavidWalz/diversipy.git
pip install matbench-discovery
pip install matbench
pip install openpyxl
conda create -n matbench python=3.11
conda activate matbench
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install torch_geometric pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cu121.html
pip install "pynanoflann@git+https://github.com/dwastberg/pynanoflann#egg=af434039ae14bedcbb838a7808924d6689274168"
pip install matbench-discovery typer
The preprocessing scripts convert various data sources into a unified .pkl format containing both unrelaxed and NNIP-relaxed structures.
# Band gap prediction (Materials Project)
python src/preprocessing_relaxation_bandgap.py \
--device cuda:0 \
--data_path MP \
--property_cols '["Eg(eV)"]'
# Molecular properties (MoleculeNet)
python src/preprocessing_relaxation_moleculenet.py \
--device cuda:0 \
--data_path /path/to/ESOL_dataset.csv \
--property_cols '["measured log solubility in mols per litre"]'
# Diffusivity prediction (amorphous materials)
python src/preprocessing_relaxation_diffusivity.py \
--device cuda:0 \
--data_path /path/to/diffusivity_data.parquet \
--property_cols '["diffusivity"]'
# Materials Project trajectories
python src/preprocessing_relaxation_mptrj.py \
--device cuda:0 \
--data_path /path/to/mptrj_data \
--property_cols '["energy_per_atom"]'For processing multiple datasets in parallel:
python src/run_preprocessing.pyEdit the script to customize DEVICES, DATA_FILES, and PROPERTY_COLS lists.
Train downstream ML models using features extracted from NNIP-relaxed structures.
python src/train_eval_orb.py \
--device cuda:0 \
--data_path preprocessed_data/ESOL_dataset_relaxed.pkl \
--task_type regression \
--split_type randompython src/train_eval_eqV2.py \
--device cuda:0 \
--data_path preprocessed_data/BACE_dataset_relaxed.pkl \
--task_type classification \
--split_type scaffoldpython src/train_eval_mace.py \
--device cuda:0 \
--data_path preprocessed_data/bandgap_relaxed.pkl \
--task_type regression \
--split_type randompython src/run_evaluation.pyFollow the sequential pipeline for Matbench evaluation:
cd src/matbench
# Step 1: Retrieve datasets from Matbench
python 1_retrieve_data.py
# Step 2: Build supercells for materials
python 2_build_sc.py
# Step 3: Extract ORB features from structures
python 3_featurize_orb2.py
# Step 4: Construct pickle files for training
python 4_construct_pkl.py
# Step 5: Train MODNet models
python 5_train_modnet.py
# Step 6: Hyperparameter optimization
python 6_opt_hp_modnet.py
# Step 7: Generate parity plots and analysis
python 7_get_parity_data.py- ESOL: Water solubility prediction
- FreeSolv: Solvation free energy
- Lipophilicity: Octanol/water partition coefficient
- Band gap: Electronic band gap of materials
- Diffusivity: Ion diffusion in amorphous materials
- Matbench regression: Formation energy, elastic moduli, phonon properties
- BACE: β-secretase inhibition
- BBBP: Blood-brain barrier permeability
- ClinTox: Clinical trial toxicity
- HIV: HIV inhibition
- Tox21: Nuclear receptor signaling toxicity
- SIDER: Side effect prediction (27 targets)
All preprocessing scripts generate .pkl files containing:
X: List of unrelaxed ASE atoms (JSON-encoded)XR: List of NNIP-relaxed ASE atoms (JSON-encoded)Y: Dictionary of property values keyed by property name
Training scripts output results to results/ directory with performance metrics (MAE, R², ROC-AUC, Accuracy).
- Materials Project Contribs page
- CSV data download
- CSV with crystalline structures
- Reference: The ab initio amorphous materials database: Empowering machine learning to decode diffusivity
- Query and download contributed data
- Band gap data from Materials Project database
- MPtrj trajectory datasets
Standard molecular property prediction benchmarks available through:
- Matbench package
- 8 standardized materials property prediction tasks
- Automated data loading via
matbenchPython package
If you use this code in your research, please cite:
@article{kim2025leveraging,
title={Leveraging neural network interatomic potentials for a foundation model of chemistry},
author={Kim et al.},
year={2025},
eprint={2506.18497},
archivePrefix={arXiv},
primaryClass={cs.LG}
}Paper: https://arxiv.org/abs/2506.18497
-
Neural Network Potentials:
orb-models: ORB pretrained foundation modelfairchem-core: EquiformerV2 implementation- MACE models
-
Structure manipulation:
ase: Atomic Simulation Environmentpymatgen: Materials analysisrdkit: Molecular informatics
-
Machine Learning:
torch,torch_geometric: Deep learningsklearn: Traditional ML modelslightning: Training framework
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please open an issue on GitHub or contact the corresponding author of the paper.