This package provides a number of molecular datasets to be used in a ML context. Each dataset represents a set of randomly selected molecules from an original collection of xyz files, augmented with tensor quantities obtained via DFT calculations.
For a description of each dataset see quantum-machine.org. Attention: The used basis set file for 6-31G(2df,p) were modified.
- Qm9 134k small organic molecules of CHONF. Molecules are non-charged and closed-shell. The train/val/test sets are roughly stratified with respect to molecular size.Calculated at B3LYP/6-31G(2df,p) level.
- Qm9Isomeres 6k constitutional isomeres from C7H10O2 taken from Qm9. Calculated at B3LYP/6-31G(2df,p) level.
- Qm9IsomeresMd Molecular trajectories of 113 molecules at 500 K randomly selected from Qm9 Isomeres. Calculated at the B3LYP/6-31G(2df,p) level.
- Clone this repository to
scf_guess_datasets - Invoke
cd scf_guess_datasets && pip install -e .
from scf_guess_datasets import Qm9Isomeres
dataset = Qm9Isomeres(
"/home/bob/datasets", # data stored in /home/bob/datasets/qm9_isomeres
size=10, # number of molecules (optional, just for testing)
val=0.1, # fraction of validation samples (optional, just for testing)
test=0.1 # fraction of test samples (optional, just for testing)
)
dataset.build() # just once, omit if /home/bob/datasets/qm9_isomeres exists
for key in dataset.train_keys: # same for val_keys or test_keys
sample = dataset.solution(key) # dft result for that molecule
print(sample.overlap) # NDArray from PysCF
print(sample.hcore) # NDArray from PysCF
print(sample.density) # NDArray from PysCF
print(sample.fock) # NDArray from PysCF
print(sample.status) # Status(converged=True, iterations=11)
for scheme, sample in dataset.guesses(key).items():
# sample has same structure as returned by dataset.solution
# matrices correspond to the initial guess
# status describes calculation starting from guess
print(scheme, sample.status)
# Let's score some custom-made density matrix for a given molecule
from scf_guess_datasets import solve
import numpy as np
solver = dataset.solver(3) # obtain a new solver for molecule 3
guess = np.ones_like(solver.get_ovlp())
overlap, hcore, density, fock, status = solve(solver, guess)
print(density) # the converged density
print(status) # Status(converged=True, iterations=19)Each dataset provided by this package implements the scf_guess_datasets.Dataset
interface. A single implementation is represented by an individual package,
containing a xyz directory as well as an optional basis.gbs basis set file.
In order to add a new dataset, create a new subpackage for it and adapt to your
needs, e.g. by specifying a custom basis or functional.