All experiments can be run in a docker container.
- Docker
- GPU/cuda environment (for Training)
Dependencies are automatically installed while building a docker image.
# on host
git clone https://github.com/OnizukaLab/Scardina.git
cd Scardina
docker build -t scardina .
docker run --rm --gpus all -v `pwd`:/workspaces/scardina -it scardina bash
# in container
poetry shell
# in poetry env in container
./scripts/dowload_imdb.shChoose hyperparameter search by optuna or manually specified parameters.
# train w/ hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp --n-trials=10 -e=20
# train w/o hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp -e=20 --d-word=64 --d-ff=256 --n-ff=4 --lr=5e-4# evaluation
# Note: When default (-s=cin), model path should be like:
# "models/imdb/mlp-cin/yyyyMMddHHmmss/nar-mlp-imdb-{}-yyyyMMddHHmmss.pt".
# "{}" is literally "{}", a placeholder string to specify multiple models
python scardina/run.py --eval -d=imdb -b=job-light -t=mlp -m={path/to/model.pt}You can find results in results/<benchmark_name> after trial.
-d/--dataset: Dataset name-t/--model-type: Internal model type (mlpfor MLP ortrmfor Transformer)-s/--schema-strategy: Internal subschema type (cinfor Closed In-neighborhood Partitioning (Scardina) orurfor Universal Relation)--seed: Random seed (Default:1234)--n-blocks: The number of blocks (for Transformer)--n-heads: The number of heads (for Transformer)--d-word: Embedding dimension--d-ff: Width of feedforward networks--n-ff: The number of feedforward networks (for MLP)--fact-threshold: Column factorization threshold (Default:2000)
-e/--epochs: Training epoch--batch-size: Batch size (Default:1024)
(w/ hyperparameter search)
--n-trials: The number of trials for hyperparameter search
(w/ specified parameters)
--lr: Learning rate--warmups: Warm-up epoch (for Transformer) (lrandwarmupsare exclusive)
-m/--model: Path to model-b/--benchmark: Benchmark name--eval-sample-size: Sample size for evaluation
- Datasets
- IMDb
imdb: (almost) All data of IMDbimdb-job-light: Subset of IMDb for JOB-light benchmark
- IMDb
- Benchmarks
- Models
mlp: MLP-based denoising autoencodertrm: Transformer-based denoising autoencoder
@article{scardina,
author = {Ito, Ryuichi and Sasaki, Yuya and Xiao, Chuan and Onizuka, Makoto},
title = {{Scardina: Scalable Join Cardinality Estimation by Multiple Density Estimators}},
journal = {{arXiv preprint arXiv:2303.18042}},
year = {2023}
}