Overview:
.
├── evaluation/
│ ├── benchmark/ # benchmark set of 16 probabilistic models
│ │ └── generated/ # sub-programs generated by generate_factorisation.py script
│ ├── custom/ # implementation of custom factorisations (GMM + LDA)
│ ├── finite/ # implementation of fixed-finite factorisation
│ ├── unrolled/ # programs where while loop can be syntactically unrolled
│ │ └── generated/ # sub-programs generated by generate_factorisation.py script
│ ├── models/ # your models go here
│ ├── lmh_standard.jl # Standard implementation of LMH
│ ├── lmh_factorised.jl # Implementation of LMH that leverages factorisation into sub-programs
│ ├── bench_lmh.jl # script to run LMH benchmark for model
│ ├── bench_smc.jl # script to run SMC benchmark for model
│ ├── bench_vi.jl # script to run VI benchmark for model
│ ├── paper_lmh_results.csv # LMH runtime results reported in paper
│ ├── paper_smc_results.csv # SMC runtime results reported in paper
│ ├── paper_vi_results.csv # VI gradient variance results reported in paper
│ ├── ppl.jl # implemenation of research PPL
│ ├── smc.jl # implemenation of SMC (naive and iterative way)
│ └── vi.jl # implemenation of variational inference (standard and variance-reduced way)
│
├── compare/
│ ├── gen/ # benchmark set of 16 probabilistic models re-implemented in Gen.jl
│ │ ├── run_lmh.py # script to run combinator inference for every model
│ │ ├── paper_gen_results.csv # Gen runtime results reported in paper
│ │ └── geometric_recurse.jl # Demonstrating why we could not evaluate Recurse combinator
│ ├── pyro_bbvi/ # benchmark set re-implemented in Pyro
│ │ ├── bbvi.py # Black-box variational inference implementation in Pyro
│ │ ├── run_vi.py # script to run BBVI for every model
│ │ ├── paper_vi_results.csv # VI gradient variance results reported in paper
│ │ └── geometric.jl # Demonstrating why we cannot reduce variance with control-dependencies in Pyro
│ └── webppl/ # benchmark set of 16 probabilistic models re-implemented in WebPPL
│ ├── smc/ # Javascript implementation of naive SMC
│ ├── run_lmh.py # script to run C3 inference for every model
│ ├── run_smc.py # script to run SMC inference for every model
│ ├── paper_lmh_results.csv # WebPPL C3 runtime results reported in paper
│ └── paper_smc_results.csv # WebPPL SMC runtime results reported in paper
│
├── src/
│ ├── formal/
│ │ ├── ast.jl # get abstract syntax tree for julia program
│ │ ├── factorisation_builder.py # uses model_graph.py to write sub-programs for each factor of probabilistic program
│ │ └── formal_cfg.py # computes the CFG for probabilistic program
│ └── static/
│ ├── cfg.py # types for building CFG
│ ├── ir.py # intermediate represenation of probabilstic program
│ ├── model_graph.py # computes dependencies between sample statements as graph
│ └── rd_bp.py # functionality to compute reaching definitions and branch parents
│
├── run_all.sh # script to run all benchmarks
├── sanity_check.py # script to check if artifact installation was successful
├── sanity_check_output.txt # log of output of sanity_check.py
├── generate_factorisation.py # script to generate sub-programs for each factor for every benchmark model
├── run_lmh_benchmark.py # script to run bench_lmh.jl for every benchmark model
├── run_smc_benchmark.py # script to run bench_smc.jl for every benchmark model
├── run_vi_benchmark.py # script to run bench_vi.jl for every benchmark model
├── generate_factorisation_for_model.py # script to generate sub-programs for your model
└── run_benchmark_for_model.py. # script to run benchmarks for your model
No special hardware is needed for installation.
Recommendations:
- Hardware: >= 3.5 GHZ dual core CPU, >= 8 GB RAM, and >= 10 GB storage
- OS: tested on macOS (Sequoia 15.2, M2 Pro), Linux 64bit (Fedora 43, 5950x), Windows 11 64bit (5950x)
- Installation with Docker
Install docker.
You can download and load the docker image provided at Zenodo with
docker load -i pplstaticfactor-amd64.tar
or
docker load -i pplstaticfactor-arm64.tar
depending on your system, which was saved with (Docker version 28.3.0)
docker buildx build --platform linux/amd64 -t pplstaticfactor-amd64 .
docker buildx build --platform linux/arm64 -t pplstaticfactor-arm64 .
docker image save pplstaticfactor-amd64 > pplstaticfactor-amd64.tar
docker image save pplstaticfactor-arm64 > pplstaticfactor-arm64.tar
Run those images with
docker run -it --name pplstaticfactor-amd64 --rm pplstaticfactor-amd64
or
docker run -it --name pplstaticfactor-arm64 --rm pplstaticfactor-arm64
Alternatively, build the pplstaticfactor image from scratch (this may take several minutes):
docker build -t pplstaticfactor .
If the build was successful, run the docker image:
docker run -it --name pplstaticfactor --rm pplstaticfactor
Python 3.10.12withrequirements.txtsexpdata==1.0.2pandas==2.2.3numpy==2.2.4pyro-ppl==1.9.1torch==2.7.1tqdm==4.67.1
Julia 1.9.2withProject.tomlDistributions = "0.25.112"JuliaSyntax = "0.4.10"Gen = "0.4.7"
node.js 24.13
To test if your installation was successful, run python3 sanity_check.py and compare output against sanity_check_output.txt.
Time measurements will differ, but you should not see any errors.
This concludes the instructions for the kick-the-tires phase.
For convenience, you may execute bash run_all.sh to run all experiments needed for reproducing the empirical evaluation.
This will take ~12 hours to complete on systems comparable to an M2 Pro.
In the next sections, we describe each of the experiments in detail.
(Re-) Generate the sub-programs:
python3 generate_factorisation.py
Produces factorised versions of probabilistic programs stored in evaluation/benchmark and writes them to evaluation/benchmark/generated. (Also factorises the unrolled version of HMM evaluation/unrolled/hmm.jl to evaluation/unrolled/generated/hmm.jl).
Expected output: Same programs as currently stored in evaluation/benchmark/generated.
If you like to verify, run
mv evaluation/benchmark/generated evaluation/benchmark/generated_existing
python3 generate_factorisation.py
diff -r evaluation/benchmark/generated evaluation/benchmark/generated_existing
This should only complain about missing .gitkeep file, indicating no difference between existing and newly generated Julia files.
To clean up, run
touch evaluation/benchmark/generated/.gitkeep
rm -rf evaluation/benchmark/generated_existing
Run LMH benchmarks N times (we set N = 10):
python3 run_lmh_benchmark.py N
The results are written to evaluation/lmh_results.csv and aggregated in evaluation/lmh_results_aggregated.csv.
The results reported in the paper can be found in evaluation/paper_lmh_results.csv (measured on a M2 Pro CPU).
This script runs the bench_lmh.jl file which measures the runtime for the factored and non-factored versions of the LMH algorithm and asserts that the samples are the same for all LMH implementations.
Expected output: Absolute runtime measurements may differ on your hardware, but the relative speed-up reported in columns rel_static, rel_finite and rel_custom should be similar.
Estimated runtime: 30min (on system comparable to M2 Pro)
To compare our approach against existing methods, run (we set N = 10):
python3 compare/gen/run_lmh.py N
and
python3 compare/webppl/run_lmh.py N
The results are written to compare/gen/lmh_results.csv and compare/webppl/lmh_results.csv and aggregated in compare/gen/lmh_results_aggregated.csv and compare/webppl/lmh_results_aggregated.csv, respectively.
The results reported in the paper can be found in compare/gen/paper_lmh_results.csv and compare/webppl/paper_lmh_results.csv (measured on a M2 Pro CPU).
Expected output: Again, the relative speed-ups reported in the rel column should be similar on your hardware.
Estimated runtime: 1h and 30min, respectively (on system comparable to M2 Pro)
python3 print_table_1.py will generate Table 1 of the manuscript from the paper_lmh_results.csv files.
python3 print_table_2.py will generate Table 2 of the manuscript from the paper_lmh_results.csv files.
Run BBVI benchmarks N times (we set N = 10):
python3 run_vi_benchmark.py N
The results are written to evaluation/vi_results.csv and aggregated in evaluation/vi_results_aggregated.csv.
The results reported in the paper can be found in evaluation/paper_vi_results.csv (measured on a M2 Pro CPU).
This script runs the bench_vi.jl file which measures the gradient variance for the factored and non-factored versions of the VI algorithm.
Expected output: Runtime may differ, but the gradient variance reported in the none and static columns, as well as the relative improvement in rel_static should be similar to the reported results.
Estimated runtime: 1h30min (on system comparable to M2 Pro)
To compare our approach against existing methods, run (we set N = 1) :
python3 compare/pyro_bbvi/run_vi.py N
The results are written to compare/pyro_bbvi/vi_results.csv and aggregated in compare/pyro_bbvi/vi_results_aggregated.csv.
The results reported in the paper can be found in compare/pyro_bbvi/paper_vi_results.csv (measured on a M2 Pro CPU).
Expected output: Again, the gradient variance and relative improvement (columns none, graph, rel_graph) should be similar to the paper results.
Estimated runtime: This may take a long time to complete, ~6-8 hours on system comparable to M2 Pro. You may lower N_ITER or L in the script for faster completion.
python3 print_table_3.py will generate Table 3 of the manuscript from the paper_vi_results.csv files.
Run SMC benchmarks N times (we set N = 10):
python3 run_smc_benchmark.py N
The results are written to evaluation/smc_results.csv and aggregated in evaluation/smc_results_aggregated.csv.
The results reported in the paper can be found in evaluation/paper_smc_results.csv (measured on a M2 Pro CPU).
This script runs the bench_smc.jl file which measures the runtime for the naive and iterative versions of the SMC algorithm.
Expected output: Absolute runtime measurements may differ on your hardware, but the relative speed-up reported in column rel_static should be similar.
Estimated runtime: 30 min (on system comparable to M2 Pro)
To compare our approach against existing methods, run (we set N = 10):
python3 compare/webppl/run_smc.py N
The results are written to compare/webppl/smc_results.csv and aggregated in compare/webppl/smc_results_aggregated.csv.
The results reported in the paper can be found in compare/webppl/paper_smc_results.csv (measured on a M2 Pro CPU).
Expected output: Absolute runtime measurements may differ on your hardware, but the relative speed-up reported in column cps_rel should be similar.
Estimated runtime: 5min (on system comparable to M2 Pro)
python3 print_table_4.py will generate Table 3 of the manuscript from the paper_smc_results.csv files.
This artifact can be used to reproduce the empirical evaluation of Section 5.
For the interested reader, we briefly list where the sub-program generation described in Section 5 is implemented.
src/formal/formal_cfg.py: implements a function which translates a Julia program (of restricted grammar like described in Section 2 and see below) to the control-flow-graph (CFG) representation (src/static/cfg.py) of Section 3.1 and 4.1.src/static/model_graph.py: implements a function which constructs a dependency graph based on Algorithm 1 (factorisation).src/formal/factorisation_builder.py: uses this dependency graph to generate sub-programs
The claims in this section are based on the data presented in Table 1 and Table 2.
Table 1 and Table 2 can be reproduced by following the instructions outlined in 2. LMH: Single-site Metroplis-Hastings.
The claims in this section are based on the data presented in Table 3.
Table 3 can be reproduced by following the instructions outlined in 3. BBVI: Black-box Variational Inference.
The claims in this section are based on the data presented in Table 4.
Table 4 can be reproduced by following the instructions outlined in 4. SMC: Sequential Monte Carlo.
The artifact may be reused to test the static factorisation approach on new models, i.e., to generate sub-programs and to run the benchmarks for the new models.
To do so, you may implement a new model in a single file in evaluation/models.
The static analysis only accepts programs written in Julia that follow the formal syntax described in the manuscript with some minor adaptions, which we summarise in the following.
A probabilistic program is defined with a Julia function annotated with @model with statement S and arguments A1,...,An.
modelname = "..."
proposers = Dict{String, Distribution}(...)
@model function model_fn(ctx::SampleContext, A1, ..., An)
S
endIn general, we allow arbitrary pure (side-effect free) Julia expressions E in statements.
Further, you may define pure Julia helper functions outside of this model block to be used in expressions.
Also, container types like arrays may only be used in an immutable fashion.
You may also give your model a name with the global variable modelname and custom proposal distributions for LMH with the global proposers dictionary.
For references, take a look at the models in evaluation/benchmark.
Next, we describe the set of allowed statements S:
- Assignment, where
xis a variable
x = EImportantly, the first use of a variable has to be type-annotated, e.g.
x::Vector{Float64} = Eand the type of the variable has to be static throughout the program.
- Sequence of two statements
S1
S2- If statements
if E
S1
endif E
S1
else
S2
end- While statement
while E
S
end- sample statement, where
ctxis theSampleContext,xis a variable andf(E1,...En)is the constructor of a distribution
x = sample(ctx, E, f(E1,...En))We also support to inline observed values with expression C that does not depend on random variables.
x = sample(ctx, E, f(E1,...En), observed=C)You may also omit the assignment in this case:
sample(ctx, E, f(E1,...En), observed=C)Generate the sub-programs with
usage: generate_factorisation_for_model.py [-h] filename
positional arguments:
filename name of your program in evaluations/models/ without path but with file extension
E.g. python3 generate_factorisation_for_model.py test.jl generates evaluation/models/generated/test.jl for evaluation/models/test.jl.
Run a benchmark with
usage: run_benchmark_for_model.py [-h] filename algorithm repetitions
positional arguments:
filename name of your program in evaluations/models/ without path but with file extension
algorithm benchmark to run (lp|is|lmh|smc|vi)
repetitions number of experiment runs
E.g. python3 run_benchmark_for_model.py test.jl lmh 3.
You can configure the number of iterations for is and lmh in evaluation/N_iters.jl based on modelname.
The lmh option runs the benchmark described in 2. LMH: Single-site Metroplis-Hastings, the vi option runs the benchmark described in 3. BBVI: Black-box Variational Inference and the smc option runs the benchmark described in 4. SMC: Sequential Monte Carlo.
The lp option benchmarks just the density function of the model and the is option benchmarks importance sampling.
Note that in order to compare the factorised SMC implementation to a naive implementation, you have to implement a data-annealed version of your model in evaluation/smc.jl (This only makes sense if your model can make use of incrementally adding observed data points during SMC inference. For examples see evaluation/smc.jl).