-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Hi!
Recently I started evaluating using Profile-Guided Optimization (PGO) for optimizing different kinds of software - all my current results are available in my GitHub repo. Since PGO helps with achieving better runtime efficiency in many cases, I decided to perform some PGO tests on Lace. I performed some benchmarks and want to share my results here.
Test environment
- Fedora 39
- Linux kernel 6.6.13
- AMD Ryzen 9 5900x
- 48 Gib RAM
- SSD Samsung 980 Pro 2 Tib
- Compiler - Rustc 1.75
- Lace version: the latest for now from the
masterbranch on commit66e5a67688c76437a9ae5ec1bcadc4c1d0c7b604 - Disabled Turbo boost (for more stable results across benchmark runs)
Benchmark
For benchmarking purposes, I use two things:
- Built-in benchmarks
- Manual
lace-cliinvocations with manual time measurements.
Built-in benchmarks are invoked with cargo bench --all-features --workspace. PGO instrumentation phase on benchmarks is done with cargo pgo bench -- --all-features --workspace. PGO optimization phase is done with cargo pgo optimize bench -- --all-features --workspace.
For lace-cli Release build is done with cargo build --release. PGO instrumented build is done with cargo pgo build. PGO optimized build is done with cargo pgo optimized build. The PGO training phase is done with LLVM_PROFILE_FILE=/home/zamazan4ik/open_source/lace/cli/target/pgo-profiles/lace_%m_%p.profraw ./lace_instrumented run --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace (see "Results" section for more details about using different training sets and its impact on the actual performance numbers).
For lace-cli I use taskset -c 0 to reduce an OS scheduler impact on the result. The seed is fixed for the same purpose.
All PGO optimization steps are done with cargo-pgo tool.
Results
At first, here are the results for the built-in benchmarks:
- Release: https://gist.github.com/zamazan4ik/d4bc743b2beb7e6f4bcf8c3c7fcab41b
- PGO optimized compared to Release: https://gist.github.com/zamazan4ik/27734cb744ce2cd57e12ad8eda95e318
- (just for reference) PGO instrumentation compared to Release (you can estimate the slowdown from the instrumentation phase): https://gist.github.com/zamazan4ik/446ac5486058cb3bb9a12c100a8c3e56
According to these benchmarks, PGO helps with achieving better performance in many cases. However, as you see, in some cases the performance is regressed. It could be an expected thing since the benchmarks have different scenarios, and some scenarios can have "optimization conflicts": the same optimization decision can lead to an improvement in one scenario and to a regression in another scenario. That's why using benchmarks for the PGO training phase could be a dangerous thing. Anyway, even knowing this we see many improvements.
If we want to see more real-life scenario, I performed PGO benchmarks on lace-cli.
Release vs PGO optimized (trained on the satellites dataset) on the satellites dataset:
hyperfine --warmup 10 --min-runs 50 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
Time (mean ± σ): 1.469 s ± 0.006 s [User: 1.386 s, System: 0.063 s]
Range (min … max): 1.464 s … 1.507 s 50 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
Time (mean ± σ): 1.382 s ± 0.001 s [User: 1.299 s, System: 0.064 s]
Range (min … max): 1.380 s … 1.388 s 50 runs
Summary
taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace ran
1.06 ± 0.00 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
Release vs PGO optimized (trained on the satellites dataset) on the animals dataset:
hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
Time (mean ± σ): 682.7 ms ± 3.6 ms [User: 608.5 ms, System: 65.8 ms]
Range (min … max): 680.4 ms … 706.4 ms 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
Time (mean ± σ): 652.4 ms ± 2.9 ms [User: 579.8 ms, System: 64.3 ms]
Range (min … max): 648.2 ms … 672.5 ms 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
1.05 ± 0.01 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
Just for reference, here is the slowdown from PGO instrumentation:
hyperfine --warmup 5 --min-runs 10 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
Time (mean ± σ): 681.7 ms ± 0.7 ms [User: 608.1 ms, System: 65.8 ms]
Range (min … max): 681.0 ms … 683.1 ms 10 runs
Benchmark 2: taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
Time (mean ± σ): 841.0 ms ± 4.7 ms [User: 754.1 ms, System: 77.3 ms]
Range (min … max): 835.2 ms … 853.1 ms 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
1.23 ± 0.01 times faster than taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
I decided to test one more thing - how much performance differs if we use different PGO training sets? So here we go.
PGO optimized (trained on the satellites dataset) vs PGO optimized (trained on the animals dataset) on the animals dataset:
hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
Time (mean ± σ): 653.0 ms ± 1.4 ms [User: 579.7 ms, System: 65.4 ms]
Range (min … max): 649.4 ms … 655.9 ms 100 runs
Benchmark 2: taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
Time (mean ± σ): 622.7 ms ± 1.8 ms [User: 550.3 ms, System: 64.1 ms]
Range (min … max): 618.6 ms … 626.3 ms 100 runs
Summary
taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
1.05 ± 0.00 times faster than taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
As you see, improvement is measurable (5% is a good improvement).
Concluding all the results above, I can say that PGO helps to achieve better performance with Lace.
For anyone who cares about the binary size, I also did some measurements on lace-cli:
- Release:
28184240byte - PGO optimized (
animalsdataset):28085792byte - PGO optimized (
satellitesdataset):27785576byte - PGO instrumented:
116176688byte
Possible further steps
I can suggest the following things to consider:
- Perform more PGO benchmarks on Lace. If it shows improvements - add a note to the documentation about possible improvements in Lace performance with PGO (I guess somewhere in the README file will be enough).
- Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Lace according to their workloads.
- Optimize pre-built binaries (if any)
Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual LTO and PGO.
Here are some examples of how PGO optimization is integrated into other projects:
- Rustc: a CI script for the multi-stage build
- GCC:
- Clang: Docs
- Python:
- Go: Bash script
- V8: Bazel flag
- ChakraCore: Scripts
- Chromium: Script
- Firefox: Docs
- Thunderbird has PGO support too
- PHP - Makefile command and old Centminmod scripts
- MySQL: CMake script
- YugabyteDB: GitHub commit
- FoundationDB: Script
- Zstd: Makefile
- Foot: Scripts
- Windows Terminal: GitHub PR
- Pydantic-core: GitHub PR
- file.d: GitHub PR
- OceanBase: CMake flag
I would be happy to answer all your questions about PGO! Much more materials about PGO (actual performance numbers across a lot of other projects, PGO state across an ecosystem, PGO traps, and tricky details) can be found in https://github.com/zamazan4ik/awesome-pgo