Profile-Guided Optimization (PGO) results

Hi!

Recently I started evaluating using Profile-Guided Optimization (PGO) for optimizing different kinds of software - all my current results are available in my GitHub [repo](https://github.com/zamazan4ik/awesome-pgo). Since PGO helps with achieving better runtime efficiency in many cases, I decided to perform some PGO tests on Lace. I performed some benchmarks and want to share my results here.

## Test environment

* Fedora 39
* Linux kernel 6.6.13
* AMD Ryzen 9 5900x
* 48 Gib RAM
* SSD Samsung 980 Pro 2 Tib
* Compiler - Rustc 1.75
* Lace version: the latest for now from the `master` branch on commit `66e5a67688c76437a9ae5ec1bcadc4c1d0c7b604`
* Disabled Turbo boost (for more stable results across benchmark runs)

## Benchmark

For benchmarking purposes, I use two things:

* Built-in benchmarks
* Manual `lace-cli` invocations with manual time measurements.

Built-in benchmarks are invoked with `cargo bench --all-features --workspace`. PGO instrumentation phase on benchmarks is done with `cargo pgo bench -- --all-features --workspace`. PGO optimization phase is done with `cargo pgo optimize bench -- --all-features --workspace`.

For `lace-cli` Release build is done with `cargo build --release`. PGO instrumented build is done with `cargo pgo build`. PGO optimized build is done with `cargo pgo optimized build`. The PGO training phase is done with `LLVM_PROFILE_FILE=/home/zamazan4ik/open_source/lace/cli/target/pgo-profiles/lace_%m_%p.profraw ./lace_instrumented run --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace` (see "Results" section for more details about using different training sets and its impact on the actual performance numbers).

For `lace-cli` I use `taskset -c 0` to reduce an OS scheduler impact on the result. The `seed` is fixed for the same purpose.

All PGO optimization steps are done with [cargo-pgo](https://github.com/Kobzol/cargo-pgo) tool.

# Results

At first, here are the results for the built-in benchmarks:

* Release: https://gist.github.com/zamazan4ik/d4bc743b2beb7e6f4bcf8c3c7fcab41b
* PGO optimized compared to Release: https://gist.github.com/zamazan4ik/27734cb744ce2cd57e12ad8eda95e318
* (just for reference) PGO instrumentation compared to Release (you can estimate the slowdown from the instrumentation phase): https://gist.github.com/zamazan4ik/446ac5486058cb3bb9a12c100a8c3e56

According to these benchmarks, PGO helps with achieving better performance in many cases. However, as you see, in some cases the performance is regressed. It could be an expected thing since the benchmarks have different scenarios, and some scenarios can have "optimization conflicts": the same optimization decision can lead to an improvement in one scenario and to a regression in another scenario. That's why using benchmarks for the PGO training phase could be a dangerous thing. Anyway, even knowing this we see many improvements.

If we want to see more real-life scenario, I performed PGO benchmarks on `lace-cli`.

Release vs PGO optimized (trained on the `satellites` dataset) on the `satellites` dataset:
```
hyperfine --warmup 10 --min-runs 50 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
  Time (mean ± σ):      1.469 s ±  0.006 s    [User: 1.386 s, System: 0.063 s]
  Range (min … max):    1.464 s …  1.507 s    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
  Time (mean ± σ):      1.382 s ±  0.001 s    [User: 1.299 s, System: 0.064 s]
  Range (min … max):    1.380 s …  1.388 s    50 runs

Summary
  taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace ran
    1.06 ± 0.00 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
```

Release vs PGO optimized (trained on the `satellites` dataset) on the `animals` dataset:
```
hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     682.7 ms ±   3.6 ms    [User: 608.5 ms, System: 65.8 ms]
  Range (min … max):   680.4 ms … 706.4 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     652.4 ms ±   2.9 ms    [User: 579.8 ms, System: 64.3 ms]
  Range (min … max):   648.2 ms … 672.5 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.05 ± 0.01 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
```

Just for reference, here is the slowdown from PGO instrumentation:
```
hyperfine --warmup 5 --min-runs 10 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     681.7 ms ±   0.7 ms    [User: 608.1 ms, System: 65.8 ms]
  Range (min … max):   681.0 ms … 683.1 ms    10 runs

Benchmark 2: taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     841.0 ms ±   4.7 ms    [User: 754.1 ms, System: 77.3 ms]
  Range (min … max):   835.2 ms … 853.1 ms    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.23 ± 0.01 times faster than taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
```

I decided to test one more thing - how much performance differs if we use different PGO training sets? So here we go.

PGO optimized (trained on the `satellites` dataset) vs PGO optimized (trained on the `animals` dataset) on the `animals` dataset:
```
hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     653.0 ms ±   1.4 ms    [User: 579.7 ms, System: 65.4 ms]
  Range (min … max):   649.4 ms … 655.9 ms    100 runs

Benchmark 2: taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     622.7 ms ±   1.8 ms    [User: 550.3 ms, System: 64.1 ms]
  Range (min … max):   618.6 ms … 626.3 ms    100 runs

Summary
  taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.05 ± 0.00 times faster than taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
```

As you see, improvement is measurable (5% is a good improvement).

Concluding all the results above, I can say that PGO helps to achieve better performance with Lace.

For anyone who cares about the binary size, I also did some measurements on `lace-cli`:

* Release: `28184240` byte
* PGO optimized (`animals` dataset): `28085792` byte
* PGO optimized (`satellites` dataset): `27785576` byte
* PGO instrumented: `116176688` byte

## Possible further steps

I can suggest the following things to consider:

* Perform more PGO benchmarks on Lace. If it shows improvements - add a note to the documentation about possible improvements in Lace performance with PGO (I guess somewhere in the README file will be enough).
* Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Lace according to their workloads.
* Optimize pre-built binaries (if any)

Testing Post-Link Optimization techniques (like [LLVM BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/README.md)) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual LTO and PGO.

Here are some examples of how PGO optimization is integrated into other projects:

* Rustc: a CI [script](https://github.com/rust-lang/rust/blob/master/src/ci/stage-build.py) for the multi-stage build
* GCC:
  - Official [docs](https://gcc.gnu.org/install/build.html), section "Building with profile feedback" (even AutoFDO build is supported)
  - A [part](https://github.com/gcc-mirror/gcc/blob/4832767db7897be6fb5cbc44f079482c90cb95a6/configure#L7818) in a "wonderful" `configure` script 
* Clang: [Docs](https://llvm.org/docs/HowToBuildWithPGO.html) 
* Python: 
  - CPython: [README](https://github.com/python/cpython#profile-guided-optimization)
  - Pyston: [README](https://github.com/pyston/pyston#building)
* Go: [Bash script](https://github.com/golang/go/blob/master/src/cmd/compile/profile.sh)
* V8: [Bazel flag](https://github.com/v8/v8/blob/main/BUILD.gn#L184)
* ChakraCore: [Scripts](https://github.com/chakra-core/ChakraCore/tree/master/Build/scripts/pgo)
* Chromium: [Script](https://chromium.googlesource.com/chromium/src/build/config/+/refs/heads/main/compiler/pgo/BUILD.gn)
* Firefox: [Docs](https://firefox-source-docs.mozilla.org/build/buildsystem/pgo.html)
   - Thunderbird has PGO support too
* PHP - [Makefile command](https://github.com/php/php-src/blob/master/build/Makefile.global#L138) and old Centminmod [scripts](https://github.com/centminmod/php_pgo_training_scripts)
* MySQL: [CMake script](https://github.com/mysql/mysql-server/blob/8.0/cmake/fprofile.cmake)
* YugabyteDB: [GitHub commit](https://github.com/yugabyte/yugabyte-db/commit/34cb791ed9d3d5f8ae9a9b9e9181a46485e1981d)
* FoundationDB: [Script](https://github.com/apple/foundationdb/blob/1a6114a66f3de508c0cf0a45f72f3687ba05750c/contrib/generate_profile.sh)
* Zstd: [Makefile](https://github.com/facebook/zstd/blob/dev/programs/Makefile#L232)
* [Foot](https://codeberg.org/dnkl/foot): [Scripts](https://codeberg.org/dnkl/foot/src/branch/master/pgo)
* Windows Terminal: [GitHub PR](https://github.com/microsoft/terminal/pull/10071)
* Pydantic-core: [GitHub PR](https://github.com/pydantic/pydantic-core/pull/741)
* file.d: [GitHub PR](https://github.com/ozontech/file.d/pull/469)
* OceanBase: [CMake flag](https://github.com/oceanbase/oceanbase/blob/master/cmake/Env.cmake#L55)

I would be happy to answer all your questions about PGO! Much more materials about PGO (actual performance numbers across a lot of other projects, PGO state across an ecosystem, PGO traps, and tricky details) can be found in https://github.com/zamazan4ik/awesome-pgo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profile-Guided Optimization (PGO) results #172

Test environment

Benchmark

Results

Possible further steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Profile-Guided Optimization (PGO) results #172

Description

Test environment

Benchmark

Results

Possible further steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions