Skip to content

Conversation

@okaneco
Copy link

@okaneco okaneco commented Nov 5, 2025

Remove unsafe transmute in AVX2 YCbCr conversion

Prior to this, the conversion was scalarized and used a bswap.
Rewriting the code to avoid reversing the array resulted in worse
codegen that extracted the bytes and manually re-inserted them
back into the SIMD register to store 8 bytes at once.

This is stacked on top of - #17
Only the last commit is relevant

@okaneco okaneco mentioned this pull request Nov 5, 2025
Shnatsel and others added 3 commits November 5, 2025 21:41
Remove unsafe transmute in AVX2 YCbCr conversion

Prior to this, the conversion was scalarized and used a bswap.
Rewriting the code to avoid reversing the array resulted in worse
codegen that extracted the bytes and manually re-inserted them
back into the SIMD register to store 8 bytes at once.
@okaneco okaneco force-pushed the remove_transmute_ycbcr branch from ae27b16 to 97c3575 Compare November 5, 2025 23:17
Copy link
Contributor

@Shnatsel Shnatsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests comparing against the scalar implementation pass. The benchmarks show up to 4.5% improvement on Zen 4 and no regressions:

Benchmarks on Zen 4
encode rgb/encode rgb 100
                        time:   [56.018 ms 56.023 ms 56.028 ms]
                        change: [+0.0905% +0.1598% +0.2284%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
Benchmarking encode rgb/encode rgb 4x1: Warming up for 15.000 s
Warning: Unable to complete 100 samples in 90.0s. You may wish to increase target time to 174.6s, enable flat sampling, or reduce sample count to 50.
encode rgb/encode rgb 4x1
                        time:   [34.628 ms 34.630 ms 34.631 ms]
                        change: [-1.1374% -1.0220% -0.9103%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe
Benchmarking encode rgb/encode rgb progressive: Warming up for 15.000 s
Warning: Unable to complete 100 samples in 90.0s. You may wish to increase target time to 175.3s, enable flat sampling, or reduce sample count to 50.
encode rgb/encode rgb progressive
                        time:   [34.762 ms 34.763 ms 34.765 ms]
                        change: [-1.9688% -1.7338% -1.4983%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe
encode rgb/encode rgb optimized
                        time:   [116.74 ms 116.76 ms 116.78 ms]
                        change: [-2.6467% -2.5590% -2.4677%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
encode rgb/encode rgb optimized progressive
                        time:   [118.35 ms 118.38 ms 118.41 ms]
                        change: [-4.6270% -4.5451% -4.4616%] (p = 0.00 < 0.05)
                        Performance has improved.
encode rgb/encode rgb mixed
                        time:   [245.04 ms 245.09 ms 245.15 ms]
                        change: [-3.1994% -3.1084% -3.0202%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  12 (12.00%) high mild

     Running benches/fdct.rs (target/release/deps/fdct-3a322e62af936f7b)
fdct/default fdct       time:   [39.240 ns 39.271 ns 39.303 ns]
                        change: [-2.0367% -1.9016% -1.7679%] (p = 0.00 < 0.05)
                        Performance has improved.
fdct/fdct avx2          time:   [19.228 ns 19.229 ns 19.229 ns]
                        change: [-2.3674% -2.2783% -2.1903%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

This may understate the improvements on other systems, Zen4 tends to deal well with just about any sequence of instructions and not benefit as much from SIMD as older Intel chips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants