Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

okaneco · 2025-11-05T21:05:40Z

Remove unsafe transmute in AVX2 YCbCr conversion

Prior to this, the conversion was scalarized and used a bswap.
Rewriting the code to avoid reversing the array resulted in worse
codegen that extracted the bytes and manually re-inserted them
back into the SIMD register to store 8 bytes at once.

This is stacked on top of - #17
Only the last commit is relevant

…s instead of raw pointers

…uffers. Verify that there is enough capacity to hold the data before filling them to avoid buffer overflows. Treat them as MaybeUninit until they're filled to avoid exposining uninitialized memory.

…erations are now wrapped in safe abstractions that verify preconditions

Remove unsafe transmute in AVX2 YCbCr conversion Prior to this, the conversion was scalarized and used a bswap. Rewriting the code to avoid reversing the array resulted in worse codegen that extracted the bytes and manually re-inserted them back into the SIMD register to store 8 bytes at once.

Shnatsel

The tests comparing against the scalar implementation pass. The benchmarks show up to 4.5% improvement on Zen 4 and no regressions:

Benchmarks on Zen 4

encode rgb/encode rgb 100
                        time:   [56.018 ms 56.023 ms 56.028 ms]
                        change: [+0.0905% +0.1598% +0.2284%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
Benchmarking encode rgb/encode rgb 4x1: Warming up for 15.000 s
Warning: Unable to complete 100 samples in 90.0s. You may wish to increase target time to 174.6s, enable flat sampling, or reduce sample count to 50.
encode rgb/encode rgb 4x1
                        time:   [34.628 ms 34.630 ms 34.631 ms]
                        change: [-1.1374% -1.0220% -0.9103%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe
Benchmarking encode rgb/encode rgb progressive: Warming up for 15.000 s
Warning: Unable to complete 100 samples in 90.0s. You may wish to increase target time to 175.3s, enable flat sampling, or reduce sample count to 50.
encode rgb/encode rgb progressive
                        time:   [34.762 ms 34.763 ms 34.765 ms]
                        change: [-1.9688% -1.7338% -1.4983%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe
encode rgb/encode rgb optimized
                        time:   [116.74 ms 116.76 ms 116.78 ms]
                        change: [-2.6467% -2.5590% -2.4677%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
encode rgb/encode rgb optimized progressive
                        time:   [118.35 ms 118.38 ms 118.41 ms]
                        change: [-4.6270% -4.5451% -4.4616%] (p = 0.00 < 0.05)
                        Performance has improved.
encode rgb/encode rgb mixed
                        time:   [245.04 ms 245.09 ms 245.15 ms]
                        change: [-3.1994% -3.1084% -3.0202%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  12 (12.00%) high mild

     Running benches/fdct.rs (target/release/deps/fdct-3a322e62af936f7b)
fdct/default fdct       time:   [39.240 ns 39.271 ns 39.303 ns]
                        change: [-2.0367% -1.9016% -1.7679%] (p = 0.00 < 0.05)
                        Performance has improved.
fdct/fdct avx2          time:   [19.228 ns 19.229 ns 19.229 ns]
                        change: [-2.3674% -2.2783% -2.1903%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

This may understate the improvements on other systems, Zen4 tends to deal well with just about any sequence of instructions and not benefit as much from SIMD as older Intel chips.

Shnatsel added 12 commits October 29, 2025 16:29

Bump MSRV to 1.86

92110c5

AVX ycbcr: safety: always access the slice in &self with bounds check…

333c101

…s instead of raw pointers

Split the transmute into a helper function

af429c7

AVX ycbcr: safety: Always perform bounds checks on the ycbcr output b…

04a0c27

…uffers. Verify that there is enough capacity to hold the data before filling them to avoid buffer overflows. Treat them as MaybeUninit until they're filled to avoid exposining uninitialized memory.

AVX ycbcr: the function no longer needs to be unsafe, all unsafe op…

4a97518

…erations are now wrapped in safe abstractions that verify preconditions

refactor for slightly better codegen

5590004

cargo fmt

5642f24

expand safety comment

df6173b

Replace manual bookkeeping and unsafe{set_len()} with safe Vec methods

fc320d8

Add a test verifying that AVX2 result is identical to scalar

f431b60

add a TODO

fb3f9d5

Add a note about scalar loads

120c5be

okaneco mentioned this pull request Nov 5, 2025

Safe AVX YCbCr #17

Open

Shnatsel and others added 3 commits November 5, 2025 21:41

Don't attempt to run AVX2 test on machines without AVX2

8d2d642

Address compiler warning

c2b36b1

okaneco force-pushed the remove_transmute_ycbcr branch from ae27b16 to 97c3575 Compare November 5, 2025 23:17

Shnatsel approved these changes Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

Uh oh!

okaneco commented Nov 5, 2025

Uh oh!

Shnatsel left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

Are you sure you want to change the base?

Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22

Uh oh!

Conversation

okaneco commented Nov 5, 2025

Uh oh!

Shnatsel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shnatsel left a comment •

edited

Loading