-
Notifications
You must be signed in to change notification settings - Fork 6
Explicitly vectorize i32x8 to u8x8 conversion for storing into YCbCr buffers #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…s instead of raw pointers
…uffers. Verify that there is enough capacity to hold the data before filling them to avoid buffer overflows. Treat them as MaybeUninit until they're filled to avoid exposining uninitialized memory.
…erations are now wrapped in safe abstractions that verify preconditions
Remove unsafe transmute in AVX2 YCbCr conversion Prior to this, the conversion was scalarized and used a bswap. Rewriting the code to avoid reversing the array resulted in worse codegen that extracted the bytes and manually re-inserted them back into the SIMD register to store 8 bytes at once.
ae27b16 to
97c3575
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests comparing against the scalar implementation pass. The benchmarks show up to 4.5% improvement on Zen 4 and no regressions:
Benchmarks on Zen 4
encode rgb/encode rgb 100
time: [56.018 ms 56.023 ms 56.028 ms]
change: [+0.0905% +0.1598% +0.2284%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe
Benchmarking encode rgb/encode rgb 4x1: Warming up for 15.000 s
Warning: Unable to complete 100 samples in 90.0s. You may wish to increase target time to 174.6s, enable flat sampling, or reduce sample count to 50.
encode rgb/encode rgb 4x1
time: [34.628 ms 34.630 ms 34.631 ms]
change: [-1.1374% -1.0220% -0.9103%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) high mild
9 (9.00%) high severe
Benchmarking encode rgb/encode rgb progressive: Warming up for 15.000 s
Warning: Unable to complete 100 samples in 90.0s. You may wish to increase target time to 175.3s, enable flat sampling, or reduce sample count to 50.
encode rgb/encode rgb progressive
time: [34.762 ms 34.763 ms 34.765 ms]
change: [-1.9688% -1.7338% -1.4983%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) high mild
10 (10.00%) high severe
encode rgb/encode rgb optimized
time: [116.74 ms 116.76 ms 116.78 ms]
change: [-2.6467% -2.5590% -2.4677%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild
encode rgb/encode rgb optimized progressive
time: [118.35 ms 118.38 ms 118.41 ms]
change: [-4.6270% -4.5451% -4.4616%] (p = 0.00 < 0.05)
Performance has improved.
encode rgb/encode rgb mixed
time: [245.04 ms 245.09 ms 245.15 ms]
change: [-3.1994% -3.1084% -3.0202%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
12 (12.00%) high mild
Running benches/fdct.rs (target/release/deps/fdct-3a322e62af936f7b)
fdct/default fdct time: [39.240 ns 39.271 ns 39.303 ns]
change: [-2.0367% -1.9016% -1.7679%] (p = 0.00 < 0.05)
Performance has improved.
fdct/fdct avx2 time: [19.228 ns 19.229 ns 19.229 ns]
change: [-2.3674% -2.2783% -2.1903%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) low mild
2 (2.00%) high mild
6 (6.00%) high severe
This may understate the improvements on other systems, Zen4 tends to deal well with just about any sequence of instructions and not benefit as much from SIMD as older Intel chips.
Remove unsafe transmute in AVX2 YCbCr conversion
Prior to this, the conversion was scalarized and used a bswap.
Rewriting the code to avoid reversing the array resulted in worse
codegen that extracted the bytes and manually re-inserted them
back into the SIMD register to store 8 bytes at once.
This is stacked on top of - #17
Only the last commit is relevant