Skip to content

Conversation

@jammychiou1
Copy link
Contributor

@jammychiou1 jammychiou1 commented Nov 6, 2025

This approach was suggested by Hanno Becker in #411 (comment), when we implemented the same function in AArch64.

The speedup is barely noticeable, less than 50 cycles per call on my laptop.

@jammychiou1 jammychiou1 marked this pull request as ready for review November 7, 2025 01:53
@jammychiou1 jammychiou1 requested a review from a team as a code owner November 7, 2025 01:53
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jammychiou1.
I agree with the changes you made - it's easier to explain.

Please also remove the check-magic annotations concerning your comments and instead add these constants to the whitelist.

@jammychiou1 jammychiou1 force-pushed the decompose-explanation branch 2 times, most recently from 5082756 to fa07d1f Compare November 8, 2025 08:13
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 46416 cycles 46421 cycles 1.00
ML-DSA-44 sign 132718 cycles 132738 cycles 1.00
ML-DSA-44 verify 47837 cycles 47840 cycles 1.00
ML-DSA-65 keypair 81452 cycles 81443 cycles 1.00
ML-DSA-65 sign 219217 cycles 219207 cycles 1.00
ML-DSA-65 verify 80136 cycles 80134 cycles 1.00
ML-DSA-87 keypair 132753 cycles 132758 cycles 1.00
ML-DSA-87 sign 280934 cycles 280953 cycles 1.00
ML-DSA-87 verify 130316 cycles 130326 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 115272 cycles 115262 cycles 1.00
ML-DSA-44 sign 431782 cycles 431720 cycles 1.00
ML-DSA-44 verify 122176 cycles 122167 cycles 1.00
ML-DSA-65 keypair 197436 cycles 197490 cycles 1.00
ML-DSA-65 sign 700971 cycles 701274 cycles 1.00
ML-DSA-65 verify 197693 cycles 197702 cycles 1.00
ML-DSA-87 keypair 325389 cycles 325412 cycles 1.00
ML-DSA-87 sign 884468 cycles 884484 cycles 1.00
ML-DSA-87 verify 328634 cycles 328655 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 115718 cycles 115652 cycles 1.00
ML-DSA-44 sign 377201 cycles 377357 cycles 1.00
ML-DSA-44 verify 120344 cycles 120215 cycles 1.00
ML-DSA-65 keypair 200127 cycles 200073 cycles 1.00
ML-DSA-65 sign 623016 cycles 622766 cycles 1.00
ML-DSA-65 verify 198223 cycles 198195 cycles 1.00
ML-DSA-87 keypair 327615 cycles 326756 cycles 1.00
ML-DSA-87 sign 791103 cycles 789971 cycles 1.00
ML-DSA-87 verify 325264 cycles 324409 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 35631 cycles 35116 cycles 1.01
ML-DSA-44 sign 120705 cycles 120958 cycles 1.00
ML-DSA-44 verify 38074 cycles 38274 cycles 0.99
ML-DSA-65 keypair 61818 cycles 62757 cycles 0.99
ML-DSA-65 sign 199188 cycles 201252 cycles 0.99
ML-DSA-65 verify 62198 cycles 62387 cycles 1.00
ML-DSA-87 keypair 94415 cycles 94461 cycles 1.00
ML-DSA-87 sign 230678 cycles 230993 cycles 1.00
ML-DSA-87 verify 94054 cycles 95279 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 95091 cycles 95412 cycles 1.00
ML-DSA-44 sign 349043 cycles 349579 cycles 1.00
ML-DSA-44 verify 100848 cycles 101012 cycles 1.00
ML-DSA-65 keypair 165116 cycles 165049 cycles 1.00
ML-DSA-65 sign 566948 cycles 567954 cycles 1.00
ML-DSA-65 verify 165483 cycles 165700 cycles 1.00
ML-DSA-87 keypair 267238 cycles 267808 cycles 1.00
ML-DSA-87 sign 723156 cycles 723827 cycles 1.00
ML-DSA-87 verify 272344 cycles 272309 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@jammychiou1 jammychiou1 changed the title Add bounds reasoning comments to AVX2 decompose Update AVX2 decompose to use a more explainable (and very slightly faster) approach, along with bounds reasoning comments. Nov 8, 2025
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 227025 cycles 233223 cycles 0.97
ML-DSA-44 sign 656604 cycles 673055 cycles 0.98
ML-DSA-44 verify 226548 cycles 231292 cycles 0.98
ML-DSA-65 keypair 399858 cycles 399604 cycles 1.00
ML-DSA-65 sign 1093277 cycles 1092503 cycles 1.00
ML-DSA-65 verify 382610 cycles 378979 cycles 1.01
ML-DSA-87 keypair 668662 cycles 662585 cycles 1.01
ML-DSA-87 sign 1457596 cycles 1442394 cycles 1.01
ML-DSA-87 verify 632700 cycles 631363 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 214068 cycles 213795 cycles 1.00
ML-DSA-44 sign 781499 cycles 782133 cycles 1.00
ML-DSA-44 verify 230065 cycles 230257 cycles 1.00
ML-DSA-65 keypair 385084 cycles 385239 cycles 1.00
ML-DSA-65 sign 1326386 cycles 1314084 cycles 1.01
ML-DSA-65 verify 375339 cycles 375765 cycles 1.00
ML-DSA-87 keypair 606587 cycles 606848 cycles 1.00
ML-DSA-87 sign 1621233 cycles 1623082 cycles 1.00
ML-DSA-87 verify 617288 cycles 617742 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 69198 cycles 69604 cycles 0.99
ML-DSA-44 sign 184949 cycles 187462 cycles 0.99
ML-DSA-44 verify 69047 cycles 69269 cycles 1.00
ML-DSA-65 keypair 120248 cycles 119917 cycles 1.00
ML-DSA-65 sign 295658 cycles 297151 cycles 0.99
ML-DSA-65 verify 115575 cycles 115546 cycles 1.00
ML-DSA-87 keypair 202548 cycles 202342 cycles 1.00
ML-DSA-87 sign 386766 cycles 386965 cycles 1.00
ML-DSA-87 verify 193569 cycles 193643 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 57336 cycles 56956 cycles 1.01
ML-DSA-44 sign 179376 cycles 180499 cycles 0.99
ML-DSA-44 verify 60900 cycles 61247 cycles 0.99
ML-DSA-65 keypair 99751 cycles 99457 cycles 1.00
ML-DSA-65 sign 296170 cycles 296461 cycles 1.00
ML-DSA-65 verify 99941 cycles 100169 cycles 1.00
ML-DSA-87 keypair 153766 cycles 154195 cycles 1.00
ML-DSA-87 sign 352782 cycles 352935 cycles 1.00
ML-DSA-87 verify 152815 cycles 153194 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 116044 cycles 116677 cycles 0.99
ML-DSA-44 sign 377724 cycles 379707 cycles 0.99
ML-DSA-44 verify 120646 cycles 121174 cycles 1.00
ML-DSA-65 keypair 200451 cycles 200327 cycles 1.00
ML-DSA-65 sign 623509 cycles 623378 cycles 1.00
ML-DSA-65 verify 198593 cycles 198489 cycles 1.00
ML-DSA-87 keypair 328191 cycles 327340 cycles 1.00
ML-DSA-87 sign 792035 cycles 790697 cycles 1.00
ML-DSA-87 verify 325645 cycles 324830 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 42462 cycles 42166 cycles 1.01
ML-DSA-44 sign 129986 cycles 130558 cycles 1.00
ML-DSA-44 verify 44008 cycles 44242 cycles 0.99
ML-DSA-65 keypair 72320 cycles 72946 cycles 0.99
ML-DSA-65 sign 210845 cycles 210881 cycles 1.00
ML-DSA-65 verify 72922 cycles 72765 cycles 1.00
ML-DSA-87 keypair 109431 cycles 111282 cycles 0.98
ML-DSA-87 sign 248355 cycles 252306 cycles 0.98
ML-DSA-87 verify 109568 cycles 110921 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 135055 cycles 136238 cycles 0.99
ML-DSA-44 sign 539959 cycles 544574 cycles 0.99
ML-DSA-44 verify 148401 cycles 149362 cycles 0.99
ML-DSA-65 keypair 228535 cycles 230475 cycles 0.99
ML-DSA-65 sign 893053 cycles 895163 cycles 1.00
ML-DSA-65 verify 238247 cycles 239847 cycles 0.99
ML-DSA-87 keypair 373776 cycles 376850 cycles 0.99
ML-DSA-87 sign 1108012 cycles 1112531 cycles 1.00
ML-DSA-87 verify 387508 cycles 389978 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 74308 cycles 74255 cycles 1.00
ML-DSA-44 sign 228603 cycles 228755 cycles 1.00
ML-DSA-44 verify 78250 cycles 78127 cycles 1.00
ML-DSA-65 keypair 130496 cycles 130420 cycles 1.00
ML-DSA-65 sign 378316 cycles 378291 cycles 1.00
ML-DSA-65 verify 129294 cycles 129164 cycles 1.00
ML-DSA-87 keypair 209590 cycles 211688 cycles 0.99
ML-DSA-87 sign 479315 cycles 479661 cycles 1.00
ML-DSA-87 verify 208641 cycles 210182 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 70000 cycles 69864 cycles 1.00
ML-DSA-44 sign 214918 cycles 215244 cycles 1.00
ML-DSA-44 verify 72777 cycles 72692 cycles 1.00
ML-DSA-65 keypair 124033 cycles 123579 cycles 1.00
ML-DSA-65 sign 353286 cycles 353468 cycles 1.00
ML-DSA-65 verify 120824 cycles 120718 cycles 1.00
ML-DSA-87 keypair 202214 cycles 201648 cycles 1.00
ML-DSA-87 sign 451358 cycles 451997 cycles 1.00
ML-DSA-87 verify 198404 cycles 198649 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 159164 cycles 157982 cycles 1.01
ML-DSA-44 sign 569001 cycles 567777 cycles 1.00
ML-DSA-44 verify 170753 cycles 169763 cycles 1.01
ML-DSA-65 keypair 271354 cycles 271455 cycles 1.00
ML-DSA-65 sign 926525 cycles 925734 cycles 1.00
ML-DSA-65 verify 275640 cycles 275498 cycles 1.00
ML-DSA-87 keypair 451014 cycles 451543 cycles 1.00
ML-DSA-87 sign 1182715 cycles 1183249 cycles 1.00
ML-DSA-87 verify 460835 cycles 460624 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 120143 cycles 120533 cycles 1.00
ML-DSA-44 sign 453777 cycles 456002 cycles 1.00
ML-DSA-44 verify 130320 cycles 132129 cycles 0.99
ML-DSA-65 keypair 204830 cycles 207895 cycles 0.99
ML-DSA-65 sign 732904 cycles 742729 cycles 0.99
ML-DSA-65 verify 209363 cycles 211005 cycles 0.99
ML-DSA-87 keypair 337665 cycles 337927 cycles 1.00
ML-DSA-87 sign 923416 cycles 923041 cycles 1.00
ML-DSA-87 verify 344913 cycles 345844 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 214842 cycles 214202 cycles 1.00
ML-DSA-44 sign 782554 cycles 794999 cycles 0.98
ML-DSA-44 verify 230631 cycles 229962 cycles 1.00
ML-DSA-65 keypair 385817 cycles 385876 cycles 1.00
ML-DSA-65 sign 1310148 cycles 1307768 cycles 1.00
ML-DSA-65 verify 376009 cycles 376256 cycles 1.00
ML-DSA-87 keypair 607294 cycles 607000 cycles 1.00
ML-DSA-87 sign 1624685 cycles 1625772 cycles 1.00
ML-DSA-87 verify 617770 cycles 617491 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 138816 cycles 138783 cycles 1.00
ML-DSA-44 sign 493083 cycles 493854 cycles 1.00
ML-DSA-44 verify 148367 cycles 148389 cycles 1.00
ML-DSA-65 keypair 242529 cycles 242264 cycles 1.00
ML-DSA-65 sign 809972 cycles 809969 cycles 1.00
ML-DSA-65 verify 240719 cycles 240614 cycles 1.00
ML-DSA-87 keypair 396675 cycles 396621 cycles 1.00
ML-DSA-87 sign 1027482 cycles 1027277 cycles 1.00
ML-DSA-87 verify 401597 cycles 401369 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 133144 cycles 133258 cycles 1.00
ML-DSA-44 sign 498479 cycles 498179 cycles 1.00
ML-DSA-44 verify 144897 cycles 144918 cycles 1.00
ML-DSA-65 keypair 227070 cycles 226755 cycles 1.00
ML-DSA-65 sign 812705 cycles 812078 cycles 1.00
ML-DSA-65 verify 231517 cycles 231580 cycles 1.00
ML-DSA-87 keypair 374798 cycles 375108 cycles 1.00
ML-DSA-87 sign 1020759 cycles 1020839 cycles 1.00
ML-DSA-87 verify 383690 cycles 383524 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 289514 cycles 290746 cycles 1.00
ML-DSA-44 sign 930458 cycles 937533 cycles 0.99
ML-DSA-44 verify 291385 cycles 291943 cycles 1.00
ML-DSA-65 keypair 491840 cycles 493090 cycles 1.00
ML-DSA-65 sign 1538201 cycles 1526359 cycles 1.01
ML-DSA-65 verify 477106 cycles 476058 cycles 1.00
ML-DSA-87 keypair 833556 cycles 843754 cycles 0.99
ML-DSA-87 sign 2048886 cycles 2088455 cycles 0.98
ML-DSA-87 verify 813904 cycles 818519 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 keypair 309445 cycles 304061 cycles 1.02
ML-DSA-44 sign 1226595 cycles 1204370 cycles 1.02
ML-DSA-44 verify 347072 cycles 331394 cycles 1.05
ML-DSA-65 keypair 574923 cycles 577955 cycles 0.99
ML-DSA-65 sign 2020807 cycles 1998603 cycles 1.01
ML-DSA-65 verify 550026 cycles 552006 cycles 1.00
ML-DSA-87 keypair 869120 cycles 870486 cycles 1.00
ML-DSA-87 sign 2508517 cycles 2493534 cycles 1.01
ML-DSA-87 verify 894364 cycles 896546 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: fa07d1f Previous: 100e446 Ratio
ML-DSA-44 verify 347072 cycles 331394 cycles 1.05

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jammychiou1. I am good with the changes now.

WDYT @hanno-becker?

Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM and really helps understanding the code, thank you @jammychiou1. A few smaller change requests, see comments.

@jammychiou1
Copy link
Contributor Author

Thank you @hanno-becker for your suggestions (and your wonderful work on #659)!

Please let me know if there's more things to change. If not, I'll clean up the commit history to prepare for the merge into main.

The new approach is adapted from our Neon implementation. See
<#411 (comment)>
for more information on the idea.

Bounds reasoning comments are also added.

Signed-off-by: jammychiou1 <[email protected]>
Edit some comments while we're at it.

Signed-off-by: jammychiou1 <[email protected]>
@jammychiou1 jammychiou1 force-pushed the decompose-explanation branch from d96ec5d to b79682b Compare November 11, 2025 09:28
Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you @jammychiou1

@hanno-becker hanno-becker merged commit 8e74a84 into main Nov 11, 2025
259 checks passed
@hanno-becker hanno-becker deleted the decompose-explanation branch November 11, 2025 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add bounds reasoning comments to AVX2 backend

5 participants