Skip to content

Conversation

@MaodiMa
Copy link
Contributor

@MaodiMa MaodiMa commented Nov 21, 2025

When data become 16 bytes or less, the final result is gained by folding and barrett reduction with VPCLMULQDQ in crc32_iscsi_by8_02. This is less efficient than using CRC32 instruction directly on x86 platform.
In this commit, we use CRC32 instruction to calculate:

  • short data no more than 16 bytes
  • long data when folding until 16 bytes

_barrett and other deprecated codes are removed.

Besides, we add several fastpaths for short data at the very beginning. It will notice the short data as early as possible and prevent it from the long control flow in the procedure. The entry points (i.e. 4/8/16 bytes) are choosed based on the balance of performance of all data length less than 16 bytes.

Performance data are listed below (iterations per second):

Data len (byte) Orig (Hygon 7490) Opt (Hygon 7490) Orig (AMD Zen5 forced) Opt (AMD Zen5 forced) Orig (Intel Xeon(R) Platinum 8480+ forced) Opt (Intel Xeon(R) Platinum 8480+ forced)
1 60.8 213.71   123.33 624.75   78.81 187.4
2 60.315 230.195   123.385 692.85   75.84 199.885
3 60.67 213.9167   123.41 611.0367   77.13667 199.9133
4 51.2975 213.7325   90.5325 574.13   63.665 187.4225
5 51.3 213.714   89.4 630.712   61.766 187.416
6 51.32333 199.5017   90.90833 690.815   62.205 199.8883
7 51.32571 166.2457   89.42571 610.2614   60.73 199.9086
8 51.2875 166.2363   90.4475 543.13   63.6775 158.0863
9 51.26444 166.2011   89.42889 502.1789   61.65667 158.0233
10 51.272 175.999   90.983 552.159   62.198 176.317
11 51.30455 157.52   89.52 502.4264   60.74182 176.0864
12 51.30167 175.9483   90.92333 545.6992   62.1875 176.3975
13 51.29692 157.4792   89.43308 545.5031   60.73692 176.3815
14 51.23714 149.55   89.42714 627.9607   61.14857 187.4407
15 51.22867 130.122   87.99733 579.0333   61.25467 187.418
16 156.7456 213.7519   426.4888 692.675   164.5294 214.25
32 163.8588 186.7644   369.6678 602.8903   159.65 187.6084
33 147.3003 175.877   291.9685 455.4342   166.0924 187.7615
34 147.1832 175.8962   291.9426 456.2132   166.6847 188.6229
35 147.2469 176.0066   291.9271 457.2131   166.4914 188.3
36 147.365 175.3875   291.9078 456.8592   166.47 187.7475
37 147.3995 175.5854   291.9 454.7443   165.7959 187.9919
38 147.3842 175.5974   291.9111 457.6768   166.4245 187.9037
39 147.2685 175.3803   291.9208 455.7831   166.7226 188.0136
248 46.91202 53.59121   63.74879 80.16802   62.69609 68.2371
249 46.90333 53.66181   63.74261 80.1592   62.71562 68.26221
250 46.97656 53.64704   63.75324 80.1802   62.6408 67.89964
251 46.99442 53.64036   63.76171 80.16928   62.69586 68.13741
252 46.98548 53.68766   63.75278 80.15647   62.66 67.93214
253 46.91992 53.67925   63.74443 80.17787   62.69016 68.51415
254 46.92409 53.71902   63.75252 80.16445   62.66961 68.10441
255 46.96302 53.63333   63.74784 80.15486   62.67047 68.05439
1024 23.02245 23.75864   21.36499 22.03716   22.88934 23.61054
2048 11.60119 11.787   10.7657 10.9329   11.57854 11.76069
3072 7.75623 7.838092   7.195329 7.270322   7.749375 7.829681
4096 5.824614 5.870693   5.403274 5.445383   5.823433 5.868193
5120 4.663133 4.693289   4.322949 4.348883   4.659406 4.693359
6144 3.887625 3.908548   3.604282 3.622699   3.889852 3.910246
7168 3.33365 3.348873   3.090539 3.104657   3.335957 3.350896
8192 2.918 2.929668   2.705631 2.715931   2.920171 2.93166

- Use CRC32 instruction to calculate:
    a. short data no more than 16 bytes;
    b. long data when folding until 16 bytes.
- Add fastpath for short data to make the procedure more efficient.

Signed-off-by: Maodi Ma <[email protected]>
@pablodelara
Copy link
Contributor

Thanks for the PR. This is now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants