Add support for fp16 iGEMM with SME2 #9005

JonathanC-ARM · 2025-10-20T08:30:18Z

Initial prototype for FP16 Igemm support for SME2
continuing work from ##8687

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]>

Signed-off-by: Gian Marco Iodice <[email protected]>

Signed-off-by: Jonathan Clohessy <[email protected]>

dsharlet · 2025-10-21T04:01:46Z

This isn't building for us:

test/gemm-microkernel-tester.cc:2455:40: error: no viable overloaded '+='
 2455 |         c_ref[m_index * n() + n_index] +=
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
 2456 |             xnn_float16_to_float(input_f16[m_index * k() + k_index]) *
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 2457 |             xnn_float16_to_float(weights[n_index * k() + k_index]);
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test/gemm-microkernel-tester.cc:2459:38: error: no viable overloaded '+='
 2459 |       c_ref[m_index * n() + n_index] += xnn_float16_to_float(bias[n_index]);
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.

JonathanC-ARM · 2025-10-21T11:14:24Z

Hi @dsharlet could you give a bit more context of the error, particularly around the build command used. The strange thing on my end is that I cant see this.

I am compiling on an M4 however, but I'm going to try on an x86_64 machine shortly and cross compile

bazel build -c opt --enable_bzlmod --define xnn_enable_arm_sme=true --define xnn_enable_arm_sme2=true //test:gemm_microkernel_tester

Tried a few variations on the command, cleaned my environment etc. Also synced my fork with master in case anything since.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 69ccf09 PiperOrigin-RevId: 821598958

No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068

…ally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723

…graph PiperOrigin-RevId: 821566586

PiperOrigin-RevId: 821694771

This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594

PiperOrigin-RevId: 821708108

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217

PiperOrigin-RevId: 821808685

PiperOrigin-RevId: 821857188

PiperOrigin-RevId: 821867761

PiperOrigin-RevId: 821984759

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-10-21T18:06:12Z

@dsharlet thanks for telling me about the build problem, seemed to only show up on Linux machines. I was able to fix the build issue in the latest commit.

Will be resolving the conflicts with Master shortly.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 56ee7cb PiperOrigin-RevId: 821598958

gonnet · 2025-10-22T12:51:41Z

This is still failing to build for the CI workflows, e.g. https://github.com/google/XNNPACK/actions/runs/18713631848/job/53367695764.

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-10-23T11:52:16Z

@gonnet I have resolved the build failures and tested it both on a ubuntu machine and mac with clean environments. So should be building now.

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-10-27T21:16:27Z

Hi @dsharlet thanks for the approval, I seen there was a few failures in the workflows. Two of them appear to be avx related tests, so not sure that is necessarily related to this change. But the other was an unused variable warning on one gcc version, so i've gone ahead and removed that now, so think it should be ok next time around.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> -- 06a44d2 by Jonathan Clohessy <[email protected]>: Refactor Convolution to new structure and fix build failures Signed-off-by: Jonathan Clohessy <[email protected]> -- 175903d by Jonathan Clohessy <[email protected]>: Remove unused gemm config structure init Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 175903d PiperOrigin-RevId: 821598958

dsharlet · 2025-10-28T17:56:12Z

The failures on this branch look real, a lot of tests that run SME code are crashing.

JonathanC-ARM · 2025-10-28T20:23:17Z

@dsharlet yeah that seems to have been the case, spent a good bit of today sorting through things. I should have a new push shortly. The crashes appear to be missing sme1 or 2 variants of pack functions for example, in the crashing batch matmul test case it seemed that some gemm config pointers were null which led me to missing packs etc..

Also a quick question which has come from this maybe yourself or @gonnet might be able to answer, the tools/update_microkernels.py file autogenerates bazel and cmake files for various microkernels. But one thing I've encountered is that for example most sme2 kernels their pack functions are sme, the naming convention of the file dictates what is included in those bazel or cmake files. The only way without modifying the logic of the python script is to basically duplicate files across neonsme and neonsme2, it seems that has been the case, but I just wanted to ask your thoughts on it.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> -- 06a44d2 by Jonathan Clohessy <[email protected]>: Refactor Convolution to new structure and fix build failures Signed-off-by: Jonathan Clohessy <[email protected]> -- 175903d by Jonathan Clohessy <[email protected]>: Remove unused gemm config structure init Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 9efa3d6 PiperOrigin-RevId: 821598958

Signed-off-by: Jonathan Clohessy <[email protected]>

…_igemm

JonathanC-ARM · 2025-10-29T14:41:59Z

Hi @dsharlet I made some additional changes, and ran all of //test/... with sme2 on/off and vice versa. Everything seemed to pass testing, I was able to replicate the original failures and work through them. So I think it should be all good now.

Thanks

Signed-off-by: Jonathan Clohessy <[email protected]>

dsharlet · 2025-10-30T01:47:35Z

There were some issues with build timeouts earlier. I re-ran the failed builds, there is a remaining real build issue:

C:\Users\runneradmin\.cargo\bin\ccache.exe C:\PROGRA~1\MICROS~2\2022\ENTERP~1\VC\Tools\MSVC\1444~1.352\bin\Hostx64\arm64\cl.exe  /nologo /TP -DNOMINMAX -DPTHREADPOOL_NO_DEPRECATED_API=1 -DXNN_ENABLE_ARM_BF16=0 -DXNN_ENABLE_ARM_DOTPROD=1 -DXNN_ENABLE_ARM_FP16_SCALAR=0 -DXNN_ENABLE_ARM_FP16_VECTOR=1 -DXNN_ENABLE_ARM_I8MM=1 -DXNN_ENABLE_ARM_SME2=0 -DXNN_ENABLE_ARM_SME=1 -DXNN_ENABLE_ASSEMBLY=0 -DXNN_ENABLE_AVX256SKX=1 -DXNN_ENABLE_AVX256VNNI=1 -DXNN_ENABLE_AVX256VNNIGFNI=1 -DXNN_ENABLE_AVX2=1 -DXNN_ENABLE_AVX512AMX=1 -DXNN_ENABLE_AVX512BF16=0 -DXNN_ENABLE_AVX512F=1 -DXNN_ENABLE_AVX512FP16=0 -DXNN_ENABLE_AVX512SKX=1 -DXNN_ENABLE_AVX512VBMI=1 -DXNN_ENABLE_AVX512VNNI=1 -DXNN_ENABLE_AVX512VNNIGFNI=1 -DXNN_ENABLE_AVX=1 -DXNN_ENABLE_AVXVNNI=1 -DXNN_ENABLE_AVXVNNIINT8=0 -DXNN_ENABLE_CPUINFO=1 -DXNN_ENABLE_F16C=1 -DXNN_ENABLE_FMA3=1 -DXNN_ENABLE_HVX=1 -DXNN_ENABLE_KLEIDIAI=0 -DXNN_ENABLE_RISCV_VECTOR=1 -DXNN_ENABLE_SPARSE=1 -DXNN_ENABLE_SSE2=1 -DXNN_ENABLE_SSE41=1 -DXNN_ENABLE_SSE=1 -DXNN_ENABLE_SSSE3=1 -DXNN_ENABLE_VSX=1 -DXNN_ENABLE_WASM_REVECTORIZE=0 -DXNN_LOG_LEVEL=0 -IC:\a\XNNPACK\XNNPACK\include -IC:\a\XNNPACK\XNNPACK\build\windows\arm64\pthreadpool-source\include -external:IC:\a\XNNPACK\XNNPACK\. -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googlemock\include -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googlemock -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googletest\include -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googletest -external:W0 /UNDEBUG  /DWIN32 /D_WINDOWS /GR /EHsc /O2 /Ob2 /DNDEBUG -std:c++14 -MD /wd4146 /bigobj /wd4190 /O2 /DEBUG:FASTLINK /Zi /showIncludes /Fotest\CMakeFiles\gemm-microkernel-tester.dir\gemm-microkernel-tester.cc.obj /Fdtest\CMakeFiles\gemm-microkernel-tester.dir\gemm-microkernel-tester.pdb /FS -c C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc
C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc(2864): error C3861: 'xnn_packed_size_kai_f16_conv_goki_w': identifier not found
C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc(2870): error C3861: 'xnn_pack_kai_f16_conv_goki_w_sme': identifier not found

JonathanC-ARM · 2025-10-30T11:41:39Z

I just made some small tweaks for ifdef's which meant this stuff was getting into non kleidi builds.
bazel test --compilation_mode=opt --define xnn_enable_assembly=false --define xnn_enable_arm_fp16_scalar=false --define xnn_enable_arm_bf16=false --define xnn_enable_kleidiai=false //test/... Was able to see the failure resolved it and from what I was able to test on my end it should be working now.

gmiodice and others added 4 commits October 20, 2025 09:24

Prototype: Add support for fp16 iGEMM with SME2

c69ccdb

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]>

Include missing files

a3537a1

Signed-off-by: Gian Marco Iodice <[email protected]>

Update FP16 iGEMM based on review comments

232826c

Signed-off-by: Gian Marco Iodice <[email protected]>

Updated FP16 iGemm Review with Fixes

03bccaa

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM mentioned this pull request Oct 20, 2025

Prototype: Add support for fp16 iGEMM with SME2 #8687

Open

Fix rebase issues

9cd6e88

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the f16_igemm branch from 77362fd to 9cd6e88 Compare October 20, 2025 09:44

copybara-service bot mentioned this pull request Oct 21, 2025

Copybara import of the project: #9017

Open

Aelphy and others added 12 commits October 21, 2025 16:16

Remove threadpool library and just build threadpool.cc as part of sub…

aeeca5d

…graph PiperOrigin-RevId: 821566586

Disable SME when msan is enabled

7304027

PiperOrigin-RevId: 821694771

Don't bother disabling KleidiAI if using YNNPACK

89a72e3

This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594

Disable SME on older Apple compilers

0c5edfc

PiperOrigin-RevId: 821708108

Fix correctness of dot benchmarks for transpose_a kernels

0d3dc09

PiperOrigin-RevId: 821808685

Update pthreadpool dependency.

4b73eb1

PiperOrigin-RevId: 821857188

Fix flaky quantize tests

66d084b

PiperOrigin-RevId: 821867761

Add missing gemm_config .element_size initializations.

6fc5696

PiperOrigin-RevId: 821984759

Fix build issues and guard against sme2 specific path

923b7f9

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the f16_igemm branch from 6eb7ad7 to 56ee7cb Compare October 22, 2025 10:32

JonathanC-ARM force-pushed the f16_igemm branch from 56ee7cb to bf9d731 Compare October 22, 2025 10:45

Merge remote-tracking branch 'origin/master' into f16_igemm

22beb50

JonathanC-ARM force-pushed the f16_igemm branch from bf9d731 to 22beb50 Compare October 22, 2025 10:46

Refactor Convolution to new structure and fix build failures

06a44d2

Signed-off-by: Jonathan Clohessy <[email protected]>

dsharlet approved these changes Oct 25, 2025

View reviewed changes

Remove unused gemm config structure init

175903d

Signed-off-by: Jonathan Clohessy <[email protected]>

dsharlet approved these changes Oct 27, 2025

View reviewed changes

Merge branch 'google:master' into f16_igemm

9efa3d6

JonathanC-ARM added 2 commits October 29, 2025 14:06

Updated code with sme variants of kernels and fixed tests

999f4e3

Signed-off-by: Jonathan Clohessy <[email protected]>

Merge branch 'f16_igemm' of github.com:JonathanC-ARM/XNNPACK into f16…

892eee1

…_igemm

dsharlet approved these changes Oct 29, 2025

View reviewed changes

Updated ifdef guards and yml file

a2bd7aa

Signed-off-by: Jonathan Clohessy <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for fp16 iGEMM with SME2 #9005

Add support for fp16 iGEMM with SME2 #9005

JonathanC-ARM commented Oct 20, 2025

Uh oh!

dsharlet commented Oct 21, 2025

Uh oh!

JonathanC-ARM commented Oct 21, 2025 •

edited

Loading

Uh oh!

JonathanC-ARM commented Oct 21, 2025

Uh oh!

gonnet commented Oct 22, 2025

Uh oh!

JonathanC-ARM commented Oct 23, 2025

Uh oh!

JonathanC-ARM commented Oct 27, 2025

Uh oh!

dsharlet commented Oct 28, 2025

Uh oh!

JonathanC-ARM commented Oct 28, 2025

Uh oh!

JonathanC-ARM commented Oct 29, 2025

Uh oh!

dsharlet commented Oct 30, 2025

Uh oh!

JonathanC-ARM commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Add support for fp16 iGEMM with SME2 #9005

Are you sure you want to change the base?

Add support for fp16 iGEMM with SME2 #9005

Conversation

JonathanC-ARM commented Oct 20, 2025

Uh oh!

dsharlet commented Oct 21, 2025

Uh oh!

JonathanC-ARM commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonathanC-ARM commented Oct 21, 2025

Uh oh!

gonnet commented Oct 22, 2025

Uh oh!

JonathanC-ARM commented Oct 23, 2025

Uh oh!

JonathanC-ARM commented Oct 27, 2025

Uh oh!

dsharlet commented Oct 28, 2025

Uh oh!

JonathanC-ARM commented Oct 28, 2025

Uh oh!

JonathanC-ARM commented Oct 29, 2025

Uh oh!

dsharlet commented Oct 30, 2025

Uh oh!

JonathanC-ARM commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JonathanC-ARM commented Oct 21, 2025 •

edited

Loading