Skip to content

Conversation

@JonathanC-ARM
Copy link

Initial prototype for FP16 Igemm support for SME2
continuing work from ##8687

gmiodice and others added 4 commits October 20, 2025 09:24
- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>
Signed-off-by: Gian Marco Iodice <[email protected]>
Signed-off-by: Jonathan Clohessy <[email protected]>
Signed-off-by: Jonathan Clohessy <[email protected]>
@dsharlet
Copy link
Collaborator

This isn't building for us:

test/gemm-microkernel-tester.cc:2455:40: error: no viable overloaded '+='
 2455 |         c_ref[m_index * n() + n_index] +=
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
 2456 |             xnn_float16_to_float(input_f16[m_index * k() + k_index]) *
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 2457 |             xnn_float16_to_float(weights[n_index * k() + k_index]);
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test/gemm-microkernel-tester.cc:2459:38: error: no viable overloaded '+='
 2459 |       c_ref[m_index * n() + n_index] += xnn_float16_to_float(bias[n_index]);
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.

@JonathanC-ARM
Copy link
Author

JonathanC-ARM commented Oct 21, 2025

Hi @dsharlet could you give a bit more context of the error, particularly around the build command used. The strange thing on my end is that I cant see this.

I am compiling on an M4 however, but I'm going to try on an x86_64 machine shortly and cross compile

bazel build -c opt --enable_bzlmod --define xnn_enable_arm_sme=true --define xnn_enable_arm_sme2=true //test:gemm_microkernel_tester

Tried a few variations on the command, cleaned my environment etc. Also synced my fork with master in case anything since.

copybara-service bot pushed a commit that referenced this pull request Oct 21, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 69ccf09
PiperOrigin-RevId: 821598958
Aelphy and others added 12 commits October 21, 2025 16:16
No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068
…ally

long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723
PiperOrigin-RevId: 821694771
This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594
PiperOrigin-RevId: 821708108
According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217
PiperOrigin-RevId: 821857188
PiperOrigin-RevId: 821867761
@JonathanC-ARM
Copy link
Author

@dsharlet thanks for telling me about the build problem, seemed to only show up on Linux machines. I was able to fix the build issue in the latest commit.

Will be resolving the conflicts with Master shortly.

copybara-service bot pushed a commit that referenced this pull request Oct 22, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 56ee7cb
PiperOrigin-RevId: 821598958
@gonnet
Copy link
Collaborator

gonnet commented Oct 22, 2025

This is still failing to build for the CI workflows, e.g. https://github.com/google/XNNPACK/actions/runs/18713631848/job/53367695764.

@JonathanC-ARM
Copy link
Author

@gonnet I have resolved the build failures and tested it both on a ubuntu machine and mac with clean environments. So should be building now.

@JonathanC-ARM
Copy link
Author

Hi @dsharlet thanks for the approval, I seen there was a few failures in the workflows. Two of them appear to be avx related tests, so not sure that is necessarily related to this change. But the other was an unused variable warning on one gcc version, so i've gone ahead and removed that now, so think it should be ok next time around.

copybara-service bot pushed a commit that referenced this pull request Oct 27, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>

--
06a44d2 by Jonathan Clohessy <[email protected]>:

Refactor Convolution to new structure and fix build failures

Signed-off-by: Jonathan Clohessy <[email protected]>

--
175903d by Jonathan Clohessy <[email protected]>:

Remove unused gemm config structure init

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 175903d
PiperOrigin-RevId: 821598958
@dsharlet
Copy link
Collaborator

The failures on this branch look real, a lot of tests that run SME code are crashing.

@JonathanC-ARM
Copy link
Author

@dsharlet yeah that seems to have been the case, spent a good bit of today sorting through things. I should have a new push shortly. The crashes appear to be missing sme1 or 2 variants of pack functions for example, in the crashing batch matmul test case it seemed that some gemm config pointers were null which led me to missing packs etc..

Also a quick question which has come from this maybe yourself or @gonnet might be able to answer, the tools/update_microkernels.py file autogenerates bazel and cmake files for various microkernels. But one thing I've encountered is that for example most sme2 kernels their pack functions are sme, the naming convention of the file dictates what is included in those bazel or cmake files. The only way without modifying the logic of the python script is to basically duplicate files across neonsme and neonsme2, it seems that has been the case, but I just wanted to ask your thoughts on it.

copybara-service bot pushed a commit that referenced this pull request Oct 29, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>

--
06a44d2 by Jonathan Clohessy <[email protected]>:

Refactor Convolution to new structure and fix build failures

Signed-off-by: Jonathan Clohessy <[email protected]>

--
175903d by Jonathan Clohessy <[email protected]>:

Remove unused gemm config structure init

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 9efa3d6
PiperOrigin-RevId: 821598958
@JonathanC-ARM
Copy link
Author

Hi @dsharlet I made some additional changes, and ran all of //test/... with sme2 on/off and vice versa. Everything seemed to pass testing, I was able to replicate the original failures and work through them. So I think it should be all good now.

Thanks

Signed-off-by: Jonathan Clohessy <[email protected]>
@dsharlet
Copy link
Collaborator

There were some issues with build timeouts earlier. I re-ran the failed builds, there is a remaining real build issue:

C:\Users\runneradmin\.cargo\bin\ccache.exe C:\PROGRA~1\MICROS~2\2022\ENTERP~1\VC\Tools\MSVC\1444~1.352\bin\Hostx64\arm64\cl.exe  /nologo /TP -DNOMINMAX -DPTHREADPOOL_NO_DEPRECATED_API=1 -DXNN_ENABLE_ARM_BF16=0 -DXNN_ENABLE_ARM_DOTPROD=1 -DXNN_ENABLE_ARM_FP16_SCALAR=0 -DXNN_ENABLE_ARM_FP16_VECTOR=1 -DXNN_ENABLE_ARM_I8MM=1 -DXNN_ENABLE_ARM_SME2=0 -DXNN_ENABLE_ARM_SME=1 -DXNN_ENABLE_ASSEMBLY=0 -DXNN_ENABLE_AVX256SKX=1 -DXNN_ENABLE_AVX256VNNI=1 -DXNN_ENABLE_AVX256VNNIGFNI=1 -DXNN_ENABLE_AVX2=1 -DXNN_ENABLE_AVX512AMX=1 -DXNN_ENABLE_AVX512BF16=0 -DXNN_ENABLE_AVX512F=1 -DXNN_ENABLE_AVX512FP16=0 -DXNN_ENABLE_AVX512SKX=1 -DXNN_ENABLE_AVX512VBMI=1 -DXNN_ENABLE_AVX512VNNI=1 -DXNN_ENABLE_AVX512VNNIGFNI=1 -DXNN_ENABLE_AVX=1 -DXNN_ENABLE_AVXVNNI=1 -DXNN_ENABLE_AVXVNNIINT8=0 -DXNN_ENABLE_CPUINFO=1 -DXNN_ENABLE_F16C=1 -DXNN_ENABLE_FMA3=1 -DXNN_ENABLE_HVX=1 -DXNN_ENABLE_KLEIDIAI=0 -DXNN_ENABLE_RISCV_VECTOR=1 -DXNN_ENABLE_SPARSE=1 -DXNN_ENABLE_SSE2=1 -DXNN_ENABLE_SSE41=1 -DXNN_ENABLE_SSE=1 -DXNN_ENABLE_SSSE3=1 -DXNN_ENABLE_VSX=1 -DXNN_ENABLE_WASM_REVECTORIZE=0 -DXNN_LOG_LEVEL=0 -IC:\a\XNNPACK\XNNPACK\include -IC:\a\XNNPACK\XNNPACK\build\windows\arm64\pthreadpool-source\include -external:IC:\a\XNNPACK\XNNPACK\. -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googlemock\include -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googlemock -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googletest\include -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googletest -external:W0 /UNDEBUG  /DWIN32 /D_WINDOWS /GR /EHsc /O2 /Ob2 /DNDEBUG -std:c++14 -MD /wd4146 /bigobj /wd4190 /O2 /DEBUG:FASTLINK /Zi /showIncludes /Fotest\CMakeFiles\gemm-microkernel-tester.dir\gemm-microkernel-tester.cc.obj /Fdtest\CMakeFiles\gemm-microkernel-tester.dir\gemm-microkernel-tester.pdb /FS -c C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc
C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc(2864): error C3861: 'xnn_packed_size_kai_f16_conv_goki_w': identifier not found
C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc(2870): error C3861: 'xnn_pack_kai_f16_conv_goki_w_sme': identifier not found

@JonathanC-ARM
Copy link
Author

I just made some small tweaks for ifdef's which meant this stuff was getting into non kleidi builds.
bazel test --compilation_mode=opt --define xnn_enable_assembly=false --define xnn_enable_arm_fp16_scalar=false --define xnn_enable_arm_bf16=false --define xnn_enable_kleidiai=false //test/... Was able to see the failure resolved it and from what I was able to test on my end it should be working now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants