You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--
c69ccdb by Gian Marco Iodice <[email protected]>:
Prototype: Add support for fp16 iGEMM with SME2
- Initial prototype to enable fp16 iGEMM with SME2 in conv2d
Signed-off-by: Gian Marco Iodice <[email protected]>
--
a3537a1 by Gian Marco Iodice <[email protected]>:
Include missing files
Signed-off-by: Gian Marco Iodice <[email protected]>
--
232826c by Gian Marco Iodice <[email protected]>:
Update FP16 iGEMM based on review comments
Signed-off-by: Gian Marco Iodice <[email protected]>
--
03bccaa by Jonathan Clohessy <[email protected]>:
Updated FP16 iGemm Review with Fixes
Signed-off-by: Jonathan Clohessy <[email protected]>
--
9cd6e88 by Jonathan Clohessy <[email protected]>:
Fix rebase issues
Signed-off-by: Jonathan Clohessy <[email protected]>
--
7eb618d by Misha Gutman <[email protected]>:
Added multiple_of to handle all multiples in reductions simply.
No significant performance loss:
bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6)
PiperOrigin-RevId: 821549068
--
e5cb8c0 by Misha Gutman <[email protected]>:
Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.
bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17%
bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14%
bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13%
bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13%
bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7%
PiperOrigin-RevId: 821556723
--
aeeca5d by Dillon Sharlet <[email protected]>:
Remove threadpool library and just build threadpool.cc as part of subgraph
PiperOrigin-RevId: 821566586
--
7304027 by Dillon Sharlet <[email protected]>:
Disable SME when msan is enabled
PiperOrigin-RevId: 821694771
--
89a72e3 by Dillon Sharlet <[email protected]>:
Don't bother disabling KleidiAI if using YNNPACK
This causes builds to fail, and it's harmless to leave it enabled.
PiperOrigin-RevId: 821704594
--
0c5edfc by Dillon Sharlet <[email protected]>:
Disable SME on older Apple compilers
PiperOrigin-RevId: 821708108
--
9b29972 by Dillon Sharlet <[email protected]>:
Fix usage of `sv{ld,st}1_hor_vnum_za32`
According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).
This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.
PiperOrigin-RevId: 821730217
--
0d3dc09 by Dillon Sharlet <[email protected]>:
Fix correctness of dot benchmarks for transpose_a kernels
PiperOrigin-RevId: 821808685
--
4b73eb1 by Pedro Gonnet <[email protected]>:
Update `pthreadpool` dependency.
PiperOrigin-RevId: 821857188
--
66d084b by Dillon Sharlet <[email protected]>:
Fix flaky quantize tests
PiperOrigin-RevId: 821867761
--
6fc5696 by Quentin Khan <[email protected]>:
Add missing `gemm_config` `.element_size` initializations.
PiperOrigin-RevId: 821984759
--
923b7f9 by Jonathan Clohessy <[email protected]>:
Fix build issues and guard against sme2 specific path
Signed-off-by: Jonathan Clohessy <[email protected]>
--
06a44d2 by Jonathan Clohessy <[email protected]>:
Refactor Convolution to new structure and fix build failures
Signed-off-by: Jonathan Clohessy <[email protected]>
--
175903d by Jonathan Clohessy <[email protected]>:
Remove unused gemm config structure init
Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 175903d
PiperOrigin-RevId: 821598958
0 commit comments