Skip to content

Commit 2597292

Browse files
JonathanC-ARMxnnpack-bot
authored andcommitted
Copybara import of the project:
-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> -- 06a44d2 by Jonathan Clohessy <[email protected]>: Refactor Convolution to new structure and fix build failures Signed-off-by: Jonathan Clohessy <[email protected]> -- 175903d by Jonathan Clohessy <[email protected]>: Remove unused gemm config structure init Signed-off-by: Jonathan Clohessy <[email protected]> -- 999f4e3 by Jonathan Clohessy <[email protected]>: Updated code with sme variants of kernels and fixed tests Signed-off-by: Jonathan Clohessy <[email protected]> -- a2bd7aa by Jonathan Clohessy <[email protected]>: Updated ifdef guards and yml file Signed-off-by: Jonathan Clohessy <[email protected]> -- 551cfde by Jonathan Clohessy <[email protected]>: Add new test case and fix issue with LHS pack Signed-off-by: Jonathan Clohessy <[email protected]> -- bcc62a0 by Jonathan Clohessy <[email protected]>: Removed ForceInlineLhsPackingPf16OnLastConv and use runtime flags instead Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm bcc62a0 PiperOrigin-RevId: 833326167
1 parent 26cebbb commit 2597292

38 files changed

+1689
-142
lines changed

bench/gemm-benchmark.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1189,7 +1189,7 @@ void GEMMBenchmark(benchmark::State& state,
11891189
const uint32_t mb = min(mc - m, mr);
11901190
gemm(mb, nc, kc * sizeof(xnn_float16),
11911191
input_packed.data() +
1192-
xnn_x16_pack_lh_offset__neonsme2(m, kc, mr_packed, kr, sr),
1192+
xnn_x16_pack_lh_offset__neonsme(m, kc, mr_packed, kr, sr),
11931193
w.data() + packed_w_size * buffer_index,
11941194
&c[c_elements * buffer_index], nc * sizeof(xnn_float16),
11951195
sizeof(xnn_float16), &minmax_params);

build_srcs.bzl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -284,6 +284,7 @@ MICROKERNEL_DEFS = [
284284
"src/x64-transposec/x64-transposec.inc",
285285
"src/x8-pack-lh/x8-pack-lh.inc",
286286
"src/x8-pack-lh/x8-pack-lh-igemm.inc",
287+
"src/x16-pack-lh/x16-pack-lh-igemm.inc",
287288
"src/x8-packq/x8-packq.inc",
288289
"src/x8-packw/x8-packw.inc",
289290
"src/x8-transposec/x8-transposec.inc",

cmake/gen/neonsme2_microkernels.cmake

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010

1111

1212
SET(PROD_NEONSME2_MICROKERNEL_SRCS
13+
src/pf16-f16-f16-igemm/pf16-f16-f16-igemm-32x32c2-minmax-neonsme2.c
1314
src/pf16-gemm/pf16-gemm-1x32c2-minmax-neonsme2.c
1415
src/pf16-gemm/pf16-gemm-32x32c2-minmax-neonsme2.c
1516
src/pf32-gemm/pf32-gemm-1x32-minmax-neonsme2.c
@@ -23,7 +24,9 @@ SET(PROD_NEONSME2_MICROKERNEL_SRCS
2324
src/qp8-f32-qc8w-gemm/qp8-f32-qc8w-gemm-minmax-16x64c4-neonsme2.c
2425
src/x8-pack-lh/x8-packlh-igemm-neonsme2.c
2526
src/x8-pack-lh/x8-packlh-neonsme2.c
26-
src/x16-pack-lh/x16-packlh-neonsme2.c)
27+
src/x16-pack-lh/x16-packlh-igemm-neonsme2.c
28+
src/x16-pack-lh/x16-packlh-neonsme2.c
29+
src/x32-pack-lh/x32-packlh-neonsme2.c)
2730

2831
SET(NON_PROD_NEONSME2_MICROKERNEL_SRCS)
2932

cmake/gen/neonsme_microkernels.cmake

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,11 @@
1212
SET(PROD_NEONSME_MICROKERNEL_SRCS
1313
src/pf32-gemm/pf32-gemm-1x32-minmax-neonsme.c
1414
src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme.c
15+
src/x16-pack-lh/x16-packlh-igemm-neonsme.c
16+
src/x16-pack-lh/x16-packlh-neonsme.c
1517
src/x32-pack-lh/x32-packlh-neonsme.c)
1618

17-
SET(NON_PROD_NEONSME_MICROKERNEL_SRCS)
19+
SET(NON_PROD_NEONSME_MICROKERNEL_SRCS
20+
src/pf16-f16-f16-igemm/pf16-f16-f16-igemm-32x32c2-minmax-neonsme.c)
1821

1922
SET(ALL_NEONSME_MICROKERNEL_SRCS ${PROD_NEONSME_MICROKERNEL_SRCS} + ${NON_PROD_NEONSME_MICROKERNEL_SRCS})

gen/neonsme2_microkernels.bzl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ Auto-generated file. Do not edit!
66
"""
77

88
PROD_NEONSME2_MICROKERNEL_SRCS = [
9+
"src/pf16-f16-f16-igemm/pf16-f16-f16-igemm-32x32c2-minmax-neonsme2.c",
910
"src/pf16-gemm/pf16-gemm-1x32c2-minmax-neonsme2.c",
1011
"src/pf16-gemm/pf16-gemm-32x32c2-minmax-neonsme2.c",
1112
"src/pf32-gemm/pf32-gemm-1x32-minmax-neonsme2.c",
@@ -19,7 +20,9 @@ PROD_NEONSME2_MICROKERNEL_SRCS = [
1920
"src/qp8-f32-qc8w-gemm/qp8-f32-qc8w-gemm-minmax-16x64c4-neonsme2.c",
2021
"src/x8-pack-lh/x8-packlh-igemm-neonsme2.c",
2122
"src/x8-pack-lh/x8-packlh-neonsme2.c",
23+
"src/x16-pack-lh/x16-packlh-igemm-neonsme2.c",
2224
"src/x16-pack-lh/x16-packlh-neonsme2.c",
25+
"src/x32-pack-lh/x32-packlh-neonsme2.c",
2326
]
2427

2528
NON_PROD_NEONSME2_MICROKERNEL_SRCS = [

gen/neonsme_microkernels.bzl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,13 @@ Auto-generated file. Do not edit!
88
PROD_NEONSME_MICROKERNEL_SRCS = [
99
"src/pf32-gemm/pf32-gemm-1x32-minmax-neonsme.c",
1010
"src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme.c",
11+
"src/x16-pack-lh/x16-packlh-igemm-neonsme.c",
12+
"src/x16-pack-lh/x16-packlh-neonsme.c",
1113
"src/x32-pack-lh/x32-packlh-neonsme.c",
1214
]
1315

1416
NON_PROD_NEONSME_MICROKERNEL_SRCS = [
17+
"src/pf16-f16-f16-igemm/pf16-f16-f16-igemm-32x32c2-minmax-neonsme.c",
1518
]
1619

1720
ALL_NEONSME_MICROKERNEL_SRCS = PROD_NEONSME_MICROKERNEL_SRCS + NON_PROD_NEONSME_MICROKERNEL_SRCS

include/xnnpack.h

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3068,6 +3068,46 @@ enum xnn_status xnn_create_convolution2d_nhwc_f16(
30683068
xnn_weights_cache_t weights_cache,
30693069
xnn_operator_t* convolution_op_out);
30703070

3071+
enum xnn_status xnn_create_convolution2d_nhwc_pf16(
3072+
uint32_t input_padding_top,
3073+
uint32_t input_padding_right,
3074+
uint32_t input_padding_bottom,
3075+
uint32_t input_padding_left,
3076+
uint32_t kernel_height,
3077+
uint32_t kernel_width,
3078+
uint32_t subsampling_height,
3079+
uint32_t subsampling_width,
3080+
uint32_t dilation_height,
3081+
uint32_t dilation_width,
3082+
uint32_t groups,
3083+
size_t group_input_channels,
3084+
size_t group_output_channels,
3085+
size_t input_channel_stride,
3086+
size_t output_channel_stride,
3087+
const void* kernel,
3088+
const void* bias,
3089+
float output_min,
3090+
float output_max,
3091+
uint32_t flags,
3092+
xnn_weights_cache_t weights_cache,
3093+
xnn_operator_t* convolution_op_out);
3094+
3095+
enum xnn_status xnn_reshape_convolution2d_nhwc_pf16(
3096+
xnn_operator_t convolution_op,
3097+
size_t batch_size,
3098+
size_t input_height,
3099+
size_t input_width,
3100+
size_t* workspace_size,
3101+
size_t* output_height_out,
3102+
size_t* output_width_out,
3103+
pthreadpool_t threadpool);
3104+
3105+
enum xnn_status xnn_setup_convolution2d_nhwc_pf16(
3106+
xnn_operator_t convolution_op,
3107+
void* workspace,
3108+
const void* input,
3109+
void* output);
3110+
30713111
enum xnn_status xnn_reshape_convolution2d_nhwc_f16(
30723112
xnn_operator_t convolution_op,
30733113
size_t batch_size,

scripts/generate-tests.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ tools/generate-gemm-test.py --spec test/qs8-qc4w-gemm-minmax-fp32.yaml --output-
5050
tools/generate-gemm-test.py --spec test/qs8-qc8w-gemm-minmax-fp32.yaml --output-test test/qs8-qc8w-gemm-minmax-fp32.cc --output-test test/qs8-qc8w-gemm-minmax-fp32-2.cc --output-test test/qs8-qc8w-gemm-minmax-fp32-3.cc --output-bench bench/qs8-qc8w-gemm-fp32.cc &
5151

5252
### Tests for IGEMM micro-kernels
53+
tools/generate-gemm-test.py --spec test/pf16-f16-igemm-minmax.yaml --output-test test/pf16-f16-igemm-minmax.cc &
5354
tools/generate-gemm-test.py --spec test/f16-igemm-minmax.yaml --output-test test/f16-igemm-minmax.cc &
5455
tools/generate-gemm-test.py --spec test/f16-f32acc-igemm-minmax.yaml --output-test test/f16-f32acc-igemm-minmax.cc &
5556

src/configs/gemm-config.c

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -326,27 +326,40 @@ static void init_pf16_gemm_config(void) {
326326
pf16_gemm_config.bias_element_size = sizeof(xnn_float16);
327327
#if XNN_ARCH_ARM64 && XNN_ENABLE_KLEIDIAI
328328
const struct xnn_hardware_config* hardware_config =
329-
xnn_init_hardware_config();
329+
xnn_init_hardware_config();
330330
assert(hardware_config != NULL);
331-
if (XNN_ENABLE_ARM_SME2 && (hardware_config->arch_flags & xnn_arch_arm_sme2)) {
332-
#if XNN_ENABLE_ARM_SME2
333-
const size_t mr = xnn_pf16_gemm_minmax_ukernel_32x32c2__neonsme2_get_mr();
334-
const size_t nr = xnn_pf16_gemm_minmax_ukernel_32x32c2__neonsme2_get_nr();
335-
pf16_gemm_config.arch = xnn_arch_arm_sme2;
336-
pf16_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(1)] = XNN_INIT_HMP_GEMM_UKERNEL(xnn_pf16_gemm_minmax_ukernel_1x32c2__neonsme2);
337-
pf16_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(mr)] = XNN_INIT_HMP_GEMM_UKERNEL(xnn_pf16_gemm_minmax_ukernel_32x32c2__neonsme2);
338-
pf16_gemm_config.init.f16 = xnn_init_f16_minmax_scalar_params;
339-
pf16_gemm_config.pack_weights_and_biases = xnn_pack_kai_f16_weights_and_biases;
340-
pf16_gemm_config.packed_stride_weights_and_biases = xnn_packed_stride_kai_f16_weights_and_biases;
341-
pf16_gemm_config.mr = mr;
342-
pf16_gemm_config.mr_packed = mr;
343-
pf16_gemm_config.nr = nr;
344-
pf16_gemm_config.log2_kr = 1;
345-
#endif // XNN_ENABLE_ARM_SME2
331+
if ((hardware_config->arch_flags & xnn_arch_arm_sme2)) {
332+
#if XNN_ENABLE_ARM_SME2
333+
const size_t mr = xnn_pf16_gemm_minmax_ukernel_32x32c2__neonsme2_get_mr();
334+
size_t nr = xnn_pf16_gemm_minmax_ukernel_32x32c2__neonsme2_get_nr();
335+
const size_t nstep_min = 16;
336+
pf16_gemm_config.arch = xnn_arch_arm_sme2;
337+
pf16_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(1)] = XNN_INIT_HMP_GEMM_UKERNEL(xnn_pf16_gemm_minmax_ukernel_1x32c2__neonsme2);
338+
pf16_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(mr)] = XNN_INIT_HMP_GEMM_UKERNEL(xnn_pf16_gemm_minmax_ukernel_32x32c2__neonsme2);
339+
pf16_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(mr)] =
340+
xnn_init_hmp_packed_igemm_ukernel(
341+
(xnn_packed_lhs_igemm_ukernel_fn)
342+
xnn_pf16_f16_igemm_minmax_fp16_ukernel_32x32c2__neonsme2);
343+
pf16_gemm_config.init.f16 = xnn_init_f16_minmax_scalar_params;
344+
pf16_gemm_config.pack_weights_and_biases = xnn_pack_kai_f16_weights_and_biases;
345+
pf16_gemm_config.packed_stride_weights_and_biases = xnn_packed_stride_kai_f16_weights_and_biases;
346+
pf16_gemm_config.pack_igemm_goki =
347+
(xnn_pack_conv_goki_w_fn)xnn_pack_kai_f16_conv_goki_w_sme; // both sme and sme2 use the same packing kernel
348+
pf16_gemm_config.pack_igemm_kgo =
349+
(xnn_pack_conv_kgo_w_fn)xnn_pack_f16_conv_kgo_w;
350+
pf16_gemm_config.mr = mr;
351+
pf16_gemm_config.mr_packed = mr;
352+
pf16_gemm_config.nr = nr < nstep_min ? nstep_min : nr;
353+
pf16_gemm_config.log2_kr = 1;
354+
#endif
355+
} else {
356+
/* no action */
346357
}
347-
#endif // XNN_ARCH_ARM64 && XNN_ENABLE_KLEIDIAI
358+
assert(pf16_gemm_config.mr <= XNN_MAX_MR);
359+
#endif // XNN_ARCH_ARM64 && XNN_ENABLE_KLEIDIAI
348360
}
349361

362+
350363
static void init_bf16_f32_gemm_config(void) {
351364
// Common parameters.
352365
bf16_f32_gemm_config.log2_input_element_size = XNN_LOG2_SIZEOF_BFLOAT16;
@@ -5635,6 +5648,7 @@ const struct xnn_gemm_config* xnn_init_pf16_gemm_config() {
56355648
return NULL;
56365649
}
56375650
XNN_INIT_ONCE(pf16_gemm);
5651+
56385652
return pf16_gemm_config.mr ? &pf16_gemm_config : NULL;
56395653
}
56405654

0 commit comments

Comments
 (0)