Skip to content

Conversation

@gmiodice
Copy link
Contributor

@gmiodice gmiodice commented Jul 8, 2025

  • Initial prototype to enable fp16 iGEMM with SME2 in conv2d

gmiodice added 2 commits July 8, 2025 16:10
- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>
Signed-off-by: Gian Marco Iodice <[email protected]>
#if XNN_ENABLE_KLEIDIAI
assert(kr == 2);

return kai_get_lhs_packed_offset_lhs_imatmul_pack_x16p2vlx2_x16p_sme(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be an SME kernel. SME2 is not required?
If so we have emulator support for SME that could test this function with sme, but not SME2.
Is there a unittest we can run for x16-pack-lh microkernels?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the KleidiAI naming convention, so not something that needs to be addressed here.

#endif // XNN_ENABLE_KLEIDIAI
}

size_t xnn_x16_pack_lh_offset__igemm_neonsme2(size_t m, size_t kc, size_t ks,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neon should be removed from these function/file isa names. in this case, the function it calls is _sme so the cpuinfo should detect sme (1.0) and typically the compiler would need to support sme, but not sme2, to build microkernels.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK as long as it follows the naming convention for the other pack-lh kernels, which it does.

}
}

void transpose_weights_x16(const uint16_t* in, uint16_t* out, size_t height,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this 'x16' meant to be 16 bits per element? if so it should be at the start of function/file name
x16_transpose_weights()
unless you're proposing a public api for transpose, this should be static or inlined into the calling function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is meant to be internal-only.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seemed that was the intention, so have made it static

src/subgraph.c Outdated

case xnn_node_type_convolution_2d:
case xnn_node_type_deconvolution_2d: {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run clang-format using this project's config on all modified files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang format was ran against all the files in the latest push

size_t mr_packed, size_t kr,
size_t sr) {
#if XNN_ENABLE_KLEIDIAI
assert(kr == 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kai function appears to be for x16 (NR=16)?
I see kai struggles with the over use of 'x' just like XNNPack :-)
one of the x16 appears to be bits per element. In XNNPack that goes first in the name.
The x in pack functions is from MRxNR but MR is not applicable, so x16 is NR=16
We used to also have x2 means unrolled by 2, but renamed that to u2.
Anyway, the kai name implies a specific tiling.
In packw microkernels it is the same and the function/file are specific to a set of parameters. NR, KR, SR
The function is specialized and expected to fail if called with anything else.
If kai is the same, the assert for KR=2 (ah x2. We would call that c2).. the assert should check all applicable parameters, but especially NR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KleidiAI has a very extensive documentation of their type naming conventions on their GitLab/GitHub pages.

@gonnet gonnet self-assigned this Aug 19, 2025
Copy link
Collaborator

@gonnet gonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in reviewing this!

I think it's really important to get all the microkernel and subgraph tests hooked up to make sure that they all pass.

Comment on lines 250 to 251
x16_igemm_pack_lh_config.log2_input_element_size = 0;
x16_igemm_pack_lh_config.log2_packed_element_size = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values should be 1 since fp16 is 1 << 1 bytes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to the correct value

return status;
}

enum xnn_status xnn_create_convolution2d_nhwc_pf16(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function and xnn_create_convolution2d_nhwc_f16 (below it) differ only in the gemm_config and operator_type.

Could you create a single create_convolution2d_nhwc_f16 that takes the gemm_config and operator_type as an argument, and have this function and xnn_create_convolution2d_nhwc_f16 call it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single common creation function create_convolution2d_nhwc_f16 which takes the gemm config and op type as you have suggested

"`XNN_ENABLE_KLEIDIAI`." &&
0);
#endif // XNN_ENABLE_KLEIDIAI
} No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a final newline.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line added in latest push

}
}

void transpose_weights_x16(const uint16_t* in, uint16_t* out, size_t height,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is meant to be internal-only.

#if XNN_ENABLE_KLEIDIAI
assert(kr == 2);

return kai_get_lhs_packed_offset_lhs_imatmul_pack_x16p2vlx2_x16p_sme(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the KleidiAI naming convention, so not something that needs to be addressed here.

xnn_pqs8_qc8w_igemm_minmax_fp32_ukernel_32x32c4__neonsme2)


#define DECLARE_PF16_F16_PACKED_IGEMM_MINMAX_UKERNEL_FUNCTION(fn_name) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix formatting.

size_t k_stride, //
size_t extra_bytes);

XNN_INTERNAL void xnn_pack_kai_f16_conv_goki_w_sme2(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix formatting.

xnnpack::Buffer<xnn_float16> c_ref(m() * n(), 0, /*extra_bytes=*/{0},
"c_ref");

#if 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test pass?

xnn_pack_weights_and_biases_fn pack,
xnn_packed_stride_weights_and_biases_fn packed_stride);

void Test_PF16(xnn_packed_f16_lhs_igemm_ukernel_fn packed_igemm,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function to be called, you need to add your new kernel to the corresponding .yaml files and likely also add a few lines to the test generator script (see e.g. XNNPACK/test/qp8-f32-qc8w-gemm-minmax.cc).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the yml files and updated tests, they should now run and pass

@dsharlet
Copy link
Collaborator

@gmiodice It looks like this needs conflicts resolved

@gonnet @fbarchard can you please follow up? Your comments have been responded to.

xnnpack_unit_test(
name = "pf16_f16_igemm_minmax_test",
srcs = [
"pf16-f16-igemm-minmax.cc",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to be added to the corresponding CMakeLists.txt file.

}

size_t xnn_packed_size_kai_f16_conv_goki_w(size_t nc, size_t ks, size_t kc) {
#if XNN_ENABLE_KLEIDIAI
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are already inside an #if XNN_ENABLE_KLEIDIAI, so the #else below is not needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, removed the additional #if


for (size_t g_idx = 0; g_idx < g; ++g_idx) {
transpose_weights_x16(k, tmp_data, nc, ks * kc);
kai_run_rhs_imatmul_pack_kxn_x16p2vlx2b_x16_x16_sme(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer to both really would be to have a new function variant which would require the kleidi team to implement new function versions and provide them in a release. The majority of their function signatures are the similar in terms of the bias. So I suspect they would hesitate to remove the bias.

Would it be sufficient for the function to handle b == NULL as if it were full of zeros? That would not require a change in the API?

But as for the transpose this variant of the pack kernel is kxn if we had a nxk version we could remove the manual transpose, as for right now though we will need it.

OK, could you add a TODO to remind us? Thanks!

@JonathanC-ARM
Copy link

Hi @gonnet, I have picked up this work from @gmiodice. I've created a new PR with the updates requested above in the latest commit #9005

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants