- 
                Notifications
    You must be signed in to change notification settings 
- Fork 447
Prototype: Add support for fp16 iGEMM with SME2 #8687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
         gmiodice
  
      
      
      commented
      
            gmiodice
  
      
      
      commented
        Jul 8, 2025 
      
    
  
- Initial prototype to enable fp16 iGEMM with SME2 in conv2d
- Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]>
Signed-off-by: Gian Marco Iodice <[email protected]>
| #if XNN_ENABLE_KLEIDIAI | ||
| assert(kr == 2); | ||
|  | ||
| return kai_get_lhs_packed_offset_lhs_imatmul_pack_x16p2vlx2_x16p_sme( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be an SME kernel.  SME2 is not required?
If so we have emulator support for SME that could test this function with sme, but not SME2.
Is there a unittest we can run for x16-pack-lh microkernels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the KleidiAI naming convention, so not something that needs to be addressed here.
| #endif // XNN_ENABLE_KLEIDIAI | ||
| } | ||
|  | ||
| size_t xnn_x16_pack_lh_offset__igemm_neonsme2(size_t m, size_t kc, size_t ks, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neon should be removed from these function/file isa names. in this case, the function it calls is _sme so the cpuinfo should detect sme (1.0) and typically the compiler would need to support sme, but not sme2, to build microkernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK as long as it follows the naming convention for the other pack-lh kernels, which it does.
        
          
                src/reference/packing.cc
              
                Outdated
          
        
      | } | ||
| } | ||
|  | ||
| void transpose_weights_x16(const uint16_t* in, uint16_t* out, size_t height, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this 'x16' meant to be 16 bits per element?  if so it should be at the start of function/file name
x16_transpose_weights()
unless you're proposing a public api for transpose, this should be static or inlined into the calling function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is meant to be internal-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seemed that was the intention, so have made it static
        
          
                src/subgraph.c
              
                Outdated
          
        
      |  | ||
| case xnn_node_type_convolution_2d: | ||
| case xnn_node_type_deconvolution_2d: { | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run clang-format using this project's config on all modified files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang format was ran against all the files in the latest push
| size_t mr_packed, size_t kr, | ||
| size_t sr) { | ||
| #if XNN_ENABLE_KLEIDIAI | ||
| assert(kr == 2); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the kai function appears to be for x16 (NR=16)?
I see kai struggles with the over use of 'x' just like XNNPack :-)
one of the x16 appears to be bits per element. In XNNPack that goes first in the name.
The x in pack functions is from MRxNR but MR is not applicable, so x16 is NR=16
We used to also have x2 means unrolled by 2, but renamed that to u2.
Anyway, the kai name implies a specific tiling.
In packw microkernels it is the same and the function/file are specific to a set of parameters.  NR, KR, SR
The function is specialized and expected to fail if called with anything else.
If kai is the same, the assert for KR=2 (ah x2.  We would call that c2).. the assert should check all applicable parameters, but especially NR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KleidiAI has a very extensive documentation of their type naming conventions on their GitLab/GitHub pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in reviewing this!
I think it's really important to get all the microkernel and subgraph tests hooked up to make sure that they all pass.
        
          
                src/configs/pack-lh-config.c
              
                Outdated
          
        
      | x16_igemm_pack_lh_config.log2_input_element_size = 0; | ||
| x16_igemm_pack_lh_config.log2_packed_element_size = 0; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values should be 1 since fp16 is 1 << 1 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to the correct value
        
          
                src/operators/convolution-nhwc.c
              
                Outdated
          
        
      | return status; | ||
| } | ||
|  | ||
| enum xnn_status xnn_create_convolution2d_nhwc_pf16( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function and xnn_create_convolution2d_nhwc_f16 (below it) differ only in the gemm_config and operator_type.
Could you create a single create_convolution2d_nhwc_f16 that takes the gemm_config and operator_type as an argument, and have this function and xnn_create_convolution2d_nhwc_f16 call it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single common creation function create_convolution2d_nhwc_f16 which takes the gemm config and op type as you have suggested
| "`XNN_ENABLE_KLEIDIAI`." && | ||
| 0); | ||
| #endif // XNN_ENABLE_KLEIDIAI | ||
| } No newline at end of file | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a final newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New line added in latest push
        
          
                src/reference/packing.cc
              
                Outdated
          
        
      | } | ||
| } | ||
|  | ||
| void transpose_weights_x16(const uint16_t* in, uint16_t* out, size_t height, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is meant to be internal-only.
| #if XNN_ENABLE_KLEIDIAI | ||
| assert(kr == 2); | ||
|  | ||
| return kai_get_lhs_packed_offset_lhs_imatmul_pack_x16p2vlx2_x16p_sme( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the KleidiAI naming convention, so not something that needs to be addressed here.
        
          
                src/xnnpack/gemm.h
              
                Outdated
          
        
      | xnn_pqs8_qc8w_igemm_minmax_fp32_ukernel_32x32c4__neonsme2) | ||
|  | ||
|  | ||
| #define DECLARE_PF16_F16_PACKED_IGEMM_MINMAX_UKERNEL_FUNCTION(fn_name) \ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix formatting.
        
          
                src/xnnpack/pack.h
              
                Outdated
          
        
      | size_t k_stride, // | ||
| size_t extra_bytes); | ||
|  | ||
| XNN_INTERNAL void xnn_pack_kai_f16_conv_goki_w_sme2( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix formatting.
        
          
                test/gemm-microkernel-tester.cc
              
                Outdated
          
        
      | xnnpack::Buffer<xnn_float16> c_ref(m() * n(), 0, /*extra_bytes=*/{0}, | ||
| "c_ref"); | ||
|  | ||
| #if 0 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test pass?
        
          
                test/gemm-microkernel-tester.h
              
                Outdated
          
        
      | xnn_pack_weights_and_biases_fn pack, | ||
| xnn_packed_stride_weights_and_biases_fn packed_stride); | ||
|  | ||
| void Test_PF16(xnn_packed_f16_lhs_igemm_ukernel_fn packed_igemm, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this function to be called, you need to add your new kernel to the corresponding .yaml files and likely also add a few lines to the test generator script (see e.g. XNNPACK/test/qp8-f32-qc8w-gemm-minmax.cc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the yml files and updated tests, they should now run and pass
Signed-off-by: Gian Marco Iodice <[email protected]>
| @gmiodice It looks like this needs conflicts resolved @gonnet @fbarchard can you please follow up? Your comments have been responded to. | 
| xnnpack_unit_test( | ||
| name = "pf16_f16_igemm_minmax_test", | ||
| srcs = [ | ||
| "pf16-f16-igemm-minmax.cc", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs to be added to the corresponding CMakeLists.txt file.
| } | ||
|  | ||
| size_t xnn_packed_size_kai_f16_conv_goki_w(size_t nc, size_t ks, size_t kc) { | ||
| #if XNN_ENABLE_KLEIDIAI | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are already inside an #if XNN_ENABLE_KLEIDIAI, so the #else below is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, removed the additional #if
|  | ||
| for (size_t g_idx = 0; g_idx < g; ++g_idx) { | ||
| transpose_weights_x16(k, tmp_data, nc, ks * kc); | ||
| kai_run_rhs_imatmul_pack_kxn_x16p2vlx2b_x16_x16_sme( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer to both really would be to have a new function variant which would require the kleidi team to implement new function versions and provide them in a release. The majority of their function signatures are the similar in terms of the bias. So I suspect they would hesitate to remove the bias.
Would it be sufficient for the function to handle b == NULL as if it were full of zeros? That would not require a change in the API?
But as for the transpose this variant of the pack kernel is kxn if we had a nxk version we could remove the manual transpose, as for right now though we will need it.
OK, could you add a TODO to remind us? Thanks!