vulkan: Use mul_mat_vec_id for small values of n #18918

jeffbolznv · 2026-01-18T16:24:18Z

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.

Perf on 5090:

test-backend-ops.exe perf -o MUL_MAT_ID -p mxfp4

before:

  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880):                75400 runs -    13.37 us/run -  66.36 MFLOP/run -   4.96 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880):                 5278 runs -   190.94 us/run - 265.42 MFLOP/run -   1.39 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880):                 4347 runs -   239.34 us/run - 530.84 MFLOP/run -   2.22 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880):               1878 runs -   533.28 us/run -  33.97 GFLOP/run -  63.71 TFLOPS

after:

  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880):                73892 runs -    13.57 us/run -  66.36 MFLOP/run -   4.89 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880):                21112 runs -    48.15 us/run - 265.42 MFLOP/run -   5.51 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880):                10395 runs -    96.23 us/run - 530.84 MFLOP/run -   5.52 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880):               1941 runs -   515.45 us/run -  33.97 GFLOP/run -  65.91 TFLOPS

This came up when testing #18892. There are small batches that dominate the runtime:

before:
MUL_MAT_ID mxfp4 m=2880 n=4 k=2880 n_expert=32 batch=4: 71436 x 218.869 us = 1.56352e+07 us (1212.48 GFLOPS/s)

after:
MUL_MAT_ID_ADD_ID MUL_MAT_ID mxfp4 m=2880 n=4 k=2880 n_expert=32 batch=4: 71508 x 53.676 us = 3.83833e+06 us (4943.93 GFLOPS/s)

That test also shows a lot of time in small batches of FA, but I'll fix that separately.

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets. Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.

jeffbolznv requested a review from 0cc4m as a code owner January 18, 2026 16:24

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Use mul_mat_vec_id for small values of n #18918

vulkan: Use mul_mat_vec_id for small values of n #18918

jeffbolznv commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vulkan: Use mul_mat_vec_id for small values of n #18918

Are you sure you want to change the base?

vulkan: Use mul_mat_vec_id for small values of n #18918

Conversation

jeffbolznv commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant