Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.

Perf on 5090:

test-backend-ops.exe perf -o MUL_MAT_ID -p mxfp4

before:

  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880):                75400 runs -    13.37 us/run -  66.36 MFLOP/run -   4.96 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880):                 5278 runs -   190.94 us/run - 265.42 MFLOP/run -   1.39 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880):                 4347 runs -   239.34 us/run - 530.84 MFLOP/run -   2.22 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880):               1878 runs -   533.28 us/run -  33.97 GFLOP/run -  63.71 TFLOPS

after:

  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880):                73892 runs -    13.57 us/run -  66.36 MFLOP/run -   4.89 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880):                21112 runs -    48.15 us/run - 265.42 MFLOP/run -   5.51 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880):                10395 runs -    96.23 us/run - 530.84 MFLOP/run -   5.52 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880):               1941 runs -   515.45 us/run -  33.97 GFLOP/run -  65.91 TFLOPS

This came up when testing #18892. There are small batches that dominate the runtime:

before:
MUL_MAT_ID mxfp4 m=2880 n=4 k=2880 n_expert=32 batch=4: 71436 x 218.869 us = 1.56352e+07 us (1212.48 GFLOPS/s)

after:
MUL_MAT_ID_ADD_ID MUL_MAT_ID mxfp4 m=2880 n=4 k=2880 n_expert=32 batch=4: 71508 x 53.676 us = 3.83833e+06 us (4943.93 GFLOPS/s)

That test also shows a lot of time in small batches of FA, but I'll fix that separately.

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner January 18, 2026 16:24
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant