Skip to content

Conversation

@zhang-hui-yulo
Copy link
Contributor

@zhang-hui-yulo zhang-hui-yulo commented Jan 17, 2026

Add mmf for CDNA, CDNA3 is passed, it will be very helpful if anyone can test it on CDNA2 and CDNA1, thank you.

  • Refactor mmf to make rows_per_block as input parameter.
  • Pass MUL_MAT and MUL_MAT_ID.
    - [x] Extend tile size to support shared memory loading 128.
  • Perf tuning.

Attach the perf data, looks like that MUL_MAT cannot reach mmf no matter on CDNA or RDNA now, not sure why.

MUL_MAT_ID
Backend GGML op Op parameters TFLOPS master TFLOPS mmf_for_cdna Speedup
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.33 2.77 0.83
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 0.25 17.76 72.16
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 0.49 29.49 60.43
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 0.07 6.05 84.84
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 0.03 3.45 98.80
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 0.96 30.28 31.42
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 0.13 9.92 79.10
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 0.04 4.64 107.09
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 2.99 2.99 1.00
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 1.21 39.38 32.48
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 2.39 2.56 1.07
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 0.33 15.21 46.78
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 0.09 3.78 44.37
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 4.38 4.38 1.00
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 0.65 28.57 44.03
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 0.13 6.36 49.30
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 1.84 1.98 1.07
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 2.42 11.73 4.84
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 3.48 16.17 4.64
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 0.93 3.67 3.96
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 0.55 2.10 3.80
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 5.13 19.87 3.87
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 1.41 6.26 4.46
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 0.52 2.14 4.08
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 2.03 2.03 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 4.48 17.26 3.86
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 5.31 5.28 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 2.64 6.11 2.32
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 0.92 2.31 2.51
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 10.29 10.28 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 4.89 12.00 2.46
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 1.18 3.37 2.86
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 2.66 2.68 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 7.97 8.00 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 14.16 14.17 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 4.76 4.81 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 0.81 0.85 1.04
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 26.72 26.77 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 7.20 7.17 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 1.58 1.60 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 2.97 2.99 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 15.66 15.73 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 24.48 24.48 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 9.04 8.91 0.99
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 1.51 1.51 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 46.58 46.75 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 13.85 13.83 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 2.95 2.97 1.01
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880 5.60 5.70 1.02
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880 3.38 3.41 1.01
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880 71.06 70.96 1.00
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880 5.31 5.37 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.50 3.47 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 12.20 12.20 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 22.04 22.08 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 7.61 7.67 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 1.40 1.40 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 40.82 40.87 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 10.80 10.80 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 2.64 2.61 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 3.88 3.93 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 22.78 22.87 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 37.29 37.45 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 13.73 13.81 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 2.36 2.36 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 70.28 70.30 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 20.29 20.17 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 4.54 4.57 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.27 3.23 0.99
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 13.71 13.74 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 24.53 24.58 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 7.90 7.98 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 1.54 1.50 0.97
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 44.81 45.02 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 11.25 11.25 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 2.88 2.83 0.98
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 3.65 3.70 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 25.52 25.88 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 41.16 41.68 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 14.44 14.43 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 2.53 2.56 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 77.15 77.76 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 21.01 21.08 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 5.00 4.92 0.98
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.05 3.09 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 7.70 7.71 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 13.75 13.75 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 4.97 5.12 1.03
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 1.00 1.04 1.04
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 26.03 26.04 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 6.78 6.79 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 1.96 1.99 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 3.44 3.47 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 14.95 14.99 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 23.93 23.99 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 9.41 9.47 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 1.81 1.82 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 45.48 45.56 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 12.91 12.92 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 3.38 3.43 1.01
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.35 3.33 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 12.86 12.89 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 23.54 23.60 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 7.55 7.53 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 1.63 1.38 0.85
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 42.28 42.16 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 10.86 10.71 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 2.64 2.59 0.98
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 3.65 3.69 1.01
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 24.15 24.22 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 40.83 41.04 1.01
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 13.49 13.31 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 2.31 2.30 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 75.89 76.14 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 20.10 19.98 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 4.44 4.41 0.99

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 17, 2026
@zhang-hui-yulo
Copy link
Contributor Author

Add lds 128 version, similar to lds 64.

MUL_MAT_ID
Backend GGML op Op parameters TFLOPS master TFLOPS mmf_for_cdna Speedup
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.22 3.23 1.00
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 0.25 18.02 73.24
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 0.49 29.25 59.90
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 0.07 6.63 95.51
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 0.04 3.52 91.41
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 0.96 30.93 32.14
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 0.12 10.09 81.15
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 0.04 4.31 95.76
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 2.91 3.05 1.05
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 1.29 37.26 28.80
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 2.39 2.55 1.07
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 0.32 14.67 45.18
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 0.09 3.73 40.97
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 4.38 4.37 1.00
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 0.65 27.81 42.80
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 0.12 5.71 46.41
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 1.82 1.96 1.07
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 2.43 11.91 4.89
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 3.54 16.88 4.77
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 0.88 3.46 3.92
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 0.52 1.91 3.69
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 5.12 18.71 3.66
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 1.38 6.17 4.47
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 0.52 2.18 4.21
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 2.00 2.02 1.01
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 4.83 18.71 3.87
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 5.33 5.30 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 2.66 6.08 2.28
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 1.05 2.16 2.06
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 10.28 10.40 1.01
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 4.90 11.87 2.42
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 1.32 3.43 2.60
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 2.66 2.68 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 7.66 7.65 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 13.67 13.65 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 4.64 4.62 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 0.85 0.82 0.96
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 25.82 25.78 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 6.90 6.92 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 1.59 1.61 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 2.96 2.97 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 15.04 15.02 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 24.08 23.91 0.99
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 8.60 8.54 0.99
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 1.51 1.51 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 45.62 45.53 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 13.27 13.21 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 2.96 2.97 1.00
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880 5.45 5.52 1.01
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880 2.80 2.79 1.00
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880 69.17 68.95 1.00
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880 5.14 5.25 1.02
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.50 3.54 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 11.73 11.66 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 21.36 21.16 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 7.41 7.24 0.98
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 1.40 1.40 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 39.64 39.43 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 10.42 10.49 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 2.62 2.65 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 3.91 3.91 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 22.08 21.87 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 36.57 36.23 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 13.07 13.16 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 2.36 2.36 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 68.45 68.32 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 19.56 19.52 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 4.57 4.51 0.99
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.31 3.30 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 13.21 13.13 0.99
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 23.82 23.68 0.99
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 7.76 7.59 0.98
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 1.47 1.59 1.08
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 43.51 43.39 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 10.93 10.85 0.99
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 2.84 2.95 1.04
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 3.69 3.69 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 24.56 24.67 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 40.59 40.52 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 13.95 14.01 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 2.58 2.59 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 75.84 74.57 0.98
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 20.48 20.46 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 4.95 4.91 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.11 3.11 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 7.39 7.37 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 13.35 13.30 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 4.89 4.89 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 1.10 1.00 0.92
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 25.25 25.20 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 6.49 6.49 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 1.98 1.95 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 3.41 3.42 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 14.35 14.27 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 23.48 23.29 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 9.00 9.11 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 1.79 1.83 1.02
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 44.50 44.44 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 12.44 12.48 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 3.45 3.31 0.96
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048 3.36 3.37 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048 12.35 12.27 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048 22.66 22.59 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048 7.33 7.42 1.01
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048 1.37 1.36 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048 40.86 40.78 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048 10.40 10.41 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048 2.59 2.57 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048 3.77 3.67 0.97
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048 23.31 23.11 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048 40.30 39.89 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048 12.75 12.70 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048 2.28 2.33 1.02
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048 74.69 74.06 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048 19.34 19.29 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048 4.42 4.46 1.01

@zhang-hui-yulo zhang-hui-yulo marked this pull request as ready for review January 19, 2026 03:10
@zhang-hui-yulo
Copy link
Contributor Author

zhang-hui-yulo commented Jan 19, 2026

Hello @JohannesGaessler ,

Just revert lds 128 as little perf improvement, this PR is ready for review but need some basic tuning.

As I only have limited time to access MI300X, I need to ensure the model first, are llama-8b and granite-3.1-1b enough? Thank you.

Only tested on CDNA3, could you have a test on your CDNA2 and CDNA1 if possible? If you don't have enough resource, I will only enable it on CDNA3.

Best Regards
Hui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant