vulkan: Enable topk_moe fusion for GLM-4.7-Flash #18947

jeffbolznv · 2026-01-20T04:39:20Z

Just need to add the fusion detection logic, this is a combination of existing modes (early softmax, bias, norm, scale), and is covered by the existing backend tests.

before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -r 10 -fa 1 -p 512 -n 128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-vulkan.dll
load_backend: loaded CPU backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-cpu.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8434.22 ± 37.67 |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           tg128 |       185.26 ± 16.12 |

after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -r 10 -fa 1 -p 512 -n 128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-vulkan.dll
load_backend: loaded CPU backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-cpu.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8504.07 ± 57.02 |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           tg128 |       206.38 ± 16.00 |

Just need to add the fusion detection logic, this is a combination of existing modes (early softmax, bias, norm, scale), and is covered by the existing backend tests.

jeffbolznv · 2026-01-21T03:16:02Z

This may not be needed after #18980, I'll check once that lands.

vulkan: Enable topk_moe fusion for GLM-4.7-Flash

addcbcc

Just need to add the fusion detection logic, this is a combination of existing modes (early softmax, bias, norm, scale), and is covered by the existing backend tests.

jeffbolznv requested a review from 0cc4m as a code owner January 20, 2026 04:39

loci-dev mentioned this pull request Jan 20, 2026

UPSTREAM PR #18947: vulkan: Enable topk_moe fusion for GLM-4.7-Flash auroralabs-loci/llama.cpp#975

Open

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Enable topk_moe fusion for GLM-4.7-Flash #18947

vulkan: Enable topk_moe fusion for GLM-4.7-Flash #18947

jeffbolznv commented Jan 20, 2026

Uh oh!

jeffbolznv commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vulkan: Enable topk_moe fusion for GLM-4.7-Flash #18947

Are you sure you want to change the base?

vulkan: Enable topk_moe fusion for GLM-4.7-Flash #18947

Conversation

jeffbolznv commented Jan 20, 2026

Uh oh!

jeffbolznv commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant