Skip to content

Misc. bug: GLM-4.7-Flash Inference Much slower as compared to other A3B Models #19081

@engrtipusultan

Description

@engrtipusultan

Name and Version

bash  ./llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
version: 7822 (8f91ca5)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench, llama-server

Command line

./llama-bench -m /home/tipu/AI/models/ggml-org/GPT-OSS-20B/gpt-oss-20b-mxfp4.gguf -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf -m /home/tipu/AI/models/unsloth/Nemotron-3-Nano/Nemotron-3-Nano-30B-A3B-Q8_0.gguf -m /home/tipu/AI/models/unsloth/GLM-4.7-Flash/GLM-4.7-Flash-Q8_0.gguf -ngl 100 --ubatch-size 512 --batch-size 2048 --mmap 0 -fa 0,1 -d 0,1024,2048,8096 -r 3 -dio 1

Problem description & steps to reproduce

GLM-4.7-Flash is just released. There are many issues raised around it. Model for me on my shared built is giving good output, but it TG and PP is much lower as compared to other A3B models.
Also as context grows its TG and PP drop to much.
Since MLA PR #18986 is merged I am not sure if there any indented PR to speed up the model.

I want to know that this slow down is due architecture of model or backends used by my machine need some update in implementation or something will be improved in llama.cpp to make inference of this model better.

I have mesa vulkan for GPU inference and AOCL BLAS installed for anything that goes to CPU.

AOCL: aocl-linux-gcc-5.2.0
VULKAN SDK: LunarG 1.4.335.0
VULKAN Driver: mesa-vulkan-drivers:amd64 25.3.3kisak1n amd64

Make options used:

cmake -S "$SOURCE_DIR" -B "$BUILD_DIR" -DGGML_NATIVE=ON -DSD_VULKAN=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=AOCL -DBLAS_INCLUDE_DIRS=/opt/AMD/aocl/aocl-linux-gcc-5.2.0/gcc/include -DCMAKE_PREFIX_PATH=/opt/AMD/aocl/aocl-linux-gcc-5.2.0/gcc
cmake --build "$BUILD_DIR" --config Release --clean-first --parallel $(nproc)

llama-bench build: 8f91ca5 (7822)

FA = off

Param gpt-oss 20B MXFP4 MoE nemotron_h_moe 31B.A3.5B Q8_0 qwen3moe 30B.A3B Q8_0 GLM4.7 Flash Q8_0
pp512 147.08 ± 1.49 113.02 ± 0.32 119.46 ± 0.18 102.81 ± 0.31
tg128 16.17 ± 0.00 12.05 ± 0.01 12.95 ± 0.01 10.77 ± 0.00
pp512 @ d1024 136.19 ± 1.73 111.24 ± 0.13 105.93 ± 0.34 86.65 ± 0.31
tg128 @ d1024 15.78 ± 0.03 11.84 ± 0.01 12.06 ± 0.06 7.29 ± 0.05
pp512 @ d2048 128.45 ± 1.21 108.86 ± 0.40 94.63 ± 0.51 73.20 ± 0.48
tg128 @ d2048 15.20 ± 0.03 11.50 ± 0.00 11.23 ± 0.00 5.28 ± 0.03
pp512 @ d8096 95.64 ± 0.76 98.47 ± 0.93 56.28 ± 0.18 38.71 ± 0.02
tg128 @ d8096 12.28 ± 0.01 9.17 ± 0.05 5.89 ± 0.05 2.19 ± 0.02

FA = on

params gpt-oss 20B MXFP4 MoE nemotron_h_moe 31B.A3.5B Q8_0 qwen3moe 30B.A3B Q8_0 GLM4.7 Flash Q8_0
pp512 146.69 ± 0.87 112.54 ± 0.65 114.26 ± 0.87 86.09 ± 0.12
tg128 16.64 ± 0.01 12.12 ± 0.01 13.39 ± 0.01 10.97 ± 0.01
pp512 @ d1024 132.76 ± 0.39 107.09 ± 0.32 77.43 ± 0.10 50.39 ± 0.10
tg128 @ d1024 16.36 ± 0.08 12.05 ± 0.01 12.29 ± 0.00 9.76 ± 0.01
pp512 @ d2048 120.38 ± 0.10 101.26 ± 0.28 55.47 ± 0.35 35.40 ± 0.02
tg128 @ d2048 16.11 ± 0.08 11.98 ± 0.00 11.66 ± 0.01 8.79 ± 0.00
pp512 @ d8096 77.32 ± 0.34 77.85 ± 0.48 20.76 ± 0.17 12.94 ± 0.01
tg128 @ d8096 14.91 ± 0.01 11.52 ± 0.00 8.92 ± 0.00 5.58 ± 0.00

First Bad Commit

None

Relevant log output

Logs
 tipu-dev-machine   ~/Development/GH/llama.cpp/build/bin  b7822 ≢  1                                                                      22:23:32 
 bash  ./llama-bench -m /home/tipu/AI/models/ggml-org/GPT-OSS-20B/gpt-oss-20b-mxfp4.gguf -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf -m /home/tipu/AI/models/unsloth/Nemotron-3-Nano/Nemotron-3-Nano-30B-A3B-Q8_0.gguf -m /home/tipu/AI/models/unsloth/GLM-4.7-Flash/GLM-4.7-Flash-Q8_0.gguf -ngl 100 --ubatch-size 512 --batch-size 2048 --mmap 0 -fa 0,1 -d 0,1024,2048,8096 -r 3 -dio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | threads | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |           pp512 |        147.08 ± 1.49 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |           tg128 |         16.17 ± 0.00 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d1024 |        136.19 ± 1.73 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d1024 |         15.78 ± 0.03 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d2048 |        128.45 ± 1.21 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d2048 |         15.20 ± 0.03 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d8096 |         95.64 ± 0.76 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d8096 |         12.28 ± 0.01 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |           pp512 |        146.69 ± 0.87 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |           tg128 |         16.64 ± 0.01 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d1024 |        132.76 ± 0.39 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d1024 |         16.36 ± 0.08 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d2048 |        120.38 ± 0.10 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d2048 |         16.11 ± 0.08 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d8096 |         77.32 ± 0.34 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d8096 |         14.91 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |           pp512 |        119.46 ± 0.18 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |           tg128 |         12.95 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d1024 |        105.93 ± 0.34 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d1024 |         12.06 ± 0.06 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d2048 |         94.63 ± 0.51 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d2048 |         11.23 ± 0.00 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d8096 |         56.28 ± 0.18 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d8096 |          5.89 ± 0.05 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |           pp512 |        114.26 ± 0.87 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |           tg128 |         13.39 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d1024 |         77.43 ± 0.10 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d1024 |         12.29 ± 0.00 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d2048 |         55.47 ± 0.35 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d2048 |         11.66 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d8096 |         20.76 ± 0.17 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d8096 |          8.92 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |           pp512 |        113.02 ± 0.32 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |           tg128 |         12.05 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d1024 |        111.24 ± 0.13 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d1024 |         11.84 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d2048 |        108.86 ± 0.40 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d2048 |         11.50 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d8096 |         98.47 ± 0.93 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d8096 |          9.17 ± 0.05 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |           pp512 |        112.54 ± 0.65 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |           tg128 |         12.12 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d1024 |        107.09 ± 0.32 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d1024 |         12.05 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d2048 |        101.26 ± 0.28 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d2048 |         11.98 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d8096 |         77.85 ± 0.48 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d8096 |         11.52 ± 0.00 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |           pp512 |        102.81 ± 0.31 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |           tg128 |         10.77 ± 0.00 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d1024 |         86.65 ± 0.31 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d1024 |          7.29 ± 0.05 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d2048 |         73.20 ± 0.48 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d2048 |          5.28 ± 0.03 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d8096 |         38.71 ± 0.02 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d8096 |          2.19 ± 0.02 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |           pp512 |         86.09 ± 0.12 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |           tg128 |         10.97 ± 0.01 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d1024 |         50.39 ± 0.10 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d1024 |          9.76 ± 0.01 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d2048 |         35.40 ± 0.02 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d2048 |          8.79 ± 0.00 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d8096 |         12.94 ± 0.01 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d8096 |          5.58 ± 0.00 |

build: 8f91ca54e (7822)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions