Misc. bug: GLM-4.7-Flash Inference Much slower as compared to other A3B Models

### Name and Version

bash  ./llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
version: 7822 (8f91ca54e)
built with GNU 13.3.0 for Linux x86_64

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-bench, llama-server

### Command line

```shell
./llama-bench -m /home/tipu/AI/models/ggml-org/GPT-OSS-20B/gpt-oss-20b-mxfp4.gguf -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf -m /home/tipu/AI/models/unsloth/Nemotron-3-Nano/Nemotron-3-Nano-30B-A3B-Q8_0.gguf -m /home/tipu/AI/models/unsloth/GLM-4.7-Flash/GLM-4.7-Flash-Q8_0.gguf -ngl 100 --ubatch-size 512 --batch-size 2048 --mmap 0 -fa 0,1 -d 0,1024,2048,8096 -r 3 -dio 1
```

### Problem description & steps to reproduce

GLM-4.7-Flash is just released. There are many issues raised around it. Model for me on my shared built is giving good output, but it TG and PP is much lower as compared to other A3B models. 
Also as context grows its TG and PP drop to much. 
Since MLA PR #18986  is merged I am not sure if there any indented PR to speed up the model. 

I want to know that this slow down is due architecture of model or backends used by my machine need some update in implementation or something will be improved in llama.cpp to make inference of this model better. 

I have mesa vulkan for GPU inference and AOCL BLAS installed for anything that goes to CPU. 

**AOCL:** aocl-linux-gcc-5.2.0
**VULKAN SDK:** LunarG 1.4.335.0
**VULKAN Driver:** mesa-vulkan-drivers:amd64 25.3.3~kisak1~n amd64

Make options used:

```
cmake -S "$SOURCE_DIR" -B "$BUILD_DIR" -DGGML_NATIVE=ON -DSD_VULKAN=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=AOCL -DBLAS_INCLUDE_DIRS=/opt/AMD/aocl/aocl-linux-gcc-5.2.0/gcc/include -DCMAKE_PREFIX_PATH=/opt/AMD/aocl/aocl-linux-gcc-5.2.0/gcc
cmake --build "$BUILD_DIR" --config Release --clean-first --parallel $(nproc)
```

llama-bench build: 8f91ca54e (7822)

FA = off

| Param         | gpt-oss 20B MXFP4 MoE | nemotron_h_moe 31B.A3.5B Q8_0 | qwen3moe 30B.A3B Q8_0 | GLM4.7 Flash Q8_0 |
| ------------- | --------------------- | ----------------------------- | --------------------- | ----------------- |
| pp512         | 147.08 ± 1.49         | 113.02 ± 0.32                 | 119.46 ± 0.18         | 102.81 ± 0.31     |
| tg128         | 16.17 ± 0.00          | 12.05 ± 0.01                  | 12.95 ± 0.01          | 10.77 ± 0.00      |
| pp512 @ d1024 | 136.19 ± 1.73         | 111.24 ± 0.13                 | 105.93 ± 0.34         | 86.65 ± 0.31      |
| tg128 @ d1024 | 15.78 ± 0.03          | 11.84 ± 0.01                  | 12.06 ± 0.06          | 7.29 ± 0.05       |
| pp512 @ d2048 | 128.45 ± 1.21         | 108.86 ± 0.40                 | 94.63 ± 0.51          | 73.20 ± 0.48      |
| tg128 @ d2048 | 15.20 ± 0.03          | 11.50 ± 0.00                  | 11.23 ± 0.00          | 5.28 ± 0.03       |
| pp512 @ d8096 | 95.64 ± 0.76          | 98.47 ± 0.93                  | 56.28 ± 0.18          | 38.71 ± 0.02      |
| tg128 @ d8096 | 12.28 ± 0.01          | 9.17 ± 0.05                   | 5.89 ± 0.05           | 2.19 ± 0.02       |

FA = on

| params        | gpt-oss 20B MXFP4 MoE | nemotron_h_moe 31B.A3.5B Q8_0 | qwen3moe 30B.A3B Q8_0 | GLM4.7 Flash Q8_0 |
| ------------- | --------------------- | ----------------------------- | --------------------- | ----------------- |
| pp512         | 146.69 ± 0.87         | 112.54 ± 0.65                 | 114.26 ± 0.87         | 86.09 ± 0.12      |
| tg128         | 16.64 ± 0.01          | 12.12 ± 0.01                  | 13.39 ± 0.01          | 10.97 ± 0.01      |
| pp512 @ d1024 | 132.76 ± 0.39         | 107.09 ± 0.32                 | 77.43 ± 0.10          | 50.39 ± 0.10      |
| tg128 @ d1024 | 16.36 ± 0.08          | 12.05 ± 0.01                  | 12.29 ± 0.00          | 9.76 ± 0.01       |
| pp512 @ d2048 | 120.38 ± 0.10         | 101.26 ± 0.28                 | 55.47 ± 0.35          | 35.40 ± 0.02      |
| tg128 @ d2048 | 16.11 ± 0.08          | 11.98 ± 0.00                  | 11.66 ± 0.01          | 8.79 ± 0.00       |
| pp512 @ d8096 | 77.32 ± 0.34          | 77.85 ± 0.48                  | 20.76 ± 0.17          | 12.94 ± 0.01      |
| tg128 @ d8096 | 14.91 ± 0.01          | 11.52 ± 0.00                  | 8.92 ± 0.00           | 5.58 ± 0.00       |

### First Bad Commit

None

### Relevant log output

<details>
<summary>Logs</summary>


```console
 tipu-dev-machine   ~/Development/GH/llama.cpp/build/bin  b7822 ≢  1                                                                      22:23:32 
 bash  ./llama-bench -m /home/tipu/AI/models/ggml-org/GPT-OSS-20B/gpt-oss-20b-mxfp4.gguf -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf -m /home/tipu/AI/models/unsloth/Nemotron-3-Nano/Nemotron-3-Nano-30B-A3B-Q8_0.gguf -m /home/tipu/AI/models/unsloth/GLM-4.7-Flash/GLM-4.7-Flash-Q8_0.gguf -ngl 100 --ubatch-size 512 --batch-size 2048 --mmap 0 -fa 0,1 -d 0,1024,2048,8096 -r 3 -dio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | threads | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |           pp512 |        147.08 ± 1.49 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |           tg128 |         16.17 ± 0.00 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d1024 |        136.19 ± 1.73 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d1024 |         15.78 ± 0.03 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d2048 |        128.45 ± 1.21 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d2048 |         15.20 ± 0.03 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d8096 |         95.64 ± 0.76 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d8096 |         12.28 ± 0.01 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |           pp512 |        146.69 ± 0.87 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |           tg128 |         16.64 ± 0.01 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d1024 |        132.76 ± 0.39 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d1024 |         16.36 ± 0.08 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d2048 |        120.38 ± 0.10 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d2048 |         16.11 ± 0.08 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d8096 |         77.32 ± 0.34 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d8096 |         14.91 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |           pp512 |        119.46 ± 0.18 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |           tg128 |         12.95 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d1024 |        105.93 ± 0.34 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d1024 |         12.06 ± 0.06 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d2048 |         94.63 ± 0.51 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d2048 |         11.23 ± 0.00 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d8096 |         56.28 ± 0.18 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d8096 |          5.89 ± 0.05 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |           pp512 |        114.26 ± 0.87 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |           tg128 |         13.39 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d1024 |         77.43 ± 0.10 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d1024 |         12.29 ± 0.00 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d2048 |         55.47 ± 0.35 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d2048 |         11.66 ± 0.01 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d8096 |         20.76 ± 0.17 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d8096 |          8.92 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |           pp512 |        113.02 ± 0.32 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |           tg128 |         12.05 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d1024 |        111.24 ± 0.13 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d1024 |         11.84 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d2048 |        108.86 ± 0.40 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d2048 |         11.50 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d8096 |         98.47 ± 0.93 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d8096 |          9.17 ± 0.05 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |           pp512 |        112.54 ± 0.65 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |           tg128 |         12.12 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d1024 |        107.09 ± 0.32 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d1024 |         12.05 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d2048 |        101.26 ± 0.28 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d2048 |         11.98 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d8096 |         77.85 ± 0.48 |
| nemotron_h_moe 31B.A3.5B Q8_0  |  31.27 GiB |    31.58 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d8096 |         11.52 ± 0.00 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |           pp512 |        102.81 ± 0.31 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |           tg128 |         10.77 ± 0.00 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d1024 |         86.65 ± 0.31 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d1024 |          7.29 ± 0.05 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d2048 |         73.20 ± 0.48 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d2048 |          5.28 ± 0.03 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   pp512 @ d8096 |         38.71 ± 0.02 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  0 |   1 |   tg128 @ d8096 |          2.19 ± 0.02 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |           pp512 |         86.09 ± 0.12 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |           tg128 |         10.97 ± 0.01 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d1024 |         50.39 ± 0.10 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d1024 |          9.76 ± 0.01 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d2048 |         35.40 ± 0.02 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d2048 |          8.79 ± 0.00 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   pp512 @ d8096 |         12.94 ± 0.01 |
| deepseek2 ?B Q8_0              |  29.65 GiB |    29.94 B | Vulkan,BLAS |       8 |  1 |   1 |   tg128 @ d8096 |          5.58 ± 0.00 |

build: 8f91ca54e (7822)

```
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: GLM-4.7-Flash Inference Much slower as compared to other A3B Models #19081

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Param	gpt-oss 20B MXFP4 MoE	nemotron_h_moe 31B.A3.5B Q8_0	qwen3moe 30B.A3B Q8_0	GLM4.7 Flash Q8_0
pp512	147.08 ± 1.49	113.02 ± 0.32	119.46 ± 0.18	102.81 ± 0.31
tg128	16.17 ± 0.00	12.05 ± 0.01	12.95 ± 0.01	10.77 ± 0.00
pp512 @ d1024	136.19 ± 1.73	111.24 ± 0.13	105.93 ± 0.34	86.65 ± 0.31
tg128 @ d1024	15.78 ± 0.03	11.84 ± 0.01	12.06 ± 0.06	7.29 ± 0.05
pp512 @ d2048	128.45 ± 1.21	108.86 ± 0.40	94.63 ± 0.51	73.20 ± 0.48
tg128 @ d2048	15.20 ± 0.03	11.50 ± 0.00	11.23 ± 0.00	5.28 ± 0.03
pp512 @ d8096	95.64 ± 0.76	98.47 ± 0.93	56.28 ± 0.18	38.71 ± 0.02
tg128 @ d8096	12.28 ± 0.01	9.17 ± 0.05	5.89 ± 0.05	2.19 ± 0.02

params	gpt-oss 20B MXFP4 MoE	nemotron_h_moe 31B.A3.5B Q8_0	qwen3moe 30B.A3B Q8_0	GLM4.7 Flash Q8_0
pp512	146.69 ± 0.87	112.54 ± 0.65	114.26 ± 0.87	86.09 ± 0.12
tg128	16.64 ± 0.01	12.12 ± 0.01	13.39 ± 0.01	10.97 ± 0.01
pp512 @ d1024	132.76 ± 0.39	107.09 ± 0.32	77.43 ± 0.10	50.39 ± 0.10
tg128 @ d1024	16.36 ± 0.08	12.05 ± 0.01	12.29 ± 0.00	9.76 ± 0.01
pp512 @ d2048	120.38 ± 0.10	101.26 ± 0.28	55.47 ± 0.35	35.40 ± 0.02
tg128 @ d2048	16.11 ± 0.08	11.98 ± 0.00	11.66 ± 0.01	8.79 ± 0.00
pp512 @ d8096	77.32 ± 0.34	77.85 ± 0.48	20.76 ± 0.17	12.94 ± 0.01
tg128 @ d8096	14.91 ± 0.01	11.52 ± 0.00	8.92 ± 0.00	5.58 ± 0.00

Misc. bug: GLM-4.7-Flash Inference Much slower as compared to other A3B Models #19081

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions