-
Notifications
You must be signed in to change notification settings - Fork 14.7k
Description
Name and Version
bash ./llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
version: 7822 (8f91ca5)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench, llama-server
Command line
./llama-bench -m /home/tipu/AI/models/ggml-org/GPT-OSS-20B/gpt-oss-20b-mxfp4.gguf -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf -m /home/tipu/AI/models/unsloth/Nemotron-3-Nano/Nemotron-3-Nano-30B-A3B-Q8_0.gguf -m /home/tipu/AI/models/unsloth/GLM-4.7-Flash/GLM-4.7-Flash-Q8_0.gguf -ngl 100 --ubatch-size 512 --batch-size 2048 --mmap 0 -fa 0,1 -d 0,1024,2048,8096 -r 3 -dio 1Problem description & steps to reproduce
GLM-4.7-Flash is just released. There are many issues raised around it. Model for me on my shared built is giving good output, but it TG and PP is much lower as compared to other A3B models.
Also as context grows its TG and PP drop to much.
Since MLA PR #18986 is merged I am not sure if there any indented PR to speed up the model.
I want to know that this slow down is due architecture of model or backends used by my machine need some update in implementation or something will be improved in llama.cpp to make inference of this model better.
I have mesa vulkan for GPU inference and AOCL BLAS installed for anything that goes to CPU.
AOCL: aocl-linux-gcc-5.2.0
VULKAN SDK: LunarG 1.4.335.0
VULKAN Driver: mesa-vulkan-drivers:amd64 25.3.3kisak1n amd64
Make options used:
cmake -S "$SOURCE_DIR" -B "$BUILD_DIR" -DGGML_NATIVE=ON -DSD_VULKAN=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=AOCL -DBLAS_INCLUDE_DIRS=/opt/AMD/aocl/aocl-linux-gcc-5.2.0/gcc/include -DCMAKE_PREFIX_PATH=/opt/AMD/aocl/aocl-linux-gcc-5.2.0/gcc
cmake --build "$BUILD_DIR" --config Release --clean-first --parallel $(nproc)
llama-bench build: 8f91ca5 (7822)
FA = off
| Param | gpt-oss 20B MXFP4 MoE | nemotron_h_moe 31B.A3.5B Q8_0 | qwen3moe 30B.A3B Q8_0 | GLM4.7 Flash Q8_0 |
|---|---|---|---|---|
| pp512 | 147.08 ± 1.49 | 113.02 ± 0.32 | 119.46 ± 0.18 | 102.81 ± 0.31 |
| tg128 | 16.17 ± 0.00 | 12.05 ± 0.01 | 12.95 ± 0.01 | 10.77 ± 0.00 |
| pp512 @ d1024 | 136.19 ± 1.73 | 111.24 ± 0.13 | 105.93 ± 0.34 | 86.65 ± 0.31 |
| tg128 @ d1024 | 15.78 ± 0.03 | 11.84 ± 0.01 | 12.06 ± 0.06 | 7.29 ± 0.05 |
| pp512 @ d2048 | 128.45 ± 1.21 | 108.86 ± 0.40 | 94.63 ± 0.51 | 73.20 ± 0.48 |
| tg128 @ d2048 | 15.20 ± 0.03 | 11.50 ± 0.00 | 11.23 ± 0.00 | 5.28 ± 0.03 |
| pp512 @ d8096 | 95.64 ± 0.76 | 98.47 ± 0.93 | 56.28 ± 0.18 | 38.71 ± 0.02 |
| tg128 @ d8096 | 12.28 ± 0.01 | 9.17 ± 0.05 | 5.89 ± 0.05 | 2.19 ± 0.02 |
FA = on
| params | gpt-oss 20B MXFP4 MoE | nemotron_h_moe 31B.A3.5B Q8_0 | qwen3moe 30B.A3B Q8_0 | GLM4.7 Flash Q8_0 |
|---|---|---|---|---|
| pp512 | 146.69 ± 0.87 | 112.54 ± 0.65 | 114.26 ± 0.87 | 86.09 ± 0.12 |
| tg128 | 16.64 ± 0.01 | 12.12 ± 0.01 | 13.39 ± 0.01 | 10.97 ± 0.01 |
| pp512 @ d1024 | 132.76 ± 0.39 | 107.09 ± 0.32 | 77.43 ± 0.10 | 50.39 ± 0.10 |
| tg128 @ d1024 | 16.36 ± 0.08 | 12.05 ± 0.01 | 12.29 ± 0.00 | 9.76 ± 0.01 |
| pp512 @ d2048 | 120.38 ± 0.10 | 101.26 ± 0.28 | 55.47 ± 0.35 | 35.40 ± 0.02 |
| tg128 @ d2048 | 16.11 ± 0.08 | 11.98 ± 0.00 | 11.66 ± 0.01 | 8.79 ± 0.00 |
| pp512 @ d8096 | 77.32 ± 0.34 | 77.85 ± 0.48 | 20.76 ± 0.17 | 12.94 ± 0.01 |
| tg128 @ d8096 | 14.91 ± 0.01 | 11.52 ± 0.00 | 8.92 ± 0.00 | 5.58 ± 0.00 |
First Bad Commit
None
Relevant log output
Logs
tipu-dev-machine ~/Development/GH/llama.cpp/build/bin b7822 ≢ 1 22:23:32
bash ./llama-bench -m /home/tipu/AI/models/ggml-org/GPT-OSS-20B/gpt-oss-20b-mxfp4.gguf -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf -m /home/tipu/AI/models/unsloth/Nemotron-3-Nano/Nemotron-3-Nano-30B-A3B-Q8_0.gguf -m /home/tipu/AI/models/unsloth/GLM-4.7-Flash/GLM-4.7-Flash-Q8_0.gguf -ngl 100 --ubatch-size 512 --batch-size 2048 --mmap 0 -fa 0,1 -d 0,1024,2048,8096 -r 3 -dio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model | size | params | backend | threads | fa | dio | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 | 147.08 ± 1.49 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 | 16.17 ± 0.00 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d1024 | 136.19 ± 1.73 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d1024 | 15.78 ± 0.03 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d2048 | 128.45 ± 1.21 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d2048 | 15.20 ± 0.03 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d8096 | 95.64 ± 0.76 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d8096 | 12.28 ± 0.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 | 146.69 ± 0.87 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 | 16.64 ± 0.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d1024 | 132.76 ± 0.39 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d1024 | 16.36 ± 0.08 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d2048 | 120.38 ± 0.10 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d2048 | 16.11 ± 0.08 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d8096 | 77.32 ± 0.34 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d8096 | 14.91 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 | 119.46 ± 0.18 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 | 12.95 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d1024 | 105.93 ± 0.34 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d1024 | 12.06 ± 0.06 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d2048 | 94.63 ± 0.51 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d2048 | 11.23 ± 0.00 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d8096 | 56.28 ± 0.18 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d8096 | 5.89 ± 0.05 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 | 114.26 ± 0.87 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 | 13.39 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d1024 | 77.43 ± 0.10 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d1024 | 12.29 ± 0.00 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d2048 | 55.47 ± 0.35 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d2048 | 11.66 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d8096 | 20.76 ± 0.17 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d8096 | 8.92 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 | 113.02 ± 0.32 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 | 12.05 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d1024 | 111.24 ± 0.13 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d1024 | 11.84 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d2048 | 108.86 ± 0.40 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d2048 | 11.50 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d8096 | 98.47 ± 0.93 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d8096 | 9.17 ± 0.05 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 | 112.54 ± 0.65 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 | 12.12 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d1024 | 107.09 ± 0.32 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d1024 | 12.05 ± 0.01 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d2048 | 101.26 ± 0.28 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d2048 | 11.98 ± 0.00 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d8096 | 77.85 ± 0.48 |
| nemotron_h_moe 31B.A3.5B Q8_0 | 31.27 GiB | 31.58 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d8096 | 11.52 ± 0.00 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 | 102.81 ± 0.31 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 | 10.77 ± 0.00 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d1024 | 86.65 ± 0.31 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d1024 | 7.29 ± 0.05 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d2048 | 73.20 ± 0.48 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d2048 | 5.28 ± 0.03 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 0 | 1 | pp512 @ d8096 | 38.71 ± 0.02 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 0 | 1 | tg128 @ d8096 | 2.19 ± 0.02 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 | 86.09 ± 0.12 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 | 10.97 ± 0.01 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d1024 | 50.39 ± 0.10 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d1024 | 9.76 ± 0.01 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d2048 | 35.40 ± 0.02 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d2048 | 8.79 ± 0.00 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 1 | 1 | pp512 @ d8096 | 12.94 ± 0.01 |
| deepseek2 ?B Q8_0 | 29.65 GiB | 29.94 B | Vulkan,BLAS | 8 | 1 | 1 | tg128 @ d8096 | 5.58 ± 0.00 |
build: 8f91ca54e (7822)