feat: support trtllm_mha FP8 query attention kernel #12307

elvischenv · 2025-10-29T01:46:52Z

Motivation

TRTLLM_MHA already supports FP8-qkv BF16-out attention kernel. This will achieve better performance compared to original BF16-q FP8-kv kernel.
#9782

Modifications

This PR convert query to FP8 dtype if kv cache type is also FP8.

Accuracy Tests

lm_eval --model local-completions --tasks gsm8k --model_args model=openai/gpt-oss-120b,base_url=http://127.0.0.1:18000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8682|±  |0.0132|
|     |       |strict-match    |     5|exact_match|↑  |0.6091|±  |0.0190|

main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8394|±  |0.0143|
|     |       |strict-match    |     5|exact_match|↑  |0.6318|±  |0.0188|

Benchmarking and Profiling

python3 -m sglang.bench_serving --model openai/gpt-oss-120b --host 127.0.0.1 --port 18000 --backend sglang-oai --dataset-name random --random-range-ratio 1 --random-input-len 1024 --random-output-len 1024 --max-concurrency 512 --num-prompts 2560

PR(5% perf improvement):

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 512
Successful requests:                     2560
Benchmark duration (s):                  153.18
Total input tokens:                      2621440
Total input text tokens:                 2621440
Total input vision tokens:               0
Total generated tokens:                  2621440
Total generated tokens (retokenized):    2552374
Request throughput (req/s):              16.71
Input token throughput (tok/s):          17113.94
Output token throughput (tok/s):         17113.94
Total token throughput (tok/s):          34227.89
Concurrency:                             510.27
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   30531.42
Median E2E Latency (ms):                 30537.68
---------------Time to First Token----------------
Mean TTFT (ms):                          2720.10
Median TTFT (ms):                        2655.91
P99 TTFT (ms):                           4974.16
---------------Inter-Token Latency----------------
Mean ITL (ms):                           27.24
Median ITL (ms):                         24.75
P95 ITL (ms):                            26.51
P99 ITL (ms):                            153.28
Max ITL (ms):                            4132.87
==================================================

main:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 512
Successful requests:                     2560
Benchmark duration (s):                  161.57
Total input tokens:                      2621440
Total input text tokens:                 2621440
Total input vision tokens:               0
Total generated tokens:                  2621440
Total generated tokens (retokenized):    2539762
Request throughput (req/s):              15.84
Input token throughput (tok/s):          16225.25
Output token throughput (tok/s):         16225.25
Total token throughput (tok/s):          32450.49
Concurrency:                             510.02
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   32188.29
Median E2E Latency (ms):                 32219.19
---------------Time to First Token----------------
Mean TTFT (ms):                          2632.67
Median TTFT (ms):                        2608.20
P99 TTFT (ms):                           4823.79
---------------Inter-Token Latency----------------
Mean ITL (ms):                           28.98
Median ITL (ms):                         26.49
P95 ITL (ms):                            28.89
P99 ITL (ms):                            154.62
Max ITL (ms):                            4009.70
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-29T01:46:55Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003

LGTM

Fridge003 · 2025-10-29T19:11:16Z

python/sglang/srt/layers/attention/trtllm_mha_backend.py

            )

+        if self.data_type == torch.float8_e4m3fn:
+            q = q.to(torch.float8_e4m3fn)


One more question. In this case is self.q_data_type different from self.dtype?
It's a little bit confusing... What's your command for launching model with fp8 query?

Running with --kv-cache-dtype fp8_e4m3. From the code, self.q_data_type will always be model dtype. The original behavior for FP8 kv will use model dtype query and fp8 kv. This PR makes q/k/v be the same type as --kv-cache-dtype set.

sglang/python/sglang/srt/layers/attention/trtllm_mha_backend.py

Lines 75 to 76 in 7ed8ba0

self.data_type = model_runner.kv_cache_dtype

self.q_data_type = model_runner.dtype

This is also what other attention backend do, e.g. flashmla

sglang/python/sglang/srt/layers/attention/flashmla_backend.py

Lines 356 to 357 in 7ed8ba0

if self.data_type == torch.float8_e4m3fn:

reshape_q_fp8 = reshape_q.to(torch.float8_e4m3fn)

@elvischenv Sorry for the unrelated question, but: FP8 kv with trtllm mha was supported before this PR, right? Because, I run into some error lately like #12372, I'm not sure if you experienced the same. I may be wrong, but I believe it to be supported before, so maybe SGLang has started passing the wrong params recently.

@b8zhong on current main if using --kv-cache-dtype fp8_e4m3, it will use BF16-query FP8-kv kernel. Flashinfer only has BF16-q FP8-kv decode kernel BUT does NOT have BF16-q FP8-kv prefill kernel.

With this PR, --kv-cache-dtype fp8_e4m3 will always use FP8 q, this is good since Flashinfer has FP8-qkv kernel for both prefill and decode kernel.

@elvischenv Thanks!! That makes a lot of sense & resolves my issue. Thank you very much.

Flashinfer only has BF16-q FP8-kv decode kernel BUT does NOT have BF16-q FP8-kv prefill kernel.
I did not know this, it's good information.

support trtllm_mha FP8 query

078bbbd

elvischenv requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners October 29, 2025 01:46

b8zhong added the run-ci label Oct 29, 2025

Fridge003 approved these changes Oct 29, 2025

View reviewed changes

Fridge003 reviewed Oct 29, 2025

View reviewed changes

b8zhong mentioned this pull request Oct 30, 2025

[Bug] TRT-LLM gen MHA + FP8 KV cache issue #12372

Closed

5 tasks

Merge branch 'main' into elvischenv/support-trtllm-mha-fp8-query

e63efc3

Fridge003 merged commit 069e490 into sgl-project:main Oct 31, 2025
53 of 73 checks passed

b8zhong mentioned this pull request Oct 31, 2025

Revert "fix llama4 kv cache layout" #12437

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support trtllm_mha FP8 query attention kernel #12307

feat: support trtllm_mha FP8 query attention kernel #12307

elvischenv commented Oct 29, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Uh oh!

Fridge003 left a comment

Uh oh!

Fridge003 Oct 29, 2025

Uh oh!

elvischenv Oct 30, 2025

Uh oh!

b8zhong Oct 30, 2025 •

edited

Loading

Uh oh!

elvischenv Oct 30, 2025

Uh oh!

b8zhong Oct 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	self.data_type = model_runner.kv_cache_dtype
	self.q_data_type = model_runner.dtype

	if self.data_type == torch.float8_e4m3fn:
	reshape_q_fp8 = reshape_q.to(torch.float8_e4m3fn)

feat: support trtllm_mha FP8 query attention kernel #12307

feat: support trtllm_mha FP8 query attention kernel #12307

Conversation

elvischenv commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Fridge003 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

elvischenv Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

b8zhong Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elvischenv Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

b8zhong Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elvischenv commented Oct 29, 2025 •

edited

Loading

b8zhong Oct 30, 2025 •

edited

Loading

b8zhong Oct 30, 2025 •

edited

Loading