Skip to content

Conversation

@b8zhong
Copy link
Collaborator

@b8zhong b8zhong commented Oct 31, 2025

Because #12307 is actually the correct solution

  • Update FP8 column for trtllm mha.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@b8zhong b8zhong requested a review from Copilot October 31, 2025 05:30
Copy link
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

@Fridge003 Fridge003 merged commit a076ec1 into sgl-project:main Oct 31, 2025
6 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes a temporary workaround that was forcing bfloat16 KV cache for Llama4 models using the TRTLLM MHA backend on SM100 hardware, now that FP8 KV cache support is available.

  • Removes automatic downgrade from FP8 to bfloat16 for Llama4 with trtllm_mha backend
  • Updates documentation to reflect FP8 KV Cache support in TRTLLM MHA backend

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
python/sglang/srt/server_args.py Removes workaround that forced kv_cache_dtype to bfloat16 for Llama4 with trtllm_mha on SM100
docs/advanced_features/attention_backend.md Updates support matrix to indicate TRTLLM MHA now supports FP8 KV Cache

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@b8zhong b8zhong deleted the revert-12347-fix-llama4-kv-cache-layout branch October 31, 2025 05:35
mingfeima pushed a commit to mingfeima/sglang that referenced this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants