Feat: SGL backend for online SD training #564

h-guo18 · 2025-11-14T22:30:37Z

What does this PR do?

Type of change: new feature

Overview:

New trainer with different base model backends available for online training.
- Now the base model can be SGlang, HF-TP, while the student model DDP on different devices.
- Trainer based on previous experimental PR352.
- SGL wrapper based on specforge PR239;
Other improvements:
- Move train_acc.item() out from eagle forward to avoid cuda graph break during torch compile;
- Optionally turn off ar validation pbar to avoid log overflow.

Usage

export NCCL_CUMEM_ENABLE=1;
python train.py  \
 --out_path $OUT \
 --data_path $DATA \
 --model_path $MODEL \
 --teacher_backend <"sglang" or "hf"> \
 --teacher_devices 0,1,2,3 \
 --student_devices 4,5,6,7 \
 --teacher_ep_size 1  #sglang backend only

Testing

Parallelism

Tested base model TP on HF backend;
Tested base model TP and EP on SGLang backend;

Training Quality Test

Compared previous HF trainer, new trainer-HF backend and trainer-SGL backend;
Setting: Llama3.2-1B, magpie, bs=8, lr1e-4, seqlen1k;

SGL backend worse than HF backend in step>0;
New trainer-HF backend almost identical with previous HF trainer;

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

copy-pr-bot · 2025-11-14T22:30:41Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot · 2025-11-17T00:40:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: h-guo18 <[email protected]>

codecov · 2025-11-17T01:35:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.47%. Comparing base (e3e399a) to head (6f8fa51).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #564   +/-   ##
=======================================
  Coverage   74.47%   74.47%           
=======================================
  Files         182      182           
  Lines       18225    18225           
=======================================
  Hits        13573    13573           
  Misses       4652     4652

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: h-guo18 <[email protected]>

yeyu-nvidia · 2025-11-18T18:37:53Z

examples/speculative_decoding/train.py

+torch.manual_seed(0)
+
+
+def _setup_distributed(rank, args, backend="nccl"):


Why do we need to manually set this up

yeyu-nvidia · 2025-11-18T18:39:37Z

examples/speculative_decoding/trainer/sgl_wrapper.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# MIT License


Is this code borrowed from OSS? if so, you will need to follow the OSS code procedure to file a ticket and get approval

yeyu-nvidia · 2025-11-18T18:41:41Z

According to the figures, there is a big accuracy/loss gap between SGL trainer and our current trainer. We need to figure it out before this can be merged. Also, as this PR may change our training accuracy fundamentally, we need to train larger models than llama 1B

h-guo18

Still do not know the reason for acc gap between SGLang and HF. I think there is a numeric error that propagate along TTT steps, leading to a decreasing acc for later TTT steps.

So far, we have eliminate several possibilities by experiments:

TP
enable_fp32_lm_head in SGLang
triton_attention_reduce_in_fp32
attention backend

h-guo18 force-pushed the haoguo/sgl-backend branch from fa58a28 to f5bb1ae Compare November 17, 2025 01:18

squash: new trainer with HF and SGL backend

8f00ea5

Signed-off-by: h-guo18 <[email protected]>

h-guo18 force-pushed the haoguo/sgl-backend branch from f5bb1ae to 8f00ea5 Compare November 17, 2025 01:22

h-guo18 self-assigned this Nov 17, 2025

h-guo18 changed the title ~~Feat: SGL backend for SD training~~ Feat: SGL backend for online SD training Nov 17, 2025

add license

6f8fa51

Signed-off-by: h-guo18 <[email protected]>

h-guo18 marked this pull request as ready for review November 17, 2025 02:57

h-guo18 requested a review from a team as a code owner November 17, 2025 02:57

h-guo18 requested review from ChenhanYu and yeyu-nvidia November 17, 2025 02:57

yeyu-nvidia reviewed Nov 18, 2025

View reviewed changes

h-guo18 commented Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: SGL backend for online SD training #564

Feat: SGL backend for online SD training #564

h-guo18 commented Nov 14, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 14, 2025

Uh oh!

copy-pr-bot bot commented Nov 17, 2025

Uh oh!

codecov bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

yeyu-nvidia Nov 18, 2025

Uh oh!

yeyu-nvidia Nov 18, 2025

Uh oh!

yeyu-nvidia commented Nov 18, 2025

Uh oh!

h-guo18 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		torch.manual_seed(0)


		def _setup_distributed(rank, args, backend="nccl"):

Feat: SGL backend for online SD training #564

Are you sure you want to change the base?

Feat: SGL backend for online SD training #564

Conversation

h-guo18 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Parallelism

Training Quality Test

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Nov 14, 2025

Uh oh!

copy-pr-bot bot commented Nov 17, 2025

Uh oh!

codecov bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yeyu-nvidia Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

yeyu-nvidia Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

yeyu-nvidia commented Nov 18, 2025

Uh oh!

h-guo18 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

h-guo18 commented Nov 14, 2025 •

edited

Loading

codecov bot commented Nov 17, 2025 •

edited

Loading

h-guo18 left a comment •

edited

Loading