Skip to content

Conversation

@h-guo18
Copy link
Contributor

@h-guo18 h-guo18 commented Nov 14, 2025

What does this PR do?

Type of change: new feature

Overview:

  • New trainer with different base model backends available for online training.

    • Now the base model can be SGlang, HF-TP, while the student model DDP on different devices.
    • Trainer based on previous experimental PR352.
    • SGL wrapper based on specforge PR239;
  • Other improvements:

    • Move train_acc.item() out from eagle forward to avoid cuda graph break during torch compile;
    • Optionally turn off ar validation pbar to avoid log overflow.

Usage

export NCCL_CUMEM_ENABLE=1;
python train.py  \
 --out_path $OUT \
 --data_path $DATA \
 --model_path $MODEL \
 --teacher_backend <"sglang" or "hf"> \
 --teacher_devices 0,1,2,3 \
 --student_devices 4,5,6,7 \
 --teacher_ep_size 1  #sglang backend only

Testing

Parallelism

  • Tested base model TP on HF backend;
  • Tested base model TP and EP on SGLang backend;

Training Quality Test

Compared previous HF trainer, new trainer-HF backend and trainer-SGL backend;
Setting: Llama3.2-1B, magpie, bs=8, lr1e-4, seqlen1k;

  • SGL backend worse than HF backend in step>0;
  • New trainer-HF backend almost identical with previous HF trainer;
image

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@h-guo18 h-guo18 self-assigned this Nov 17, 2025
@codecov
Copy link

codecov bot commented Nov 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.47%. Comparing base (e3e399a) to head (6f8fa51).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #564   +/-   ##
=======================================
  Coverage   74.47%   74.47%           
=======================================
  Files         182      182           
  Lines       18225    18225           
=======================================
  Hits        13573    13573           
  Misses       4652     4652           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@h-guo18 h-guo18 changed the title Feat: SGL backend for SD training Feat: SGL backend for online SD training Nov 17, 2025
Signed-off-by: h-guo18 <[email protected]>
@h-guo18 h-guo18 marked this pull request as ready for review November 17, 2025 02:57
@h-guo18 h-guo18 requested a review from a team as a code owner November 17, 2025 02:57
torch.manual_seed(0)


def _setup_distributed(rank, args, backend="nccl"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to manually set this up

# See the License for the specific language governing permissions and
# limitations under the License.

# MIT License
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code borrowed from OSS? if so, you will need to follow the OSS code procedure to file a ticket and get approval

@yeyu-nvidia
Copy link
Contributor

According to the figures, there is a big accuracy/loss gap between SGL trainer and our current trainer. We need to figure it out before this can be merged. Also, as this PR may change our training accuracy fundamentally, we need to train larger models than llama 1B

Copy link
Contributor Author

@h-guo18 h-guo18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still do not know the reason for acc gap between SGLang and HF. I think there is a numeric error that propagate along TTT steps, leading to a decreasing acc for later TTT steps.

So far, we have eliminate several possibilities by experiments:

  • TP
  • enable_fp32_lm_head in SGLang
  • triton_attention_reduce_in_fp32
  • attention backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants