-
Notifications
You must be signed in to change notification settings - Fork 195
Feat: SGL backend for online SD training #564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
fa58a28 to
f5bb1ae
Compare
Signed-off-by: h-guo18 <[email protected]>
f5bb1ae to
8f00ea5
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #564 +/- ##
=======================================
Coverage 74.47% 74.47%
=======================================
Files 182 182
Lines 18225 18225
=======================================
Hits 13573 13573
Misses 4652 4652 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: h-guo18 <[email protected]>
| torch.manual_seed(0) | ||
|
|
||
|
|
||
| def _setup_distributed(rank, args, backend="nccl"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to manually set this up
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # MIT License |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code borrowed from OSS? if so, you will need to follow the OSS code procedure to file a ticket and get approval
|
According to the figures, there is a big accuracy/loss gap between SGL trainer and our current trainer. We need to figure it out before this can be merged. Also, as this PR may change our training accuracy fundamentally, we need to train larger models than llama 1B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still do not know the reason for acc gap between SGLang and HF. I think there is a numeric error that propagate along TTT steps, leading to a decreasing acc for later TTT steps.
So far, we have eliminate several possibilities by experiments:
- TP
- enable_fp32_lm_head in SGLang
- triton_attention_reduce_in_fp32
- attention backend
What does this PR do?
Type of change: new feature
Overview:
New trainer with different base model backends available for online training.
Other improvements:
train_acc.item()out from eagle forward to avoid cuda graph break during torch compile;Usage
Testing
Parallelism
Training Quality Test
Compared previous
HF trainer,new trainer-HF backendandtrainer-SGL backend;Setting: Llama3.2-1B, magpie, bs=8, lr1e-4, seqlen1k;
Before your PR is "Ready for review"
Additional Information