Skip to content

Add muon optimizer #3924

@NewBornRustacean

Description

@NewBornRustacean

What's up Burn community!
I'd like to suggest to add a new optimizer muon

Feature description

Add the Muon optimizer to burn-optim. Muon (Momentum Orthogonalized by Newton-schulz) is a new optimizer specifically designed for training neural network hidden layers, particularly effective for large language models and transformers.

Muon combines SGD-momentum with Newton-Schulz orthogonalization, replacing each 2D parameter's update with the nearest orthogonal matrix. This approach provides:

  • Faster convergence compared to Adam/AdamW
  • Better stability in large-scale training
  • Memory efficiency (similar to SGD-momentum)
  • Automatic learning rate transfer across different model scales

References

Feature motivation

Adding Muon to Burn would benefit:

  • Large language model training projects
  • Users seeking faster convergence with less hyperparameter tuning
  • Applications where memory efficiency is critical
  • ...and more!

(Optional) Suggest a Solution

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureThe feature request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions