01 Dec 23:53

terrykong

c751f81

Release 0.4.0 Latest

Latest

🚀 Release v0.4.0

📝 Blog

On‑Policy Distillation for LLMs in NVIDIA NeMo-RL

✨ Highlights

Container

A linux/amd64 Docker container is available on NGC as nvcr.io/nvidia/nemo-rl:v0.4.0. We plan to include a linux/arm64 container for the next NeMo-RL release. Here are the major software components included in the container:

Software Component	Version
NeMo-RL	0.4.0
NeMo-Automodel	0.2.0.rc0+277a8a8
Megatron-Bridge	0.1.0.rc0+62f4704
Megatron-Core	0.15.0.rc3+af73aa2
Pytorch	2.7.1
vllm	0.10.0

The NeMo-RL container is built on top of the 25.05 cuda-dl-base devel image: https://docs.nvidia.com/deeplearning/frameworks/cuda-dl-release-notes/rel-25-05.html#rel-25-05

If you would like to build this container, or nightly containers, yourself, we provide the exact instructions we use at https://docs.nvidia.com/nemo/rl/latest/docker.html#release-image.

Megatron-Bridge and NeMo-Automodel

We are excited to share two large backend changes in v0.4 that were aimed at migrating to the latest training libraries in the NeMo ecosystem.

Megatron-Bridge

NeMo RL in v0.4 is built on top of megatron-bridge which is our official training library based on Megatron-Core backend. You can read more about Megatron-Bridge here: https://docs.nvidia.com/nemo/megatron-bridge/latest/.

As before, to enable the Megatron-core backend you simple set:

policy.megatron_cfg.enabled=True

See PR for details on the migration: #905

NeMo Automodel

Automodel is the new backend powering the DTensorPolicyWorkerV2 implementation. Future development accelerating training with native Pytorch parallelisms propagates to NeMo RL through the Automodel integration. DTensorPolicyWorkerV1 will be deprecated in Q1 2026.

DTensorPolicyWorkerV2 has feature parity (including model coverage) with DTensorPolicyWorkerV1
DTensorPolicyWorkerV2 also has native support for a distributed safetensors format
Beginning in v0.4, DTensorPolicyWorkerV2 is the default backend for the pure pytorch path.
- If you need the original DTensorPolicyWorker, set policy.dtensor_cfg._v2=False.
- See “DTensorPolicy (v1) Deprecation” below for details of our plans for these two backends
Features we plan to enable in upcoming RL releases include:
- Efficient kernels
- Custom model path to enable EP and PP in large MoE models

You can read more about Automodel here: https://docs.nvidia.com/nemo/automodel/latest/index.html.

All models that could be supported by NeMoAutoModelForCausalLM and NeMoAutoModelForSequenceClassification should be supported inside NeMo RL. (There are some known issues about MoE models inside RL. They will be supported in next release.)

Multimodal (DTensor and Mcore backends)

Vision-Language Model (VLM) training is now supported for both DTensor and Megatron backends.

DTensor example:

uv run examples/run_vlm_grpo.py --config examples/configs/vlm_grpo_3B.yaml

Megatron example:

uv run examples/run_vlm_grpo.py --config  examples/configs/vlm_grpo_3B_megatron.yaml

DAPO and GSPO

NeMo RL now has support for the DAPO (Dynamic Sampling Policy Optimization) and GSPO (Group Sequence Policy Optimization) algorithms! Both algorithms can be run with simple config changes.

To run GSPO:

loss_fn:
  sequence_level_importance_ratios: True
  token_level_loss: False

To run DAPO:

grpo:
  ## enable DAPO dynamic sampling
  use_dynamic_sampling: true
  batch_multiplier: 3
  dynamic_sampling_max_gen_batches: 10

  ## enabled DAPO reward shaping
  reward_shaping:
    enabled: true
    overlong_buffer_length: 4096     # Threshold before penalties apply (paper uses 4096)
    overlong_buffer_penalty: 1.0     # Penalty per excess token
    max_response_length: 20480

For more details on the DAPO algorithm and how to configure your DAPO run, refer to the documentation.

On-Policy Knowledge Distillation (DTensor and Mcore backends)

NeMo RL now supports On-Policy Knowledge Distillation. This enables a student/reference model to further improve its policy using rich supervision from the logits of a larger/better teacher model. For full details and setup instructions, see our Quickstart guide and latest blog post.

DTensor example:

uv run python examples/run_distillation_math.py --config examples/configs/distillation_math.yaml

Megatron example:

uv run python examples/run_distillation_math.py --config examples/configs/distillation_math_megatron.yaml

Native HF Reward Model Environments

The RewardModelEnvironment evaluates rollouts using Hugging Face reward models and returns scores that can be used as rewards during GRPO training. You can enable the reward model environment through the env configuration field and launch training with the reward model environment using the following command:

uv run examples/run_grpo_rm.py --config=examples/configs/grpo_rm_1B.yaml

For more details on the reward model environment and how to use it, refer to our design documentation.

Furthermore, users can train their own reward model using NeMo RL PyTorch backend.

Async RL

Nemo RL v0.4 supports asynchronous RL with following features:

max_trajectory_age: this controls how stale the rollout sample can be to be used in a training step. If a trajectory was generated with w_k weights, and if max_trajectory_age is set to be 2, then it can be used in training steps w_k + 1 and w_k + 2.
In-flight weight update: this feature allows the trainers to update the weights of rollout workers during generation such that the rollout workers stop at some on-going decoding step, get the new weights and continue with future decoding steps (note that this technique is similar to pipeline RL https://arxiv.org/abs/2509.19128). Nemo-RL allows the users to choose whether they want the kv-cache to be invalidated or not after each weight update.

Note:
Asynchronous RL is only supported with non-colocated setup (i.e distinct workers for rollout and training).

FP8

Currently NeMo RL v0.4 enables users to have the flexibility to do FP8 either for the entire RL E2E pipeline, or only for rollout, for dense models. FP8 support for MoE models is upcoming.

FP8 Rollouts

v0.4 supports using the FP8 (block-wise) quantization method in vLLM to accelerate generation. The supported FP8 quantization method is identical to what is described in the DeepSeek-V3 technical report. You can turn on FP8 generation by adding or modifying the following fields in the GRPO config file. Note that it is required to use importance sampling correction in the loss function when you use FP8 generation.

loss_fn:
  use_importance_sampling_correction: true
policy:
  generation:
    vllm_cfg:
      precision: "fp8"

This feature currently only works for dense models. Support for FP8 generation in MoE models will be available in the next release.

E2E FP8 (Training and Generation)

v0.4 supports FP8 training in the Megatron training backend. There are three forms of FP8 training: per-tensor delayed scaling, per-tensor current scaling, and block-wise scaling as described in the DeepSeek-V3 technical report. You can use FP8 training in SFT or GRPO by adding or modifying the following fields in the config file.

policy:
  megatron_cfg:
    fp8_cfg:
      enabled: true
      fp8: "e4m3"			# choices: "e4m3" for block-wise scaling, "hybrid" for both delayed and current per-tensor scaling
      fp8_recipe: "blockwise"   	# choices: "blockwise" for block-wise scaling, "tensorwise" for per-tensor current scaling, "delayed" for delayed scaling.
    env_vars:
      NVTE_FP8_BLOCK_SCALING_FP32_SCALES: "1"	# this is required for block-wise scaling

The recommended FP8 GRPO recipe is to use FP8 block-wise quantization in vLLM and FP8 block-wise scaling in Megatron training backend.

Note that only for FP8 block-wise scaling in Megatron training, you need to use the NGC-pytorch based NeMo-RL container built from docker/Dockerfile.ngc_pytorch. This is because the cuda based NeMo-RL container built from docker/Dockerfile does not contain the necessary cuBLAS version for block-wise GEMM kernels. This is not a requirement for vLLM FP8 generation, or Megatron FP8 training schemes other than block-wise scaling. This will not be a requirement in the next release.

To build and push docker/Dockerfile.ngc_pytorch in one command:

docker buildx build -f docker/Dockerfile.ngc_pytorch --build-arg NRL_GIT_REF=r0.4.0 --tag <registry>/nemo-rl:r0.4.0-ngc --push https://github.com/NVIDIA-NeMo/RL.git

KL divergence (between Training and generation)

Enhanced KL divergence K3 estimator metric monitoring for training-inference mismatch along with a case study for explanation. We find the KL metric (gen_kl_error) to be an informative metric for divergence in an...

Contributors

jseppanen, yfw, and 42 other contributors

Assets 3

01 Aug 22:16

terrykong

v0.3.1

27d0af9

Release v0.3.1

🚀 Release v0.3.1

This is a patch release to fix a bug with sequence-packing and a memory optimization in the softmax.

Please read about the major features in v0.3 on the release notes.

What's Changed

cp: fix: Enforce minimum packed sequence bin count and multiple of bin count (748) into r0.3.1 by @chtruong814 in #759
cp: feat: avoid softmax deepcopy in logprobs (tpoisonooo) (761) into r0.3.1 by @chtruong814 in #763
chore: 0.3.0 -> 0.3.1 by @terrykong in #825

Full Changelog: v0.3.0...v0.3.1

Contributors

terrykong and chtruong814

Assets 2

25 Jul 17:31

terrykong

v0.3.0

b1c86a8

Release 0.3.0

🚀 Release v0.3.0

📝 Blog

Our latest blog post shares highlights and progress from recent work—take a look!

✨ Highlights

🏗️ Improved Training Throughput and Scalability via Megatron-Core Backend

In addition to PyT DTensor backend to seamlessly support 🤗HuggingFace models, this release has added Megatron-Core backend("Megatron backend") to enable large scale dense and MoE model training. It includes efficient parallelisms (data, tensor, pipeline, context, expert and sequence) and distributed optimizers for efficient training, and is our recommendation for RL on large model sizes and compute scale.

To use the Megatron backend, ensure you have initialized the submodules of NeMo RL:

git submodule update --init --recursive

You can try out the Megatron backend using predefined configs:

# Example 1 GPU 
uv run examples/run_grpo_math.py --config=examples/configs/grpo_math_1B_megatron.yaml

Or by enabling it from the command line:

# Example 1 GPU
uv run examples/run_sft.py policy.megatron_cfg.enabled=True

To learn more about the different backends and their configuration, visit our documentation on Traning Backends.

For FAQ using the Megatron backend, see this section.

⚡ Context Parallelism and Sequence Packing

Users can now train with longer sequences at enhanced GPU utilization via Context Parallelism ("CP") and Sequence Packing support for both Megatron-Core and PyT DTensor backends.

For the Megatron backend, both Context Parallelism and sequence packing can be enabled together:

policy:
  megatron_cfg:
    context_parallel_size: 2
  sequence_packing:
    enabled: True

DTensor backend also supports CP and Sequence Packing, but cannot be used together yet. Progress on this feature is being tracked here #520. In addition, there is also a known issue of CP with sequence parallelism, tracked here #659. For more information about CP and some of its current limitations in the Dtensor backend, visit our documentation.

policy:
  dtensor_cfg:
    context_parallel_size: 2
  # CP and sequence packing cannot be used together (To enable sequence packing, set context_parallel_size=1)
  sequence_packing:
     enabled: False

We recommend sequence packing to avoid extra padding and accelerate your training run, but if your model cannot use sequence packing (e.g., due to unsupported attention kernel), we recommend using dynamic_batching instead (see config). Dynamic batching is mutually exclusive with sequence packing, so it should be enabled on its own.

For more details on sequence packing and dynamic batching and how to use it, refer to our design documentation.

💎 Expanded Model Support

🔹 Qwen3 Support

Full support for Qwen3 model family with optimized configurations is available on the Megatron backend.

Qwen3 dense variants and the smallest MoE variant (Qwen/Qwen3-30B-A3B) is also available in the DTensor backend. If you need full N-d parallelism and the largest scale, we recommend the Megatron backend.

🔹 DeepSeekV3 Support

DeepSeekV3 (671B) is now supported on the Megatron backend. See #591 for more details on how to launch. We are continuing to optimize performance on DeepSeekV3 and other large MoE models, which we hope to land in our next release.

🚀 Async VLLM Engine

We have added Async VLLM Engine (v1) support in v0.3 which enables two important features not possible before:

Multi-node VLLM rollouts (for large MoEs like DSV3)
Pipeline Parallelism

Async engine can be enabled with the following config change:

  generation:
    backend: "vllm"
    vllm_cfg:
      async_engine: true

With Async VLLM Engine enabled, multi-turn rollouts are now much faster since we no longer block at each turn to wait for all the batch elements to complete.

📍 Non-colocated Generation ("Split Placement")

NeMo RL now supports placing the training backend on a different set of GPUs than the generation backend. This is currently supported in the DTensor backend, with support in the Megatron backend coming soon #613

This feature can be useful if:

training and generation have incompatible parallelism/world sizes
the memory after offloading for training or generation is still not low enough

Non-colcated generation can be enabled with the following config changes:

  generation:
    backend: "vllm"
    colocated:
      # true: generation shares training GPUs
      # false: uses dedicated generation resources
      enabled: false
      # only relevant when enabled is false
      resources:
        gpus_per_node: null # Decides num gpus to be dedicated to generation when there is one node in the cluster i.e cluster.num_nodes == 1
        num_nodes: 1 # Decides number of nodes to be dedicated to generation

An example multi-node command is:

# 5 nodes with 8GPUs, 4 nodes for train and 1 node for inference
uv run python examples/run_grpo_math.py \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.num_nodes=1 \
    cluster.num_nodes=5 \
    cluster.gpus_per_node=8

Non-colocated generation is also an important prerequisite for our continued work on Async RL.

📊 MLFlow Integration for Experiment Tracking

NeMo RL now supports MLFlow integration for comprehensive experiment tracking and management. This extends our suite of loggers which included Tensorboard and Wandb.

Enable MLFlow tracking in your configuration:

logger:
  mlflow_enabled: true
  mlflow:
    experiment_name: "grpo-dev"
    run_name: "grpo-dev-logger"

⚡ Performance Optimizations

🚀 Refit Optimizations

Multiple improvements to the refit process (weight updates from training to generation backend) lead to a several fold speedup. In large MoE models this has a significant effect on E2E step time. We have measured these optimizations on DeepSeekV3 which has brought down refit time from 850 seconds to 51 seconds (16x improvement). The improvements are particularly beneficial for extra-large models with large TP sizes in vLLM.

The core engineering team is planning on sharing some of the insights and optimization techniques used to achieve this in a blog; so stay tuned!

📊 VLLM CUDA Graphs

In v0.3 we now enable CUDA Graphs in VLLM by default

🚫 FSDP1 Deprecation

NeMo RL has officially removed the original FSDP1 path used for multi-gpu multi-node training in pure Pytorch. For training in pure Pytorch without the Megatron backend, we now recommend using the DTensor path which uses FSDP2 by default and is strictly better in terms of functionality and performance.

For more information on the deprecation and the burn testing done before its removal see #614.

🛠️ Known Issues

There is a known convergence issue with the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B base model used in the DeepScaleR recipe when cuda graphs are enabled in vLLM. This is fixed on main with #857. This divergence has not been observed in other models.
Qwen 32B needs minimum 32 nodes on DTensor backend #656
- We have observed increased memory pressure on Qwen 32B requiring a much higher node count than expected.
- This bug does not seem to affect the model on the Megatron backend.
Qwen/Qwen3-30B-A3B needs 8 nodes to run causing overheads due to extra parallelisms
Llama 70B performance on long context (>=32k) is slower than expected on the Megatron-Core backend.
DeepSeekV3 and Qwen3-253B E2E performance is still WIP, with improvements targed for v0.4
On the Megatron backend, training with MoE models using EP>1 and DP>1 can
hang if sequence packing is enabled #718.
- Our recommendation if you are using EP>1 and DP>1 is to disable sequence packing.
DPO does not support sequence packing or dynamic-batching #719.

📊 Release Runs

We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.

To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.

What's Changed

fix: update the comment about why we init in fp32 by @parthchadha in #354
feat: add and log a very rough entropy approximation by @SahilJain314 in #342
fix: fix issues preventing running grpo on volta by @parthchadha in #294
docs: remove license that was erroneously copy-pasted by @terrykong in #357
fix: recipes missing args by @terrykong in #365
test: make dpo functional test threshold higher until flakiness resolved by @terrykong in #371
ci: Migrate ...

Contributors

pcmoritz, yfw, and 26 other contributors

Assets 3

15 May 22:03

terrykong

v0.2.1

81d421f

Release v0.2.1

🚀 Release v0.2.1

🎉 Official Open Source Release!

We are thrilled to announce that NeMo RL is now officially open source! We welcome the community to use and contribute to it to help shape the future of reinforcement learning.

✨ Highlights

🎯 DeepScaleR Reproducer in NeMo RL

This release features a reproducer for the DeepScaleR work by Agentica AI, where a 1.5B parameter model surpassed O1-Preview on the AIME benchmark (Pass@1). Our implementation replicates this by iteratively scaling DeepSeek's GRPO algorithm from 8K → 16K → 24K context lengths.

You can start the first stage of training (8K context window) using the following command:

uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml

For the complete 3-stage iterative training instructions and more details, please see our GRPO on DeepScaleR guide.

📐 OpenMathInstruct-2 SFT in NeMo RL

This release includes a Supervised Fine-Tuning (SFT) recipe that follows the OpenMathInstruct-2 paper. Using this recipe, training a Llama-3.1-8B model on the train_1M split of the nvidia/OpenMathInstruct-2 dataset achieves a score of 0.5020 on the MATH-500 benchmark, matching the reference implementation in NeMo-Skills.

You can run the OpenMathInstruct-2 recipe using the following command:

uv run examples/run_sft.py --config=examples/configs/sft_openmathinstruct2.yaml

For more details on dataset splits, training times, and evaluation, please see our SFT on OpenMathInstruct-2 guide.

⚡ Faster GRPO with Dynamic Batching

GRPO E2E performance has been significantly improved with the introduction of dynamic batching. This feature optimizes GPU utilization by sorting variable-length responses by sequence length and bucketing them into microbatches. These microbatches aim to have a total number of tokens close to train_mb_tokens and logprob_mb_tokens for the training and logprob stages, respectively.

Important: Dynamic batching requires dtensor to be enabled.

You can enable dynamic batching and dtensor in your YAML configuration like so:

policy:
  # Enable DTensor (required for dynamic batching)
  dtensor_cfg:
    enabled: True
    # Other dtensor settings like tensor_parallel_size, sequence_parallel, etc.
    # tensor_parallel_size: 1
    # sequence_parallel: False
    # activation_checkpointing: True

  # Dynamic batching settings
  dynamic_batching:
    enabled: True
    # Target number of tokens for training microbatches
    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
    # Target number of tokens for logprob microbatches
    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
    # Round sequence lengths to the nearest multiple of this value for bucketing
    sequence_length_round: 64
  # Other policy settings like max_total_sequence_length, train_micro_batch_size, etc.
  # max_total_sequence_length: 4096
  # train_micro_batch_size: 4
  # logprob_batch_size: 8

Alternatively, you can enable these features and configure them via command-line overrides when running a script (e.g., run_grpo_math.py):

uv run examples/run_grpo_math.py \
  --config=<your_base_config.yaml> \
  policy.dtensor_cfg.enabled=True \
  policy.dynamic_batching.enabled=True \
  # Optionally override other dynamic batching or dtensor parameters:
  # policy.dynamic_batching.train_mb_tokens=16384 \
  # policy.dynamic_batching.logprob_mb_tokens=32768 \
  # policy.dtensor_cfg.tensor_parallel_size=2

Make sure to adjust train_mb_tokens, logprob_mb_tokens, and other parameters according to your sequence length and batch size configuration.

💎 Broad Model Support (including Gemma3)

NeMo RL enables users to leverage powerful open models from families such as Qwen, Llama, and Gemma for reinforcement learning. For this v0.2.1 release, we've enhanced support, particularly for Gemma3 models, addressing their unique characteristics like tied weights across all model sizes (which require special handling for tensor parallelism) and specific vLLM initialization needs. NeMo RL automatically handles these model quirks to ensure seamless training and inference. For more details on this, please see our Model Quirks guide.

🛠️ Bug Fixes

Gradient Accumulation: Resolved a common issue where naive averaging of losses during gradient accumulation, especially with varying sequence lengths, led to inaccurate loss calculations; this fix (see #266) ensures training runs and loss calculations are performed accurately.

📊 Release Runs

We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.

To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.

What's Changed

fix: Fix fsdp1 grad clipping and log grad norm by @ashors1 in #251
fix: Update DPO and SFT configs to use dtensor by @ashors1 in #256
chore: better logging when insufficient resources by @terrykong in #271
feat: E2E multi-turn RL example with a sliding puzzle game by @SahilJain314 in #242
docs: instruct users to git clone before beginning by @terrykong in #257
fix: add bibtex entry by @parthchadha in #273
feat: Updated Name to NeMo RL by @SahilJain314 in #265
docs: Correcting build issues and CI by @aschilling-nv in #270
fix: improve port selection and exiting early from ray.sub by @terrykong in #272
feat: publish convergence/release runs by @terrykong in #214
fix: fixes #264 where tied weights check didn't work on fsdp1 by @parthchadha in #284
feat: Add hydra style overrides to SFT by @hemildesai in #208
feat: rename ratio_eps_{min/max} to ratio_clip_{min/max} for clarity by @SahilJain314 in #283
ci: add eval functional test by @YUki-666 in #269
chore: add isort rules and pyflakes in ruff/precommit by @terrykong in #291
test: add a test that checks if recipes can be merged into the base config by @terrykong in #288
feat: Remove 'last 100' hack for math verifier by @SahilJain314 in #287
chore: Remove online hf checkpointing by @ashors1 in #285
fix: Fixed max seqlen not respected correctly by @SahilJain314 in #299
chore: Remove outdated comment in DPO config by @ashors1 in #293
fix: fix dtype of empty token_ids for consistency by @ashors1 in #290
ci: Add initial code coverage report by @chtruong814 in #268
feat: add qwen3 support by @gshennvm in #289
feat: config.json -> config.yaml to keep configs in the same representation by @terrykong in #314
fix: Step LR scheduler once per grpo step by @ashors1 in #305
perf: update sft and dpo recipes to use bf16 by @ashors1 in #302
fix: Add division by temperature in training model by @parthchadha in #316
ci: Add DPO convergence recipes by @ashors1 in #297
docs: large tech doc edit by @terrykong in #303
feat: mute math verify to dev null by @SahilJain314 in #319
docs: Add an example for saving a HF checkpoint E2E by @terrykong in #320
fix: Fixed capitalization of 'NVIDIA/nemo-rl' -> 'NVIDIA/NeMo-RL' in URL refs. by @SahilJain314 in #330
feat: Add support for gemma-3 by @yfw in #298
test: switch tests to qwen3 0.6B by @terrykong in #315
docs: fix the front page readme heading levels by @terrykong in #336
fix: Loosen threshold for dpo functional test by @ashors1 in #344
feat: Add deepscaler dataset by @abukharin-nv in #335
fix: reinitialize ray cluster if required by @parthchadha in #341
feat: dual-clip in grpo loss by @ZhiyuLi-Nvidia in #311
feat: improve eval by @YUki-666 in #325
fix: sliding_window_overwrite by @ZhiyuLi-Nvidia in #331
docs: add docs for local concurrent clusters and fix paths by @terrykong in #346
feat: p...

Contributors

yfw, terrykong, and 12 other contributors

Assets 4

24 Apr 16:38

terrykong

v0.2.0

26f651e

Release v0.2.0

🚀 Release v0.2.0

⚙️ Advanced Parallelism — FSDP 2, TP & SP for Efficient Training

The headline feature of v0.2 is the new DTensorPolicyWorker.

It enables advanced parallelisms—FSDP 2, Tensor Parallelism, and Sequence Parallelism—letting us scale to 32 B-parameter models.

Enable it via YAML or CLI overrides:

policy.dtensor_cfg.enabled=True \
policy.dtensor_cfg.tensor_parallel_size=8 \
policy.dtensor_cfg.sequence_parallel=True \
policy.dtensor_cfg.activation_checkpointing=True

🧠 Learning Algorithms — DPO (Direct Preference Optimization)

Our algorithm suite now includes DPO, compatible with both FSDP 1 and DTensor.

uv run examples/run_dpo.py

More examples live in the docs.

🔄 Multi-Turn RL — Tool Use, Games & Beyond

We now support multi-turn generation and training with GRPO.

An E2E example of training to play a sliding puzzle game will be available in the next release, but you can try it by cherry-picking this PR: #242

# 8x80GB GPUs recommended
uv run python examples/run_grpo_sliding_puzzle.py

🏋️‍♂️ Large-Model Support — Native PyTorch up to 32 B @ 16k sequence length

FSDP 2 + TP + SP make RL and SFT on 32 B models possible:

uv run ./examples/run_grpo_math.py \
  --config examples/configs/grpo_math_8B.yaml \
  policy.model_name='Qwen/Qwen2.5-32B' \
  policy.generation.vllm_cfg.tensor_parallel_size=4 \
  policy.max_total_sequence_length=16384 \
  cluster.num_nodes=16 \
  policy.dtensor_cfg.enabled=True \
  policy.dtensor_cfg.tensor_parallel_size=8 \
  policy.dtensor_cfg.sequence_parallel=True \
  policy.dtensor_cfg.activation_checkpointing=True

Full multi-node walkthrough in the docs.

🛡️ Environment Isolation — Per-Worker Deps with `uv`

In NeMo RL, workers can now launch cached, isolated uv virtual environments with their own Python dependencies—a setup we’ve found to be significantly faster than Ray’s builtin conda/pip/uv flow. Details here.

🐞 Known Issues

FSDP 1 gradient-clipping bug — see #251
Qwen 32 B perf tweaks coming in the next patch
Gemma3 convergence: #236
SFT/DPO configs default to FSDP1, which is not recommended for 1B models with tied embeddings. #256. Enabling DTensor manually will resolve the error.
V100 configuration: #259
The default SFT and DPO configs in examples/configs set policy.dtensor_cfg.enabled=False, but dtensor must be enabled to run with the default 1B models. Please make sure to set policy.dtensor_cfg.enabled=True when running with the default SFT and DPO configs.

📊 Release Runs

We have provided tensorboard logs to release runs to give you a head start on what to expect from our recipes.

You may download them here and serve them with tensorboard:

mkdir v0.2.0
tar -xzf release_runs.tar.gz -C v0.2.0/
tensorboard serve --logdir v0.2.0/

🚧 Coming soon… : In future releases, we will share a tensorboard viewer to make it easier to view and compare release runs.

What's Changed

fix: ray.sub race condition when overlapping srun commands on same node by @terrykong in #39
feat: add gpu mem and util logging to wandb/tensorboard by @terrykong in #37
ci: tests now run with HF_DATASETS_CACHE to speed up e2e time by @terrykong in #41
fix: update the instructions for multi-node setup; change the title f… by @parthchadha in #78
fix: Mixed Prec memory improvements and better default configs (converge-able) by @SahilJain314 in #32
fix: Remove reference of tokenizer from generation backend (#75) by @parthchadha in #82
feat: unit test metric tracking by @terrykong in #40
fix: unit test error when coverage wasn't specified by @terrykong in #88
ci: temporarily disable CI on main since PRs must be up to date before merge by @terrykong in #91
fix: error out early if ray cluster does not have resources by @parthchadha in #89
ci: skip functional until more capacity available and/or tests speed up by @terrykong in #94
feat: evaluation implement by @YUki-666 in #16
fix: gradient should be averaged instead of summed across mbs by @parthchadha in #86
fix: Use separate step_metric for GPU Monitoring by @yfw in #92
feat: Update sft config to use single GPU by @ashors1 in #90
fix: Grammar nit by @SahilJain314 in #98
feat: add capability to set min/max eps separately as proposed in the… by @parthchadha in #95
fix: change format messages to out of place by @KiddoZhu in #77
fix: correct version and use setuptools.dynamic metadata for version/readme by @terrykong in #104
fix: remove usage of vllm to get device uuid and instead use nvidia-m… by @parthchadha in #105
fix: Change optional-dependencies to dependency-groups by @hemildesai in #81
feat: Add support for hydra style overrides by @hemildesai in #80
fix: Do not initialize reference model for sft by @ashors1 in #71
fix: change grpo default to use 64 prompts per step and 32 generation… by @parthchadha in #111
feat: use cuda_graph by default for vllm by @parthchadha in #116
fix: ensure that we check for pad_token and not assume pad_token==eos… by @parthchadha in #120
ci: Consolidate tests by @chtruong814 in #27
feat: support local venvs for dependency isolation by @terrykong in #102
fix: make message formatting compatible with tokenizers with no bos/eos token by @ashors1 in #118
fix: reset prefix cache when sleep is called to ensure prefix cache i… by @parthchadha in #112
ci: Fix unit test summary by @chtruong814 in #128
fix: fix error padding by @YUki-666 in #87
feat: Distributed checkpointing by @ashors1 in #99
ci: Add DCO placeholder check for merge queue by @chtruong814 in #147
ci: Clarify DCO check in merge_group by @chtruong814 in #154
fix: host ip resolution uses ray vs socket by @terrykong in #153
test: Add grpo/reinforce/ppo loss tests (prep for incoming vocab parallel changes) by @SahilJain314 in #162
fix: always test vllm by @parthchadha in #167
docs: Fix doc build warnings and add external CI config by @mckimn in #157
fix: allow configuring ray ports in ray.sub in case conflict on cluster by @terrykong in #173
feat: support arbitrary end_strings by @YUki-666 in #96
ci: labels for docs/L0/L1/L2 and run even if only doc test by @terrykong in #181
fix: don't use cuda-graphs for vllm generation by @parthchadha in #187
ci: Update to include public/ folder for pages deployment by @mckimn in #182
docs: run tests with --group test to avoid missing test deps by @terrykong in #188
fix: default to less verbose logging + uv-venv log once per worker by @terrykong in #141
docs: Correcting file names by @aschilling-nv in #161
fix: convert DCP to HF script works without ray cluster by @terrykong in #185
docs: remove backticks from uv.md title by @terrykong in #179
feat: add a unique seed for each vllm llm engine by @parthchadha in #171
fix: unit test script halts on first failure by @terrykong in #189
feat: Upgrade to vllm v1 runtime by @parthchadha in #170
ci: R...

Contributors

yfw, mckimn, and 11 other contributors

Assets 3

25 Mar 16:28

terrykong

v0.1.1

fbe85c5

Release v0.1.1

Patch release on top of v0.1.0

🛠️ More stable mixed precision configurations and resolves OOMs observed in llama 8b
🛠️ Fixes race condition in ray.sub where pyxis can fail if subsequent srun commands are run too early (with --overlap)

What's Changed

fix: ray.sub race condition when overlapping srun commands on same node by @terrykong in #39
feat: add gpu mem and util logging to wandb/tensorboard by @terrykong in #37
ci: tests now run with HF_DATASETS_CACHE to speed up e2e time by @terrykong in #41
fix: update the instructions for multi-node setup; change the title f… by @parthchadha in #78
fix: Mixed Prec memory improvements and better default configs (converge-able) by @SahilJain314 in #32

Known Issues

gpu memory and utilization in wandb/tensorboard has a bug when enabled. This is tracked in #83

Full Changelog: v0.1.0...v0.1.1

Contributors

terrykong, parthchadha, and SahilJain314

Assets 2

22 Mar 00:07

terrykong

v0.1.0

9ee564e

Release v0.1.0

✅ Fast Generation - vLLM backend for optimized inference
✅ HuggingFace Integration - Works with 1-8B models (Qwen1.5, Llama)
✅ Distributed Training - FSDP support and Ray-based infrastructure
✅ Environment Support - Support for multi-environment training.
✅ Learning Algorithms - GRPO (Group Relative Policy Optimization) and SFT (Supervised Fine-Tuning)
✅ Worker Isolation - Process isolation between RL Actors (no worries about global state)

What's Changed

ci: Add initial GHA by @chtruong814 in #1
feat: reinforcer initial commit by @terrykong in #3
Checkpointing fixes by @ashors1 in #9
docs: Move adding_new_models doc to guides section by @parthchadha in #11
fix: disable mixed precision training until #13 is resolved by @parthchadha in #14
docs: Small update to sft documentation by @ashors1 in #12
ci: Update unit tests to run on self-hosted runner by @chtruong814 in #6
feat: SFT improvements: refactor and add validation and checkpointing by @ashors1 in #15
docs: GRPO documentation and Configuration cleanup by @SahilJain314 in #7
feat: lots of fixes by @terrykong in #17
feat: Configurable precision by @SahilJain314 in #19
ci: OPTIONAL -> IS_OPTIONAL by @terrykong in #22
feat: disable ray usage collection stats be default by @terrykong in #24
docs: refresh our PR template by @terrykong in #23
docs: micro doc update with a helpful reminder on environment variables by @SahilJain314 in #20
fix: disable usage stats more forcefully since container env took precedence by @terrykong in #25
feat: Enable amp with autocast (fix poor bf16 convergence on GRPO by @SahilJain314 in #26
feat: Use openmathinstruct2 training in grpo math example by @parthchadha in #18
docs: Updated adding models docs to fix latex rendering errors and fix math by @SahilJain314 in #28
fix: updated stale cluster.md by @terrykong in #30
feat: SFT convergence run changes by @yfw in #21
docs: Add SFT quickstart by @ashors1 in #29
feat: Change vllm frac to 0.6 by @parthchadha in #31

New Contributors

@chtruong814 made their first contribution in #1
@terrykong made their first contribution in #3
@ashors1 made their first contribution in #9
@parthchadha made their first contribution in #11
@yfw made their first contribution in #21

Known Issues

There is a known bug with SFT checkpointing that requires the full model to be gathered on GPU before saving a checkpoint. This causes OOM for larger model sizes. If you run into OOM when checkpointing, disable checkpointing by adding checkpointing.enabled=False to your run command.

Full Changelog: https://github.com/NVIDIA/NeMo-RL/commits/v0.1.0

Contributors

yfw, terrykong, and 4 other contributors

Assets 2

Releases: NVIDIA-NeMo/RL

Release 0.4.0

🚀 Release v0.4.0

📝 Blog

✨ Highlights

Container

Megatron-Bridge and NeMo-Automodel

Megatron-Bridge

NeMo Automodel

Multimodal (DTensor and Mcore backends)

DAPO and GSPO

On-Policy Knowledge Distillation (DTensor and Mcore backends)

Native HF Reward Model Environments

Async RL

FP8

FP8 Rollouts

E2E FP8 (Training and Generation)

KL divergence (between Training and generation)

Contributors

Uh oh!

Release v0.3.1

What's Changed

Contributors

Uh oh!

Release 0.3.0

🚀 Release v0.3.0

📝 Blog

✨ Highlights

🏗️ Improved Training Throughput and Scalability via Megatron-Core Backend

⚡ Context Parallelism and Sequence Packing

💎 Expanded Model Support

🔹 Qwen3 Support

🔹 DeepSeekV3 Support

🚀 Async VLLM Engine

📍 Non-colocated Generation ("Split Placement")

📊 MLFlow Integration for Experiment Tracking

⚡ Performance Optimizations

🚀 Refit Optimizations

📊 VLLM CUDA Graphs

🚫 FSDP1 Deprecation

🛠️ Known Issues

📊 Release Runs

What's Changed

Contributors

Uh oh!

Release v0.2.1

🚀 Release v0.2.1

🎉 Official Open Source Release!

✨ Highlights

🎯 DeepScaleR Reproducer in NeMo RL

📐 OpenMathInstruct-2 SFT in NeMo RL

⚡ Faster GRPO with Dynamic Batching

💎 Broad Model Support (including Gemma3)

🛠️ Bug Fixes

📊 Release Runs

What's Changed

Contributors

Uh oh!

Release v0.2.0

🚀 Release v0.2.0

⚙️ Advanced Parallelism — FSDP 2, TP & SP for Efficient Training

🧠 Learning Algorithms — DPO (Direct Preference Optimization)

🔄 Multi-Turn RL — Tool Use, Games & Beyond

🏋️‍♂️ Large-Model Support — Native PyTorch up to 32 B @ 16k sequence length

🛡️ Environment Isolation — Per-Worker Deps with uv

🐞 Known Issues

📊 Release Runs

What's Changed

Contributors

Uh oh!

Release v0.1.1

Release v0.1.1

What's Changed

Known Issues

Contributors

Uh oh!

Release v0.1.0

Release v0.1.0

What's Changed

New Contributors

🛡️ Environment Isolation — Per-Worker Deps with `uv`