Implement SimKO to add entropy in TopK token sampling during RL #13

ighoshsubho · 2025-10-29T11:28:03Z

This Impl is directly inspired from this paper - https://arxiv.org/pdf/2510.14807v1

I'm still actively testing out and will share results on it, the topK sampling metrics during training are yet to get logged in W&B

…s/qk norm

…into add-grpo # Conflicts: # .github/CODEOWNERS # .github/workflows/integration_test_8gpu_h100.yaml # .github/workflows/integration_test_8gpu_models.yaml # .github/workflows/integration_test_8gpu_torchft.yaml # torchtitan/components/checkpoint.py # torchtitan/experiments/__init__.py # torchtitan/models/attention.py # torchtitan/models/deepseek_v3/infra/parallelize.py # torchtitan/models/llama3/__init__.py # torchtitan/models/llama3/model/args.py # torchtitan/models/llama3/model/model.py # torchtitan/tools/logging.py # torchtitan/train.py

add grpo :)

add inference logp IS

feat: qwen3-next

feat: seed-oss support

…ing; add compute_token_entropy and apply_simko_adjustment functions in GRPO step.

dmahan93 · 2025-10-29T14:53:46Z

torchtitan/grpo/grpo_step.py


    return loss

+def compute_token_entropy(pred, mask):


we have a vocab parallel entropy fn in utils, that should be used instead

sure will do in the following commit.

dmahan93 · 2025-10-29T14:55:22Z

torchtitan/grpo/grpo_step.py

+def apply_simko_adjustment(
+    ratio, pred, labels, reward, mask,
+    alpha=0.01, K=3, lambda_top1=1.1, tau_percentile=80
+):


this fn does not work with tensor parallel

ok fixing it

dmahan93 · 2025-10-29T14:56:29Z

torchtitan/grpo/grpo_step.py

+    Returns:
+        adjusted_ratio: Ratio with SimKO adjustments
+    """
+    # 1. Identify forking tokens (high-entropy)


does this need backpropagated through, or can you wrap it with no_grad?

No this won't need a backprop and not seen anywhere in paper to mention so, yeah no grad is way to go

dmahan93 · 2025-10-29T14:56:39Z

torchtitan/grpo/grpo_step.py

+
+    # 2. Get top-K tokens and compute their ratios
+    _, topk_indices = torch.topk(pred, k=K, dim=-1)
+    new_log_probs_full = torch.nn.functional.log_softmax(pred, dim=-1)


this uses up a lot of memory

sure, do you think like doing log softmax one time would do? If we are doing a no grad on this, it could be reused for old poilicy top k probs as well

well this needs grads, since the grads are flowing through this softmax, we need to combine the way we're currently getting logprobs with the top-k logprob method, as the logprobs we need for the GRPO/GSPO loss are not guaranteed to be within the top-k. the quick and dirty way may be to wrap it in a checkpoint and chunk it, that way we "save" the memory by having it re-execute on backward in manageable chunks, even if it's not optimal, it should still just be a tiny portion of the compute so we can add to the backlog as i'm mostly worried about memory 😎

jquesnelle and others added 25 commits August 28, 2025 04:52

nous changes

4add1dc

add grpo :)

92352f0

working version

a8a2049

add config for legacy checkpoint loading

de041d8

nous changes

7fcd8aa

don't require a tokenizer or sequence lengths

1e44011

pass position ids through to llama3

9878802

hack: fix compile on blackwell

abded1e

position ids for deepseek (need to figure out sft vs. pretrain)

ae881c8

only pass sequence_lengths to attention init

5445dea

merge models into one modeling arch since the differences are qkv bia…

13d201c

…s/qk norm

initial commit of qwen3-next

47abb43

fixes from upstream changes

c013f69

Merge pull request #7 from NousResearch/add-grpo

62ca23f

add grpo :)

fix gating and activation

25dc331

add inference logp IS

44b99ad

- Took Verl's IS implementation so we can get sequence IS

eb1a501

Merge pull request #9 from NousResearch/add-inference-logp

03d2da7

add inference logp IS

add position_ids to qwen3-next

33591ba

guard import of causal_conv1d and fla

797db5f

Merge pull request #10 from NousResearch/q3n

6786199

feat: qwen3-next

add seed-oss support

d91ee11

Merge pull request #11 from NousResearch/seed-oss

139f275

feat: seed-oss support

Implement Simko configuration and adjustments for entropy-based train…

29e9cfb

…ing; add compute_token_entropy and apply_simko_adjustment functions in GRPO step.

ighoshsubho requested a review from dmahan93 October 29, 2025 11:28

dmahan93 requested changes Oct 29, 2025

View reviewed changes

jquesnelle force-pushed the dev-updated-again branch from 921ffb2 to 94d6abd Compare November 11, 2025 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement SimKO to add entropy in TopK token sampling during RL #13

Implement SimKO to add entropy in TopK token sampling during RL #13

Uh oh!

ighoshsubho commented Oct 29, 2025

Uh oh!

dmahan93 Oct 29, 2025

Uh oh!

ighoshsubho Oct 29, 2025

Uh oh!

dmahan93 Oct 29, 2025

Uh oh!

ighoshsubho Oct 29, 2025

Uh oh!

dmahan93 Oct 29, 2025

Uh oh!

ighoshsubho Oct 29, 2025

Uh oh!

dmahan93 Oct 29, 2025

Uh oh!

ighoshsubho Oct 29, 2025

Uh oh!

dmahan93 Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Implement SimKO to add entropy in TopK token sampling during RL #13

Are you sure you want to change the base?

Implement SimKO to add entropy in TopK token sampling during RL #13

Uh oh!

Conversation

ighoshsubho commented Oct 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants