EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control
2The Hong Kong University of Science and Technology
The 1.5B-parameter model trained with EntroPIC has set a new state-of-the-art (SOTA), surpassing top baselines and achieving the best results on both pass@1 and pass@N evaluations.
- Model Access: You can find and use the model at: https://huggingface.co/yangkaiSIGS/EntroPIC-Nemotron-1.5b
- Training Curves: View the training logs and curves on Weights & Biases: https://wandb.ai/1658198604/entropy?nw=nwuser1658198604
Long-term training of LLMs requires maintaining stable exploration to prevent collapse into sub-optimal behaviors.
Entropy plays a key role in this process by regulating exploration and preventing premature convergence.
However, RL methods often struggle to maintain an appropriate entropy level, since positive and negative samples affect entropy in opposite ways during training.
We introduce EntroPIC (Entropy stabilization via Proportional-Integral Control) — a simple yet effective approach that dynamically balances the influence of positive and negative samples through adaptive loss weighting.
⭐ EntroPIC provides a principled and lightweight entropy control mechanism, enabling more efficient exploration, smoother optimization, and long-term stability in LLM reinforcement learning.
Follow the steps below to start single-machine training with Qwen3-8B-Base using the EntroPIC setup.
-
Edit
run_entropic.sh:- Replace
$TRAIN_DATASET_PATHwith your training dataset path. - Replace
$VALIDATION_DATASET_PATHwith your validation dataset path. - Replace
$YOUR_WANDB_API_KEYwith your personal Weights & Biases API key.
- Replace
-
Start training:
bash run_entropic.sh
We perform comprehensive evaluations on both Reasoning Models (Chain-of-Thought based) and Standard Models (Answer-only).
We applied EntroPIC to OpenReasoning-Nemotron-1.5B. The model demonstrates state-of-the-art performance among 1.5B parameter models, outperforming strong baselines like QuestA and JustRL on challenging mathematical benchmarks while maintaining robust generalization.
Comparison on Mathematical Benchmarks (Avg@N)
| Models | Math | AMC | AIME24 | AIME25 | Olympiad | Minerva | HMMT | BRUMO | CMIMC | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| Nemotron-1.5B | 88.7 | 86.6 | 51.7 | 46.4 | 62.1 | 25.5 | 30.9 | 49.6 | 26.8 | 52.0 |
| QuestA | 93.2 | 94.1 | 72.5 | 63.1 | 71.1 | 25.3 | 42.1 | 70.0 | 42.1 | 63.7 |
| JustRL | 94.2 | 95.4 | 69.6 | 61.5 | 70.5 | 23.9 | 37.5 | 67.2 | 39.2 | 62.1 |
| EntroPIC | 93.2 | 96.4 | 74.9 | 68.3 | 70.1 | 36.4 | 42.7 | 63.8 | 43.0 | 65.4 |
Robust Generalization
Unlike other RL methods that suffer from "alignment tax" (forgetting general capabilities), EntroPIC improves performance on non-mathematical tasks.
For non-reasoning models, we report results based on Qwen3-8B-Base.
Evaluation uses DeepScaler and IFEval protocols.
On-policy Training Results
| Models | Math (avg@N / pass@N) | AMC | AIME24 | AIME25 | Olympic Bench | Omni-math | Overall |
|---|---|---|---|---|---|---|---|
| Initial Model | 86.1 / 97.0 | 58.4 / 81.6 | 23.4 / 60.0 | 23.0 / 53.0 | 49.9 / 68.7 | 32.0 / 49.3 | 45.5 / 68.3 |
| GRPO | 91.2 / 97.4 | 75.1 / 88.0 | 34.3 / 70.0 | 31.0 / 53.3 | 59.1 / 72.7 | 40.7 / 57.6 | 55.2 / 73.2 |
| NSR | 91.5 / 96.4 | 74.1 / 89.2 | 34.7 / 63.3 | 30.0 / 46.7 | 58.5 / 71.3 | 39.7 / 56.2 | 54.8 / 70.5 |
| AEC | 92.5 / 97.8 | 77.6 / 89.2 | 37.1 / 73.3 | 31.6 / 60.0 | 60.9 / 72.5 | 42.0 / 58.5 | 56.9 / 75.2 |
| EntroPIC | 92.4 / 97.2 | 80.1 / 91.6 | 42.3 / 76.7 | 34.6 / 66.7 | 60.0 / 71.3 | 42.7 / 58.4 | 58.7 / 77.0 |
Off-policy Training Results
| Models | Math (avg@N / pass@N) | AMC | AIME24 | AIME25 | Olympic Bench | Omni-math | Overall |
|---|---|---|---|---|---|---|---|
| GRPO | 88.7 / 93.6 | 64.0 / 87.9 | 28.9 / 63.3 | 25.5 / 50.0 | 53.2 / 69.0 | 35.3 / 52.4 | 49.3 / 69.4 |
| EntroPIC (P) | 89.8 / 96.4 | 67.8 / 90.4 | 34.8 / 66.7 | 27.5 / 53.3 | 56.4 / 71.1 | 36.6 / 54.9 | 52.2 / 72.2 |
| EntroPIC (PI) | 91.9 / 97.0 | 75.3 / 90.4 | 34.7 / 70.0 | 27.6 / 53.3 | 58.8 / 71.9 | 40.0 / 56.8 | 54.7 / 73.2 |
# Compute entropy loss
entropy_loss = agg_loss(loss_mat=entropy, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)
entropy_loss_item = entropy_loss.detach().item()
if self.config.target_entropy >= 0:
control_alpha_i = self.accumulate_entropy_error * K_i
control_alpha_p = (entropy_loss_item - self.config.target_entropy) * K_p
control_alpha = control_alpha_i + control_alpha_p
control_alpha = np.clip(control_alpha, -1.0, 1.0)
self.accumulate_entropy_error += (
entropy_loss_item - self.config.target_entropy
) * response_mask.shape[0] / self.config.ppo_mini_batch_size
else:
control_alpha = 0.0
pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = policy_loss_fn(
old_log_prob=old_log_prob,
log_prob=log_prob,
advantages=advantages,
response_mask=response_mask,
loss_agg_mode=loss_agg_mode,
config=self.config,
rollout_is_weights=rollout_is_weights,
control_alpha=control_alpha,
)def compute_policy_loss_entropic(..., control_alpha):
pg_loss = ...
# EntroPIC adjustment
prob_high = (log_prob > torch.log(torch.tensor(high_prob_thresh, device=log_prob.device)))
prob_mask = response_mask * prob_high
pg_loss_adjust = control_alpha * agg_loss(
-ratio * advantages.abs(),
loss_mask=prob_mask,
loss_agg_mode=loss_agg_mode
)
pg_loss = pg_loss + pg_loss_adjustIf you have any questions or would like to discuss collaboration, please feel free to contact:
Kai Yang — [email protected]
Saiyong Yang — [email protected]
If you find our work helpful for your research, please consider citing our paper:
@article{yang2025entropic,
title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control},
author={Yang, Kai and Xu, Xin and Chen, Yangkun and Liu, Weijie and Lyu, Jiafei and Lin, Zichuan and Ye, Deheng and Yang, Saiyong},
journal={arXiv preprint arXiv:2511.15248},
year={2025}
}


