Skip to content

EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control

License

Notifications You must be signed in to change notification settings

yk7333/EntroPIC

Repository files navigation

EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control

arXiv Hugging Face WandB
1Tencent Hunyuan 
2The Hong Kong University of Science and Technology

🏆 Highlight

The 1.5B-parameter model trained with EntroPIC has set a new state-of-the-art (SOTA), surpassing top baselines and achieving the best results on both pass@1 and pass@N evaluations.

Reasoning Performance

🧠 Overview

Entropy Control

Long-term training of LLMs requires maintaining stable exploration to prevent collapse into sub-optimal behaviors.
Entropy plays a key role in this process by regulating exploration and preventing premature convergence.

However, RL methods often struggle to maintain an appropriate entropy level, since positive and negative samples affect entropy in opposite ways during training.

We introduce EntroPIC (Entropy stabilization via Proportional-Integral Control) — a simple yet effective approach that dynamically balances the influence of positive and negative samples through adaptive loss weighting.

EntroPIC provides a principled and lightweight entropy control mechanism, enabling more efficient exploration, smoother optimization, and long-term stability in LLM reinforcement learning.

Entropy Control Coefficient


🚀 Quick Start

Follow the steps below to start single-machine training with Qwen3-8B-Base using the EntroPIC setup.

  1. Edit run_entropic.sh:

    • Replace $TRAIN_DATASET_PATH with your training dataset path.
    • Replace $VALIDATION_DATASET_PATH with your validation dataset path.
    • Replace $YOUR_WANDB_API_KEY with your personal Weights & Biases API key.
  2. Start training:

    bash run_entropic.sh

📊 Evaluation

We perform comprehensive evaluations on both Reasoning Models (Chain-of-Thought based) and Standard Models (Answer-only).

🏆 Reasoning Model Results (Nemotron-1.5B)

We applied EntroPIC to OpenReasoning-Nemotron-1.5B. The model demonstrates state-of-the-art performance among 1.5B parameter models, outperforming strong baselines like QuestA and JustRL on challenging mathematical benchmarks while maintaining robust generalization.

Comparison on Mathematical Benchmarks (Avg@N)

Models Math AMC AIME24 AIME25 Olympiad Minerva HMMT BRUMO CMIMC Overall
Nemotron-1.5B 88.7 86.6 51.7 46.4 62.1 25.5 30.9 49.6 26.8 52.0
QuestA 93.2 94.1 72.5 63.1 71.1 25.3 42.1 70.0 42.1 63.7
JustRL 94.2 95.4 69.6 61.5 70.5 23.9 37.5 67.2 39.2 62.1
EntroPIC 93.2 96.4 74.9 68.3 70.1 36.4 42.7 63.8 43.0 65.4

Robust Generalization
Unlike other RL methods that suffer from "alignment tax" (forgetting general capabilities), EntroPIC improves performance on non-mathematical tasks.

Generalize Performance

🧩 Non-thinking Model Results (Qwen3-8B-base)

For non-reasoning models, we report results based on Qwen3-8B-Base.
Evaluation uses DeepScaler and IFEval protocols.

On-policy Training Results

Models Math (avg@N / pass@N) AMC AIME24 AIME25 Olympic Bench Omni-math Overall
Initial Model 86.1 / 97.0 58.4 / 81.6 23.4 / 60.0 23.0 / 53.0 49.9 / 68.7 32.0 / 49.3 45.5 / 68.3
GRPO 91.2 / 97.4 75.1 / 88.0 34.3 / 70.0 31.0 / 53.3 59.1 / 72.7 40.7 / 57.6 55.2 / 73.2
NSR 91.5 / 96.4 74.1 / 89.2 34.7 / 63.3 30.0 / 46.7 58.5 / 71.3 39.7 / 56.2 54.8 / 70.5
AEC 92.5 / 97.8 77.6 / 89.2 37.1 / 73.3 31.6 / 60.0 60.9 / 72.5 42.0 / 58.5 56.9 / 75.2
EntroPIC 92.4 / 97.2 80.1 / 91.6 42.3 / 76.7 34.6 / 66.7 60.0 / 71.3 42.7 / 58.4 58.7 / 77.0

Off-policy Training Results

Models Math (avg@N / pass@N) AMC AIME24 AIME25 Olympic Bench Omni-math Overall
GRPO 88.7 / 93.6 64.0 / 87.9 28.9 / 63.3 25.5 / 50.0 53.2 / 69.0 35.3 / 52.4 49.3 / 69.4
EntroPIC (P) 89.8 / 96.4 67.8 / 90.4 34.8 / 66.7 27.5 / 53.3 56.4 / 71.1 36.6 / 54.9 52.2 / 72.2
EntroPIC (PI) 91.9 / 97.0 75.3 / 90.4 34.7 / 70.0 27.6 / 53.3 58.8 / 71.9 40.0 / 56.8 54.7 / 73.2

Exp-Entropy Exp-Reward Exp-Eval


⚙️ Key Code Modifications

dp_actor.py: Computing Entropy Control Parameter α via PI Control

# Compute entropy loss
entropy_loss = agg_loss(loss_mat=entropy, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)
entropy_loss_item = entropy_loss.detach().item()

if self.config.target_entropy >= 0:
    control_alpha_i = self.accumulate_entropy_error * K_i
    control_alpha_p = (entropy_loss_item - self.config.target_entropy) * K_p
    control_alpha = control_alpha_i + control_alpha_p
    control_alpha = np.clip(control_alpha, -1.0, 1.0)
    self.accumulate_entropy_error += (
        entropy_loss_item - self.config.target_entropy
    ) * response_mask.shape[0] / self.config.ppo_mini_batch_size
else:
    control_alpha = 0.0

pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = policy_loss_fn(
    old_log_prob=old_log_prob,
    log_prob=log_prob,
    advantages=advantages,
    response_mask=response_mask,
    loss_agg_mode=loss_agg_mode,
    config=self.config,
    rollout_is_weights=rollout_is_weights,
    control_alpha=control_alpha,
)

core_algos: Adjusting Update Weights with the Control Coefficient

def compute_policy_loss_entropic(..., control_alpha):
    pg_loss = ...
    # EntroPIC adjustment
    prob_high = (log_prob > torch.log(torch.tensor(high_prob_thresh, device=log_prob.device)))
    prob_mask = response_mask * prob_high
    pg_loss_adjust = control_alpha * agg_loss(
        -ratio * advantages.abs(),
        loss_mask=prob_mask,
        loss_agg_mode=loss_agg_mode
    )
    pg_loss = pg_loss + pg_loss_adjust

📮Contact

If you have any questions or would like to discuss collaboration, please feel free to contact:
Kai Yang — [email protected]
Saiyong Yang — [email protected]

📚 Citation

If you find our work helpful for your research, please consider citing our paper:

@article{yang2025entropic,
  title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control},
  author={Yang, Kai and Xu, Xin and Chen, Yangkun and Liu, Weijie and Lyu, Jiafei and Lin, Zichuan and Ye, Deheng and Yang, Saiyong},
  journal={arXiv preprint arXiv:2511.15248},
  year={2025}
}

About

EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published