✨ TL;DR
This paper addresses a critical problem in reinforcement learning for large language models: when base models are already very accurate on training benchmarks, standard RL methods fail because there aren't enough errors to learn from, causing models to collapse into repetitive solutions. The authors propose CUTS, a novel sampling strategy that maintains solution diversity even when models are highly accurate, improving generalization on challenging out-of-domain math problems by up to 15.1%.
As large language models become stronger, they increasingly saturate standard reasoning benchmarks like MATH, producing correct but nearly identical solutions. This creates a paradox for reinforcement learning: group-relative algorithms like GRPO rely on comparing good and bad solutions to compute advantage signals that guide learning. When models are already highly accurate, there are few failure cases to learn from, causing the advantage signal to vanish. This leads to mode collapse where the policy degenerates into producing homogeneous, repetitive solutions rather than exploring diverse reasoning paths. The fundamental issue is that traditional RL approaches designed for weaker models break down when applied to already-capable base models, preventing further improvement and limiting out-of-domain generalization.
The authors propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy that enforces exploration while maintaining solution quality. Unlike standard sampling methods that follow the model's probability distribution (and thus its biases), CUTS flattens the local optimization landscape by sampling uniformly from a constrained set of high-confidence candidate tokens. This preserves structural diversity in generated solutions without sacrificing correctness. They integrate CUTS into Mixed-CUTS, a training framework that combines exploitative rollouts (using standard sampling) with exploratory rollouts (using CUTS). This synergistic approach amplifies intra-group advantage variance, providing meaningful learning signals even when most solutions are correct. The method maintains diversity within the semantic manifold of valid solutions, allowing the model to explore different reasoning paths while staying within the space of correct answers.