✨ TL;DR
This paper investigates when reinforcement learning with verifiable rewards (RLVR) enables large language models to generalize under weak supervision (scarce data, noisy rewards, or self-supervised signals). The key finding is that models generalize when they exhibit prolonged pre-saturation training dynamics, which is predicted by reasoning faithfulness—the degree to which intermediate reasoning steps logically support final answers.
Large language models have improved reasoning through reinforcement learning with verifiable rewards, but creating high-quality reward signals becomes harder as models advance. Understanding when RLVR succeeds with weaker supervision is critical for scaling these methods. The paper addresses three challenging weak supervision scenarios: limited training data, noisy reward signals, and self-supervised proxy rewards that may not perfectly align with true task objectives. Without clear understanding of what enables generalization in these settings, practitioners risk models that memorize training patterns rather than learning generalizable reasoning strategies.
The authors conduct a systematic empirical study across multiple model families and reasoning domains under three weak supervision conditions. They analyze training dynamics by tracking how training reward saturation relates to downstream generalization performance. The study examines pre-RL model properties, specifically reasoning faithfulness (whether intermediate steps logically support final answers) and output diversity, to identify predictors of generalization success. They then disentangle the effects of continual pre-training on domain data versus supervised fine-tuning on explicit reasoning traces. Finally, they validate their findings by applying identified interventions to Llama3.2-3B-Base to transform a non-generalizing model into one that succeeds across all weak supervision settings.