✨ TL;DR
This paper develops a principled reinforcement learning framework for deciding when language models should stop generating text mid-response to avoid wasting compute on incorrect outputs. The approach uses value function estimation to determine optimal stopping points, improving accuracy-compute tradeoffs compared to existing methods.
Large language models using chain-of-thought reasoning frequently expend significant computational resources generating long responses that ultimately prove incorrect. While abstention (refusing to answer) can help by withholding low-confidence outputs, most existing methods make this decision either before generation begins or after it completes. Dynamic mid-generation abstention, which considers early termination at each token position during generation, has been explored empirically but lacks principled theoretical guidance for when to abstain. Without a formal framework, it remains unclear how to optimally balance the tradeoff between computational cost and the value of continuing generation.
The authors formalize dynamic abstention as an explicit action within a regularized reinforcement learning framework. They model the generation process where at each token position, the model can choose to either continue generating or abstain. An abstention reward parameter explicitly controls the tradeoff between compute cost and information gain. The key theoretical contribution is proving that abstaining when the value function (expected future reward) falls below the abstention reward strictly outperforms natural baseline strategies under general conditions. To make this practical, they derive an efficient method to approximate the value function during generation, enabling real-time abstention decisions without requiring expensive rollouts or separate value models.