✨ TL;DR
This paper introduces Bounded Ratio Reinforcement Learning (BRRL), a theoretical framework that bridges the gap between trust region methods and PPO's clipped objective, leading to a new algorithm called Bounded Policy Optimization (BPO) that provides monotonic improvement guarantees while matching or exceeding PPO's performance. The framework also extends to Group-relative BPO (GBPO) for large language model fine-tuning.
Proximal Policy Optimization (PPO) has become the dominant on-policy reinforcement learning algorithm due to its empirical success, but there exists a fundamental disconnect between the theoretical foundations of trust region methods and PPO's heuristic clipped objective function. This gap means that while PPO works well in practice, its theoretical justification is incomplete, and the reasons for its success are not fully understood. The field lacks a principled framework that can both explain PPO's effectiveness and provide stronger theoretical guarantees for policy optimization.
The authors develop the Bounded Ratio Reinforcement Learning (BRRL) framework by formulating a novel regularized and constrained policy optimization problem. They derive an analytical optimal solution to this problem and prove it ensures monotonic performance improvement. For practical implementation with parameterized policies, they develop Bounded Policy Optimization (BPO), which minimizes an advantage-weighted divergence between the current policy and the analytical optimal solution from BRRL. The framework establishes a lower bound on expected performance in terms of the BPO loss function. They extend this approach to Group-relative BPO (GBPO) specifically for large language model fine-tuning applications.