✨ TL;DR
Platypoos is a planning algorithm for deterministic environments with stochastic rewards that automatically adapts to unknown reward scales and smoothness without requiring prior knowledge of discount factors or reward bounds. It achieves optimal sample complexity with matching upper and lower bounds.
Planning in reinforcement learning environments with deterministic dynamics and stochastic rewards presents challenges when the reward function's scale and smoothness are unknown. Traditional planning algorithms often require knowledge of reward bounds or discount factors in advance, limiting their practical applicability. When rewards are unbounded and their scale varies significantly, existing methods may fail to adapt efficiently, leading to suboptimal sample complexity. The problem is further complicated by the need to handle discounted returns across different discount factors simultaneously. Without knowing the appropriate scale of the reward function or the effective planning horizon determined by the discount factor, algorithms must either make conservative assumptions that hurt performance or require manual tuning for each problem instance.
Platypoos is a scale-free adaptive planning algorithm designed to work without prior knowledge of reward scales, smoothness parameters, or discount factors. The algorithm automatically adapts to the unknown characteristics of the reward function during planning. The key innovation is its ability to operate across a broad range of discount factors and reward scales simultaneously. The algorithm employs adaptive techniques that allow it to discover the relevant scale and smoothness of rewards through sampling, rather than requiring these as input parameters. This scale-free property means the algorithm's performance guarantees hold uniformly across different problem instances without manual parameter tuning. The method is specifically designed for deterministic dynamics, leveraging this structure to achieve improved sample complexity.