✨ TL;DR
This paper analyzes active sequential prediction-powered mean estimation, where labels are selectively queried and ML predictions fill in the gaps. The authors find that contrary to intuition, using a nearly constant query probability (ignoring uncertainty) often produces tighter confidence intervals than adaptive uncertainty-based querying.
In prediction-powered inference, researchers want to estimate population means using a combination of expensive ground-truth labels and cheap ML model predictions. The key challenge is deciding when to query true labels versus relying on predictions. Prior work proposed mixing an uncertainty-based adaptive query strategy with a constant baseline probability, but the optimal mixing remained unclear. The fundamental question is how to allocate a limited labeling budget across sequential samples to minimize the width of confidence intervals around the mean estimate.
The authors conduct both theoretical and empirical analysis of the query probability selection mechanism. They develop a non-asymptotic analysis that provides data-dependent bounds on confidence intervals for the mean estimator. They examine how different mixing weights between uncertainty-based and constant query probabilities affect performance. Additionally, they analyze what happens when a no-regret learning algorithm is used to adaptively choose query probabilities by treating the confidence bound as a loss function to minimize over time.