Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling

Fei Wang; Li Shen; Liang Ding; Chao Xue; Ye Liu; Changxing Ding

✨ TL;DR

AdaLeZO accelerates zeroth-order optimization for fine-tuning large language models by intelligently sampling layers based on their sensitivity rather than uniformly perturbing all parameters. This adaptive approach achieves 1.7-3.0x speedup over existing methods while maintaining memory efficiency and acting as a universal plug-in for existing ZO optimizers.

01 · Problem

Zeroth-order (ZO) optimization offers a memory-efficient alternative to backpropagation for fine-tuning large language models by using only forward passes. However, its practical deployment faces severe challenges: slow wall-clock convergence time and high estimation variance make it impractical for real-world applications. The authors identify that over 40% of training latency comes from generating perturbations and updating parameters. The root cause is the standard uniform exploration strategy, which treats all layers equally despite deep networks exhibiting heterogeneous sensitivity across layers. This uniform approach leads to computationally wasteful exploration where the limited perturbation budget is spent on less sensitive parameters that contribute minimally to optimization progress.

02 · Approach

AdaLeZO introduces an adaptive layer-wise sampling framework that formulates layer selection as a non-stationary Multi-Armed Bandit (MAB) problem. Instead of uniformly perturbing all parameters, the method dynamically allocates the perturbation budget to the most sensitive layers based on their contribution to the optimization objective. The framework employs sampling with replacement to select which layers to perturb at each iteration. To ensure unbiased gradient estimation despite non-uniform sampling, AdaLeZO incorporates an Inverse Probability Weighting (IPW) mechanism that reweights the gradient estimates according to their sampling probabilities. This IPW mechanism also functions as a temporal denoiser, reducing estimation variance. The approach is designed as a universal plug-and-play module that can enhance any existing ZO optimizer without requiring additional memory.

Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling

What the paper shows.

↘ Related papers