✨ TL;DR
This paper develops a random matrix theory model that explains why neural networks exhibit a transient learning window where signal is detectable before overfitting occurs. The key mechanism is that anisotropy in input data creates fast and slow learning directions, causing a learnable eigenvalue to temporarily separate from noise before being reabsorbed.
Empirical observations of neural network training reveal a puzzling transient regime: there exists a finite time window during gradient descent where the model successfully captures signal, but this signal later disappears as overfitting takes over. This phenomenon is commonly addressed through early stopping in practice, yet lacks theoretical understanding. The challenge is to explain why and when this transient learning window occurs, and what factors control its duration and existence. Understanding this requires analyzing the time-dependent spectral properties of learned weight matrices in the presence of noise and structured input data.
The authors construct an analytically tractable random matrix model of gradient flow in a linear teacher-student setting with anisotropic input covariance. They model the input covariance as a two-block structure that creates fast and slow learning directions. The analysis focuses on the time-dependent bulk spectrum of the symmetrized weight matrix, which they derive through a 2×2 Dyson equation. For a rank-one teacher signal, they obtain an explicit outlier condition using a rank-two determinant formula. This framework allows them to track when an isolated eigenvalue (representing learned signal) separates from the noisy bulk spectrum and when it gets reabsorbed, characterizing a time-dependent Baik-Ben Arous-Péché (BBP) phase transition.