✨ TL;DR
This paper shows that randomly initialized neural networks can learn useful representations through simple peer-to-peer consensus (self-distillation) alone, without projectors, predictors, or pretext tasks. The findings suggest that self-distillation itself is a key mechanism driving learning in self-supervised methods, independent of other architectural components.
State-of-the-art self-supervised learning methods, particularly self-distilled approaches, achieve impressive performance but rely on complex ensembles of mechanisms with many empirically motivated design choices that lack theoretical understanding. It remains unclear which components are essential for learning and which are auxiliary. Specifically, the role of self-distillation itself within the learning dynamics is not well isolated or understood, as it is typically bundled with projectors, predictors, momentum encoders, and various pretext tasks.
The authors isolate the effect of self-distillation by creating a minimal experimental setup that removes all common auxiliary components. They train a group of randomly initialized networks that learn solely through peer-to-peer consensus, where networks distill knowledge from each other without projectors, predictors, or pretext tasks. This stripped-down approach allows them to study the pure effect of self-distillation on learning dynamics. They evaluate the learned representations on downstream tasks and analyze how performance varies with different hyperparameters to understand what the models learn under these minimal conditions.