✨ TL;DR
This paper challenges the Platonic Representation Hypothesis by showing that apparent alignment between vision and language models is an artifact of small-scale evaluation. When tested at scale with millions of samples and realistic many-to-many settings, cross-modal alignment degrades substantially, suggesting different modalities learn different representations of reality.
The Platonic Representation Hypothesis proposes that neural networks trained on different modalities (text, images, etc.) converge toward the same underlying representation of reality. This hypothesis has important implications for AI development, suggesting that modality choice may not fundamentally matter. However, the experimental evidence supporting this hypothesis comes from evaluations on small datasets (approximately 1,000 samples) using mutual nearest neighbor metrics in constrained one-to-one image-caption settings. If the hypothesis is correct, it would mean that different modalities are merely different paths to the same representational endpoint. However, if the evidence is fragile and dependent on specific evaluation conditions, this could indicate that different modalities actually learn fundamentally different representations, which would have significant implications for multimodal AI system design and our understanding of how neural networks represent knowledge.
The authors systematically re-evaluate the evidence for cross-modal alignment by scaling up the evaluation regime. They test alignment using mutual nearest neighbor metrics across datasets ranging from the original small scale (approximately 1,000 samples) to millions of samples. They examine how alignment metrics change with scale and investigate whether the alignment reflects fine-grained structural correspondence or merely coarse semantic overlap. The researchers also move beyond the constrained one-to-one image-caption evaluation setting used in prior work to test more realistic many-to-many scenarios where multiple captions can describe the same image and vice versa. Additionally, they examine whether the reported trend of stronger language models increasingly aligning with vision models holds for newer, more capable models that have been released since the original hypothesis was proposed.