Are Models Converging Towards the Same Representation of the World?
Are AI models converging in the way they represent the world? This recent paper says yes.
The authors analyze several language and vision models, their learned latent spaces and the way they measure distances between data points, and conclude that there is a noticeable and growing alignment between all of them.
The paper makes the hypothesis that different models are all trying to arrive at the same representation of reality. This shared statistical representation of the world is named after Plato’s concept of an ideal reality, and referred to as the platonic representation.
What is intriguing is that neural networks, even if they are trained with different objectives on different data and modalities, seem to be converging to a shared statistical model of reality. Given the variability of training settings (loss functions, datasets, modalities, architectures…), it is not obvious why this would be the case.
You might wonder how we can measure the similarity between two latent spaces, or representations, for two different foundational models.
This paper focuses on vector embeddings and looks at the similarity structure they induce, meaning how the distance/similarity between data points is measured for each model.
In other words, this paper looks at how each foundational model measures similarity between its data points, or its kernel. It then introduces a kernel-alignment metric that determines how close two kernels of two different foundational models are.
Some results that argue in favor of the hypothesis include:
Representation alignment in vision and language models:
Vision models solving a higher percentage of VTAB tasks show greater alignment with each other. This suggests that as models become more competent and general, their internal representations converge.
Larger models show greater alignment with each other compared to smaller ones. This is consistent with the hypothesis that scaling model size increases representational alignment.
Representation alignment across modalities:
There is a linear relationship between language model performance and their alignment with vision models on the Wikipedia caption dataset. This indicates that more capable language models tend to align better with more capable vision models.
Previous work showed that a single linear projection could stitch a vision model to a language model and achieve good performance on visual question answering and image captioning tasks. This also supports the idea that representations are converging across different data modalities.
Several selective pressures are identified as driving this convergence, including the scaling of model size, data diversity, and the need to solve a wide range of tasks.
There are broader implications of this potential convergence than just scientific curiosity. This work shows the potential for more efficient transfer learning, better integration of multimodal data, and the development of more general AI systems.