The future is multi-modal
Google announced its in-house Large-Language-Model, Gemini, a direct competitor to OpenAI’s GPT models. A pretty mandatory thing to do by Google, more than one year later to the release of ChatGPT. Which unfortunately makes their announcement prone to over-promises right from the start.
In its release report, Google makes a head-to-head comparison between Gemini and GPT, bench-marking them on various datasets. While Gemini slightly edges out GPT, the differences are mostly to the degree of several percentages. It’s worth noting that some of these benchmarks have themselves a certain margin of error, so small performance gains might be relatively irrelevant in this context.
But let’s shift focus. I’d argue that the direct comparison is not the most interesting thing to analyze, as is the “recipe” for training the model. This is what offers a glimpse into the trends and directions in the field of AI in general, and how future models will look like, however good or bad are the present ones. And one thing is for sure: the future is multi-modal, meaning that the same model not only handles text, but also images and audio.
Gemini is described as “anything-to-anything”. The presentation argues that the development focused from the start on a shared representation for multiple types of input. This is important, because while GPT 4 also supports images, we do not know for sure how organic are they represented among the text input. And this “organic”, homogeneous blend of information is precisely where the next great advancement in AI stands. Allow me to explain.
When dealing with multiple types of input, a very interesting phenomenon happens during the training of such a model: the performance on text improves because of the image data, and vice versa. This is called transfer learning and in a way, mirrors how human learning works, because knowledge is no longer constrained by one particular medium.
This is why I find the performance comparison almost irrelevant (Also because I own neither Google, nor OpenAI stocks). Right now, I think the most remarkable feature of Gemini is it being multi-modal. Because the same principles of transfer learning are not only applied to a shared space for text and images, but for audio, too. Voice is not merely transcribed into text, but processed as it is directly, allowing for differences in the tone of voice and accents. Add also the immense “vocabulary” of sounds, from birds singing to musical instruments.
This is why the future is multi-modal. Because you need a thousand words to describe an image, and who-knows-how many to describe a sound. Projecting all the knowledge into mere words is a constraint, however many we assign for each entry. The way forward is the shared, homogeneous expression of all of them processed directly.