Debunking the AI-Art Hype
As AI-generated imagery is now on everyone’s fences, it’s worth understanding the story. In the spring, the Dall•e 2 model appears, capable of generating images based on a text description. Trained with hundreds of millions of pairs of images + representative descriptions (this exact association between an input and its labeling is needed), the model brings together graphic understanding with that of natural language (inheritance of the GPT3 model). The results are a step up from previous generations and the model is gaining popularity within the guild.
But Dall•e 2 is not open-source, despite the fact that OpenAI, the parent company, started from exactly this promise. (OpenAI is funded by various giants such as Musk, Amazon, Microsoft, Bedrock, Sequoia) Finally, the results remain relatively internal to the community of ML engineers and researchers.
Towards the end of the summer, a much smaller company, Stability AI, releases a model similar to Dall•e 2, Stable Diffusion, with an important caveat: it’s completely open-source. Both the training code and the model itself become accessible to anyone with an internet connection and a decent computer. The move is profitable, the results go viral and everyone is talking about Stability AI, which, a few months later, raises $100 million in funding.
From a technical point of view, with ~20 personal pictures you can over-train the model (fine-tuning) enough to learn the association between your own looks and a denominator (I leave the link in the comments). That is, just as I can generate an image of George Washington eating an apple in space because the model understands who GW is, after this training I can do the same with your truly. A part of the public becomes interested.
Technological opportunists waste no time. These #hustlers who, if they didn’t know computers, would open car washes, launch within a month the first generation of AI “art” applications, with the only contribution being that they interfaced the model Stable Diffusion. The winners of the arms race manage to get their start-ups up to speed even in less than 2 weeks. Social Media feeds are filled with ads that promise the fulfillment of all visual imaginations — all a description away. At the same time, in Colorado, an art competition is won by a painting generated by an AI. The hype begins.
A few months later, artificial avatars are everywhere. As history repeats itself, the new I-m-so-artistic-nobody-understand-me profile picture filters are AIs, unlike not the banal textures and sobel operators of 10 years ago (remember those times?) There is, rightly, the problem of data. Apparently, portrait apps pretty much reserve full rights to your face and its derivatives. But a good question is: what good is Karen’s selfie for those who already have an automatic face factory?
This data is not necessarily relevant by association with the actual identity of Karen, aka no one is looking for that particular person on the street to replace with some clone. The idea is that each image contributes infinitesimally to the next generation of images, because the next generation of Dall•e and Diffusion will be trained on the images of now. And as I said, the image alone, unaccompanied by a proper label, is not useful. Question: what is that indirect label that emerges from Karen’s pictures? Well, when the user uploads their personal pictures, the app generates (say) 10 virtual portraits. From these, the user chooses: a) the most faithful replicas — they must look like you — and b) the “coolest” ones — the ones that would collect the most likes. These are the useful tags that emerge. They will ensure that the avatars of future extra-verses will be photo-realistic, charismatic, and perhaps even physically attractive. I’m not going into the details of the worlds these avatars would work in, one can imagine.
One thing we can be sure of, whatever the invention, the market will immediately turn it into an advertisement or a tool to optimize (and market) obscenity. It’s what sells. And it’s also quite sad that on an individual level, the first thing we build for ourselves with the tool of descriptive imagination is a virtual mirror. But looking further, these models open up a new avenue of artificial understanding. Natural language is brought together with the visual, both projected purely mathematically into a common space. The representation for the text “red car” is, in this space, algebraically close to the representation of images with actual red cars. And the representation for “red truck” is at the same time close to that of red cars on some dimensions and on others to that of gray trucks.
And so we arrive at the picture above. Not temporally, because the results belong to the Flamingo model (DeepMind) and their paper is from April. But I liked these examples. First, extracting text from images has always been a somewhat frustrating goal. The generic solution is to detect letters and group them into words. But from here, multiple technical difficulties. Flamingo “gets” pretty much what you want from a few examples and responds quite well (I’m curious how it understands successive lines). There is also a symbolic understanding, see calculus (but here also purely linguistic models end up making mistakes when you take them out of the generic regime, so I don’t think it leads to integrals yet). And from a purely pragmatic point of view, I find it absolutely impressive that we have in front of us the possible initial version of the universal encyclopedia (line 2).
So we can be sure of one more thing. Understanding is not one-dimensional. Intelligence comes from the interpenetration of multiple planes. Huge potential energy sits in our information silos. And yes, energy can mean both the rocket fuel that reaches the moon and the gunpowder of the explosion that destroys us. So, individually and collectively, we owe it to ourselves to make wise choices.
References:
- Stable Diffusion: https://huggingface.co/spaces/stabilityai/stable-diffusion
- Flamingo: https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model?fbclid=IwAR2IG40gLwknMu4sg2n8BxNKuJXbwOeEZv4LAIEhnL2x3NURIs09dtzQZTo