Video Generation Models š„ vs. World Simulators š
The performance of the latest OpenAI model, Sora, is no news. Itās obvious to the naked eye how impressive it is. And I am sure youāre already familiar with some of its feats.
But Iām more interested in something else. Video generation models as world simulators ā this is the title of the official technical report. And hereās the debate: are they really?
A world simulator requires a world model. A set of explicit constraints, rules and principles that govern the world. We cannot simulate physics without Newtonās Laws. And we cannot simulate Newtonās Laws just by mimicking their visual effects.
Seems to me, weāre entering a very existential dilemma. On one hand, we have LLMs, and the argument that language alone cannot support intelligence. On the other, we have Visual Generators, and the argument that merely displaying something cannot be enough for its understanding.
Can it, though? We too communicate through representations. Books sum up a good deal of our collective knowledge, and theyāre just words and pictures, after all. As long as theyāre coherent, how can we conclude that the representation does not stand for deeper reasoning?
Getting back to the model, the most important breakthrough is not the resolution of the video. Rather, itās the time-consistency. This means that the movements in the video follow realistically: an object might get obstructed from view for a few frames but when it re-appears it is the same as before. Previous models rarely accomplished this. Objects in the video would be forever lost behind an unfortunate obstruction.
This is major step forward. So, itās tempting to pin it on a latent world model that supposedly drives the video to be coherent with spatial and temporal rules. Hence, the title of the report. However, I argue that the complexity of a world model cannot be a result of just image analysis.
Vision, like language, is a lower-dimensional projection. Just like a sphere whose shadow is a circle, the projection only allows for estimates of the higher-dimensional object. If we were to ask the Greek philosopher Plato, the conclusion would be obvious: the current models cannot be World Simulators.
This is not to claim that AI models are forever constrained to the surface of knowledge. But the current architectures, which are trained on discontinuous patches with the criteria of āwhat seems to fitā, do not make strong candidates. They might produce images that look impressive. Or sentences sound intelligent. For anything deeper than that, we must incorporate the same level of depth in their very own fabric.