Video Generation Models šŸŽ„ vs. World Simulators šŸŒŽ

Vlad Ștefan
3 min readFeb 22, 2024

--

The performance of the latest OpenAI model, Sora, is no news. Itā€™s obvious to the naked eye how impressive it is. And I am sure youā€™re already familiar with some of its feats.

Some of the feats of OpenAIā€™s Sora Model

But Iā€™m more interested in something else. Video generation models as world simulators ā€” this is the title of the official technical report. And hereā€™s the debate: are they really?

A world simulator requires a world model. A set of explicit constraints, rules and principles that govern the world. We cannot simulate physics without Newtonā€™s Laws. And we cannot simulate Newtonā€™s Laws just by mimicking their visual effects.

Seems to me, weā€™re entering a very existential dilemma. On one hand, we have LLMs, and the argument that language alone cannot support intelligence. On the other, we have Visual Generators, and the argument that merely displaying something cannot be enough for its understanding.

Is language a sufficient projection space for a world model? (Taken from this article)

Can it, though? We too communicate through representations. Books sum up a good deal of our collective knowledge, and theyā€™re just words and pictures, after all. As long as theyā€™re coherent, how can we conclude that the representation does not stand for deeper reasoning?

Getting back to the model, the most important breakthrough is not the resolution of the video. Rather, itā€™s the time-consistency. This means that the movements in the video follow realistically: an object might get obstructed from view for a few frames but when it re-appears it is the same as before. Previous models rarely accomplished this. Objects in the video would be forever lost behind an unfortunate obstruction.

An example of time-consistency

This is major step forward. So, itā€™s tempting to pin it on a latent world model that supposedly drives the video to be coherent with spatial and temporal rules. Hence, the title of the report. However, I argue that the complexity of a world model cannot be a result of just image analysis.

Vision, like language, is a lower-dimensional projection. Just like a sphere whose shadow is a circle, the projection only allows for estimates of the higher-dimensional object. If we were to ask the Greek philosopher Plato, the conclusion would be obvious: the current models cannot be World Simulators.

Platoā€™s Allegory of the Cave (source) - The Models do not learn from direct world phenomena but from our projection of them.

This is not to claim that AI models are forever constrained to the surface of knowledge. But the current architectures, which are trained on discontinuous patches with the criteria of ā€œwhat seems to fitā€, do not make strong candidates. They might produce images that look impressive. Or sentences sound intelligent. For anything deeper than that, we must incorporate the same level of depth in their very own fabric.

--

--

Vlad Ștefan
Vlad Ștefan

Written by Vlad Ștefan

Machine Learning Engineer & AI Enthusiast

No responses yet