A Very Good Question
And a plausible answer about why language models and video models are on different paths
A thoughtful reader asked a very good question about my last essay Sending Sora to Film School. I wrote that improving AI video models would require a change in how we train them. And that brute-forcing the task (adding more GPUs) won’t be enough because the current training paradigm lacks clear goals. It’s hard to express a goal for video models besides ‘make it more photorealistic’.
This thoughtful reader read thoughtfully and posed the following Very Good Question:
Language does not have a clear goal, yet Large Language Models (LLMs) became super fluent with enough compute. Why won’t the same approach work for video?
It is a very good question. Indeed a good enough question that it deserves a long response.
The answer comes down to goals and structure. We can explain it using three examples:
Chess, a game of math
Language, a symbolic system
Video, a… sensory medium? An audio-visual representation? A flat time sculpture?
This last part, as we’ll see, is kinda the issue.
Modeling chess, a simple game
AI is really good at chess. It’s actually superhuman! It can beat any person. How did it get so good?
Chess has a well-defined objective. The win condition is really clear: the king is either in checkmate or not. This clarity allows the training process to be really effective. Researchers can reward the model when it wins and create feedback loops.
This progress isn’t hard for people to understand. Chess is a game of limited mathematical possibilities. It makes sense that a big enough computer could master it.
Modeling language, a symbolic system
Language seems less straight forward. Neither you nor I can define a clear objective of the English language. Regardless, we have LLMs that can read and write more fluently than most people. How did they get so good?
LLMs’ progress is not incomprehensible magic. Language is a symbolic system. It has structure. We can predictably combine words according to grammar and syntax to create meaning. That makes the training process much easier. Errors, for example, are quite clear. A sentence where a cat hatches out of its egg is false. A frabjous day is nonsensical. And qwerty fsads isn’t even language. Dr. Seuss might argue, but these are clear cases for AI researchers.
Because language is a symbolic system, big enough computers were able to search and discover the underlying structure of it. Once LLMs understood core patterns of language they could use it do neat stuff like recite facts about the world, psychoanalyze a text message, even tell you how to declare bankruptcy in Canada. Today we have general purpose LLMs that feel magical.
So why can't the same approach work for video? Scale up compute, discover the secret structure of video, and master it?
Modeling video, a flat representation
Video is very different. It is not a symbolic system. Video is many 2D images that represent 3D space over time. And they’re… pretty good. How did they get pretty good?
Video models watch millions of videos and learn to predict the next visually plausible frame. That method can get you surprisingly far, but not far enough. The models can generate shots that look completely plausible, but they violate physics, time, and basic common sense.
As video models train along this path they face two big obstacles.
Video models have a very unclear goal. They’re designed to predict the next plausible frame. While this goal often approximates realism, it just as often leads the model to strange and incorrect places. Visually plausible does not equal physically plausible, narratively plausible, or other important plausibles.
Video, unlike language, is not a structured system. Models will struggle to discover the underlying structure of video by only watching video. That’s because they’re studying 2D frames that are a lossy representation of something else— the world. The underlying structure of the world is physics as well as biology, human behavior, society, etc. Video is not a good way of representing these things to an AI model.
Put another way, we’ve set up a training paradigm where models can master the grammar of light (what things look like) but not the grammar of the world (physics, object permanence, time). Even if we add more compute, the models will still be studying the wrong thing.
A Summing Up
Video models will probably become as good as language models. The path will just be different. To understand the underlying structure of video, they need to model the world (physics) and people (agency) among many other things. This is what AI companies mean by "world models". It’s an idea they’re all thinking about and working on.
So to the Thoughtful Reader, there you have an answer to your Very Good Question. I won’t pretend it’s a complete answer. I’m still considering the topic myself. Your question is a good one and it reminds me to think deeply about first principles and goals. Thinking along those lines, often gives me a better sense of AI development.
For AI generated video, what might be a good interim solution would be video generation with more "frame by frame" direction by the human creator, where the AI was more of a "helpful paintbrush," than a generator of a first draft. And, of course, as billions of sessions accumulated, it would make better predictions about what users do and do not want.
I don't fully buy this argument. First, most video models don't "predict the next frame", they generate full videos at once given a text prompt.
That aside, there is no convincing argument that a model needs to understand 3D/physics in order to produce very convincing videos (models like Veo 2 are already getting close). Remember 12 months ago when image models kept generating 6 fingers on hands? People expected that something had to be done explicitly to handle this, or else it wouldn't be solved. All that mattered in the end was more data and more compute (problem solved). Read The Bitter Lesson by Sutton.