A Very Good Question

Feb 20

And a plausible answer about why language and video scale differently

5 Comments

For AI generated video, what might be a good interim solution would be video generation with more "frame by frame" direction by the human creator, where the AI was more of a "helpful paintbrush," than a generator of a first draft. And, of course, as billions of sessions accumulated, it would make better predictions about what users do and do not want.

Expand full comment

Slicey Me Likey

Feb 24

I don't fully buy this argument. First, most video models don't "predict the next frame", they generate full videos at once given a text prompt.

That aside, there is no convincing argument that a model needs to understand 3D/physics in order to produce very convincing videos (models like Veo 2 are already getting close). Remember 12 months ago when image models kept generating 6 fingers on hands? People expected that something had to be done explicitly to handle this, or else it wouldn't be solved. All that mattered in the end was more data and more compute (problem solved). Read The Bitter Lesson by Sutton.

Expand full comment

Reply (1)

Mike Gioia

Feb 24

Sutton's The Bitter Lesson is actually the impetus for this article. My previous post talks about that essay at length. I corresponded with Rich while writing that post asking him several questions. His response to the final draft was simply "Not too bad", which is about as good as I can expect to get from him on an essay that eschewed the bitter lesson's lesson.

Predicting the next frame is definitely a simplification, but I think it's not a bad metaphor. In reality I think the models like compress data and do the prediction in latent space, but they're still predicting how those pixels plausibly change over the course of the video. I could be off, but it feels like a helpful example for readers.

Expand full comment

Damiano Vukotic

Feb 21

Would Ai gen video LLMs ingest 3D video to solve it?

Expand full comment

Reply (1)

Mike Gioia

Feb 21

Labs are doing several clever things. Some train on video game data. Many games run on unreal engine which is a basic physics engine. They create tons and tons of simulated scenarios using the engine and then train the model on them.

There are other methods which I understand less well too.

Expand full comment

Intelligent Jello

A Very Good Question