It’s been an exhausting 18 months in the AI video space. Every week a new AI demo blows the internet away. Social media feeds waterfall with reposts, reactions, and statements that “Everything just changed”. And the casual observer, punchdrunk with AI news, can’t help but agree that “Indeed, things just changed forever”.
However the following week, the sobered observer finds that everything has not changed. That the world is… kinda the same. Even disappointing compared to the dizzying highs they were riding just days ago.
This outsized excitement over technical demos is typical in the generative AI space. We routinely overestimate how our lives will change. In reality change is iterative and rarely comes in single, sweeping advancements. AI companies, though, would have you believe otherwise.
Demos that made a big splash
The Rabbit R1 is an AI-in-a-box, an ever-present handheld assistant for the physical world. About the size of a Post-It note, this orange, plastic box was hailed as an iPhone killer after its announcement in January 2024. The demo careened across the internet as AI-in-a-box products were declared the next major pillar of the AI boom. We’d all have an AI-in-a-box separate from our phone-in-a-box. It would be hard to even remember your life before AI-in-a-boxes! But six months later the Rabbit R1 is popularly considered a waste of money and “barely reviewable”.
The Humane AI pin followed a similar story. This small wearable device was greeted messianically in November 2023 as a boldly reimagined form of AI. It wasn’t in your phone, it wasn’t in a handheld box, it was exactly where you wanted AI to be— pinned to your shirt. It was AI-in-a-box-on-your-shirt. But as the product reached real consumer shirts, the appeal became less clear. It’s now called “the worst product I’ve ever reviewed” by more than a few influential tech reviewers.
It may be a while until any AI-in-a-box product changes your or my life. But other AI products, like video models, are much more immediately impactful. Still, these products enjoy no shortage of outsized and distorted hype.
Early demos of OpenAI’s video model Sora launched a thousand op-eds, sent filmmakers scrambling in uncertainty, and reportedly made Tyler Perry halt $800 million of spending on his Atlanta studio. However, Sora is still only available to a very small group of filmmakers hand-selected by OpenAI.
While Sora preens aloof on the sidelines, other AI video innovations arrive almost weekly. In the month of June three big video models were released including a new Chinese video model Kling, which produces videos that look ‘better than Sora’ to many eyes. It’s strangely good at videos of people eating, a much-ridiculed weak spot of other models.
After watching several Kling videos with a deep feeling of “Everything just changed”, I coincidentally stumbled onto a news story about the Yuntai waterfall in China. The natural wonder, which draws millions of visitors annually, was discovered by one tourist to be, as it turn out, not so natural. At the top a large pipe was secretly adding gallons of water to the stream. It was being juiced to make a bigger splash.
This got me thinking about how AI products make a big splash. And the ease of obfuscating actual functionality in demos. To be clear, AI products are both powerful and real. I’m not saying they’re fake. Or can’t generate anything good. Or are secretly powered by a pipe from the Chinese government. Filmmakers are already using Sora to make real commercials.
However, the technology’s real impact will be different than the anticipation causing demo-watchers’ heads to ring with “Everything just changed!”. A lot of the most trumpeted advancements are not actually the things filmmakers want most. Increasing the realism of tree leaves from a 10 to an 11 is, in my opinion, a red herring. And that’s what I’ll write about today: why AI tech demos look so amazing. And why the reality is often less than promised.
How do they juice it?
Every AI company is alike. They want a demo that dominates social media feeds for a news cycle. And they optimize for this outcome when assembling their demos.
Let’s take the example of video models. In the last month alone we’ve gotten three new ones: Kling, Runway’s Gen-3, and LumaLabs’s Dream Machine. And of course, they all sit in the shadow of Sora.
To get a little more precise, let’s extend demos to mean “impressive clips floating around social media that demonstrate a model’s capability”. Not only companies, but consumers and influencers also release demonstration videos with little context and trumpeting statements.
For naive artists grabbing AI tools to apply them to real world use case, there’s a disconnect between reality and advertised reality.
In reality, AI video models are less useful in applied use cases than the demos would make you think. Here’s why:
The example clips are cherry-picked.
The demos avoid the model’s weaknesses and emphasize its strengths.
The demos conveniently blur the lines of what you’re looking at (text to video, image to video, video to video).
You can’t see the filmmaker’s intention! You only see the result.
You can hardly blame someone for cherry-picking the best shot from many takes. But demos also cherry pick kinds of shots they show. And our brains generalize from what’s included, overlooking what’s excluded. When we see videos of football and baseball, we think the model does all sports well. In reality, it may be utterly unable to do basketball or hockey. The same principle applies to things like camera movements and character movement.
Companies understand the strengths and weaknesses of their video models. Unsurprisingly they build demos around amazing visuals in the strength categories and exclude any weaknesses. An honest demo might be more like a progress report showing where the model is excelling and where it’s struggling. But that doesn’t excite people the same way. Demos simply are not in the business of working with the garage door up.
Video model demos also often obscure the workflow behind each clip. Whether a video is generated purely by a text prompt or whether it’s AI-rotoscoped is an important distinction. With these demos, you never quite know what you’re looking at.
Finally, viewers have no idea what the creator intended. I often prompt a model dozens of times for something specific and never get it. Any individual image pulled from my outputs will look impressive. But they’re all short of what I actually wanted. When you hide the intention behind an image, you hide the shortcomings. The ability to reflect an artistic intention is among the most important challenges video models face in my opinion.
You can summarize all the above into a single point: applied use cases are way more involved than indiscriminately generating cool images.
Try to do something hard and specific with generative AI, and you’ll hit roadblocks very quickly.
Applied use cases are more involved than indiscriminately generating cool images .
Fidelity vs. Control: The cinema of sun dappled foliage
I’d like to emphasize: the demos are not fake. They include actual video that was actually generated with AI. It’s not a Yuntai waterfall. So with all these amazing, real advancements, why hasn’t everything changed? It’s because the models keep improving fidelity rather than control. They’re less focused on the practical roadblocks filmmakers hit than pumping up the photorealism of sun-dappled foliage to an 11.
Current video models simply can’t achieve what’s demanded by narrative film yet. While they nail sun dappled foliage swaying in a breeze, they can’t really do a person talking. Or walking. Or walk-and-talking (Aaron Sorkin would be at a loss!). And in my opinion, the cinema of people walking and talking and doing stuff is better than the cinema of sun dappled foliage.
Generative video models are research technology that have only been lightly productized. Advancements are foundational rather than practical. This is one reason why we still lack AI videos people want to watch. Real world use cases hit weird problems. And researchers are rarely focused on the right problems. While most AI labs focus on higher levels of photorealism, filmmakers would readily forego photorealism for better lip syncing or facial movements. Demos conveniently elide these desired traits, instead showing fantastically crisp visuals.
Because there’s a large swath of filmmaking grammar that’s totally unavailable to the artists using these models, the best existing AI videos are all basically demos— suggestions at what can be, emphasizing the amazing and hiding the weak spots.
The models’ limitations are reflected in the current dominant genre of AI videos— trailers for movies that don’t exist. Filmmakers love making AI trailers for movies that they can’t make yet with the current technology. What is a movie trailer but a demo? A video that glides over the imperfections of a movie? That skips the substance and shows you “just the cool stuff’?
Just the cool stuff
In many ways this sums up where we are. The AI video space shows off all the coolest stuff being generated and not much of the practical problems holding back narrative storytelling in the medium.
That’s not a bad thing. Stunning demo videos excite people. But don’t use them as a yardstick to measure AI video model progress. Because it’s just the cool stuff.
As for this new generation of 2024 video models— Dream Machine, Runway Gen-3, Kling, Sora– they’re all big technical achievements. You’re guaranteed to see many an amazing output scroll by you on social media if you haven’t already.
But will they be tools that filmmakers can use fluidly? And will they move AI filmmaking towards videos people actually want to watch? We don’t know. Hopefully.
But we do know that for now seeing is not quite believing.
Thanks for the summary of the current pros and cons of AI-generated video clips!
It reminded me of PC Gamer magazine in the 1990s. Each issue of PC Gamer reviewed recently released computer games. The fun extra bit of each issue was that a CD was included, and it had little demos of upcoming computer game releases.
I'd play these demos and many of these games looked awesome. Of course, when the actual games came out months (or years) later, most of them were okay, some were terrible, and only a handful were actually as cool as the demo I'd played.
It's always easier to generate hype than follow through completely on your initial promise. AI video seems no different from computer game demos from the 1990s.
Just remember the best AI models will never be available to the public. The late Jim Simons and his team at Renaissance Technologies created the medallion fund based on the most aggressive machine learning AI algorithms. The algorithms created hundreds of billions of dollars surpassing their expectations so they had to buy investors out of the fund and keep the fund completely private solely for employees.