Full of Hot Air
I’m not a fan of video generators. That’s not to say that I hate “AI” being used in video, it can be a powerful tool, but I hate these things that are not tools.
Sora, and Google’s recently announced Veo, produce RGB pixel output from the videos they were trained on, which seems to be a lot of stock footage.
They don’t make elements, or assets, bits, or pieces, they bake the whole shebang flattened out into a rectangle. If you don’t like that completed video, you can keep on prompting but there’s no file for you to open and tweak.
It’s no surprise that working with the output of these video generators is like working with stock footage. There’s nothing shameful about using stock footage, that’s why it’s there. You can buy explosions, fire, dust hits, blood squirts, wooden debris falling, arc welding sparks, rain, aerial views of cities, etc.
A person has to decide how to combine those elements, which is where Sora and Veo come in. It decides how to combine the stock footage it was trained on, and flatten it out to a new result.
Sometimes the results are uncanny, but they’re only for that one shot. Sometimes the results are warbling bodies, and boiling fingers. A lot of the time, it’s in slow motion.
Air Head
Air Head made by Shy Kids was released a month ago. The people at Shy Kids are clever because they know what would be the most difficult thing to do (having a consistent main character with facial features, expression, and lipsync) so they make the head a featureless balloon. However, they can’t even get Sora to give them the same yellow balloon, attached to the same head, and scaled proportionately to the body the same way. That’s why it’s like a found footage documentary.
Instead, continuity is imposed upon this montage of Sora output. Shy Kids had to slice and dice these individual videos to force even a little consistency in there, even though the only thing credited at the end of the video is Sora.
Here’s a video breakdown released by Shy Kids, on their YouTube channel, not on the OpenAI channel where Air Head is:
“How do you maintain a character, and look consistent, even though Sora is very much a slot machine as to what you get back?” — Walter Woodman Director at Shy Kids
You didn’t, Walter. You cut these shots together back to back that aren’t even close.
Shy Kids followed that first short up with Deflated where they combined live action footage with Sora output and explicitly said that “VFX software” was used.
This looks terrible, on a technical level. If you don’t notice the problems, trust me and watch it again. Look at edges. Look at the neck. Look at the black levels (the darkest darks) and the white levels (check out those posters and billboards).
It also still has the problems from the first video, with the balloon, scale, and attachment point changing wildly throughout, but now there are those matting issues where they’re trying to combine their Sora output with their live action. Sora isn’t a compositing tool, so it can’t insert that footage itself.
In a typical production pipeline you could shoot the reporter and the balloon boy walking down the street together, and then paint out the balloon boy’s head (a relatively trivial matter these days, I assure you). Matchmove the actor’s head and neck, render a yellow balloon asset, and composite it over the painted plate. The balloon would look the same in every single shot and it would match the actor’s performance instead of just wobbling.
They close out the video montage with the reporter teasing an upcoming segment with a doctor. “Why are so many people moving in slow motion? And what’s going on with everyone’s hands?” At least they have some humor about it.
Yesterday, a week after releasing Deflated OpenAI’s YouTube channel released a BTS video from Shy Kids.
They don’t include any botched output from Sora, like they did in the BTS of the first short they released on their own YouTube channel. This time around they show how they stabilize, rotoscope, composite, and color correct the Sora output to fit with the live-action reporter. They also replaced the balloon in a few shots, like the magazine covers.
The most effective uses of Sora in the short are the random montage clips inserted in the “Tough Questions” opener, and the montage of people wearing balloons on a runway. It’s no wonder those are more successful because they’re based on the kinds of input you get from stock footage, and they’re being used in place of stock footage.
What about the aerial shots of the balloon deflating? The first shot worked well, but then they cut to other shots and you could see that it wasn’t the same location, and it wasn’t the same balloon. Sure, they didn’t have to rent a helicopter and gear, but people very rarely do that and instead use … stock footage you can buy online and then put your matching balloon asset on top of.
OpenAI and Google are both selling these video generators as technological breakthroughs in filmmaking. The reality is that it’s artifacting stock footage.
Bad clients, and bad producers will tell their editors to put Sora or Veo output in to the initial edit, then they’ll turn to a VFX house and say that the shots are “90% there” and they “just need someone to take it across the finish line.”
How do I know this? Because that happens with stock footage and weird composites and retimes that editors make in Avid when clients want to have something in the edit so they can figure it out. Even if the client agrees to replace it, they can get married to how the stock footage or temp looked, or how it was timed (remember that playback speed is a factor).
That’s why these companies talk about how it empowers directors, and small-scale productions. Until they want to change something just a little bit.
As a VFX artist I’m all for AI/ML to improve work, but that kind of improvement is in better tracking tools. It’s in better keying and extraction tools. It’s in generative fill for paint purposes. It’s in screen inserts for funky monitors and round rect smart phones. It’s in better photogrammetry with gaussian splatting, radiance fields, etc.
Tools are parts of a pipeline where you can adjust individual elements upstream, or downstream without starting something over from scratch because it isn’t just one flattened hallucination. It’s not a piece of stock footage.
CopyCat and Cattery
Foundry has a tool called CopyCat, where you can train up a model on repetitive tasks, as explained here:
In this example for removing tracking markers from a stack of green paper cards by only painting 5 of the frames to train the model.
Here’s Foundry’s brief explanation of Cattery, a library of models ready to use with CopyCat:
This is how you use machine learning to do filmmaking. These aren’t for making choppy montages.
New breakthroughs in stock footage are being misrepresented by OpenAI, and Google, hungry for investment, and sweaty to establish their product before too many people ask questions. It repulses me as a VFX artist, and it should repulse you too.
Category: text