Google DeepMind unveils a new video model to rival Sora

December 16, 2024

Google DeepMind, Google’s flagship AI research lab, wants to beat OpenAI at the video generation game — and it might just, at least for a little while.

On Monday, DeepMind announced Veo 2, a next-gen video-generating AI and the successor to Veo, which powers a growing number of products across Google’s portfolio. Veo 2 can create two-minute-plus clips in resolutions up to 4k (4096 x 2160 pixels).

Notably, that’s 4x the resolution — and over 6x the duration — OpenAI’s Sora can achieve.

It’s a theoretical advantage for now, granted. In Google’s experimental video creation tool, VideoFX, where Veo 2 is now exclusively available, videos are capped at 720p and eight seconds in length. (Sora can produce up to 1080p, 20-second-long clips.)

Google DeepMind unveils a new video model to rival Sora — Veo 2 in VideoFX.Image Credits:Google

VideoFX is behind a waitlist, but Google says it’s expanding the number of users who can access it this week.

Eli Collins, VP of product at DeepMind, also told TechCrunch that Google will make Veo 2 available via its Vertex AI developer platform “as the model becomes ready for use at scale.”

“Over the coming months, we’ll continue to iterate based on feedback from users,” Collins said, “and [we’ll] look to integrate Veo 2’s updated capabilities into compelling use cases across the Google ecosystem … [W]e expect to share more updates next year.”

More controllable

Like Veo, Veo 2 can generate videos given a text prompt (e.g. “A car racing down a freeway”) or text and a reference image.

So what’s new in Veo 2? Well, DeepMind says the model, which can generate clips in a range of styles, has an improved “understanding” of physics and camera controls, and produces “clearer” footage.

By clearer, DeepMind means textures and images in clips are sharper — especially in scenes with a lot of movement. As for the improved camera controls, they enable Veo 2 to position the virtual “camera” in the videos it generates more precisely, and to move that camera to capture objects and people from different angles.

DeepMind also claims that Veo 2 can more realistically model motion, fluid dynamics (like coffee being poured into a mug), and properties of light (such as shadows and reflections). That includes different lenses and cinematic effects, DeepMind says, as well as “nuanced” human expression.

Google Veo 2 sample. Note that the compression artifacts were introduced in the clip’s conversion to a GIF. **Image Credits:**Google

DeepMind shared a few cherry-picked samples from Veo 2 with TechCrunch last week. For AI-generated videos, they looked pretty good — exceptionally good, even. Veo 2 seems to have a strong grasp of refraction and tricky liquids, like maple syrup, and a knack for emulating Pixar-style animation.

But despite DeepMind’s insistence that the model is less likely to hallucinate elements like extra fingers or “unexpected objects,” Veo 2 can’t quite clear the uncanny valley.

Note the lifeless eyes in this cartoon dog-like creature:

And the weirdly slippery road in this footage — plus the pedestrians in the background blending into each other and the buildings with physically impossible facades:

Collins admitted that there’s work to be done.

“Coherence and consistency are areas for growth,” he said. “Veo can consistently adhere to a prompt for a couple minutes, but [it can’t] adhere to complex prompts over long horizons. Similarly, character consistency can be a challenge. There’s also room to improve in generating intricate details, fast and complex motions, and continuing to push the boundaries of realism.”

DeepMind’s continuing to work with artists and producers to refine its video generation models and tooling, added Collins.

“We started working with creatives like Donald Glover, the Weeknd, d4vd, and others since the beginning of our Veo development to really understand their creative process and how technology could help bring their vision to life,” Collins said. “Our work with creators on Veo 1 informed the development of Veo 2, and we look forward to working with trusted testers and creators to get feedback on this new model.”

Safety and training

Veo 2 was trained on lots of videos. That’s generally how AI models work: Provided with example after example of some form of data, the models pick up on patterns in the data that allow them to generate new data.

DeepMind won’t say exactly where it scraped the videos to train Veo 2, but YouTube is one possible source; Google owns YouTube, and DeepMind previously told TechCrunch that Google models like Veo “may” be trained on some YouTube content.

“Veo has been trained on high-quality video-description pairings,” Collins said. “Video-description pairs are a video and associated description of what happens in that video.”

While DeepMind, through Google, hosts tools to let webmasters block the lab’s bots from extracting training data from their websites, DeepMind doesn’t offer a mechanism to let creators remove works from its existing training sets. The lab and its parent company maintain that training models using public data is fair use, meaning that DeepMind believes it isn’t obligated to ask permission from data owners.

Not all creatives agree — particularly in light of studies estimating that tens of thousands of film and TV jobs could be disrupted by AI in the coming years. Several AI companies, including the eponymous startup behind the popular AI art app Midjourney, are in the crosshairs of lawsuits accusing them of infringing on artists’ rights by training on content without consent.

“We’re committed to working collaboratively with creators and our partners to achieve common goals,” Collins said. “We continue to work with the creative community and people across the wider industry, gathering insights and listening to feedback, including those who use VideoFX.”

Thanks to the way today’s generative models behave when trained, they carry certain risks, like regurgitation, which refers to when a model generates a mirror copy of training data. DeepMind’s solution is prompt-level filters, including for violent, graphic, and explicit content.

Google’s indemnity policy, which provides a defense for certain customers against allegations of copyright infringement stemming from the use of its products, won’t apply to Veo 2 until it’s generally available, Collins said.

To mitigate the risk of deepfakes, DeepMind says it’s using its proprietary watermarking technology, SynthID, to embed invisible markers into frames Veo 2 generates. However, like all watermarking tech, SynthID isn’t foolproof.

Imagen upgrades

In addition to Veo 2, Google DeepMind this morning announced upgrades to Imagen 3, its commercial image generation model.

A new version of Imagen 3 is rolling out to users of ImageFX, Google’s image-generating tool, beginning today. It can create “brighter, better-composed” images and photos in styles like photorealism, impressionism, and anime, per DeepMind.

“This upgrade [to Imagen 3] also follows prompts more faithfully, and renders richer details and textures,” DeepMind wrote in a blog post provided to TechCrunch.

Google ImageFX — **Image Credits:**Google

Rolling out alongside the model are UI updates to ImageFX. Now, when users type prompts, key terms in those prompts will become “chiplets” with a drop-down menu of suggested, related words. Users can use the chips to iterate what they’ve written, or select from a row of auto-generated descriptors beneath the prompt.

Source link