Meta’s Movie Gen model puts out realistic video with sound, so we can finally have infinite Moo Deng

October 4, 2024

253

No one really knows what generative video models are useful for just yet, but that hasn’t stopped companies like Runway, OpenAI, and Meta from pouring millions into developing them. Meta’s latest is called Movie Gen, and true to its name turns text prompts into relatively realistic video with sound… but thankfully no voice just yet. And wisely they are not giving this one a public release.

Movie Gen is actually a collection (or “cast” as they put it) of foundation models, the largest of which is the text-to-video bit. Meta claims it outperforms the likes of Runway’s Gen3, LumaLabs’ latest, and Kling1.5, though as always this type of thing is more to show that they are playing the same game than that Movie Gen wins. The technical particulars can be found in the paper Meta put out describing all the components.

Audio is generated to match the contents of the video, adding for instance engine noises that correspond with car movements, or the rush of a waterfall in the background, or a crack of thunder halfway through the video when it’s called for. It’ll even add music if that seems relevant.

It was trained on “a combination of licensed and publicly available datasets” that they called “proprietary/commercially sensitive” and would provide no further details on. We can only guess means is a lot of Instagram and Facebook videos, plus some partner stuff and a lot of others that are inadequately protected from scrapers — AKA “publicly available.”

What Meta is clearly aiming for here, however, is not simply capturing the “state of the art” crown for a month or two, but a practical, soup-to-nuts approach where a solid final product can be produced from a very simple, natural-language prompt. Stuff like “imagine me as a baker making a shiny hippo cake in a thunderstorm.”

For instance, one sticking point for these video generators has been in how difficult they usually are to edit. If you ask for a video of someone walking across the street, then realize you want them walking right to left instead of left to right, there’s a good chance the whole shot will look different when you repeat the prompt with that additional instruction. Meta is adding a simple, text-based editing method where you can simply say “change the background to a busy intersection” or “change her clothes to a red dress” and it will attempt to make that change, but only that change.

Camera movements are also generally understood, with things like “tracking shot” and “pan left” taken into account when generating the video. This is still pretty clumsy compared with real camera control, but it’s a lot better than nothing.

The limitations of the model are a little weird. It generates video 768 pixels wide, a dimension familiar to most from the famous but outdated 1024×768, but which is also three times 256, making it play well with other HD formats. The Movie Gen system upscales this to 1080p, which is the source of the claim that it generates that resolution. Not really true, but we’ll give them a pass because upscaling is surprisingly effective.

Weirdly, it generates up to 16 seconds of video… at 16 frames per second, a frame rate no one in history has ever wanted or asked for. You can, however, also do 10 seconds of video at 24 FPS. Lead with that one!

As for why it doesn’t do voice… well, there are likely two reasons. First, it’s super hard. Generating speech is easy now, but matching it to lip movements, and those lips to face movements, is a much more complicated proposition. I don’t blame them for leaving this one til later, since it would be a minute-one failure case. Someone could say “generate a clown delivering the Gettysburg Address while riding a tiny bike in circles” — nightmare fuel primed to go viral.

The second reason is likely political: putting out what amounts to a deepfake generator a month before a major election is… not the best for optics. Crimping its capabilities a bit so that, should malicious actors try to use it, it would require some real work on their part, is a practical preventive step. One certainly could combine this generative model with a speech generator and an open lip syncing one, but you can’t just have it generate a candidate making wild claims.

“Movie Gen is purely an AI research concept right now, and even at this early stage, safety is a top priority as it has been with all of our generative AI technologies,” said a Meta rep in response to TechCrunch’s questions.

Unlike, say, the Llama large language models, Movie Gen won’t be publicly available. You can replicate its techniques somewhat by following the research paper, but the code won’t be published, except for the “underlying evaluation prompt dataset,” which is to say the record of what prompts were used to generate the test videos.

Source link