VideoPoet
A large language model for zero-shot video generation
VideoPoet can output high-motion variable length videos given a text prompt.
Video-to-audio
VideoPoet can also output audio to match an input video without using any text as guidance. Unmute the videos to play the audio.
Using generative models to tell visual stories
To showcase VideoPoet's capabilities, we have produced a short movie composed of many short clips generated by the model. For the script, we asked Bard to write a series of prompts to detail a short story about a traveling raccoon. We then generated video clips for each prompt, and stitched together all resulting clips to produce the final YouTube Short below.
Introduction
VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator. It contains a few simple components:
-
A pre-trained MAGVIT V2 video tokenizer and a SoundStream audio tokenizer transform images, video, and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities, such as text.
-
An autoregressive language model learns across video, image, audio, and text modalities to autoregressively predict the next video or audio token in the sequence.
-
A mixture of multimodal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video inpainting and outpainting, video stylization, and video-to-audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities (e.g., text-to-audio).
This simple recipe shows that language models can synthesize and edit videos with a high degree of temporal consistency. VideoPoet demonstrates state-of-the-art video generation, in particular in producing a wide range of large, interesting, and high-fidelity motions. The VideoPoet model supports generating videos in square orientation, or portrait to tailor generations towards short-form content, as well as supporting audio generation from a video input.
Quick Links
To view additional results, please also visit our other pages:
Text-to-Video - Image-to-Video - Video Editing - Stylization - Inpainting
Visual narratives
Prompts can be changed over time to tell visual stories.
See the Video Editing page for additional results.
Long(er) video generation
By default, VideoPoet outputs 2-second videos. But the model is also capable of long video generation by predicting 1 second of video output given an input of a 1-second video clip. This process can be repeated indefinitely to produce a video of any duration. Despite the short input context, the model shows strong object identity preservation not seen in prior works, as demonstrated in these longer duration clips.
See the Video Editing page for additional results.
Controllable video editing
The VideoPoet model can edit a subject to follow different motions, such as dance styles. In the example below, the model processes an the same input clip with different prompts.
See the Video Editing page for additional results.
Interactive video editing
Interactive editing is also possible, extending input videos a short duration and selecting from a list of examples. By selecting the best video from a list of candidates, we can finely control the types of desired motion from a larger generated video. Here we generate three samples without text conditioning and the final one with text conditioning.
See the Video Editing page for additional results.
Image to video generation
VideoPoet can take any input image and generate a video matching a given text prompt.
Images source: Wikimedia Commons, see footnote**
Zero-shot stylization
VideoPoet is also capable of stylizing input videos guided by a text prompt, and demonstrates stylistically pleasing prompt adherence.
See the Stylization page for additional results.
Applying Visual Styles and Effects
Styles and effects can easily be composed in text-to-video generation. We start with a base prompt and append a style to it.
Prompt: "An astronaut riding a horse in a lush forest".
See the Stylization page for additional results.
Zero-shot controllable camera motions
One emergent property of VideoPoet's pre-training is that a large degree of high-quality camera motion customization is possible by specifying the type of camera shot in the text prompt.
Prompt: "Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river."
Quick Links
To view additional results, please also visit our other pages:
Text-to-Video - Image-to-Video - Video Editing - Stylization - Inpainting
Authors
Dan Kondratyuk*, Lijun Yu*, Xiuye Gu*, José Lezama*, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold*, Lu Jiang*
*Equal technical contribution
Acknowledgements
We give special thanks to Alex Siegman, Victor Gomes, and Brendan Jou for managing computing resources. We also give thanks to Aren Jansen, Marco Tagliasacchi, Neil Zeghidour, John Hershey for audio tokenization and processing, Angad Singh for storyboarding in “Rookie the Raccoon”, Cordelia Schmid for research discussions, David Salesin, Tomas Izo, and Rahul Sukthankar for their support, and Jay Yagnik as architect of the initial concept.
**Referenced works:
-
Old Faithful, public domain.
-
Pillars of Creation, public domain.
-
The Storm on the Sea of Galilee, public domain.
-
Raising the Flag on Iwo Jima, public domain.
-
Wanderer above the Sea of Fog, public domain.
-
Mona Lisa, public domain.