A large language model for zero-shot video generation

A dog listening to music with headphones, highly detailed, 8k.

A large blob of exploding splashing rainbow paint, with an apple emerging, 8k

A robot cat eating spaghetti, digital art.

A pumpkin exploding, slow motion.

Two pandas playing cards.

A vaporwave fashion dog in Miami looks around and barks, digital art.

An astronaut riding a galloping horse.

A family of raccoons living in a small cabin, tilt shift, arc shot.

A golden retriever wearing VR goggles and eating pizza in Paris.

A tree walking through the forest, tilt shift.

A walking figure made out of water

A shark with a laser beam coming out of its mouth.

Teddy bears holding hands, walking down rainy 5th ave

A chicken lifting weights.

An origami fox walking through the forest.

Robot emerging from a large column of billowing black smoke, high quality.

A t-rex jumping over a cactus, with water gushing after the t-rex falls.

A mouse eating cheese in a royal dress, arc shot.

An alien enjoys food, 8k.

A lion with a mane made out of yellow dandelion petals roars.

A massive explosion on the surface of the earth.

A horse galloping through Van Gogh's 'starry night'.

A squirrel in armor riding a goose, action shot.

A panda taking a selfie.

An octopus attacks New York.

A bear with the head of an owl screeches loudly

An astronaut typing on a keyboard, arc shot.

A rabbit eating grass, soft lighting.

Flag of the US on top of a tall white mountain, rotating panorama.

Motorcyclist on a racing track, highly detailed.

A massive tidal wave crashes dramatically against a rugged coastline, digital art.

Humans building a highway on Mars, cinematic.

A skeleton drinking a glass of soda.

The orient express driving through a fantasy landscape, animated oil on canvas.

VideoPoet can output high-motion variable length videos given a text prompt.


VideoPoet can also output audio to match an input video without using any text as guidance. Unmute the videos to play the audio.

A dog eating popcorn at the cinema.

A teddy bear with a cap, sunglasses, and leather jacket playing drums.

A teddy bear in a leather jacket, baseball cap, and sunglasses playing guitar in front of a waterfall.

A pink cat playing piano in the forest.

The orient express driving through a fantasy landscape, oil on canvas.

A dragon breathing fire, cinematic.

Using generative models to tell visual stories

To showcase VideoPoet's capabilities, we have produced a short movie composed of many short clips generated by the model. For the script, we asked Bard to write a series of prompts to detail a short story about a traveling raccoon. We then generated video clips for each prompt, and stitched together all resulting clips to produce the final YouTube Short below.


VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator. It contains a few simple components:

  • A pre-trained MAGVIT V2 video tokenizer and a SoundStream audio tokenizer transform images, video, and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities, such as text.

  • An autoregressive language model learns across video, image, audio, and text modalities to autoregressively predict the next video or audio token in the sequence.

  • A mixture of multimodal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video inpainting and outpainting, video stylization, and video-to-audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities (e.g., text-to-audio).

This simple recipe shows that language models can synthesize and edit videos with a high degree of temporal consistency. VideoPoet demonstrates state-of-the-art video generation, in particular in producing a wide range of large, interesting, and high-fidelity motions. The VideoPoet model supports generating videos in square orientation, or portrait to tailor generations towards short-form content, as well as supporting audio generation from a video input.

An overview of the VideoPoet model, which is capable of multitasking on a variety of video-centric inputs and outputs. The LLM can optionally take text as input to guide generation for text-to-video, image-to-video, stylization, and outpainting tasks. Resources used: Wikimedia Commons and DAVIS.

Quick Links

To view additional results, please also visit our other pages:

Text-to-Video - Image-to-Video - Video Editing - Stylization - Inpainting

Visual narratives

Prompts can be changed over time to tell visual stories.

Input Video

A walking figure made out of water.

Extended Video

A walking figure made out of water. Lightning flashes in the background. Purple smoke emits from the figure of water.

Input Video

Two raccoons on motorbikes on a mountain road surrounded by pine trees, 8k.

Extended Video

Two raccoons on motorbikes. A meteor shower falls behind the raccoons. The meteors impact the earth and explode.

See the Video Editing page for additional results.

Long(er) video generation

By default, VideoPoet outputs 2-second videos. But the model is also capable of long video generation by predicting 1 second of video output given an input of a 1-second video clip. This process can be repeated indefinitely to produce a video of any duration. Despite the short input context, the model shows strong object identity preservation not seen in prior works, as demonstrated in these longer duration clips.

An astronaut starts dancing on Mars as colorful fireworks explode in the background.

FPV drone footage of a very sharp elven city of stone in the jungle with a brilliant blue river, waterfall, and large steep vertical cliff faces.

Teddy bears holding hands, walking down rainy 5th ave

FPV drone footage entering a cyberpunk city at night with many neon lights and reflective surfaces.

A large blob of exploding splashing rainbow paint, with an apple emerging, 8k

FPV drone footage of an ancient city in autumn.

See the Video Editing page for additional results.

Controllable video editing

The VideoPoet model can edit a subject to follow different motions, such as dance styles. In the example below, the model processes an the same input clip with different prompts.

Input Video

A raccoon dancing in Times Square.

A raccoon dancing the robot in Times Square.

A raccoon dancing the griddy in Times Square.

A raccoon dancing freestyle in Times Square.

See the Video Editing page for additional results.

Interactive video editing

Interactive editing is also possible, extending input videos a short duration and selecting from a list of examples. By selecting the best video from a list of candidates, we can finely control the types of desired motion from a larger generated video. Here we generate three samples without text conditioning and the final one with text conditioning.

Input Video

Closeup of an adorable rusty broken-down steampunk robot covered in moss moist and budding vegetation, surrounded by tall grass.

Sample 1 (no prompt)

Sample 2 (no prompt)

Sample 3 (no prompt)

Powering up with smoke in the background.

See the Video Editing page for additional results.

Image to video generation

VideoPoet can take any input image and generate a video matching a given text prompt.


A geyser spraying water into the air.


Flying through a nebula with many twinkling stars.


White milk splashing in a ring, a drop above the ring falls down, making a splash.


A ship navigating the rough seas with several passengers on board, thunderstorm and lightning, animated oil on canvas.


A green man riding a green horse with the wind blowing.


Soldiers raising the united states flag on a windy day.


A wanderer on a cliff with a cane looking down at the swirling sea fog below on a windy day.


A woman yawning.

Images source: Wikimedia Commons, see footnote**

Zero-shot stylization

VideoPoet is also capable of stylizing input videos guided by a text prompt, and demonstrates stylistically pleasing prompt adherence.


Wombat wearing sunglasses holding a beach ball on a sunny beach.


Teddy bears ice skating on a crystal clear frozen lake.


A metal lion roaring in the light of a forge.


A pink and blue confetti geyser with candy coated tress.


A red and white woodcut print of a man overlooking a stormy sea.


A magical snow-covered forest of dense pine trees.

See the Stylization page for additional results.

Applying Visual Styles and Effects

Styles and effects can easily be composed in text-to-video generation. We start with a base prompt and append a style to it.

Prompt: "An astronaut riding a horse in a lush forest".


Digital art

Pencil art

Ink wash

Double exposure

Small world

See the Stylization page for additional results.

Zero-shot controllable camera motions

One emergent property of VideoPoet's pre-training is that a large degree of high-quality camera motion customization is possible by specifying the type of camera shot in the text prompt.

Prompt: "Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river."

Zoom out

Dolly zoom

Pan left

Arc shot

Crane shot

FPV drone shot

Quick Links

To view additional results, please also visit our other pages:

Text-to-Video - Image-to-Video - Video Editing - Stylization - Inpainting


Dan Kondratyuk*, Lijun Yu*, Xiuye Gu*, José Lezama*, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold*, Lu Jiang*

*Equal technical contribution


We give special thanks to Alex Siegman, Victor Gomes, and Brendan Jou for managing computing resources. We also give thanks to Aren Jansen, Marco Tagliasacchi, Neil Zeghidour, John Hershey for audio tokenization and processing, Angad Singh for storyboarding in “Rookie the Raccoon”, Cordelia Schmid for research discussions, David Salesin, Tomas Izo, and Rahul Sukthankar for their support, and Jay Yagnik as architect of the initial concept.

**Referenced works: