VideoPoet – Google Research

Visual narratives

Prompts can be changed over time to tell visual stories.

Input Video

A walking figure made out of water.

Extended Video

A walking figure made out of water. Lightning flashes in the background. Purple smoke emits from the figure of water.

Input Video

Two raccoons on motorbikes on a mountain road surrounded by pine trees, 8k.

Extended Video

Two raccoons on motorbikes. A meteor shower falls behind the raccoons. The meteors impact the earth and explode.

Long(er) video generation

By default, VideoPoet outputs 2-second videos. But the model is also capable of long video generation by predicting 1 second of video output given an input of a 1-second video clip. This process can be repeated indefinitely to produce a video of any duration. Despite the short input context, the model shows strong object identity preservation not seen in prior works, as demonstrated in these longer duration clips.

An astronaut starts dancing on Mars as colorful fireworks explode in the background.

FPV drone footage of a very sharp elven city of stone in the jungle with a brilliant blue river, waterfall, and large steep vertical cliff faces.

Teddy bears holding hands, walking down rainy 5th ave

FPV drone footage entering a cyberpunk city at night with many neon lights and reflective surfaces.

A large blob of exploding splashing rainbow paint, with an apple emerging, 8k

FPV drone footage of an ancient city in autumn.

Interactive video editing

Interactive editing is also possible, extending input videos a short duration and selecting from a list of examples. By selecting the best video from a list of candidates, we can finely control the types of desired motion from a larger generated video. Here we generate three samples without text conditioning and the final one with text conditioning.