Text-to-Video
VideoPoet can output high-motion variable length videos given a text prompt.
Video-to-audio
VideoPoet can also output audio to match an input video without using any text as guidance.
Using generative models to tell visual stories
VideoPoet's capabilities are showcased by producing a short movie composed of many short clips generated by the model.
Introduction
VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator.
Visual narratives
Prompts can be changed over time to tell visual stories.
Long(er) video generation
VideoPoet outputs 2-second videos by default but can generate longer videos by predicting 1 second of video output given an input of a 1-second video clip.
Controllable video editing
The VideoPoet model can edit a subject to follow different motions, such as dance styles.
Interactive video editing
Interactive editing is possible, extending input videos a short duration and selecting from a list of examples.
Image to video generation
VideoPoet can take any input image and generate a video matching a given text prompt.
Zero-shot stylization
VideoPoet is capable of stylizing input videos guided by a text prompt, demonstrating stylistically pleasing prompt adherence.
Applying Visual Styles and Effects
Styles and effects can easily be composed in text-to-video generation by starting with a base prompt and appending a style to it.
Zero-shot controllable camera motions
High-quality camera motion customization is possible by specifying the type of camera shot in the text prompt.