Veo

Introducing Veo 3, our video generation model with expanded creative controls – including native audio and extended videos.

Re-designed for greater realism

Greater realism and fidelity, made possible by Veo 3’s real world physics and audio.

Follows prompts like never before

Improved prompt adherence, meaning more accurate responses to your instructions.

Improved creative control

Offers new levels of control, consistency, and creativity – now across audio.

Prompt: A medium shot opens on a seasoned, grey-bearded man in sunglasses and a paisley shirt, his gaze fixed off-camera with a contemplative expression. His gold chain glints subtly. Beside him, a younger man in a tank top, also looking forward, suggests a shared moment of observation or reflection. The camera slowly pushes in, subtly emphasizing their quiet focus. In the background, a vibrant mural splashes across a wall, hinting at an urban setting. Faint city murmurs and distant chatter drift in, accompanied by a mellow, soulful hip-hop beat that adds a contemplative yet grounded atmosphere. "The city always got a story," the older man murmurs, a slight nod of his head. "Just gotta listen."

Veo 3 lets you add sound effects, ambient noise, and even dialogue to your creations – generating all audio natively. It also delivers best in class quality, excelling in physics, realism and prompt adherence.

Greater control, consistency, and creativity than ever before.

Flow

Built with creatives, for creatives. Flow enables you to create seamless cinematic clips, scenes, and stories using our most capable generative AI models.

Slide 1 of 9

Text-to-video

T2V Overall preference

Participants viewed 1,003 prompts and respective videos on MovieGenBench, a benchmark dataset released by Meta. Veo 3.1 performs best on overall preference.

Text-to-video

T2V Text alignment

Participants viewed 1,003 prompts and respective videos on MovieGenBench, a benchmark dataset released by Meta. Veo 3.1 performs best on its capability to follow prompts accurately.

Text-to-video

T2V Visual quality

Participants viewed 1,003 prompts and respective videos on MovieGenBench, a benchmark dataset released by Meta. Participants rate the visual quality of Veo’s outputs more highly than other models.

Note: We were unable to compare image to video with Sora 2 Pro because it currently does not support realistic human images.

Image-to-video

I2V Overall preference

When participants viewed 355 image and text pairs from the VBench I2V benchmark, Veo 3’s outputs were preferred overall compared to other models.

Note: We were unable to compare image to video with Sora 2 Pro because it currently does not support realistic human images.

Image-to-video

I2V Text alignment

When participants viewed 355 image and text pairs from the VBench I2V benchmark, Veo 3.1’s outputs were preferred to other models for capturing the intent of the prompt.

Note: We were unable to compare image to video with Sora 2 Pro because it currently does not support realistic human images.

Image-to-video

I2V Visual quality

When participants viewed 355 image and text pairs from the VBench I2V benchmark, Veo 3.1’s outputs were preferred overall to other models for the visual quality.

Text-to-video and audio

T2VA Audio visual overall preference

Participants viewed 527 prompts from MovieGenBench, and had an overall preference for Veo’s outputs with audio over other models.

Text-to-video and audio

T2VA Audio-video alignment

Participants viewed 527 prompts from MovieGenBench, and chose Veo 3.1’s outputs over other models for having audio that is better synchronized with the video content.

Text-to-video

T2V Visually realistic physics

Participants choose Veo 3.1’s outputs over other models for having visually realistic physics on the physics subset of MovieGenBench prompts.

Slide 1 of 4

[1] Human raters conducted direct side-by-side comparisons across 364 diverse examples (each including a prompt and 1-3 reference images and evaluating a single generated video per prompt + reference images). All comparisons were done at 1280x720 resolution. Veo videos are 8 seconds long. All other videos are 10 seconds long and shown at full length to raters.
To ensure a fair visual comparison, all tests were conducted without sound. Audio was only enabled for the Overall Preference metric, and only when competing models had native sound support for the capability. We have indicated when audio was an active part of the comparison on the labels in the chart.

Ingredients to video

Overall preference and visual quality

Veo’s “Ingredients to Video” capability has achieved state-of-the-art results for: Overall Preference and Visual Quality in head-to-head comparisons by human raters against other leading video generation models on internal benchmarks. [1]

[1] Human raters conducted direct side-by-side comparisons across 80 diverse examples (each including initial text prompt and extension prompt evaluating one generated video per example. All comparisons were done at 720x1280 resolution. Veo videos are 8 seconds long. All other videos are 6 seconds long and shown at full length to raters.

To ensure a fair visual comparison, all tests were conducted without sound. Audio was only enabled for the Overall Preference metric, and only when competing models had native sound support for the capability. We have indicated when audio was an active part of the comparison on the labels in the chart.

Ingredients to video

Scene extension

Veo’s “Scene Extension” capability has achieved state-of-the-art results for: Overall Preference, Prompt Alignment and Visual Quality in head-to-head comparisons by human raters against other leading video generation models on internal benchmarks. [1]

[1] Human raters conducted direct side-by-side comparisons across 106 diverse examples (each including a prompt and a start and end images, evaluating one generated video per example. All comparisons were done at 720x1280 resolution. Veo videos are 8 seconds long. All other videos are 10 seconds long and shown at full length to raters.

Ingredients to video

First and last frame

Veo’s “First and Last Frame” capability has achieved state-of-the-art results for: Overall Preference, Prompt Alignment and Visual Quality, in head-to-head comparisons by human raters against other leading video generation models on internal benchmarks. [1].

[1] Human raters conducted direct side-by-side comparisons across 124 diverse examples (each including a video and a prompt, specifying which object to insert, evaluating one generated video per example.

All comparisons were done at 1280x720 (or 720x1280) resolution. Veo videos are 6 seconds long. All competing model videos are 5 seconds long and shown at full length to raters. All videos had no sound.

Ingredients to video

Object insertion

Veo’s “Object Insertion” capability has achieved state-of-the-art results for Overall Preference and Visual Quality, in head-to-head comparisons by human raters against other leading video generation models on internal benchmarks [1].

Promise

Promise Studios uses Veo 3.1 within its MUSE Platform to enhance generative storyboarding and previsualization for director-driven storytelling at production quality.

Volley

Volley powers its new AI-powered RPG, Wit's End, with Veo 3.1 to deliver static cinematics and dynamically generated assets narrating player progress.

OpusClip

OpusClip leverages Veo 3.1 within its Agent Opus to boost motion graphics and create realistic promotional videos for SMBs.

Gemini

Supercharge your creativity and productivity

Flow

An AI filmmaking tool built with and for creatives

Google AI Studio

The fastest path from prompt to production

Gemini API

Get started building with cutting-edge AI models

Vertex AI Studio

Test, tune, and deploy enterprise-ready generative AI