test test test
learners and reasoners

Google DeepMind
* Joint leads
Paper PDF Podcast

TL;DR

Veo 3 shows emergent zero-shot abilities across many visual tasks, indicating that video models are on a path to becoming vision foundation models—just like LLMs became foundation models for language.

Perception

Modeling

Manipulation

Reasoning

Abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding?
We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo 3's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

Podcast

On a run and want to get a gist of our paper? Listen to the following podcast!

Perception

Edge detection

Segmentation

Keypoint localization

Super-resolution

Blind deblurring

Blind denoising

Low-light enhancement

Conjunctive search / binding problem

Dalmatian illusion understanding

Shape cue-conflict understanding

Rorschach blot interpretation

Modeling

Material properties (flammability)

Rigid body transform

Soft body transform

Gravity (earth)

Gravity (moon)

Buoyancy (bottle cap)

Buoyancy (rock)

Visual Jenga

Object packing

Material optics (glass)

Material optics (mirror)

Color mixing (additive)

Color mixing (subtractive)

Categorizing objects

Omniglot (recognition)

Omniglot (generation)

Omniglot (parsing)

Memory of world states

Manipulation

Background removal

Style transfer

Colorization

Inpainting

Outpainting

Text manipulation

Image editing with doodles

Scene composition

Novel view synthesis

3D-aware reposing

Transfiguration

Professional headshot

Dexterous manipulation (jar)

Dexterous manipulation (throw/catch)

Dexterous manipulation (baoding balls)

Affordance recognition

Drawing

Visual instruction (burrito)

Reasoning

Graph traversal

Tree BFS

Sequence (dots)

Sequence (arrows)

Sequence (circles)

Sequence (squares)

Connecting colors

Shape fitting

Sorting numbers

Tool use

Simple sudoku completion

Water puzzle solving

Maze solving (mouse)

Robot navigation

Rule extrapolation

Analogy (color)

Analogy (resize)

Analogy (reflect)

Analogy (rotate)

Maze (5x5)

Maze (7x7)

Maze (9x9)

Maze (irregular)

Symmetry (shape)

Symmetry (random)

Failure cases (click to expand)

BibTeX

@article{wiedemer2025video,
  title={Video models are zero-shot learners and reasoners},
  author={Wiedemer, Thadd{\"a}us and Li, Yuxuan and Vicol, Paul and Gu, Shixiang Shane and Matarese, Nick and Swersky, Kevin and Kim, Been and Jaini, Priyank and Geirhos, Robert},
  journal={arXiv preprint arXiv:2509.20328},
  year={2025}
}