An hour with Gemini Omni

The headline is not the image quality. Generating from zero, Omni feels on par with Veo, maybe a touch better. The real thing here is the control. Precise edits to footage I already had. A scene I shot, asking the model to keep the camera move and the geometry while it swapped out the world. That feeling is what I want to walk through.

I tried it on Flow. The Higgsfields of the world have so many sliders my eyes bleed. Flow is just a brief, a clip, a button. That's what I want from an interface in 2026.

A slow pan across a dashboard.

Sitting in my GMC in a parking lot. Three seconds. A simple horizontal pan across the cabin. The kind of clip nobody would post.

Original. Three-second handheld pan, shot on phone, no edit.

Transformation 01

Background swapped for a zombie scene.

Same shot, same lens, same pan. Outside the windscreen is now a different world. The inside stays mine.

Transformation 02

Same again, this time a hellscape.

One source clip, two worlds. The constant across both is the geometry I shot.

Me in the driver's seat, turning the camera.

Same parking lot, different brief. Holding the phone in front of me, then turning around to take in the cabin. A person in frame and a camera move at the same time.. the kind of shot that usually wrecks generative video.

Original. Me in the driver's seat, holding the phone out and turning the camera around the cabin.

Transformation 01

Same camera move, dropped into a jungle. American voiceover.

The pan rate holds. The person stays in the seat. The world outside the windows is new.

Transformation 02

Same selfie pan, dropped into the desert at Giza.

Pyramids and camels through the windscreen. Watch the nav screen.. the model rebuilt the route to look like the desert too. That kind of incidental coherence is the bit that surprised me most.

A finger reaches for a mirror.

Six seconds. Me in a hoodie, holding my phone, reaching out to tap the glass. The brief I gave the model was simple.. at the moment the finger lands, transform me.

Original. Phone-shot mirror selfie, finger approaching the glass.

Transformation 01

On contact, the reflection becomes a lizard alien.

The phone stays. The pose stays. The mirror room becomes something else. The reflection is the only thing that changes.

Transformation 02

On contact, the reflection pixelates.

Same trigger frame as the lizard. Different output. Fumbled the trigger moment slightly.. a small complaint, but I noticed it.

What it actually feels like.

The phrase I keep landing on is nano banana for video. Google have said something close to this themselves. It means.. you bring a real thing and you ask the model for a precise edit on top of it. Not "make me a clip of a car in a jungle." But "take this car, in this pan, and put it in a jungle." The model holds the source. The change rides on top.

That's where it lands for anyone working in motion graphics or 3D. The boring middle of the pipeline.. set extension, sky replacement, the in-between fixes.. starts to collapse into one prompt and a clip. You stay in the director's chair. The execution moves.

Generating from zero is fine. The control is the story.

Where this is going.

Imagine the round-trip from prompt to output is near zero. You're watching a clip in real time. You say "put a hellscape in" and it appears. You say "remove the guy" and he's gone. That isn't editing anymore. That's directing. Live. As the footage is in front of you.

Three years feels about right.

Caveats, said quickly.

Forty-five minutes is not a review. I didn't push the audio. I didn't test long-form consistency. The image fidelity from scratch I barely touched. If you want a proper deep-dive of capability and edge cases, Atomic Gains has the best survey I've seen.. plenty of detail, use cases, tests. Worth your time.

I'll come back to this when I've put a real brief through it. For now.. impressed. The bit that landed wasn't the pixels. It was the feeling of holding a clip and editing the world inside it without losing the shot.