Image to Video | Grok Automation

When to use this mode

Image to Video is the right pick when:

You have a still image and want it animated with a motion prompt (“camera dolly forward, mist rolling in”).
You want a controlled transition between two images (start in a wide shot, end on a close-up of the same subject).

If you want video generated from scratch with no source frame, use Text to Video . If you want a multi-character / multi-component composite, use Reference to Video .

Frame mode: the only choice you really need to make

When you click the Image to Video mode tile, a Frame mode picker appears just below the prompts textarea. Two options:

Option	What it means	Use when
Start frame	One image per prompt. That image is the first frame of the video; the prompt describes what happens.	The motion is “and then…” from a single still.
Start + End frame	Two images per prompt. The first is the start, the second is the end. The prompt fills in the middle.	You want a controlled transition between two known states.

Screenshot pending Frame mode picker with Start frame and Start + End frame options

Start frame uses one image per prompt. Start + End uses two. The picker decides how the queue chunks your library.

How the library gets chunked

This is the thing that surprises people on first use, so it’s worth saying clearly.

In Start frame mode, each prompt consumes one image from the library, in order. A library of 6 images with 6 prompts means prompt 1 ↔ image 1, prompt 2 ↔ image 2, and so on. A library of 6 images with 3 prompts? Only the first 3 images are used.

In Start + End frame mode, each prompt consumes two images. A library of 6 images with 3 prompts means prompt 1 gets images 1+2, prompt 2 gets 3+4, prompt 3 gets 5+6. A library of 6 images with 2 prompts uses images 1–4 only.

Drag-reorder the library tiles to control which images go with which prompt. The order in the dropzone is the assignment order.

Set up a run

Click the Image to Video tile.
In the Reference image(s) dropzone, upload your stills.
Pick Frame mode — Start frame or Start + End frame.
In Prompts, write one prompt per shot (blank-line separated). For Start + End, the prompt should describe the journey between the two frames.
In Refine, set Length (6s / 10s), Quality (480p / 720p), and Aspect. The 480p + upscale combo from Text to Video works the same way here.
Click Run →.

A worked example: Start + End

Library, in order:

01-wideshot.jpg — A wide shot of an empty plaza at dawn.
02-closeup.jpg — A close-up of a coffee cup on a café table in the same plaza.

Prompts (single prompt because we have one transition):

Slow dolly forward from the empty plaza, light gradually warming, ending on the steam rising from the coffee cup. Continuous take, no cuts.

Run. One 10-second clip lands in your folder that starts on the wide shot and ends on the close-up, with the middle filled in by Grok.

Per-row status when running

The prompt list mid-run shows:

The prompt text.
A row of small thumbnails for the image(s) being used (1 in Start frame mode, 2 in Start + End).
Status: queued → generating · N% → done / failed.

If a row says failed with a no image attached error, your library has fewer images than prompts need — for Start + End that means fewer than 2 × prompt count.

Chain prompts on Image to Video

The Chain prompts checkbox is available here too. With chain on, the output video’s last frame becomes the start frame for the next prompt, regardless of what’s in the library. This is the cleanest way to build a 4-shot sequence from a single starting still. See Chain prompts .