Grok AutomationAdd to Chrome
06 · Mode · 4 min read

Image to Video

Frame mode is the only new concept here. The rest of the workflow is identical to Image to Image.

When to use this mode

Image to Video is the right pick when:

  • You have a still image and want it animated with a motion prompt (“camera dolly forward, mist rolling in”).
  • You want a controlled transition between two images (start in a wide shot, end on a close-up of the same subject).

If you want video generated from scratch with no source frame, use Text to Video . If you want a multi-character / multi-component composite, use Reference to Video .

Frame mode: the only choice you really need to make

When you click the Image to Video mode tile, a Frame mode picker appears just below the prompts textarea. Two options:

OptionWhat it meansUse when
Start frameOne image per prompt. That image is the first frame of the video; the prompt describes what happens.The motion is “and then…” from a single still.
Start + End frameTwo images per prompt. The first is the start, the second is the end. The prompt fills in the middle.You want a controlled transition between two known states.
Screenshot pending Frame mode picker with Start frame and Start + End frame options
Start frame uses one image per prompt. Start + End uses two. The picker decides how the queue chunks your library.

How the library gets chunked

This is the thing that surprises people on first use, so it’s worth saying clearly.

In Start frame mode, each prompt consumes one image from the library, in order. A library of 6 images with 6 prompts means prompt 1 ↔ image 1, prompt 2 ↔ image 2, and so on. A library of 6 images with 3 prompts? Only the first 3 images are used.

In Start + End frame mode, each prompt consumes two images. A library of 6 images with 3 prompts means prompt 1 gets images 1+2, prompt 2 gets 3+4, prompt 3 gets 5+6. A library of 6 images with 2 prompts uses images 1–4 only.

Drag-reorder the library tiles to control which images go with which prompt. The order in the dropzone is the assignment order.

Set up a run

  1. Click the Image to Video tile.
  2. In the Reference image(s) dropzone, upload your stills.
  3. Pick Frame modeStart frame or Start + End frame.
  4. In Prompts, write one prompt per shot (blank-line separated). For Start + End, the prompt should describe the journey between the two frames.
  5. In Refine, set Length (6s / 10s), Quality (480p / 720p), and Aspect. The 480p + upscale combo from Text to Video works the same way here.
  6. Click Run →.

A worked example: Start + End

Library, in order:

  1. 01-wideshot.jpg — A wide shot of an empty plaza at dawn.
  2. 02-closeup.jpg — A close-up of a coffee cup on a café table in the same plaza.

Prompts (single prompt because we have one transition):

Slow dolly forward from the empty plaza, light gradually warming, ending on the steam rising from the coffee cup. Continuous take, no cuts.

Run. One 10-second clip lands in your folder that starts on the wide shot and ends on the close-up, with the middle filled in by Grok.

Per-row status when running

The prompt list mid-run shows:

  • The prompt text.
  • A row of small thumbnails for the image(s) being used (1 in Start frame mode, 2 in Start + End).
  • Status: queuedgenerating · N%done / failed.

If a row says failed with a no image attached error, your library has fewer images than prompts need — for Start + End that means fewer than 2 × prompt count.

Chain prompts on Image to Video

The Chain prompts checkbox is available here too. With chain on, the output video’s last frame becomes the start frame for the next prompt, regardless of what’s in the library. This is the cleanest way to build a 4-shot sequence from a single starting still. See Chain prompts .


Grok Automation is an independent browser extension for Grok users. Not affiliated with xAI.