The Path to Large Scale Dense Video Captioning

TL;DR:

Tested dozens of approaches for dense video captioning of hand-object manipulation.
How you represent the video matters more than fancy prompting.
Hand-cropped past/present/future collages gave the strongest Flash baseline; raw video performed worst.
Temporal tokens, visual overlays, smarter sampling, sequential context, and CoT on smaller models mostly added noise.
Best config (Pro + hand-collage + one targeted Stack rule + lightweight reasoning) reached 63.7% rubric acceptable rate.
Remaining failures are mostly visual grounding errors: the model understands the action but names the wrong object or destination.

At Scale, we are building the largest corpus of real-world robot manipulation data in the world, with more than 1,000 hours of new demonstrations landing on our platform every day from data factories, homes, and industrial sites. Every episode has to be annotated with precise, action-level captions before it becomes useful training signal. We do this with a mix of human labelers and automated vision-language model pipelines.

In this phase, each episode is already segmented into short subgoals. The task is to generate, for each segment, a precise caption that names the action, the objects, and the destination. Examples include “Stack the green gelatin packet on top of the violet packet” or “Scoop the ice cream from the tub into the cone.” This captioning stage is the foundation for a larger dense video captioning pipeline that will also include temporal clipping and timestamp localization in future phases. These captions support downstream robot training workflows, from fine-tuning foundation models to building reward functions, which means a wrong verb or misidentified object can propagate into everything downstream.

Getting VLMs to caption robot/human manipulation reliably is harder than it might seem. Academic benchmarks often give models credit when the generated caption is roughly semantically similar to the reference. Production annotation needs to be correct on action verb, object identity, destination, and temporal ordering, simultaneously, on noisy real-world video, at scale. So we built an internal benchmark and ran ablation experiments over several weeks. We tested several commonly used temporal-understanding strategies, including temporal-position encoding, fps and motion-guided sampling, marker-based visual prompts, structured reasoning prompts, and context-conditioned caption generation across neighboring events.

Some of it worked, but a lot of it didn’t. The failures, however, were instructive. What looks elegant in a paper often becomes brittle in production, and the gap between academic robotics benchmarks and production annotation data shows up as a fundamentally different distribution of failures, rather than a worse score on the same one.

What follows documents the surprising failures, the incremental wins, and the simple approach that beats every fancy technique we tried. We also work through why several methods from prior work and several recommendations from Apollo’s systematic Video-LMM study (Zohar et al., 2024), failed to generalize to our setting, and what that implies for the field.

The Task: Dense Manipulation Captioning

What We’re Annotating

Real-world manipulation is messy, diverse, and long-tailed. Our captioning pipeline turns these demonstrations into structured action-level supervision for robot learning.

Our dataset spans hundreds of unique environments and includes from both robot and human demonstrations, with episodes ranging from 2 to 20 minutes. Each episode is pre-segmented into subgoals using human-annotated timestamps. The task: given a video segment [t_start, t_end] corresponding to one subgoal, generate a precise natural-language caption describing the action performed.

Each demonstration is segmented into subgoals, then labeled with precise action captions for robot training.

Captions follow a strict schema:

Action verb from a fixed vocabulary of 30+ verbs including pick up, put, stack, insert, open, close, retrieve, scoop, place, and others.
Object (the primary manipulandum): full name including color/material identifiers
Destination or source (for transfer verbs): where the object ends up

A correct caption looks like: “Stack the green gelatin packet on top of the violet packet.” A caption fails on a wrong verb (“Put the green gelatin packet on top of the violet packet.”), a wrong object (“Stack the orange packet on top of the violet packet.”), or a wrong destination (“Stack the green packet on the table”).

This annotation is much harder than general video QA. The model must distinguish between Put (on a surface) and Stack (on top of a specific object), between visually similar objects differing only in color, and between actions that look nearly identical at the frame level. Timestamps are also imperfect: they can be half a second to a second off, which sometimes puts two actions in the same window and leaves the model to figure out which one to caption.

Our Evaluation Methodology

We ran experiments on a held-out benchmark of ~2.3k subgoals from diverse manipulation tasks. Each prediction was graded by an LLM judge using a default temperature of 0. A caption scores 1.0 only if it gets all three fields right, and 0.0 otherwise:

Acceptable (score 1.0): correct verb, correct object, correct destination/source
Not acceptable (score 0.0): any of the above wrong
Egregious (score 0.0): hallucinated objects, completely wrong scene description

We report rubric acceptable rate as our primary metric. We also collect a text similarity score (weighted combination of object, color, action, intent match) for diagnostics and failure analysis, but we optimize for rubric acceptable rate because it directly maps to data quality.

Phase 1: What Do We Feed the Model?

Our earliest experiments explored the most basic question: how do we show the model a manipulation subgoal? We tested three representations:

Raw video clip -> trimmed to the subgoal window, passed as a video file to Gemini’s video API
Keyframes only -> a grid of N frames sampled from the subgoal window, no video

Temporal collage -> three separate collages: a grid from the previous subgoal (context), the current subgoal, and the next subgoal (future context). Plus hand-cropped close-ups of the current subgoal.

Data representation: rubric scores by input type — Comparing rubric acceptable rate across input representation strategies. Video-only performs worst despite containing the most temporal information. The hand-collage approach: past/present/future grids with hand-crop focus, delivers the largest single gain.

The results were stark, but highly model-dependent. For the models we tested, simply passing in video was worse than converting the video into structured image inputs. Video-only performed worst (28.7% rubric score), despite containing all the temporal information. Raw frame grids improved substantially (49.2%) but still missed critical context.

Our best-performing format combined past/future collages with explicit present-frame sequences and tight hand-crop keyframes, reaching 61.0%, a 32-point improvement over video-only. One likely reason is that video inputs are internally converted into image tokens using model-specific sampling strategies, giving less control over which details are preserved.

past context, current window, future context — Hand-collage representation used for captioning. Past, current, and future frames are shown as separate grids so the model can compare the scene before, during, and after the target subgoal.

Why Video Hurt

This finding runs counter to the intuition that “more information is better.” The raw video experiment (28.7% rubric acceptable rate) was a worst-case, video-only with no collage context. But we later ran a more controlled test: taking our best hand-collage baseline and adding video on top. The result was still negative: −2.52pp (58.5% from 61.0%).

Four distinct failure mechanisms emerged:

Temporal overreach (17 action-flip failures). Sub-goal clips include the approach phase before the action and a release phase after, while static keyframes naturally land at the action peak. Full video lets the model watch every frame and anchor on the wrong temporal moment:

Here, GT means ground truth: the human-written reference action label from our production annotation pipeline, used as the evaluation target.

GT: "Put the silver knife into the utensil organizer" Collage: "Put the knife into the cutlery tray." [PASS -- saw placement frame] Video: "Pick up the knife from the table." [FAIL -- anchored on approach frame] GT: "Retrieve the brown bottle from the ziploc bag" Collage: "Remove the bottle from the plastic bag." [PASS] Video: "Put the bottle into the plastic bag." [FAIL -- described return frame]

Over-identification (101 of 161 regressions - 63%). Video gives the model richer detail. It now sees brand logos, exact colors, and secondary objects clearly, producing more confident but wrong names. The GT labels are often deliberately generic (“brown pack”), but the model, seeing the video clearly, names the specific brand:

GT: "Retrieve the brown pack from the white grocery bag" Collage: "Retrieve a packet of chocolate gelatin from the plastic bag." [PASS -- generic enough] Video: "Put the chocolate D'Gari packet into the blue plastic bag." [FAIL -- brand detail + direction flip]

Where Video Genuinely Helps

For directional actions where motion is the only discriminator, video solves cases that static frames cannot:

Verb	Collage	Video	Delta
pour	45%	73%	+27pp
turn on/off	48%	61%	+12pp
adjust	17%	23%	+6pp

Video’s signal is real for Pour and Turn. The problem is the surrounding temporal noise outweighs it 2:1 (161 regressions vs 103 improvements). Tighter clip trimming (removing the approach and release phases before uploading) is the most promising path to making video net-positive.

The hand-cropped collage addresses this by:

Cropping to the hand region focuses the model’s attention on the manipulation area, where critical object interactions happen
Including temporal neighbors through past and future collages provide action context without the resolution penalty of video
Showing multiple frames from each window captures approach, grasp, and release phases

The collage approach was originally developed by our annotation engineers as a human-labeler aid. It turns out VLMs benefit from the same framing that helps humans.

This hand-collage baseline became our reference point for all subsequent ablations: 61.0% rubric acceptable rate on ~770 subgoals from a diverse set of manipulation tasks.

Temporal Fusion: The Interleaved Token Paradox

Recent video-LMM work suggests that explicit temporal structure can help models reason over video. Apollo (Zohar et al., 2024) finds that adding text or learned separator tokens between video tokens from different frames or clips improves token integration, while Qwen3-VL (Bai et al., 2025) uses explicit textual timestamp alignment for more precise temporal grounding in video.

But we found that this benefit did not transfer to prompt-based image interleaving in frontier APIs. Adding temporal progress tags such as [t=0.0] or [t=0.33] to Gemini Flash decreased rubric acceptable rate from 61.0% to 59.1%.

Our hypothesis: temporal markers help when they are part of the model’s video-specific architecture or training recipe, where they organize dense spatiotemporal token streams. In prompt-based image interleaving, the same markers may behave more like noise: they compete for attention or push the prompt away from the model’s pretrained multimodal formatting priors.

temporal interleaved tokens: flash vs pro — Effect of temporal interleaved tokens and related reasoning variants on Flash vs Pro models. On Flash, temporal interleaving hurt relative to the hand-collage baseline. The strongest result came from Pro 3.1 with stack-rule reasoning and mini-CoT.

The technique helps on bigger models, but hurt on smaller ones. If you aim for cheap, cost-effective solutions, we recommend against using it.

What “Temporal Normalized Progression” Actually Means

To understand why this happens, consider what the model must do with a [t=0.33] token. It must:

Parse the numeric value
Relate it to the other frames’ time positions
Infer what phase of the action is depicted
Use that phase information to disambiguate the action verb

This is a multi-step reasoning chain that requires the model to hold and integrate information across multiple frames. Cheap flash-class models handle single-frame queries well but struggle with cross-frame temporal integration when the integration is explicit (token-form) rather than implicit (spatial layout in a collage).

The collage format itself provides implicit temporal ordering, frames are arranged left-to-right in time, which models have learned to read from web data. Explicit temporal tokens add a second, potentially conflicting encoding that confuses smaller models.

normalized temporal progress in a manipulation subgoal — Diagram showing how temporal progress tokens map to action phases in a manipulation subgoal. Each frame gets a normalized position t ϵ [0,1] relative to the subgoal window. The production challenge: action boundaries are never perfectly localized, so t=0 and t=1 frames often contain irrelevant approach/release motion.

Why Temporal Progress Fails in Production Specifically

There is a second reason normalized temporal progress underperforms in our setting that is specific to production (as opposed to academic benchmarks): action boundary noise.

Temporal progress tokens assume the clip boundaries are meaningful. In an ideal clip, [t=0.0] shows the start of the interaction, and [t=1.0] shows the action’s endpoint.

Our production clips are not always that clean. They are segmented by humans watching video at 1× speed, so the boundaries are approximate. Many clips include extra frames before or after the actual manipulation:

Pre-contact frames at [t=0.0]: the hand is moving toward the object but has not touched it yet.
Post-release frames at [t=1.0]: the hand or arm is retracting after the object has already been released or settled.

This makes the boundary tokens noisy. The model is supposed to use [t=0.0] and [t=1.0] as anchors for action direction, but in production those anchors can point to the wrong moment. For example, labeling an empty-hand retraction frame as [t=1.0] tells the model that the endpoint of the action is the hand moving away, not the actual release.

screen grabs — Production boundaries are not frame-perfect. The model needs the precise action boundary, but human-labeled segments often include extra approach or release frames, making temporal progress tags noisy.

Sampling Methods

The Premise We Tested

One of the most persistent ideas in video understanding for robotics is that smarter keyframe selection should improve action recognition. The reasoning is intuitive: uniform sampling might hit redundant frames, or miss the critical grasp moment during fast motion. A smarter sampler would focus on the informative frames.

We tested six keyframe sampling strategies, all using 6 frames from the current subgoal window:

Uniform: evenly spaced, the naive baseline
DP (Dynamic Programming): longest-path selection maximizing visual diversity
DP-FPS: FPS-constrained DP variant
Middle: uniform over the middle 60% of the window, trimming the approach/release phases
Late-last: skip first 20%, pin the final frame to the exact last frame (to capture placement state)
MG-RGB: motion-guided sampling using per-frame L1 difference
MG-Feature: motion-guided sampling using learned feature distances

keyframe sampling methods vs uniform baseline — Rubric acceptable rate for each sampling method, relative to the uniform baseline (61.0%). No method consistently outperforms naive uniform sampling. DP and MG-RGB come closest but remain within noise. DP-FPS hurts significantly.

After five distinct sampling strategies, including motion-guided approaches with learned feature representations, not a single one beat naive uniform sampling. The information content for this task is not concentrated in “interesting” frames; it is distributed across the action.

Why Better Sampling Doesn’t Help

There are three reasons smarter sampling fails to improve manipulation captioning:

1. The critical signal is often at the boundary, not the peak motion moment. The object’s identity is best visible at the very moment it is picked up or placed, these are often the lowest-motion frames. Motion-guided methods preferentially select high-change frames, which in manipulation often correspond to the demonstrator’s arm swinging through space (not carrying an object), not the critical grasp or place moments.

2. The collage format already compensates for bad sampling. Our baseline includes both past and future collages, giving the model 18 frames total (6 per collage × 3 temporal windows). Within this rich context, the exact 6 frames selected from the current window matter much less. The model can triangulate from the neighboring windows.

3. Object identity doesn’t require the “right” frame. For our primary failure mode (wrong object), what matters is seeing enough of the scene to identify which specific item is being manipulated. Any frame showing the hand near the object accomplishes this. Uniform sampling reliably delivers one such frame; sophisticated methods don’t do meaningfully better.

Mark-Based Visual Prompting: When Visual Grounding Hurts

The Hypothesis

Recent LMM work introduced Set-of-Marks (SoM) prompting to improve visual grounding by overlaying numbered marks on candidate image regions and referring to targets by mark (Yang et al., 2023). Related robotics manipulation work, for example MOKA, adapts mark-based visual prompting for action or affordance grounding by marking candidate points or regions where the robot might act.

Our dominant failure mode is wrong-target (37% of errors, the model uses the correct verb but names the wrong object or destination). SoM seemed tailor-made to fix this. The following clip illustrates the exact kind of error we were trying to solve, the model says “Put the book on the table” when the correct action is “Stack the blue book on top of the green book”:

We implemented two mark-based prompting variants:

Hand-keypoint / trajectory overlay: We overlaid MediaPipe hand keypoints and connecting lines on each keyframe, optionally preserving motion traces across frames. This was intended to expose fingertip-to-object contact cues, hand identity, and wrist trajectory.
Object-label overlay: a two-pass Gemini Robotics-ER pipeline that detects objects, returns normalized (x, y) coordinates with labels, and renders those labels onto each keyframe before captioning.

mark-based visual prompting vs baseline — Rubric acceptance rate for mark-based visual prompting variants vs the hand-collage baseline. Both the MediaPipe hand-motion overlay and Gemini object-label overlay hurt performance. The degradation is consistent across error types: visual overlays did not fix wrong-target errors: they introduced new ones.

Mark-based visual prompting did not fix wrong-target errors. Instead, it introduced new errors by adding noisy, incorrect, or ambiguous visual cues that the model sometimes anchored on instead of the underlying visual evidence.

Why the Overlays Backfired

The failure came from a cascade of reliability issues:

Detector errors: In our evaluation, the Gemini Robotics-ER detection pass labeled objects correctly on roughly 70% of frames. The remaining errors propagated downstream. When the detector labeled a blue packet as “green packet,” the captioning model often inherited that mistake. The two-pass pipeline introduced an additional source of error rather than removing the original one.

Overlay interference: Both variants added visual marks to the images. For manipulation involving small objects, such as packets, cable connectors, and earphone components, these marks often obscured the exact regions needed for identification. A label or keypoint trace placed over the contact region between two similarly colored objects degraded the visual evidence rather than augmenting it.

There may be a path to making mark-based prompting work for manipulation: use more reliable 3D perception, incorporate depth, or apply labels only when detection confidence exceeds a high threshold. But in our setting, RGB-only ego video at production scale, the overhead and error rate of reliable marking were prohibitive.

The Broader Lesson

Mark-based visual prompting is a meta-strategy: it replaces uncertain implicit visual grounding with explicit visual or symbolic cues, betting that the inserted cues are more reliable than the model’s own visual processing. When those cues are reliable, this can help. When they are not, as in our case, the method adds a new failure mode without fixing the original one.

This generalizes beyond mark-based prompting: any technique that inserts an intermediate processing step can compound errors. A two-pass system helps only when the intermediate representation is more reliable than the model behavior it replaces. For production annotation at scale, simpler pipelines may be more robust.

We tested sequential context injection, an autoregressive approach where subgoal 0 is predicted without context, subgoal 1 receives the model’s own prediction for subgoal 0, and subgoal 2 receives predictions for 0 and 1. This implements within-episode coherence: each prediction conditions on prior predictions.

Result: 57.8%, a −3.2pp drop from the baseline. The breakdown by subgoal index reveals the mechanism:

Sub-goal index	Context injected	Improvements	Regressions	Net
0	None (cold start)	+45	−44	+1
1	pred_0	+57	−101	−44
2	pred_0 + pred_1	+57	−103	−46

Sub-goal 0 is essentially unchanged. All damage comes from the context-injected subgoals. The obvious explanation, “pred_0 was wrong, so it poisoned sub-goal 1,” accounts for only 33% of the idx=1 regressions. The other 67% happened when pred_0 was correct:

INJECTED (correct): "Put the headphones on the table." GT for sub-goal 1: "Put the headphones on the table" pred: "Put the headphones on the table." [PASS] pred: "Return to Home" [FAIL] → Model concluded: "Headphones just placed → task finished → RTH" INJECTED (correct): "Screw the lightbulb into the socket." GT for sub-goal 1: "Screw the new light bulb into the socket" pred: "Insert the lightbulb into the socket." [PASS] pred: "Turn on the lightbulb." [FAIL] → Model reasoned: "Bulb inserted → next step is turn on", predicted the future action

The model shifted from “describe what you see” to “what comes next in a logical task sequence.” Its language priors about task progression overrode the visual evidence. This distraction mechanism, even correct context causing regressions, is particularly insidious because it is not fixable by improving prediction accuracy. It requires a different prompt framing: rather than “the previous action was X,” the prompt must say “the previous action X is already finished and must NOT appear in your annotation.”

The distraction is amplified by the repetitive structure of robot teleoperation data. Tasks repeat 8 to 20 times per recording session (e.g., “Put the can into the bag” → “Adjust the bag” → “Put the can into the bag” → …). The model’s training prior says “why would you Put the same item twice?” so when context shows a Put, it predicts Pick Up or RTH instead of another Put.

This echoes findings from multi-turn LLM research: conditioning on the model’s own outputs in a chain increases coherence (the model stays consistent with itself) but not accuracy (the model stays consistently wrong). The distraction failure is subtler, even oracle context causes regressions because the model’s task-completion priors override its visual grounding.

The Lesson

Chain-of-thought (CoT), sequential context, and mini-CoT (a lightweight variant where the model first predicts the past, present, and future frames before writing the final action caption) all share a common failure mode on smaller models: on large, capable models, the reasoning chain catches mistakes and on smaller models, the reasoning chain creates mistakes.

For production annotation with Flash-class models, the pragmatic answer is: don’t ask for reasoning. Provide a clean, structured prompt with the relevant context (hand collages, disambiguation rules) and let the model answer directly.

Model Capacity: The Clearest Signal

The Model Ladder

After several experiments on prompt engineering, sampling strategies, and visual augmentation, we ran the simplest possible ablation: swap the model while holding everything else constant.

We compared five models at equivalent configuration (hand-collage, uniform sampling, no CoT, and no visual overlays):

Model	Rubric Score	Notes
Gemini 3.1 Flash	61.0%	Production-scale baseline
Gemini Robotics ER 1.6 Preview	62.1%	Domain-specialized
Gemini 2.5 Pro Preview	63.0%	+2pp over Flash
Gemini 2.5 Pro + stack rule	63.2%	+0.2pp over Pro
Gemini 3.1 Pro + stack + CoT	63.7%	Best result

Model Capacity vs Score -- The Clearest Signal in Dozens of Experiments

rubric acceptable rate graph — Best observed rubric acceptable rate by model family. For each family, we compare the highest-performing tested configuration: Flash + Stack rule at 61.4%, Gemini Robotics ER 1.6 at 62.1%, and Pro + stack + reason + mini-CoT at 63.7%. This reflects the strongest setup tested per family, not identical prompting configurations across models.

A +2.7pp gain from upgrading the model dwarfs every single prompt-engineering technique we tried. The best single-technique improvement (stack rule: +0.4pp on Flash) is an order of magnitude smaller.

If you only do one thing to improve VLM annotation quality, upgrade the model. Prompt engineering on a smaller model cannot close the gap with a larger model; it can only extract the last few percentage points once you are on the right model tier.

Targeted Rules: The Exception That Proves the Rule

One prompt engineering technique did provide a reliable, consistent benefit: targeted disambiguation rules. The Stack rule (appended to the system prompt) produced +0.4pp on Flash and +0.2pp on Pro:
SPECIAL CASE -- Stack: When one item is being placed ON TOP OF another specific item (not a flat table/surface), use "Stack", not "Put". Example: "Stack the green gelatin packet on top of the violet packet."
This works because it addresses a specific, well-characterized failure mode (Stack/Put confusion) with a minimal, targeted intervention. The rule only fires on Stack-relevant scenarios; it cannot degrade performance on non-Stack actions.

The Scoop rule (destination vs source naming) showed a more complicated story. Within Scoop pairs it worked well (+21 improvements, −7 regressions, net +14 Scoop pairs). But the directive “name where the item ends up” leaked into how the model resolves directional ambiguity in other verbs: adding the Scoop rule to the Stack rule (exp22) caused −42 Put regressions and −19 Pick regressions, the model started second-guessing destination/source in unrelated actions. Net over the Stack-rule-only baseline: +5 pairs — barely measurable.

Lesson: rules must be verb-scoped. A rule written in directional terms bleeds into every verb that involves motion direction. The Stack rule (“when placing ON TOP OF another item, use Stack not Put”) is narrowly visual, it only fires on spatial-layering observations. The Scoop rule as initially written was too semantic and rewired the model’s direction-of-motion prior globally.

How the Failure Mass Actually Decomposes

We analyzed the 890 failing pairs from the Flash + Stack + Scoop rules condition. The failures split almost exactly into thirds:

Failure type	Count	% of failures	Fixable by
Wrong verb (action recognition failure)	~326	37%	Prompt rules, context injection, sampling
Correct verb, wrong object/destination	332	37%	Better visual grounding, higher-res frames
Egregious / completely off	~232	26%	Stronger model, temporal context

The 332 correct-verb-wrong-object failures are the most informative: the model knows what is happening but cannot identify which specific item is being manipulated. Object identification is the primary failure signal, 82% of all failing pairs involve wrong object, destination, or source, whereas action (verb) errors account for only 63%.

Scoop is the sharpest illustration: 100% of Scoop failures are correct-verb-wrong-destination, the model always knows it is a scooping action, but consistently names the wrong vessel. Verb rules address the 37% action-recognition failures and leave the 37% object-grounding failures untouched.

failure mode breakdown, theoretical performance ceilings — Failure breakdown and performance ceiling analysis. Most remaining errors come from wrong-verb predictions or correct-verb/wrong-object grounding failures; fixing both could raise performance from 63.7% to ~85–88%, while structurally ambiguous clips set the residual ceiling.

This decomposition sets the theoretical ceiling for different intervention types:

Perfect prompt rules only: ~75% (eliminates wrong-verb third, grounding failures remain)
Prompt rules + better object grounding (bounding boxes, higher-res crops, depth): ~85%
True ceiling (best possible with 6 static RGB frames): ~85-90% (minus the ~10–15% structurally ambiguous clips where motion direction cannot be determined from static frames alone)

The 25.9% always-fail floor, pairs that fail across every experiment regardless of prompt, is confirmed to be structural. It includes: visually identical objects (honey bottle vs white lid container), actions only determinable from motion direction (Retrieve vs Put from a bag), and temporal contamination from adjacent sub-goals that bleed into the clip window.

The Best Configuration

Our best result is a 63.7% rubric score, combined:

Gemini 3.1 Pro (model capacity)
Hand-collage representation (past/present/future frames, hand-cropped)
Stack disambiguation rule (targeted fix for dominant verb error)
High reasoning effort (Pro thinks before answering)
Mini-CoT (structured PREVIOUS/NEXT/CURRENT reasoning)

Notably absent: visual overlays, interleaved tokens, fancy sampling, sequential context. The winning combination is the simplest capable representation plus the most capable model plus one targeted rule.

Chart showing all ablation experiments ranked by rubric acceptable rate. Color encodes experiment category: model upgrades in green, positive techniques in blue, neutral or mixed results in yellow, and negative results in red.

Ablation Results
All experiments use the Flash hand-collage baseline unless otherwise stated.

What Worked

Technique	Score	Delta vs baseline
Upgrade to Pro 3.1	63.0%	+2.0pp
Pro + Stack rule	63.2%	+2.2pp
Pro + Stack + reasoning + mini-CoT	63.7%	+2.7pp
Hand-collage baseline (Flash)	61.0%	—
Flash + Stack rule	61.4%	+0.4pp
Flash + Stack + Scoop rules	61.3%	+0.3pp

What Didn’t Work (and Hurt)

Technique	Score	Delta vs baseline
Video-only input	28.7%	−32.3pp
CoT mini-v2 on Flash (hard subset)	36.7%	−24.3pp on hard subset
Frames without hand collage	49.2%	−11.8pp
DP-FPS sampling	49.4%	−11.6pp
Hand-collage + video added	58.5%	−2.52pp
Gemini object-label overlay (2-pass)	56.2%	−4.8pp
MediaPipe hand-motion overlay	57.3%	−3.7pp
Sequential context injection	57.8%	−3.2pp
MG-Feature sampling	58.9%	−2.1pp
Temporal interleaved tokens (Flash)	59.1%	−1.9pp
Flash + mini-CoT + visual	59.5%	−1.5pp
CoT v1 on Flash	59.8%	−1.2pp
DP sampling	60.0%	−1.0pp
Flash + Stack + Scoop rules	61.3%	−0.1pp vs stack-rule-alone

The Meta-Pattern

Looking across dozens of experiments, a single pattern explains most of the results:

“Complexity taxes smaller models; simpler prompts extract more of their real capability. Larger models can absorb and use complexity productively.”

Every technique that added reasoning overhead, intermediate steps, or additional signals hurt Flash and helped Pro. The practical implication for production:

If you have budget for Pro-class models: use mini-CoT, add temporal grounding, invest in reasoning effort
If you need Flash-class economics: use the cleanest possible representation (hand-collage), add targeted rules for known failure modes, avoid any reasoning overhead

Limitations

This study should be interpreted as a directional engineering evaluation rather than a rigorous statistical analysis.

Results are reported from single evaluation runs on a fixed 2.3k-pair benchmark without confidence intervals. Small differences (<1–2pp) should not be overinterpreted, and the leaderboard is better viewed as tiers than a strict ranking.
We do not report latency or inference cost, so higher scores should be considered alongside production tradeoffs.
Results are scoped to Gemini-family models on one internal manipulation-video benchmark and may not generalize to other distributions or VLM families.

What’s Next

This was Phase 1: a single model, no agentic tooling, no multi-step correction. We deliberately stayed in the regime of “one model does everything” to understand the fundamental limits of this approach.

Phase 1 limits are now clear. The 38% failure mass decomposes into three approximately equal thirds:

~37% wrong-verb failures, fixable with better prompt rules, context injection, or middle-clip trimming
~37% correct-verb-wrong-object failures, require better visual grounding (bounding boxes, higher-res crops, depth sensors), prompt engineering cannot touch these
~26% egregious failures, require a stronger model (RTH, directionally ambiguous actions)

The theoretical performance ceilings: ~75% with perfect prompt rules only; ~85% with prompt rules plus visual grounding; ~85–90% true ceiling accounting for structurally ambiguous clips (action direction not determinable from static RGB frames). The 25.9% always-fail floor is structural, confirmed unchanged across every experiment in Phase 1.

Phase 2 priorities follow directly: (1) visual grounding of the manipulated object in each frame (bounding box overlay from a reliable detector), (2) sub-goal context injection with a negative-framing prompt that tells the model what the previous action was and that it is now finished, (3) video input with middle-clip trimming for directional verbs (Pour, Turn, Retrieve).

The findings here generalize beyond manipulation captioning. For any VLM annotation task in production:

Representation beats sampling: how you show frames matters more than which frames you show
Rules beat reasoning: for specific, known failure modes, a targeted rule is more reliable than asking the model to reason its way out
Model capacity is the biggest lever: invest in the right model tier before optimizing prompts
Two-pass systems multiply errors: marker-based visual prompting, object detection pipelines, and autoregressive conditioning on prior predictions (sequential context) all cascade their errors

Acknowledgments

This work was conducted within Scale’s Physical AI team. We thank the annotation team, whose domain expertise shaped the evaluation rubric and error taxonomy, and the operators whose data made this possible.

Interested in collaboration or using Scale’s Physical AI data engine? Contact us at [email protected].

TL;DR:

Tested dozens of approaches for dense video captioning of hand-object manipulation.
How you represent the video matters more than fancy prompting.
Hand-cropped past/present/future collages gave the strongest Flash baseline; raw video performed worst.
Temporal tokens, visual overlays, smarter sampling, sequential context, and CoT on smaller models mostly added noise.
Best config (Pro + hand-collage + one targeted Stack rule + lightweight reasoning) reached 63.7% rubric acceptable rate.
Remaining failures are mostly visual grounding errors: the model understands the action but names the wrong object or destination.

The Task: Dense Manipulation Captioning

What We’re Annotating

Real-world manipulation is messy, diverse, and long-tailed. Our captioning pipeline turns these demonstrations into structured action-level supervision for robot learning.

Each demonstration is segmented into subgoals, then labeled with precise action captions for robot training.

Captions follow a strict schema:

Action verb from a fixed vocabulary of 30+ verbs including pick up, put, stack, insert, open, close, retrieve, scoop, place, and others.
Object (the primary manipulandum): full name including color/material identifiers
Destination or source (for transfer verbs): where the object ends up

Our Evaluation Methodology

Acceptable (score 1.0): correct verb, correct object, correct destination/source
Not acceptable (score 0.0): any of the above wrong
Egregious (score 0.0): hallucinated objects, completely wrong scene description

Phase 1: What Do We Feed the Model?

Our earliest experiments explored the most basic question: how do we show the model a manipulation subgoal? We tested three representations:

Raw video clip -> trimmed to the subgoal window, passed as a video file to Gemini’s video API
Keyframes only -> a grid of N frames sampled from the subgoal window, no video

Why Video Hurt

Four distinct failure mechanisms emerged:

Here, GT means ground truth: the human-written reference action label from our production annotation pipeline, used as the evaluation target.

Where Video Genuinely Helps

For directional actions where motion is the only discriminator, video solves cases that static frames cannot:

Verb	Collage	Video	Delta
pour	45%	73%	+27pp
turn on/off	48%	61%	+12pp
adjust	17%	23%	+6pp

The hand-cropped collage addresses this by:

Cropping to the hand region focuses the model’s attention on the manipulation area, where critical object interactions happen
Including temporal neighbors through past and future collages provide action context without the resolution penalty of video
Showing multiple frames from each window captures approach, grasp, and release phases

The collage approach was originally developed by our annotation engineers as a human-labeler aid. It turns out VLMs benefit from the same framing that helps humans.

This hand-collage baseline became our reference point for all subsequent ablations: 61.0% rubric acceptable rate on ~770 subgoals from a diverse set of manipulation tasks.

Temporal Fusion: The Interleaved Token Paradox

The technique helps on bigger models, but hurt on smaller ones. If you aim for cheap, cost-effective solutions, we recommend against using it.

What “Temporal Normalized Progression” Actually Means

To understand why this happens, consider what the model must do with a [t=0.33] token. It must:

Parse the numeric value
Relate it to the other frames’ time positions
Infer what phase of the action is depicted
Use that phase information to disambiguate the action verb

Why Temporal Progress Fails in Production Specifically

There is a second reason normalized temporal progress underperforms in our setting that is specific to production (as opposed to academic benchmarks): action boundary noise.

Temporal progress tokens assume the clip boundaries are meaningful. In an ideal clip, [t=0.0] shows the start of the interaction, and [t=1.0] shows the action’s endpoint.

Pre-contact frames at [t=0.0]: the hand is moving toward the object but has not touched it yet.
Post-release frames at [t=1.0]: the hand or arm is retracting after the object has already been released or settled.

Sampling Methods

The Premise We Tested

We tested six keyframe sampling strategies, all using 6 frames from the current subgoal window:

Uniform: evenly spaced, the naive baseline
DP (Dynamic Programming): longest-path selection maximizing visual diversity
DP-FPS: FPS-constrained DP variant
Middle: uniform over the middle 60% of the window, trimming the approach/release phases
Late-last: skip first 20%, pin the final frame to the exact last frame (to capture placement state)
MG-RGB: motion-guided sampling using per-frame L1 difference
MG-Feature: motion-guided sampling using learned feature distances

Why Better Sampling Doesn’t Help

There are three reasons smarter sampling fails to improve manipulation captioning:

Mark-Based Visual Prompting: When Visual Grounding Hurts

The Hypothesis

We implemented two mark-based prompting variants:

Hand-keypoint / trajectory overlay: We overlaid MediaPipe hand keypoints and connecting lines on each keyframe, optionally preserving motion traces across frames. This was intended to expose fingertip-to-object contact cues, hand identity, and wrist trajectory.
Object-label overlay: a two-pass Gemini Robotics-ER pipeline that detects objects, returns normalized (x, y) coordinates with labels, and renders those labels onto each keyframe before captioning.

Why the Overlays Backfired

The failure came from a cascade of reliability issues:

The Broader Lesson

Result: 57.8%, a −3.2pp drop from the baseline. The breakdown by subgoal index reveals the mechanism:

Sub-goal index	Context injected	Improvements	Regressions	Net
0	None (cold start)	+45	−44	+1
1	pred_0	+57	−101	−44
2	pred_0 + pred_1	+57	−103	−46

The Lesson

Model Capacity: The Clearest Signal

The Model Ladder

After several experiments on prompt engineering, sampling strategies, and visual augmentation, we ran the simplest possible ablation: swap the model while holding everything else constant.

We compared five models at equivalent configuration (hand-collage, uniform sampling, no CoT, and no visual overlays):

Model	Rubric Score	Notes
Gemini 3.1 Flash	61.0%	Production-scale baseline
Gemini Robotics ER 1.6 Preview	62.1%	Domain-specialized
Gemini 2.5 Pro Preview	63.0%	+2pp over Flash
Gemini 2.5 Pro + stack rule	63.2%	+0.2pp over Pro
Gemini 3.1 Pro + stack + CoT	63.7%	Best result

Model Capacity vs Score -- The Clearest Signal in Dozens of Experiments

A +2.7pp gain from upgrading the model dwarfs every single prompt-engineering technique we tried. The best single-technique improvement (stack rule: +0.4pp on Flash) is an order of magnitude smaller.

Targeted Rules: The Exception That Proves the Rule

How the Failure Mass Actually Decomposes

We analyzed the 890 failing pairs from the Flash + Stack + Scoop rules condition. The failures split almost exactly into thirds:

Failure type	Count	% of failures	Fixable by
Wrong verb (action recognition failure)	~326	37%	Prompt rules, context injection, sampling
Correct verb, wrong object/destination	332	37%	Better visual grounding, higher-res frames
Egregious / completely off	~232	26%	Stronger model, temporal context

This decomposition sets the theoretical ceiling for different intervention types:

Perfect prompt rules only: ~75% (eliminates wrong-verb third, grounding failures remain)
Prompt rules + better object grounding (bounding boxes, higher-res crops, depth): ~85%
True ceiling (best possible with 6 static RGB frames): ~85-90% (minus the ~10–15% structurally ambiguous clips where motion direction cannot be determined from static frames alone)

The Best Configuration

Our best result is a 63.7% rubric score, combined:

Gemini 3.1 Pro (model capacity)
Hand-collage representation (past/present/future frames, hand-cropped)
Stack disambiguation rule (targeted fix for dominant verb error)
High reasoning effort (Pro thinks before answering)
Mini-CoT (structured PREVIOUS/NEXT/CURRENT reasoning)

Ablation Results
All experiments use the Flash hand-collage baseline unless otherwise stated.

What Worked

Technique	Score	Delta vs baseline
Upgrade to Pro 3.1	63.0%	+2.0pp
Pro + Stack rule	63.2%	+2.2pp
Pro + Stack + reasoning + mini-CoT	63.7%	+2.7pp
Hand-collage baseline (Flash)	61.0%	—
Flash + Stack rule	61.4%	+0.4pp
Flash + Stack + Scoop rules	61.3%	+0.3pp

What Didn’t Work (and Hurt)

Technique	Score	Delta vs baseline
Video-only input	28.7%	−32.3pp
CoT mini-v2 on Flash (hard subset)	36.7%	−24.3pp on hard subset
Frames without hand collage	49.2%	−11.8pp
DP-FPS sampling	49.4%	−11.6pp
Hand-collage + video added	58.5%	−2.52pp
Gemini object-label overlay (2-pass)	56.2%	−4.8pp
MediaPipe hand-motion overlay	57.3%	−3.7pp
Sequential context injection	57.8%	−3.2pp
MG-Feature sampling	58.9%	−2.1pp
Temporal interleaved tokens (Flash)	59.1%	−1.9pp
Flash + mini-CoT + visual	59.5%	−1.5pp
CoT v1 on Flash	59.8%	−1.2pp
DP sampling	60.0%	−1.0pp
Flash + Stack + Scoop rules	61.3%	−0.1pp vs stack-rule-alone

The Meta-Pattern

Looking across dozens of experiments, a single pattern explains most of the results:

“Complexity taxes smaller models; simpler prompts extract more of their real capability. Larger models can absorb and use complexity productively.”

Every technique that added reasoning overhead, intermediate steps, or additional signals hurt Flash and helped Pro. The practical implication for production:

If you have budget for Pro-class models: use mini-CoT, add temporal grounding, invest in reasoning effort
If you need Flash-class economics: use the cleanest possible representation (hand-collage), add targeted rules for known failure modes, avoid any reasoning overhead

Limitations

This study should be interpreted as a directional engineering evaluation rather than a rigorous statistical analysis.

Results are reported from single evaluation runs on a fixed 2.3k-pair benchmark without confidence intervals. Small differences (<1–2pp) should not be overinterpreted, and the leaderboard is better viewed as tiers than a strict ranking.
We do not report latency or inference cost, so higher scores should be considered alongside production tradeoffs.
Results are scoped to Gemini-family models on one internal manipulation-video benchmark and may not generalize to other distributions or VLM families.

What’s Next

Phase 1 limits are now clear. The 38% failure mass decomposes into three approximately equal thirds:

~37% wrong-verb failures, fixable with better prompt rules, context injection, or middle-clip trimming
~37% correct-verb-wrong-object failures, require better visual grounding (bounding boxes, higher-res crops, depth sensors), prompt engineering cannot touch these
~26% egregious failures, require a stronger model (RTH, directionally ambiguous actions)

The findings here generalize beyond manipulation captioning. For any VLM annotation task in production:

Representation beats sampling: how you show frames matters more than which frames you show
Rules beat reasoning: for specific, known failure modes, a targeted rule is more reliable than asking the model to reason its way out
Model capacity is the biggest lever: invest in the right model tier before optimizing prompts
Two-pass systems multiply errors: marker-based visual prompting, object detection pipelines, and autoregressive conditioning on prior predictions (sequential context) all cascade their errors

Acknowledgments

Interested in collaboration or using Scale’s Physical AI data engine? Contact us at [email protected].

The Path to Large Scale Dense Video Captioning

The Task: Dense Manipulation Captioning

What We’re Annotating

Our Evaluation Methodology

Phase 1: What Do We Feed the Model?

Why Video Hurt

Where Video Genuinely Helps

Temporal Fusion: The Interleaved Token Paradox

What “Temporal Normalized Progression” Actually Means

Why Temporal Progress Fails in Production Specifically

Sampling Methods

The Premise We Tested

Why Better Sampling Doesn’t Help

Mark-Based Visual Prompting: When Visual Grounding Hurts

The Hypothesis

Why the Overlays Backfired

The Broader Lesson

Sequential Context Injection: A Related Failure

The Lesson

Model Capacity: The Clearest Signal

The Model Ladder

Targeted Rules: The Exception That Proves the Rule

How the Failure Mass Actually Decomposes

The Best Configuration

What Worked

What Didn’t Work (and Hurt)

The Meta-Pattern

Limitations

What’s Next

Acknowledgments

The Path to Large Scale Dense Video Captioning

The Task: Dense Manipulation Captioning

What We’re Annotating

Our Evaluation Methodology

Phase 1: What Do We Feed the Model?

Why Video Hurt

Where Video Genuinely Helps

Temporal Fusion: The Interleaved Token Paradox

What “Temporal Normalized Progression” Actually Means

Why Temporal Progress Fails in Production Specifically

Sampling Methods

The Premise We Tested

Why Better Sampling Doesn’t Help

Mark-Based Visual Prompting: When Visual Grounding Hurts

The Hypothesis

Why the Overlays Backfired

The Broader Lesson

Sequential Context Injection: A Related Failure

The Lesson

Model Capacity: The Clearest Signal

The Model Ladder

Targeted Rules: The Exception That Proves the Rule

How the Failure Mass Actually Decomposes

The Best Configuration

What Worked

What Didn’t Work (and Hurt)

The Meta-Pattern

Limitations

What’s Next

Acknowledgments