SOLE-R1: Video-Language Reasoning as the
Sole Reward for On-Robot RL

Philip Schroeder^{1, 2} Thomas Weng² Karl Schmeckpeper² Eric Rosen² Stephen Hart² Ondrej Biza²

¹MIT
²RAI Institute

Overview Video

Timelapse of online RL using rewards from SOLE-R1 reasoning (top) vs GPT-5 reasoning (bottom)

Description:

Abstract

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today’s strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates tempo- rally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random ini- tialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking. We will release all models, data, code, and demos at the links above.

Model & Training Overview

Model: Given a natural-language goal and a video stream of observations, SOLE-R1 produces (i) per-timestep chain-of-thought (CoT) describing what has changed since the last timestep and (ii) a dense scalar progress estimate used as a reward signal for online RL (Figure 1).

Figure 1: SOLE-R1 is a video-language reasoning model designed to guide online RL with per-timestep CoT reasoning and progress prediction. In large-scale experiments across 40 tasks, SOLE-R1 outperforms strong baseline models with zero-shot online RL.

Training data: We build the training data (Figure 2) in two stages: (1) foundational reasoning over space (single-image + depth) and time (multi-image/video), and (2) robot-video spatiotemporal reasoning specialized for dense progress estimation. We first carefully curate a diverse collection of general spatial and multi-frame temporal reasoning data (e.g., from SSR-CoT, SpatialVLM, Spot-the-diff, Embodied CoT, RoboVQA, Robo2VLM-Reasoning) to serve as a foundational layer of our training mixture. We then generate over 1 million CoT reasoning examples from more than 40,000 real-world and simulated videos. Together, this training induces video-native reasoning that explicitly integrates both spatial and temporal structure.

Training procedure: To train SOLE-R1, we use a two-stage hybrid recipe: SFT teaches high-quality spatiotemporal CoT reasoning, while RLVR (GRPO) directly emphasizes accurate progress prediction, which is under-emphasized during SFT (since the final answer occupies only a small fraction of response tokens).

Figure 2: SOLE-R1 training data mixture. The dataset combines foundational spatial reasoning, multi-frame temporal reasoning, and our synthesized video trajectories with chain-of-thought explanations and dense progress supervision, jointly enabling reasoning over space and time for progress prediction.

Experiments

We evaluate whether SOLE-R1 can serve as the sole supervision signal for learning manipulation skills from scratch via online RL. We run experiments across four simulation benchmark suites (RoboSuite, ManiSkill, Meta-World, and LIBERO) and in a real-world tabletop manipulation setting with a Franka arm. Across all settings, we evaluate a total of 40 tasks, spanning pick-and-place, articulation, button/lever/knob interactions, and mobile manipulation.

The policy observes two RGB streams (a wrist camera and an external/shoulder camera) along with robot proprioception. Actions are end-effector delta motions and a gripper open/close command. We do not use any additional privileged state, depth, object poses, or task-specific sensors.

Unlike prior work that (i) learns from ground-truth rewards and/or (ii) tunes reward models or policies on task demonstrations, we evaluate in a fully zero-shot online RL setting:

No ground-truth rewards. The policy never observes ground-truth/external rewards (dense or sparse) and receives no success labels during training.
No demonstrations or offline trajectories. The policy starts with random actions and learns only from on-policy interaction.
No task-specific tuning or calibration. Reward models are used as-is, with fixed prompting across tasks.

Baselines: We test the current best general-purpose reasoning models, including GPT-5, Gemini-3-Pro, and Gemini Robotics-ER 1.5. We also include the strongest existing special-purpose reward models: LIV, ReWiND, VLAC, TOPReward, RoboReward, Robometer, and LRM.

Overview of tasks and environments we include in the evaluation of SOLE-R1 reasoning for guiding online RL.

1) SOLE-R1 enables zero-shot online RL from scratch

SOLE-R1 achieves at least 50% success on 24 tasks, substantially outperforming all baselines (Figure 3). The strongest baselines include GPT-5 and Gemini, but they reach 50% success on only 7 and 5 tasks, respectively. The non-reasoning models achieve near-zero success on all tasks, with the exception of Meta-World tasks, where Robometer, RoboReward, and ReWiND achieve above 40% success rate on 4 tasks.

SOLE-R1 generalizes to unseen tasks and environments. SOLE-R1 succeeds with tasks that significantly differ from the task types seen during training, such as sliding a puck into a net, opening and closing windows, and manipulating unseen levers and handles in novel ways based on the natural language task specification. This suggests that SOLE-R1 does not merely memorize task templates, but instead learns reusable spatiotemporal progress primitives (e.g., establishing contact, aligning a grasp, changing articulation state, placing/settling objects) that transfer to unseen tasks.

SOLE-R1 generalizes to unseen embodiments and camera viewpoints. SOLE-R1 solves tasks with the Franka, along with embodiments not seen during training, including the Sawyer robot in Meta-World, the WidowX AI and Fetch Mobile Manipulator in ManiSkill, and the modified Franka with different gripper fingers and wrist camera angle in real-world. We also see SOLE-R1 solve tasks with camera views that were not used during training. This indicates that SOLE-R1 reward predictions are not narrowly tied to a particular kinematic chain or gripper appearance, but instead track goal-relevant object state changes across morphology and camera placement.

Figure 3: Zero-shot success rate of online RL across 40 tasks. We plot the mean and standard error across three random seeds (real-world experiments use a single seed, shown as a single value). In all experiments, the robot begins with a random policy and learns entirely through interaction with the task, guided only by the predicted rewards.

2) SOLE-R1 is robust to the exploitation observed with existing vision-language reasoning models

We use the perceived-vs-true success plot (Figure 4) to separate failures into two types: reward-hacking (high perceived, low true) versus signal-limited (low perceived, low true). General-purpose VLM reasoning models (GPT-5 and Gemini) predominantly fail via reward hacking: online RL discovers behaviors that elicit inflated progress predictions without completing the task. We show an example of reward hacking with picking up the cube in Figure 4 (and an extended set of examples in Figure 7 in the paper). SOLE-R1 failures more often fall into the signal-limited failure type, suggesting the model typically recognizes non-success but can still provide rewards that are too flat/noisy to bootstrap exploration within the episode budget.

Figure 4: Perceived vs true success in zero-shot RL. Perceived success is the average max progress predicted (RoboReward is excluded as it does not provide dense rewards). True success is the average max ground-truth reward achieved.

3) SOLE-R1 model and training recipe follows a scaling law driven by diversity of training tasks

We find that our data synthesis and training recipe follows a scaling law driven by the diversity of training tasks (Figure 6). We train variants of SOLE-R1 with an increasing number of task types included in our training data synthesis (details in Appendix M of the paper). The figure below plots the number of downstream tasks that achieve different success thresholds as a function of training task diversity.

Figure 6: Tasks solved by SOLE-R1 vs training task diversity

BibTeX

@article{schroeder2026soler1,
  title         = {SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning},
  author        = {Schroeder, Philip and Weng, Thomas and Schmeckpeper, Karl and Rosen, Eric and Hart, Stephen and Biza, Ondrej},
  journal       = {arXiv preprint arXiv:2603.28730},
  year          = {2026},
  eprint        = {2603.28730},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  doi           = {10.48550/arXiv.2603.28730},
  url           = {https://arxiv.org/abs/2603.28730}
}

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot RL