CoRL 2025 Paper List

This repository compiles information about accepted papers for CoRL 2025.
感谢repo主：https://github.com/shu1ong/CORL2025-Paper-List.git
下载链接：
链接: https://pan.baidu.com/s/163ZqVv5WGBLHxNlO_l8CWQ 提取码: enim
–来自百度网盘超级会员v5的分享

Accepted Papers
How to Cite
Contributing
License

Accepted Papers

Below is a list of accepted papers for CoRL 2025, sorted alphabetically by title.

Accepted Papers (A-Z)

3DS-VLA: A 3D Spatial-Aware Vision Language Action Model for Robust Multi-Task Manipulation
- Authors: Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, Shanghang Zhang, Hao Dong
- Abstract: Recently, 2D vision-language-action (VLA) models have made significant strides in multi-task manipulation. However, these models struggle to reason about 3D spatial relationships from 2D image inputs. Although an increasing number of 3D approaches explicitly integrate 3D information, they encounter challenges such as limited availability of large-scale 3D datasets and loss of spatial information during input processing. Meanwhile, existing policies typically focus on the perception-to-action learning paradigm, lacking an explicit understanding of the spatial and temporal relationships between the robot and its environment. To address this, we propose 3DS-VLA, which enhances pretrained 2D vision-language models (VLMs) with comprehensive 3D awareness, enabling the prediction of robust end-effector poses. Specifically, we enable a 2D vision encoder to encode both 2D images and 3D spatial observation by introducing a 2D-to-3D positional alignment mechanism. This allows 3DS-VLA to leverage the large-scale pre-trained knowledge of the VLM for effective reasoning in complex 3D robotic environments. Furthermore, to better understand the spatiotemporal relationship between 3D observations and robot behavior, we guide the model to learn the introduced sequential 3D spatial constraints, which define affordance-relevant 3D keypoints on objects, ensuring robust interactions. Experiments in simulated and real-world demonstrate that 3DS-VLA outperforms previous state-of-the-art policies and showcase its generalizable capabilities across multi-task, multi-embodiment, and diverse environmental settings.
- PDF: https://openreview.net/pdf?id=dT45OMevL5
- Forum: https://openreview.net/forum?id=dT45OMevL5
“Stack It Up!”: 3D Stable Structure Generation from 2D Hand-drawn Sketch
- Authors: Yiqing Xu, Linfeng Li, Cunjun Yu, David Hsu
- Abstract: Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today’s robot manipulation systems can’t act on such sketches directly—they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present StackItUp, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. StackItUp introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., left-of) and stability patterns (e.g.,two-pillar-bridge) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports—critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, StackItUp consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.
- PDF: https://openreview.net/pdf?id=pukgxvcOwL
- Forum: https://openreview.net/forum?id=pukgxvcOwL
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
- Authors: Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, James Tanner, Quan Vuong, Homer Walke, Anna Walling, Haohuan Wang, Lili Yu, Ury Zhilinsky
- Abstract: In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $\pi_{0.5}$, a new model based on $\pi_0$ that uses co-training on heterogeneous tasks to enable broad generalization. $\pi_{0.5}$ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.
- PDF: https://openreview.net/pdf?id=vlhoswksBO
- Forum: https://openreview.net/forum?id=vlhoswksBO
$\texttt{SPIN}$: distilling $\texttt{Skill-RRT}$ for long-horizon prehensile and non-prehensile manipulation
- Authors: Haewon Jung, Donguk Lee, Haecheol Park, Kim Jun Hyeop, Beomjoon Kim
- Abstract: Current robots struggle with long-horizon manipulation tasks requiring sequences of prehensile and non-prehensile skills, contact-rich interactions, and long-term reasoning. We present $\texttt{SPIN}$ ($\textbf{S}$kill $\textbf{P}$lanning to $\textbf{IN}$ference), a framework that distills a computationally intensive planning algorithm into a policy via imitation learning. We propose $\texttt{Skill-RRT}$, an extension of RRT that incorporates skill applicability checks and intermediate object pose sampling for solving such long-horizon problems. To chain independently trained skills, we introduce $\textit{connectors}$, goal-conditioned policies trained to minimize object disturbance during transitions. High-quality demonstrations are generated with $\texttt{Skill-RRT}$ and distilled through noise-based replay in order to reduce online computation time. The resulting policy, trained entirely in simulation, transfers zero-shot to the real world and achieves over 80% success across three challenging long-horizon manipulation tasks and outperforms state-of-the-art hierarchical RL and planning methods.
- PDF: https://openreview.net/pdf?id=udH3b2Lsx5
- Forum: https://openreview.net/forum?id=udH3b2Lsx5
Action-Free Reasoning for Policy Generalization
- Authors: Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, Suneel Belkhale
- Abstract: End-to-end imitation learning offers a promising approach for training robot policies. However, generalizing to new settings—such as unseen scenes, tasks, and object instances—remains a significant challenge. Although large-scale robot demonstration datasets have shown potential for inducing generalization, they are resource-intensive to scale. In contrast, human video data is abundant and diverse, presenting an attractive alternative. Yet, these human-video datasets lack action labels, complicating their use in imitation learning. Existing methods attempt to extract grounded action representations (e.g., hand poses), but resulting policies struggle to bridge the embodiment gap between human and robot actions. We propose an alternative approach: leveraging language-based reasoning from human videos—essential for guiding robot actions—to train generalizable robot policies. Building on recent advances in reasoning-based policy architectures, we introduce Reasoning through Action-free Data (RAD). RAD learns from both robot demonstration data (with reasoning and action labels) and action-free human video data (with only reasoning labels). The robot data teaches the model to map reasoning to low-level actions, while the action-free data enhances reasoning capabilities. Additionally, we release a new dataset of 3,377 human-hand demonstrations compatible with the Bridge V2 benchmark. This dataset includes chain-of-thought reasoning annotations and hand-tracking data, and is aimed at facilitating future research on reasoning-driven robot learning. Our experiments demonstrate that RAD enables effective transfer across the embodiment gap, allowing robots to perform tasks seen only in action-free data. Furthermore, scaling up action-free reasoning data significantly improves policy performance and generalization to novel tasks. These results highlight the promise of reasoning-driven learning from action-free datasets for advancing generalizable robot control. See website with videos: https://rad-generalization-s.github.io/.
- PDF: https://openreview.net/pdf?id=DzRNBBCP4R
- Forum: https://openreview.net/forum?id=DzRNBBCP4R
ActLoc: Learning to Localize on the Move via Active Viewpoint Selection
- Authors: Jiajie Li, Boyang Sun, Luca Di Giammarino, Hermann Blum, Marc Pollefeys
- Abstract: Reliable localization is critical for robot navigation, yet many existing systems assume that all viewpoints along a trajectory are equally informative. In practice, localization becomes unreliable when the robot observes unmapped, ambiguous, or uninformative regions. To address this, we present ActLoc, an active viewpoint-aware planning framework for enhancing localization accuracy for general robot navigation tasks. At the core of ActLoc is an attention-based model trained at scale for viewpoint selection. This model encodes a metric map of the scene, along with camera poses used during map construction, and estimates localization accuracy over camera pitch and yaw directions at arbitrary 3D waypoint in space. This per-point accuracy distribution is integrated into the path planning process, allowing the robot to actively choose camera orientation that maximize localization robustness while respecting task and motion constraints. ActLoc achieves state-of-the-art performance in single-viewpoint selection task, and generalizes effectively to full-trajectory planning. It provides a modular enhancement to a wide range of navigation and inspection tasks in structured environments.
- PDF: https://openreview.net/pdf?id=vLtS0ZDL73
- Forum: https://openreview.net/forum?id=vLtS0ZDL73
Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning
- Authors: Albert Wilcox, Mohamed Ghanem, Masoud Moghani, Pierre Barroso, Benjamin Joffe, Animesh Garg
- Abstract: Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms’ learning capacity while enabling zero-shot transfer to novel embodiments and camera poses. For more results, visit https://adapt3r-robot.github.io/.
- PDF: https://openreview.net/pdf?id=sUWOSP6SUJ
- Forum: https://openreview.net/forum?id=sUWOSP6SUJ
Adapting by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence
- Authors: Pranay Gupta, Henny Admoni, Andrea Bajcsy
- Abstract: End-to-end visuomotor policies trained using behavior cloning have shown a remarkable ability to generate complex, multi-modal low-level robot behaviors. However, at deployment time, these policies still struggle to act reliably when faced with out-of-distribution (OOD) visuals induced by objects, backgrounds, or environment changes. Prior works in interactive imitation learning solicit corrective expert demonstrations under the OOD conditions—but this can be costly and inefficient. We observe that task success under OOD conditions does not always warrant novel robot behaviors. In-distribution (ID) behaviors can directly be transferred to OOD conditions that share functional similarities with ID conditions. For example, behaviors trained to interact with in-distribution (ID) pens can apply to interacting with a visually-OOD pencil. The key challenge lies in disambiguating which ID observations functionally correspond to the OOD observation for the task at hand. We propose that an expert can provide this OOD-to-ID functional correspondence. Thus, instead of collecting new demonstrations and re-training at every OOD encounter, our method: (1) detects the need for feedback by checking if current observations are OOD and the most similar training observations show divergent behaviors (2) solicits functional correspondence feedback to disambiguate between those behaviors, and (3) intervenes on the OOD observations with the functionally corresponding ID observations to perform deployment-time generalization. We validate our method across diverse real-world robotic manipulation tasks with a Franka Panda robotic manipulator. Our results show that test-time functional correspondences can improve the generalization of a vision-based diffusion policy to OOD objects and environment conditions with low feedback.
- PDF: https://openreview.net/pdf?id=1TdRe3wPqK
- Forum: https://openreview.net/forum?id=1TdRe3wPqK
Adaptin by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence
- Authors: Pranay Gupta, Henny Admoni, Andrea Bajcsy
- Abstract: End-to-end visuomotor policies trained using behavior cloning have shown a remarkable ability to generate complex, multi-modal low-level robot behaviors. However, at deployment time, these policies still struggle to act reliably when faced with out-of-distribution (OOD) visuals induced by objects, backgrounds, or environment changes. Prior works in interactive imitation learning solicit corrective expert demonstrations under the OOD conditions—but this can be costly and inefficient. We observe that task success under OOD conditions does not always warrant novel robot behaviors. In-distribution (ID) behaviors can directly be transferred to OOD conditions that share functional similarities with ID conditions. For example, behaviors trained to interact with in-distribution (ID) pens can apply to interacting with a visually-OOD pencil. The key challenge lies in disambiguating which ID observations functionally correspond to the OOD observation for the task at hand. We propose that an expert can provide this OOD-to-ID functional correspondence. Thus, instead of collecting new demonstrations and re-training at every OOD encounter, our method: (1) detects the need for feedback by checking if current observations are OOD and the most similar training observations show divergent behaviors (2) solicits functional correspondence feedback to disambiguate between those behaviors, and (3) intervenes on the OOD observations with the functionally corresponding ID observations to perform deployment-time generalization. We validate our method across diverse real-world robotic manipulation tasks with a Franka Panda robotic manipulator. Our results show that test-time functional correspondences can improve the generalization of a vision-based diffusion policy to OOD objects and environment conditions with low feedback.
- PDF: https://openreview.net/pdf?id=1TdRe3wPqK
- Forum: https://openreview.net/forum?id=1TdRe3wPqK
AgentWorld: An Interactive Simulation Platform for Scene Construction and Mobile Robotic Manipulation
- Authors: Yizheng Zhang, Zhenjun Yu, JiaXin Lai, Cewu Lu, Lei Han
- Abstract: We introduce AgentWorld, an interactive simulation platform for developing household mobile manipulation capabilities. Our platform combines automated scene construction that encompasses layout generation, semantic asset placement, visual material configuration, and physics simulation, with a dual-mode teleoperation system supporting both wheeled bases and humanoid locomotion policies for data collection. The resulting AgentWorld Dataset captures diverse tasks ranging from primitive actions (pick-and-place, push-pull, etc.) to multistage activities (serve drinks, heat up food, etc.) across living rooms, bedrooms, and kitchens. Through extensive benchmarking of imitation learning methods including behavior cloning, action chunking transformers, diffusion policies, and vision-language-action models, we demonstrate the dataset’s effectiveness for sim-to-real transfer. The integrated system provides a comprehensive solution for scalable robotic skill acquisition in complex home environments, bridging the gap between simulation-based training and real-world deployment.
- PDF: https://openreview.net/pdf?id=XoRtWWjXuC
- Forum: https://openreview.net/forum?id=XoRtWWjXuC
AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies
- Authors: Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jianing Yang, Amir Zadeh, Chuan Li, Nima Fazeli, Joyce Chai
- Abstract: In this paper, we propose AimBot, a lightweight visual augmentation technique that provides explicit spatial cues to improve visuomotor policy learning in robotic manipulation. AimBot overlays shooting lines and scope reticles onto multi-view RGB images, offering auxiliary visual guidance that encodes the end-effector’s state. The overlays are computed from depth images, camera extrinsics, and the current end-effector pose, explicitly conveying spatial relationships between the gripper and objects in the scene. AimBot incurs minimal computational overhead (less than 1 ms) and requires no changes to model architectures, as it simply replaces original RGB images with augmented counterparts. Despite its simplicity, our results show that AimBot consistently improves the performance of various visuomotor policies in both simulation and real-world settings, highlighting the benefits of spatially grounded visual feedback. More videos can be found at https://aimbot-reticle.github.io/.
- PDF: https://openreview.net/pdf?id=brTSiML1nh
- Forum: https://openreview.net/forum?id=brTSiML1nh
AirExo-2: Scaling up Generalizable Robotic Imitation Learning with Low-Cost Exoskeletons
- Authors: Hongjie Fang, Chenxi Wang, Yiming Wang, Jingjing Chen, Shangning Xia, Jun Lv, Zihao He, Xiyan Yi, Yunhan Guo, Xinyu Zhan, Lixin Yang, Weiming Wang, Cewu Lu, Hao-Shu Fang
- Abstract: Scaling up robotic imitation learning for real-world applications requires efficient and scalable demonstration collection methods. While teleoperation is effective, it depends on costly and inflexible robot platforms. In-the-wild demonstrations offer a promising alternative, but existing collection devices have key limitations: handheld setups offer limited observational coverage, and whole-body systems often require fine-tuning with robot data due to domain gaps. To address these challenges, we present AirExo-2, a low-cost exoskeleton system for large-scale in-the-wild data collection, along with visual adaptors that transform collected data into pseudo-robot demonstrations suitable for policy learning. We further introduce RISE-2, a generalizable imitation learning policy that fuses 3D spatial and 2D semantic perception for robust manipulations. Experiments show that RISE-2 outperforms prior state-of-the-art methods on both in-domain and generalization evaluations. Trained solely on adapted in-the-wild data produced by AirExo-2, RISE-2 achieves comparable performance to policies trained with teleoperated data, highlighting the effectiveness and potential of AirExo-2 for scalable and generalizable imitation learning.
- PDF: https://openreview.net/pdf?id=ksOrtEgIC0
- Forum: https://openreview.net/forum?id=ksOrtEgIC0
AnyPlace: Learning Generalizable Object Placement for Robot Manipulation
- Authors: Yuchi Zhao, Miroslav Bogdanovic, Chengyuan Luo, Steven Tohme, Kourosh Darvish, Alan Aspuru-Guzik, Florian Shkurti, Animesh Garg
- Abstract: Object placement in robotic tasks is inherently challenging due to the diversity of object geometries and placement configurations. We address this with AnyPlace, a two-stage method trained entirely on synthetic data, capable of predicting a wide range of feasible placement poses for real-world tasks. Our key insight is that by leveraging a Vision-Language Model (VLM) to identify approximate placement locations, we can focus only on the relevant regions for precise local placement, which enables us to train the low-level placement-pose-prediction model to capture multimodal placements efficiently. For training, we generate a fully synthetic dataset comprising 13 categories of randomly generated objects in 5370 different placement poses across three configurations (insertion, stacking, hanging) and train local placement-prediction models. We extensively evaluate our method in high-fidelity simulation and show that it consistently outperforms baseline approaches across all three tasks in terms of success rate, coverage of placement modes, and precision. In real-world experiments, our method achieves an average success and coverage rate of 76% across three tasks, where most baseline methods fail completely. We further validate the generalization of our approach on 16 real-world placement tasks, demonstrating that models trained purely on synthetic data can be directly transferred to the real world in a zero-shot setting. More at: https://anyplace-pnp.github.io/.
- PDF: https://openreview.net/pdf?id=H0zFqW6QM0
- Forum: https://openreview.net/forum?id=H0zFqW6QM0
ARCH: Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly
- Authors: Jiankai Sun, Aidan Curtis, Yang You, Yan Xu, Michael Koehle, Qianzhong Chen, Suning Huang, Leonidas Guibas, Sachin Chitta, Mac Schwager, Hui Li
- Abstract: Generalizable long-horizon robotic assembly requires reasoning at multiple levels of abstraction. While end-to-end imitation learning (IL) is a promising approach, it typically requires large amounts of expert demonstration data and often struggles to achieve the high precision demanded by assembly tasks. Reinforcement learning (RL) approaches, on the other hand, have shown some success in high-precision assembly, but suffer from sample inefficiency, which limits their effectiveness in long-horizon tasks. To address these challenges, we propose a hierarchical modular approach, named Adaptive Robotic Compositional Hierarchy (ARCH), which enables long-horizon, high-precision robotic assembly in contact-rich settings. ARCH employs a hierarchical planning framework, including a low-level primitive library of parameterized skills and a high-level policy. The low-level primitive library includes essential skills for assembly tasks, such as grasping and inserting. These primitives consist of both RL and model-based controllers. The high-level policy, learned via IL from a handful of demonstrations, without the need for teleoperation, selects the appropriate primitive skills and instantiates them with input parameters. We extensively evaluate our approach in simulation and on a real robotic manipulation platform. We show that ARCH generalizes well to unseen objects and outperforms baseline methods in terms of success rate and data efficiency.
- PDF: https://openreview.net/pdf?id=a2RMXJbkJ8
- Forum: https://openreview.net/forum?id=a2RMXJbkJ8
Articulate AnyMesh: Open-vocabulary 3D Articulated Objects Modeling
- Authors: Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, Chuang Gan
- Abstract: 3D articulated objects modeling has long been a challenging problem, since it requires to capture both accurate surface geometries and semantically meaningful and spatially precise structures, parts, and joints. Existing methods heavily depend on training data from a limited set of handcrafted articulated object categories (e.g., cabinets and drawers), which restricts their ability to model a wide range of articulated objects in an open-vocabulary context. To address these limitations, we propose Articulate AnyMesh, an automated framework that is able to convert any rigid 3D mesh into its articulated counterpart in an open-vocabulary manner. Given a 3D mesh, our framework utilizes advanced Vision-Language Models and visual prompting techniques to extract semantic information, allowing for both the segmentation of object parts and the construction of functional joints. Our experiments show that Articulate AnyMesh can generate large-scale, high-quality 3D articulated objects, including tools, toys, mechanical devices, and vehicles, significantly expanding the coverage of existing 3D articulated object datasets. Additionally, we show that these generated assets can facilitate the acquisition of new articulated object manipulation skills in simulation, which can then be transferred to a real robotic system.
- PDF: https://openreview.net/pdf?id=BNCh3SS1Yl
- Forum: https://openreview.net/forum?id=BNCh3SS1Yl
Articulated Object Estimation in the Wild
- Authors: Abdelrhman Werby, Martin Büchner, Adrian Röfer, Chenguang Huang, Wolfram Burgard, Abhinav Valada
- Abstract: Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic, unconstrained environments. In contrast, humans effortlessly infer articulation modes by watching others manipulating objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework capable of inferring articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset capturing articulated object interactions at a scene level, accompanied with articulation labels and ground truth camera poses. We benchmark ArtiPoint against a range of classical and modern deep learning baselines, demonstrating its superior performance on Arti4D. We make our code and Arti4D publicly available at redacted-for-review.
- PDF: https://openreview.net/pdf?id=J1Ekhe08QU
- Forum: https://openreview.net/forum?id=J1Ekhe08QU
AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit
- Authors: Yang Li, Junfan Chen, Feng Xue, Jiabin Qiu, Wenbin Li, Qingrui Zhang, Ying Wen, Wei Pan
- Abstract: Adaptive teaming—the capability of agents to effectively collaborate with unfamiliar teammates without prior coordination—is widely explored in virtual video games but overlooked in real-world multi-robot contexts. Yet, such adaptive collaboration is crucial for real-world applications, including border surveillance, search-and-rescue, and counter-terrorism operations. To address this gap, we introduce AT-Drone, the first dedicated benchmark explicitly designed to facilitate comprehensive training and evaluation of adaptive teaming strategies in multi-drone pursuit scenarios. AT-Drone makes the following key contributions: (1) An adaptable simulation environment configurator that enables intuitive and rapid setup of adaptive teaming multi-drone pursuit tasks, including four predefined pursuit environments. (2) A streamlined real-world deployment pipeline that seamlessly translates simulation insights into practical drone evaluations using edge devices (such as Jetson Orin Nano) and Crazyflie drones. (3) A novel algorithm zoo integrated with a distributed training framework, featuring diverse algorithms explicitly tailored, for the first time, to multi-pursuer and multi-evader drone pursuit task. (4) Standardized evaluation protocols with newly designed unseen drone zoos, explicitly designed to rigorously assess the performance of adaptive teaming. Comprehensive experimental evaluations across four progressively challenging multi-drone pursuit scenarios confirm AT-Drone’s effectiveness in advancing adaptive teaming research. Real-world drone experiments further validate its practical feasibility and utility for realistic robotic operations. Videos, code and weights are available at \url{https://sites.google.com/view/at-drone}.
- PDF: https://openreview.net/pdf?id=xPryDEv2YH
- Forum: https://openreview.net/forum?id=xPryDEv2YH
ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning
- Authors: Yunchu Zhang, Shubham Mittal, Zhengyu Zhang, Liyiming Ke, Siddhartha Srinivasa, Abhishek Gupta
- Abstract: Learning visuamotor policy through imitation learning often suffers from perceptual challenges, where visual differences between training and evaluation environments degrade policy performance. Policies relying on state estimations like 6D pose, require task-specific tracking and are difficult to scale, while raw sensor-based policies may lack robustness to small visual disturbances. In this work, we leverage 2D keypoints — spatially consistent features in the image frame — as a state representation for robust policy learning, and apply it to both sim-to-real transfer and real-world imitation learning. However, the choice of which keypoints to use can vary across objects and tasks. We propose a novel method -ATK, to automatically select keypoints in a task-driven manner, such that the chosen keypoints are that are predictive of optimal behavior for the given task. Our proposal optimizes for a minimal set of task-relevant keypoints that preserve policy performance and robustness. We distill expert data (either from an expert policy in simulation or a human expert) into a policy that operates on RGB images while tracking the selected keypoints. By leveraging pre-trained visual modules, our system effectively tracks keypoints and transfers policies to the real-world evaluation scenario, even given perceptual challenges like transparent objects or fine-grained manipulation, or widely varying scene appearance. We validate our approach on various robotic tasks, demonstrating that these minimal keypoint representations improve robustness to visual disturbances and environmental variations.
- PDF: https://openreview.net/pdf?id=a5LFUOlkIj
- Forum: https://openreview.net/forum?id=a5LFUOlkIj
AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World
- Authors: Zhiyuan Zhou, Pranav Atreya, You Liang Tan, Karl Pertsch, Sergey Levine
- Abstract: Scalable and reproducible policy evaluation has been a long-standing challenge in robot learning: evaluations are critical to assess progress and build better policies, but evaluation in the real world, especially at a scale that would provide statistically reliable results, is costly in terms of human time and hard to obtain. Evaluation of increasingly generalist robot policies requires an increasingly diverse repertoire of evaluation environments, making the evaluation bottleneck even more pronounced. To make real-world evaluation of robotic policies more practical, we propose AutoEval, a system to autonomously evaluate generalist robot policies around the clock with minimal human intervention. Users interact with AutoEval by submitting evaluation jobs to the AutoEval queue, much like how software jobs are submitted with a cluster scheduling system, and AutoEval will schedule the policies for evaluation within a framework supplying automatic success detection and automatic scene resets. We show that AutoEval can nearly fully eliminate human involvement in the evaluation process, permitting around the clock evaluations, and the evaluation results correspond closely to ground truth evaluations conducted by hand. To facilitate the evaluation of generalist policies in the robotics community, we provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms. In the future, we hope that AutoEval scenes can be set up across institutions to form a diverse and distributed evaluation network.
- PDF: https://openreview.net/pdf?id=isrcFrgwZp
- Forum: https://openreview.net/forum?id=isrcFrgwZp
BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities
- Authors: Yunfan Jiang, Ruohan Zhang, Josiah Wong, Chen Wang, Yanjie Ze, Hang Yin, Cem Gokmen, Shuran Song, Jiajun Wu, Li Fei-Fei
- Abstract: Real-world household tasks present significant challenges for mobile manipulation robots. An analysis of existing robotics benchmarks reveals that successful task performance hinges on three key whole-body control capabilities: bimanual coordination, stable and precise navigation, and extensive end-effector reachability. Achieving these capabilities requires careful hardware design, but the resulting system complexity further complicates visuomotor policy learning. To address these challenges, we introduce the BEHAVIOR Robot Suite (BRS), a comprehensive framework for whole-body manipulation in diverse household tasks. Built on a bimanual, wheeled robot with a 4-DoF torso, BRS integrates a cost-effective whole-body teleoperation interface for data collection and a novel algorithm for learning whole-body visuomotor policies. We evaluate BRS on five challenging household tasks that not only emphasize the three core capabilities but also introduce additional complexities, such as long-range navigation, interaction with articulated and deformable objects, and manipulation in confined spaces. We believe that BRS’s integrated robotic embodiment, data collection interface, and learning framework mark a significant step toward enabling real-world whole-body manipulation for everyday household tasks. BRS is open-sourced at https://behavior-robot-suite.github.io/.
- PDF: https://openreview.net/pdf?id=v2KevjWScT
- Forum: https://openreview.net/forum?id=v2KevjWScT
Belief-Conditioned One-Step Diffusion: Real-Time Trajectory Planning with Just-Enough Sensing
- Authors: Gokul Puthumanaillam, Aditya Penumarti, Manav Vora, Paulo Padrao, Jose Fuentes, Leonardo Bobadilla, Jane Shin, Melkior Ornik
- Abstract: Robots equipped with rich sensor suites can localize reliably in partially-observable environments—but powering every sensor continuously is wasteful and often infeasible. Belief-space planners address this by propagating pose-belief covariance through analytic models and switching sensors heuristically–a brittle, runtime expensive approach. Data-driven approaches–including diffusion models–learn multi-modal trajectories from demonstrations, but presuppose an accurate, always-on state estimate. We address the largely open problem: for a given task in a mapped environment, which minimal sensor subset must be active at each location to maintain state uncertainty just low enough to complete the task? Our key insight is that when a diffusion planner is explicitly conditioned on a pose-belief raster and a sensor mask, the spread of its denoising trajectories yields a calibrated, differentiable proxy for the expected localization error. Building on this insight, we present Belief-Conditioned One-Step Diffusion (B-COD), the first planner that, in a 10 ms forward pass, returns a short-horizon trajectory, per-waypoint aleatoric variances, and a proxy for localization error–eliminating external covariance rollouts. We show that this single proxy suffices for a soft-actor–critic to choose sensors online, optimising energy while bounding pose-covariance growth. We deploy B-COD in real-time marine trials on an unmanned surface vehicle and show that it reduces sensing energy consumption while matching the goal-reach performance of an always-on baseline.
- PDF: https://openreview.net/pdf?id=t2asRJv2SD
- Forum: https://openreview.net/forum?id=t2asRJv2SD
BEVCalib: LiDAR-Camera Calibration via Geometry-Guided Bird’s-Eye View Representation
- Authors: Weiduo Yuan, Jerry Li, Justin Yue, Divyank Shah, Konstantinos Karydis, Hang Qiu
- Abstract: Accurate LiDAR-camera calibration is the foundation of accurate multimodal fusion environmental perception for autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird’s-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCalib. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometry information from the BEV feature, we introduce a novel feature selector to choose the most important feature in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations in various datasets demonstrate that BEVCalib establishes a new state-of-the-art; improving the best open-source baseline by two orders of magnitude on KITTI, Nuscenes, and our dynamic extrinsic dataset, respectively, and outperforming the best baseline in literature by 72% on KITTI dataset, and 69% on Nuscenes dataset. All source code and checkpoints will be released.
- PDF: https://openreview.net/pdf?id=9FpccnRarn
- Forum: https://openreview.net/forum?id=9FpccnRarn
Beyond Constant Parameters: Hyper Prediction Models and HyperMPC
- Authors: Jan Węgrzynowski, Piotr Kicki, Grzegorz Czechmanowski, Maciej Piotr Krupka, Krzysztof Walas
- Abstract: Model Predictive Control (MPC) is among the most widely adopted and reliable methods for robot control, relying critically on an accurate dynamics model. However, existing dynamics models used in the gradient-based MPC are limited by computational complexity and state representation. To address this limitation, we propose the Hyper Prediction Model (HyperPM) - a novel approach in which we project the unmodeled dynamics onto a time-dependent dynamics model. This time-dependency is captured through time-varying model parameters, whose evolution over the MPC prediction horizon is learned using a neural network. Such formulation preserves the computational efficiency and robustness of the base model while equipping it with the capacity to anticipate previously unmodeled phenomena. We evaluated the proposed approach on several challenging systems, including real-world F1TENTH autonomous racing, and demonstrated that it significantly reduces long-horizon prediction errors. Moreover, when integrated within the MPC framework (HyperMPC), our method consistently outperforms existing state-of-the-art techniques.
- PDF: https://openreview.net/pdf?id=8v0mlyKk5q
- Forum: https://openreview.net/forum?id=8v0mlyKk5q
Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations
- Authors: Chengtian Ma, Yunyue Wei, Chenhui Zuo, Chen Zhang, Yanan Sui
- Abstract: Balance control is important for human and bipedal robotic systems. While dynamic balance during locomotion has received considerable attention, quantitative understanding of static balance and falling remains limited. This work presents a hierarchical control pipeline for simulating human balance via a comprehensive whole-body musculoskeletal system. We identified spatiotemporal dynamics of balancing during stable standing, revealed the impact of muscle injury on balancing behavior, and generated fall contact patterns that aligned with clinical data. Furthermore, our simulated hip exoskeleton assistance demonstrated improvement in balance maintenance and reduced muscle effort under perturbation. This work offers unique muscle-level insights into human balance dynamics that are challenging to capture experimentally. It could provide a foundation for developing targeted interventions for individuals with balance impairments and support the advancement of humanoid robotic systems.
- PDF: https://openreview.net/pdf?id=AVDCwK1dek
- Forum: https://openreview.net/forum?id=AVDCwK1dek
BranchOut: Capturing Realistic Multimodality in Autonomous Driving Decisions
- Authors: Hee Jae Kim, Zekai Yin, Lei Lai, Jason Lee, Eshed Ohn-Bar
- Abstract: Modeling the nuanced, multimodal nature of human driving remains a core challenge for autonomous systems, as existing methods often fail to capture the diversity of plausible behaviors in complex real-world scenarios. In this work, we introduce a novel benchmark and end-to-end planner for modeling realistic multimodality in autonomous driving decisions. We propose a Gaussian Mixture Model (GMM)-based diffusion model designed to explicitly capture human-like, multimodal driving decisions in diverse contexts. Our model achieves state-of-the-art performance on current benchmarks, but reveals weaknesses in standard evaluation practices, which rely on single ground-truth trajectories or coarse closed-loop metrics while often penalizing diverse yet plausible alternatives. To address this limitation, we further develop a human-in-the-loop simulation benchmark that enables finer-grained evaluations and measures multimodal realism in challenging driving settings. Our code, models, and benchmark data will be released to promote more accurate and human-aware evaluation of autonomous driving models.
- PDF: https://openreview.net/pdf?id=jedBaI1fgU
- Forum: https://openreview.net/forum?id=jedBaI1fgU
CaRL: Learning Scalable Planning Policies with Simple Rewards
- Authors: Bernhard Jaeger, Daniel Dauner, Jens Beißwenger, Simon Gerstenecker, Kashyap Chitta, Andreas Geiger
- Abstract: We investigate reinforcement learning (RL) for privileged planning in autonomous driving. State-of-the-art approaches for this task are rule-based, but these methods do not scale to the long tail. RL, on the other hand, is scalable and does not suffer from compounding errors like imitation learning. Contemporary RL approaches for driving use complex shaped rewards that sum multiple individual rewards, \eg~progress, position, or orientation rewards. We show that PPO fails to optimize a popular version of these rewards when the mini-batch size is increased, which limits the scalability of these approaches. Instead, we propose a new reward design based primarily on optimizing a single intuitive reward term: route completion. Infractions are penalized by terminating the episode or multiplicatively reducing route completion. We find that PPO scales well with higher mini-batch sizes when trained with our simple reward, even improving performance. Training with large mini-batch sizes enables efficient scaling via distributed data parallelism. We scale PPO to 300M samples in CARLA and 500M samples in nuPlan with a single 8-GPU node. The resulting model achieves 64 DS on the CARLA longest6 v2 benchmark, outperforming other RL methods with more complex rewards by a large margin. Requiring only minimal adaptations from its use in CARLA, the same method is the best learning-based approach on nuPlan. It scores 91.3 in non-reactive and 90.6 in reactive traffic on the Val14 benchmark while being an order of magnitude faster than prior work.
- PDF: https://openreview.net/pdf?id=1otaE496Vm
- Forum: https://openreview.net/forum?id=1otaE496Vm
CARE: Enhancing Safety of Visual Navigation through Collision Avoidance via Repulsive Estimation
- Authors: Joonkyung Kim, Joonyeol Sim, Woojun Kim, Katia P. Sycara, Changjoo Nam
- Abstract: We propose CARE (Collision Avoidance via Repulsive Estimation) for improving the robustness of learning-based visual navigation methods. Recently, visual navigation models, particularly foundation models, have demonstrated promising performance by generating viable trajectories using only RGB images. However, these policies can generalize poorly to environments containing out-of-distribution (OOD) scenes characterized by unseen objects or different camera setups (e.g., variations in field of view, camera pose, or focal length). Without fine-tuning, such models could produce trajectories that lead to collisions, necessitating substantial efforts in data collection and additional training. To address this limitation, we introduce CARE, an attachable module that enhances the safety of visual navigation without requiring additional range sensors or fine-tuning of pretrained models. CARE can be integrated seamlessly into any RGB-based navigation model that generates local robot trajectories. It dynamically adjusts trajectories produced by a pretrained model using repulsive force vectors computed from depth images estimated directly from RGB inputs. We evaluate CARE by integrating it with state-of-the-art visual navigation models across diverse robot platforms. Real-world experiments show that CARE significantly reduces collisions (up to 100%) without compromising navigation performance in goal-conditioned navigation, and further improves collision-free travel distance (up to 10.7×) in exploration tasks.
- PDF: https://openreview.net/pdf?id=JrhMGXZnja
- Forum: https://openreview.net/forum?id=JrhMGXZnja
CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models
- Authors: Huihan Liu, Rutav Shah, Shuijing Liu, Jack Pittenger, Mingyo Seo, Yuchen Cui, Yonatan Bisk, Roberto Martín-Martín, Yuke Zhu
- Abstract: Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines. More information is available at https://casper-corl25.github.io/.
- PDF: https://openreview.net/pdf?id=bU15EK0oqk
- Forum: https://openreview.net/forum?id=bU15EK0oqk
Capability-Aware Shared Hypernetworks for Flexible Heterogeneous Multi-Robot Coordination
- Authors: Kevin Fu, Shalin Jain, Pierce Howell, Harish Ravichandar
- Abstract: Recent advances have enabled heterogeneous multi-robot teams to learn complex and effective coordination skills. However, existing neural architectures that support heterogeneous teaming tend to force a trade-off between expressivity and efficiency. Shared-parameter designs prioritize sample efficiency by enabling a single network to be shared across all or a pre-specified subset of robots (via input augmentations), but tend to limit behavioral diversity. In contrast, recent designs employ a separate policy for each robot, enabling greater diversity and expressivity at the cost of efficiency and generalization. Our key insight is that such tradeoffs can be avoided by viewing these design choices as ends of a broad spectrum. Inspired by recent work in transfer and meta learning, and building on prior work in multi-robot task allocation, we propose Capability-Aware Shared Hypernetworks (CASH), a soft weight sharing architecture that uses hypernetworks to efficiently learn a flexible shared policy that dynamically adapts to each robot post-training. By explicitly encoding the impact of robot capabilities (e.g., speed and payload) on collective behavior, CASH enables zero-shot generalization to unseen robots or team compositions. Our experiments involve multiple heterogeneous tasks, three learning paradigms (imitation learning, value-based, and policy-gradient RL), and SOTA multi-robot simulation (JaxMARL) and hardware (Robotarium) platforms. Across all conditions, we find that CASH generates appropriately-diverse behaviors and consistently outperforms baseline architectures in terms of performance and sample efficiency during both training and zero-shot generalization, all with 60%-80% fewer learnable parameters.
- PDF: https://openreview.net/pdf?id=qoKo2caB9B
- Forum: https://openreview.net/forum?id=qoKo2caB9B
CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion
- Authors: Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, Ruimao Zhang
- Abstract: Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. In practice, to further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.
- PDF: https://openreview.net/pdf?id=FwLMCbs47K
- Forum: https://openreview.net/forum?id=FwLMCbs47K
CHD: Coupled Hierarchical Diffusion for Long-Horizon Tasks
- Authors: Ce Hao, Anxing Xiao, Zhiwei Xue, Harold Soh
- Abstract: Diffusion-based planners have shown strong performance in short-horizon tasks but often fail in complex, long-horizon settings. We trace the failure to loose coupling between high-level (HL) sub-goal selection and low-level (LL) trajectory generation, which leads to incoherent plans and degraded performance. We propose Coupled Hierarchical Diffusion (CHD), a framework that models HL sub-goals and LL trajectories jointly within a unified diffusion process. A shared classifier passes LL feedback upstream so that sub-goals self-correct while sampling proceeds. This tight HL–LL coupling improves trajectory coherence and enables scalable long-horizon diffusion planning. Experiments across maze navigation, tabletop manipulation, and household environments show that CHD consistently outperforms both flat and hierarchical diffusion baselines.
- PDF: https://openreview.net/pdf?id=tXY6VQlXfA
- Forum: https://openreview.net/forum?id=tXY6VQlXfA
CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation
- Authors: Sung-Wook Lee, Xuhui Kang, Brandon Y. Yang, Yen-Ling Kuo
- Abstract: Recent advances in Behavior Cloning (BC) have led to strong performance in robotic manipulation, driven by expressive models, sequence modeling of actions, and large-scale demonstration data. However, BC faces significant challenges when applied to heterogeneous datasets, such as visual shift with different camera poses or object appearances, where performance degrades despite the benefits of learning at scale. This stems from BC’s tendency to overfit individual demonstrations rather than capture shared structure, limiting generalization. To address this, we introduce Contrastive Learning via Action Sequence Supervision (CLASS), a method for learning behavioral representations from demonstrations using supervised contrastive learning. CLASS leverages weak supervision from similar action sequences identified via Dynamic Time Warping (DTW) and optimizes a soft InfoNCE loss with similarity-weighted positive pairs. We evaluate CLASS on 5 simulation benchmarks and 3 real-world tasks to achieve competitive results using retrieval-based control with representations only. Most notably, for downstream policy learning under significant visual shifts, CLASS achieves an average success rate of 70% with Diffusion Policy, while all other baseline methods fail to perform competitively.
- PDF: https://openreview.net/pdf?id=9f3klkpa4y
- Forum: https://openreview.net/forum?id=9f3klkpa4y
CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception
- Authors: Pranav N. Thakkar, Shubhangi Sinha, Karan Baijal, Yuhan (Anjelica) Bian, Leah Lackey, Ben Dodson, Heisen Kong, Jueun Kwon, Amber Li, Yifei Hu, alexios rekoutis, Tom Silver, Tapomayukh Bhattacharjee
- Abstract: Robust robot manipulation in unstructured environments often requires understanding object properties that extend beyond geometry, such as material or compliance—properties that can be challenging to infer using vision alone. Multimodal haptic sensing provides a promising avenue for inferring such properties, yet progress has been constrained by the lack of large, diverse, and realistic haptic datasets. In this work, we introduce the CLAMP device, a low-cost (< $200) sensorized reacher-grabber designed to collect large-scale, in-the-wild multimodal haptic data from non-expert users in everyday settings. We deployed 16 CLAMP devices to 41 participants, resulting in the CLAMP dataset, the largest open-source multimodal haptic dataset to date, comprising 12.3 million datapoints across 5357 household objects. Using this dataset, we train a haptic encoder that can infer material and compliance object properties from multimodal haptic data. We leverage this encoder to create the CLAMP model, a visuo-haptic perception model for material recognition that generalizes to novel objects and three robot embodiments with minimal finetuning. We also demonstrate the effectiveness of our model in three real-world robot manipulation tasks: sorting recyclable and non-recyclable waste, retrieving objects from a cluttered bag, and distinguishing overripe from ripe bananas. Our results show that large-scale, in-the-wild haptic data collection can unlock new capabilities for generalizable robot manipulation.
- PDF: https://openreview.net/pdf?id=zgVaMD0QjZ
- Forum: https://openreview.net/forum?id=zgVaMD0QjZ
CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks
- Authors: Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, Siyuan Huang
- Abstract: Humanoid robot teleoperation plays a vital role in demonstrating and collecting data for complex interactions. Current methods suffer from two key limitations: (1) restricted controllability due to decoupled upper- and lower-body control, and (2) severe drift caused by open-loop execution. These issues prevent humanoid robots from performing coordinated whole-body motions required for long-horizon loco-manipulation tasks. We introduce CLONE, a whole-body teleoperation system that overcomes these challenges through three key contributions: (1) a Mixture-of-Experts (MoE) whole-body control policy that enables complex coordinated movements, such as “picking up an object from the ground” and “placing it in a distant bin”; (2) a closed-loop error correction mechanism using LiDAR odometry, reducing translational drift to 12cm over 8.9-meter trajectories; and (3) a systematic data augmentation strategy that ensures robust performance under diverse, previously unseen operator poses. In extensive experiments, CLONE demonstrates robust performance across diverse scenarios while maintaining stable whole-body control. These capabilities significantly advance humanoid robotics by enabling the collection of long-horizon interaction data and establishing a foundation for more sophisticated humanoid-environment interaction in both research and practical applications.
- PDF: https://openreview.net/pdf?id=Bw9NHYjDqR
- Forum: https://openreview.net/forum?id=Bw9NHYjDqR
ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes
- Authors: “Zeyuan Chen, Qiyang Yan, Yuanpei Chen, Tianhao Wu, Jiyao Zhang, Zihan Ding, Jinzhou Li, Yaodong Yang, Hao Dong”
- Abstract: Dexterous grasping in cluttered scenes presents significant challenges due to diverse object geometries, occlusions, and potential collisions. Existing methods primarily focus on single-object grasping or grasp-pose prediction without interaction, which are insufficient for complex, cluttered scenes. Recent vision-language-action models offer a potential solution but require extensive real-world demonstrations, making them costly and difficult to scale. To address these limitations, we revisit the sim-to-real transfer pipeline and develop key techniques that enable zero-shot deployment in reality while maintaining robust generalization. We propose ClutterDexGrasp, a two-stage teacher-student framework for closed-loop target-oriented dexterous grasping in cluttered scenes. The framework features a teacher policy trained in simulation using clutter density curriculum learning, incorporating both a novel geometry- and spatially-embedded scene representation and a comprehensive safety curriculum, enabling general, dynamic, and safe grasping behaviors. Through imitation learning, we distill the teacher’s knowledge into a student 3D diffusion policy (DP3) that operates on partial point cloud observations. To the best of our knowledge, this represents the first zero-shot sim-to-real closed-loop system for target oriented dexterous grasping in cluttered scenes, demonstrating robust performance across diverse objects and layouts.
- PDF: https://openreview.net/pdf?id=4XKKUifQ9c
- Forum: https://openreview.net/forum?id=4XKKUifQ9c
CogniPlan: Uncertainty-Guided Path Planning with Conditional Generative Layout Prediction
- Authors: Yizhuo Wang, Haodong He, Jingsong Liang, Yuhong Cao, Ritabrata Chakraborty, Guillaume Adrien Sartoretti
- Abstract: Path planning in unknown environments is a crucial yet inherently challenging capability for mobile robots, which primarily encompasses two coupled tasks: autonomous exploration and point-goal navigation. In both cases, the robot must perceive the environment, update its belief, and accurately estimate potential information gain on-the-fly to guide planning. In this work, we propose CogniPlan, a novel path planning framework that leverages multiple plausible layouts predicted by a conditional generative inpainting model, mirroring how humans rely on cognitive maps during navigation. These predictions, based on the partially observed map and a set of layout conditioning vectors, enable our planner to reason effectively under uncertainty. We demonstrate strong synergy between generative image-based layout prediction and graph-attention-based path planning, allowing CogniPlan to combine the scalability of graph representations with the fidelity and predictiveness of occupancy maps, yielding notable performance gains in both exploration and navigation. We extensively evaluate CogniPlan on two datasets (hundreds of maps and realistic floor plans), consistently outperforming state-of-the-art planners. We further deploy it in a high-fidelity simulator and on hardware, showcasing its high-quality path planning and real-world applicability.
- PDF: https://openreview.net/pdf?id=uA9GZEmGiT
- Forum: https://openreview.net/forum?id=uA9GZEmGiT
COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning
- Authors: Sateesh Kumar, Shivin Dass, Georgios Pavlakos, Roberto Martín-Martín
- Abstract: In this work, we study the problem of data retrieval for few-shot imitation learning: select data from a large dataset to train a performant policy for a specific task, given only a few target demonstrations. Prior methods retrieve data using a single-feature distance heuristic, assuming that the best demonstrations are those that most closely resemble the target examples in visual, semantic, or motion space. However, this approach captures only a subset of the relevant information and is prone to introducing detrimental demonstrations, e.g., retrieving data from unrelated tasks due to similar scene layouts, or selecting similar motions from tasks with divergent goals. We present COLLAGE, a method for COLLective data AGgrEgation in few-shot imitation learning that uses an adaptive late fusion mechanism to guide the selection of relevant demonstrations based on a task12 specific combination of multiple cues. COLLAGE follows a simple, but flexible and efficient data aggregation recipe: it assigns weights to subsets of the dataset that are pre-selected using a single feature (e.g., appearance, shape, or language similarity), based on their task relevance, measured by how well a policy trained on each subset predicts actions in the few target demonstrations. These weights are then used during policy training to perform importance sampling over the aggregated dataset, sampling data more densely or sparsely, according to their estimated relevance. This weighted aggregation strategy is general and feature-agnostic, allowing COLLAGE to combine and leverage any number of subsets selected by any retrieval heuristic or method, and to identify which subset provides the most benefit for the target task. In extensive experiments, COLLAGE outperforms state-of-the-art retrieval and multi-task learning approaches, achieving a 5.1% improvement over the best baseline in simulation across 10 tasks, and a 16.6% improvement in the real world across 6 tasks. For our real world experiments, we include data selection from the large-scale, real-world DROID dataset, significantly improving few-shot imitation policy training. More information at: https://collagecorl25.github.io/.
- PDF: https://openreview.net/pdf?id=B6knAJsB9P
- Forum: https://openreview.net/forum?id=B6knAJsB9P
COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping
- Authors: Jun Yamada, Alexander Luis Mitchell, Jack Collins, Ingmar Posner
- Abstract: This paper addresses the challenge of occluded robot grasping, i.e. grasping in situations where the desired grasp poses are kinematically infeasible due to environmental constraints such as surface collisions. Existing RL methods struggle with task complexity, and collecting expert demonstrations is often impractical. Instead, inspired by human bimanual manipulation strategies, where two hands coordinate to stabilise and reorient objects, we focus on a bimanual robotic setup to tackle this challenge. In particular, we introduce Constraint-based Manipulation for Bimanual Occluded Grasping (COMBO-Grasp), an approach which leverages two coordinated policies: a constraint policy trained using self-supervised datasets to generate stabilising poses and a grasping policy trained using RL that reorients and grasps the target object. A key contribution lies in value function-guided policy coordination, where gradients from a jointly trained value function refine the constraint policy during RL training to improve bimanual coordination and task performance. Lastly, COMBO-Grasp employs teacher-student policy distillation to effectively deploy vision-based policies in real-world environments. Experiments show that COMBO-Grasp significantly outperforms baselines and generalises to unseen objects in both simulation and real environments.
- PDF: https://openreview.net/pdf?id=xpEjjGC82v
- Forum: https://openreview.net/forum?id=xpEjjGC82v
ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion
- Authors: Zichao Hu, Chen Tang, Michael Joseph Munje, Yifeng Zhu, Alex Liu, Shuijing Liu, Garrett Warnell, Peter Stone, Joydeep Biswas
- Abstract: This paper considers the problem of enabling robots to navigate dynamic environments while following instructions. The challenge lies in the combinatorial nature of instruction specifications: each instruction can include multiple specifications, and the number of possible specification combinations grows exponentially as the robot’s skill set expands. For example, “overtake the pedestrian while staying on the right side of the road” consists of two specifications: “”overtake the pedestrian”” and “”walk on the right side of the road.”” To tackle this challenge, we propose ComposableNav, based on the intuition that following an instruction involves independently satisfying its constituent specifications, each corresponding to a distinct motion primitive. Using diffusion models, ComposableNav learns each primitive separately, then composes them in parallel at deployment time to satisfy novel combinations of specifications unseen in training. Additionally, to avoid the onerous need for demonstrations of individual motion primitives, we propose a two-stage training procedure: (1) supervised pre-training to learn a base diffusion model for dynamic navigation, and (2) reinforcement learning fine-tuning that molds the base model into different motion primitives. Through simulation and real-world experiments, we show that ComposableNav enables robots to follow instructions by generating trajectories that satisfy diverse and unseen combinations of specifications, significantly outperforming both non-compositional VLM-based policies and costmap composing baselines.
- PDF: https://openreview.net/pdf?id=FBsawSyYBM
- Forum: https://openreview.net/forum?id=FBsawSyYBM
Constraint-Aware Diffusion Guidance for Robotics: Real-Time Obstacle Avoidance for Autonomous Racing
- Authors: Hao Ma, Sabrina Bodmer, Andrea Carron, Melanie Zeilinger, Michael Muehlebach
- Abstract: Diffusion models hold great potential in robotics due to their ability to capture complex, high-dimensional data distributions. However, their lack of constraint-awareness limits their deployment in safety-critical applications. We propose Constraint-Aware Diffusion Guidance (CoDiG), a data-efficient and general-purpose framework that integrates barrier functions into the denoising process, guiding diffusion sampling toward constraint-satisfying outputs. CoDiG enables constraint satisfaction even with limited training data and generalizes across tasks. We evaluate our framework in the challenging setting of miniature autonomous racing, where real-time obstacle avoidance is essential. Real-world experiments show that CoDiG generates safe outputs efficiently under dynamic conditions, highlighting its potential for broader robotic applications.
- PDF: https://openreview.net/pdf?id=nryBWao01j
- Forum: https://openreview.net/forum?id=nryBWao01j
Constraint-Preserving Data Generation for One-Shot Visuomotor Policy Generalization
- Authors: Kevin Lin, Varun Ragunath, Andrew McAlinden, Aaditya Prasad, Jimmy Wu, Yuke Zhu, Jeannette Bohg
- Abstract: Large-scale demonstration data has powered key breakthroughs in robot manipulation, but collecting that data remains costly and time-consuming. To this end, we present Constraint-Preserving Data Generation (CP-Gen), a method that uses a single expert trajectory to generate robot demonstrations containing novel object geometries and poses. These generated demonstrations are used to train closed-loop visuomotor policies that transfer zero-shot to the real world. Similar to prior data-generation work focused on pose variations, CP-Gen first decomposes expert demonstrations into free-space motions and robot skills. Unlike prior work, we achieve geometry-aware data generation by formulating robot skills as keypoint-trajectory constraints: keypoints on the robot or grasped object must track a reference trajectory defined relative to a task-relevant object. To generate a new demonstration, CP-Gen samples pose and geometry transforms for each task-relevant object, then applies these transforms to the object and its associated keypoints or keypoint trajectories. We optimize robot joint configurations so that the keypoints on the robot or grasped object track the transformed keypoint trajectory, and then motion plan a collision-free path to the first optimized joint configuration. Using demonstrations generated by CP-Gen, we train visuomotor policies that generalize across variations in object geometries and poses. Experiments on 16 simulation tasks and four real-world tasks, featuring multi-stage, non-prehensile and tight-tolerance manipulation, show that policies trained using our method achieve an average success rate of 77%, outperforming the best baseline which achieves an average success rate of 50%.
- PDF: https://openreview.net/pdf?id=KSKzA1mwKs
- Forum: https://openreview.net/forum?id=KSKzA1mwKs
Contrastive Forward Prediction Reinforcement Learning for Adaptive Fault-Tolerant Legged Robots
- Authors: Yangqing Fu, Yang Zhang, Qiyue Yang, Liyun Yan, Zhanxiang Cao, Yue Gao
- Abstract: In complex environments, adaptive and fault-tolerant capabilities are essential for legged robot locomotion. To address this challenge, this study proposes a reinforcement learning framework that integrates contrastive learning with forward prediction to achieve fault-tolerant locomotion for legged robots. This framework constructs a forward prediction model with contrastive learning, incorporating a comparator and a forward model. The forward model predicts the robot’s subsequent state, and the comparator compares these predictions with actual states to generate critical prediction errors. These errors are systematically integrated into the controller, facilitating the continuous adjustment and refinement of control signals.Experiments on quadruped robots across different terrains and various joint damage scenarios have verified the effectiveness of our method, especially the functions of the comparator and the forward model. Furthermore, robots can adapt to locked joints without prior training, demonstrating zero-shot transfer capability. Finally, the proposed method demonstrates universal applicability to both quadruped and hexapod robots, highlighting its potential for broader applications in legged robotics.
- PDF: https://openreview.net/pdf?id=P0uqo7CpL8
- Forum: https://openreview.net/forum?id=P0uqo7CpL8
ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models
- Authors: Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang
- Abstract: Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations — a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA’s extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.
- PDF: https://openreview.net/pdf?id=kXhOmN3x18
- Forum: https://openreview.net/forum?id=kXhOmN3x18
CoRI: Communication of Robot Intent for Physical Human-Robot Interaction
- Authors: Junxiang Wang, Emek Barış Küçüktabak, Rana Soltani Zarrin, Zackory Erickson
- Abstract: Clear communication of robot intent fosters transparency and interpretability in physical human-robot interaction (pHRI), particularly during assistive tasks involving direct human-robot contact. We introduce CoRI, a pipeline that automatically generates natural language communication of a robot’s upcoming actions directly from its motion plan and visual perception. Our pipeline first processes the robot’s image view to identify human poses and key environmental features. It then encodes the planned 3D spatial trajectory (including velocity and force) onto this view, visually grounding the path and its dynamics. CoRI queries a vision-language model with this visual representation to interpret the planned action within the visual context before generating concise, user-directed statements, without relying on task-specific information. Results from a user study involving robot-assisted feeding, bathing, and shaving tasks across two different robots indicate that CoRI leads to statistically significant difference in communication clarity compared to a baseline communication strategy. Specifically, CoRI effectively conveys not only the robot’s high-level intentions but also crucial details about its motion and any collaborative user action needed.
- PDF: https://openreview.net/pdf?id=dBaSaa7qi4
- Forum: https://openreview.net/forum?id=dBaSaa7qi4
Cost-aware Discovery of Contextual Failures using Bayesian Active Learning
- Authors: Anjali Parashar, Joseph Zhang, Yingke Li, Chuchu Fan
- Abstract: Ensuring the robustness of robotic systems is crucial for their deployment in safety-critical domains. Failure discovery, or falsification, is a widely used approach for evaluating robustness, with recent advancements focusing on improving sample efficiency and generalization through probabilistic sampling techniques and learning-theoretic approaches. However, existing methods typically rely on explicitly defined analytical cost functions to characterize failures, often overlooking the underlying causes and diversity of discovered failure scenarios. In this work, we propose a novel failure discovery framework that integrates contextual reasoning in the falsification process, specifically tailored for high evaluation-cost applications. Our method incorporates expert-in-the-loop feedback to construct a probabilistic surrogate model of failures using Bayesian inference. This model is iteratively refined and leveraged to guide an active learning strategy that prioritizes the discovery of diverse failure cases. We empirically validate our approach across a range of tasks for high-cost contextual falsification in robotic manipulation and autonomous driving.
- PDF: https://openreview.net/pdf?id=f2Y549UzM5
- Forum: https://openreview.net/forum?id=f2Y549UzM5
Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration
- Authors: Tyler Ga Wei Lum, Olivia Y. Lee, Karen Liu, Jeannette Bohg
- Abstract: Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation, a process that is challenging to scale. Videos of human-object interactions are easier to collect and scale, but leveraging them directly for robot learning is difficult due to the lack of explicit action labels and human-robot embodiment differences. We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task. Our method utilizes reinforcement learning (RL) in simulation to cross the embodiment gap without relying on wearables, teleoperation, or large-scale data collection. From the video, we extract: (1) the object pose trajectory to define an object-centric, embodiment-agnostic reward, and (2) the pre-manipulation hand pose to initialize and guide exploration during RL training. These components enable effective policy learning without any task-specific reward tuning. In the single human demo regime, Human2Sim2Robot outperforms object-aware replay by over 55% and imitation learning by over 68% on grasping, non-prehensile manipulation, and multi-step tasks. Website: https://human2sim2robot.github.io.
- PDF: https://openreview.net/pdf?id=CgGSFtjplI
- Forum: https://openreview.net/forum?id=CgGSFtjplI
Cross-Sensor Touch Generation
- Authors: Samanta Rodriguez, Yiming Dou, Miquel Oller, Andrew Owens, Nima Fazeli
- Abstract: Today’s visuo-tactile sensors come in many shapes and sizes, making it challenging to develop general-purpose tactile representations. This is because most models are tied to a specific sensor design. To address this challenge, we propose two approaches to cross-sensor image generation. The first is an end-to-end method that leverages paired data (Touch2Touch). The second method builds an intermediate depth representation and does not require paired data (T2D2: Touch-to-Depth-to-Touch). Both methods enable the use of sensor-specific models across multiple sensors via the cross-sensor touch generation process. Together, these models offer flexible solutions for sensor translation, depending on data availability and application needs. We demonstrate their effectiveness on downstream tasks such as cup stacking and tool insertion, where models originally designed for one sensor are successfully transferred to another using in-hand pose estimation.
- PDF: https://openreview.net/pdf?id=oGcC8nMOit
- Forum: https://openreview.net/forum?id=oGcC8nMOit
CUPID: Curating Data your Robot Loves with Influence Functions
- Authors: Christopher Agia, Rohan Sinha, Jingyun Yang, Rika Antonova, Marco Pavone, Haruki Nishimura, Masha Itkina, Jeannette Bohg
- Abstract: In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes—such as closed-loop task success or failure—remains a persistent challenge. Inspired by the theory of influence functions, we propose CUPID. Given a set of evaluation rollouts, CUPID estimates the influence of a training demonstration on the policy’s expected return. This enables ranking and selection of demonstrations according to their impact on the policy’s closed-loop performance. We use our estimator to curate data by 1) filtering out training demonstrations that harmed the policy’s performance and 2) subselecting newly collected trajectories that will most help improve the policy. Extensive simulated and hardware experiments show that our approach consistently identifies which data drives test-time performance. For example, training with less than 33% of curated data can result in state-of-the-art diffusion policies on the simulated Robomimic benchmark, and we observe similar improvements in hardware experiments. Furthermore, our hardware experiments show that our influence-based estimator can identify robust strategies under distribution shift, isolate spurious correlations, and even enhance post-training of generalist policies.
- PDF: https://openreview.net/pdf?id=TqevdDMqrK
- Forum: https://openreview.net/forum?id=TqevdDMqrK
Data Retrieval with Importance Weights for Few-Shot Imitation Learning
- Authors: Amber Xie, Rahul Chand, Dorsa Sadigh, Joey Hejna
- Abstract: While large-scale robot datasets have propelled recent progress in imitation learning, learning from smaller task specific datasets remains critical for deployment in new environments and unseen tasks. One such approach to few-shot imitation learning is retrieval-based imitation learning, which extracts relevant samples from large, widely available prior datasets to augment a limited demonstration dataset. To determine the relevant data from prior datasets, retrieval-based approaches most commonly calculate a prior data point’s minimum distance to a point in the target dataset in latent space. While retrieval-based methods have shown success using this metric for data selection, we demonstrate its equivalence to the limit of a Gaussian kernel density (KDE) estimate of the target data distribution. This reveals two shortcomings of the retrieval rule used in prior work. First, it relies on high-variance nearest neighbor estimates that are susceptible to noise. Second, it does not account for the distribution of prior data when retrieving data. To address these issues, we introduce Importance Weighted Retrieval (IWR), which estimates importance weights, or the ratio between the target and prior data distributions for retrieval, using Gaussian KDEs. By considering the probability ratio, IWR overcomes the bias of previous selection rules, and by using reasonable modeling parameters, IWR effectively smooths estimates using all data points. Across both simulation environments and real-world evaluations on the Bridge dataset we find that our method, IWR, consistently improves performance of existing retrieval-based methods, despite only requiring minor modifications.
- PDF: https://openreview.net/pdf?id=wnWYoetLhC
- Forum: https://openreview.net/forum?id=wnWYoetLhC
D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation
- Authors: I-Chun Arthur Liu, Jason Chen, Gaurav S. Sukhatme, Daniel Seita
- Abstract: Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 180 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation. Our anonymous website is at: https://dcodaaug.github.io/D-CODA/.
- PDF: https://openreview.net/pdf?id=LRG1xvtiwL
- Forum: https://openreview.net/forum?id=LRG1xvtiwL
Decentralized Aerial Manipulation of a Cable-Suspended Load Using Multi-Agent Reinforcement Learning
- Authors: Jack Zeng, Andreu Matoses Gimenez, Eugene Vinitsky, Javier Alonso-Mora, Sihao Sun
- Abstract: This paper presents the first decentralized method to enable real-world 6-DoF manipulation of a cable-suspended load using a team of Micro-Aerial Vehicles (MAVs). Our method leverages multi-agent reinforcement learning (MARL) to train an outer-loop control policy for each MAV. Unlike state-of-the-art controllers that utilize a centralized scheme, our policy does not require global states, inter-MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low-level controller, enables reliable sim-to-real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real-world experiments, including full-pose control under load model uncertainties, showing setpoint tracking performance comparable to the state-of-the-art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in-flight loss of one MAV. Videos of experiments: https://github.com/anonymousCoRL/MDCM_CoRL2025.
- PDF: https://openreview.net/pdf?id=IuiB5iaMxy
- Forum: https://openreview.net/forum?id=IuiB5iaMxy
Deep Reactive Policy: Learning Reactive Manipulator Motion Planning for Dynamic Environments
- Authors: Jiahui Yang, Jason Jingzhou Liu, Yulong Li, Youssef Khaky, Kenneth Shaw, Deepak Pathak
- Abstract: Generating collision-free motion in dynamic, partially observable environments is a fundamental challenge for robotic manipulators. Classical motion planners can compute globally optimal trajectories but require full environment knowledge and are typically too slow for dynamic scenes. Neural motion policies offer a promising alternative by operating in closed-loop directly on raw sensory inputs but often struggle to generalize in complex or dynamic settings. We propose Deep Reactive Policy (DRP), a visuo-motor neural motion policy designed for reactive motion generation in diverse dynamic environments, operating directly on point cloud sensory input. At its core is IMPACT, a transformer-based neural motion policy pretrained on 10 million generated expert trajectories across diverse simulation scenarios. We further improve IMPACT’s static obstacle avoidance through iterative student-teacher finetuning. We additionally enhance the policy’s dynamic obstacle avoidance at inference time using DCP-RMP, a locally reactive goal-proposal module. We evaluate DRP on challenging tasks featuring cluttered scenes, dynamic moving obstacles, and goal obstructions. DRP achieves strong generalization, outperforming prior classical and neural methods in success rate across both simulated and real-world settings. We will release the dataset, simulation environments, and trained models upon acceptance. Refer to supplementary material for videos.
- PDF: https://openreview.net/pdf?id=4eSv0QeYlz
- Forum: https://openreview.net/forum?id=4eSv0QeYlz
DemoSpeedup: Accelerating Visuomotor Policies via Entropy-Guided Demonstration Acceleration
- Authors: Lingxiao Guo, Zhengrong Xue, Zijing Xu, Huazhe Xu
- Abstract: Imitation learning has shown great promise in robotic manipulation, but the policy’s execution is often unsatisfactorily slow due to commonly tardy demonstrations collected by human operators. In this work, we present DemoSpeedup, a self-supervised method to accelerate visuomotor policy execution via entropy-guided demonstration acceleration. DemoSpeedup starts from training an arbitrary generative policy (e.g., ACT or Diffusion Policy) on normal-speed demonstrations, which serves as a per-frame action entropy estimator. The key insight is that frames with lower action entropy estimates call for more consistent policy behaviors, which often indicate the demands for higher-precision operations. In contrast, frames with higher entropy estimates correspond to more casual sections, and therefore can be more safely accelerated. Thus, we segment the original demonstrations according to the estimated entropy, and accelerate them by down-sampling at rates that increase with the entropy values. Trained with the speedup demonstrations, the resulting policies execute up to 3 times faster while maintaining the task completion performance. Interestingly, these policies could even achieve higher success rates than those trained with normal-speed demonstrations, due to the benefits of reduced decision-making horizons.
- PDF: https://openreview.net/pdf?id=Tl7girqoLi
- Forum: https://openreview.net/forum?id=Tl7girqoLi
DEQ-MPC : Deep Equilibrium Model Predictive Control
- Authors: Swaminathan Gurumurthy, Khai Nguyen, Arun L Bishop, J Zico Kolter, Zachary Manchester
- Abstract: Incorporating task-specific priors within a policy or network architecture is crucial for enhancing safety and improving representation and generalization in robotic control problems. Differentiable Model Predictive Control (MPC) layers have proven effective for embedding these priors, such as constraints and cost functions, directly within the architecture, enabling end-to-end training. However, current methods often treat the solver and the neural network as separate, independent entities, leading to suboptimal integration. In this work, we propose a novel approach that co-develops the solver and architecture unifying the optimization solver and network inference problems. Specifically, we formulate this as a \textit{joint fixed-point problem} over the coupled network outputs and necessary conditions of the optimization problem. We solve this problem in an iterative manner where we alternate between network forward passes and optimization iterations. Through extensive ablations in various robotic control tasks, we demonstrate that our approach results in richer representations and more stable training, while naturally accommodating warm starting, a key requirement for MPC.
- PDF: https://openreview.net/pdf?id=zQXurgHUVX
- Forum: https://openreview.net/forum?id=zQXurgHUVX
Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference Scoped Exploration
- Authors: Sirui Xu, Yu-Wei Chao, Liuyu Bian, Arsalan Mousavian, Yu-Xiong Wang, Liangyan Gui, Wei Yang
- Abstract: Hand–object motion-capture (MoCap) repositories provide abundant, contact-rich human demonstrations for scaling dexterous manipulation on robots. Yet demonstration inaccuracy and embodiment gaps between human and robot hands challenge direct policy learning. Existing pipelines adapt a three-stage workflow: retargeting, tracking, and residual correction. This multi-step process may not fully utilize demonstrations and can introduce compound errors. We introduce Reference-Scoped Exploration (RSE), a unified, single-loop optimization that integrates retargeting and tracking to train a scalable robot control policy directly from MoCap. Instead of treating demonstrations as strict ground truth, we view them as soft guidance. From raw demonstrations, we construct adaptive spatial scopes—time-varying termination boundaries, and reinforcement learning promotes the policy to stay within these envelopes while minimizing control effort. This holistic approach preserves demonstration intent, lets robot-specific strategies emerge, boosts robustness to noise, and scales effortlessly with large-scale demonstrations. We distill the scaled tracking policy into a vision-based, skill-conditioned generative control policy. This distilled policy captures diverse manipulation skills within a rich latent representation, enabling generalization across various objects and real-world robotic manipulation.
- PDF: https://openreview.net/pdf?id=gyihSZwQbR
- Forum: https://openreview.net/forum?id=gyihSZwQbR
DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation
- Authors: Suzannah Wistreich, Baiyu Shi, Stephen Tian, Samuel Clarke, Michael Nath, Chengyi Xu, Zhenan Bao, Jiajun Wu
- Abstract: Human skin provides a rich tactile sensing stream, localizing intentional and unintentional contact events over a large and contoured region. Replicating these tactile sensing capabilities for dexterous robotic manipulation systems remains a longstanding challenge. In this work, we take a step towards this goal by introducing DexSkin. DexSkin is a soft, conformable capacitive electronic skin that enables sensitive, localized, and calibratable tactile sensing, and can be tailored to varying geometries. We demonstrate its efficacy for learning downstream robotic manipulation by sensorizing a pair of parallel jaw gripper fingers, providing tactile coverage across almost the entire finger surfaces. We empirically evaluate DexSkin’s capabilities in learning challenging manipulation tasks that require sensing coverage across the entire surface of the fingers, such as reorienting objects in hand and wrapping elastic bands around boxes, in a learning-from-demonstration framework. We then show that, critically for data-driven approaches, DexSkin can be calibrated to enable model transfer across sensor instances, and demonstrate its applicability to online reinforcement learning on real robots. Our results highlight DexSkin’s suitability and practicality for learning real-world, contact-rich manipulation.
- PDF: https://openreview.net/pdf?id=CNPCSuwxJw
- Forum: https://openreview.net/forum?id=CNPCSuwxJw
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
- Authors: Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, Feifei Feng
- Abstract: Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion expert on cross-embodiment data, (2) aligning the VLA model to specific embodiments, and (3) post-training for rapid adaptation to new tasks. We conduct comprehensive experiments across multiple embodiments, including single-arm, bimanual, and dexterous hand, demonstrating DexVLA’s adaptability to challenging tasks without task-specific adaptation, its ability to learn dexterous skills on novel embodiments with limited data, and its capacity to complete complex, long-horizon tasks using only direct language prompting, such as laundry folding. In all settings, our method demonstrates superior performance compared to state-of-the-art models like OpenVLA and $\pi_{0}$.
- PDF: https://openreview.net/pdf?id=RFmezNsPWV
- Forum: https://openreview.net/forum?id=RFmezNsPWV
DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation
- Authors: Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, Shuran Song
- Abstract: We present DexUMI - a data collection and policy learning framework that uses the human hand as the natural interface to transfer dexterous manipulation skills to various robot hands. DexUMI incorporates hardware and software adaptations to minimize the embodiment gap between the human hand and various robot hands. The hardware adaptation bridges the kinematics gap with a wearable hand exoskeleton. It allows direct haptic feedback in manipulation data collection and adapts human motion to feasible robot hand motion. Our software adaptation bridges the visual gap by replacing the human hand in video data with high-fidelity robot hand inpainting. We demonstrate DexUMI’s capabilities through comprehensive real-world experiments on two different dexterous robot hand hardware platforms, achieving an average task success rate of 86%.
- PDF: https://openreview.net/pdf?id=XrgRvBklWu
- Forum: https://openreview.net/forum?id=XrgRvBklWu
Diffusion Dynamics Models with Generative State Estimation for Cloth Manipulation
- Authors: Tongxuan Tian, Haoyang Li, Bo Ai, Xiaodi Yuan, Zhiao Huang, Hao Su
- Abstract: Cloth manipulation is challenging due to its highly complex dynamics, near-infinite degrees of freedom, and frequent self-occlusions, which complicate both state estimation and dynamics modeling. Inspired by recent advances in generative models, we hypothesize that these expressive models can effectively capture intricate cloth configurations and deformation patterns from data. Therefore, we propose a diffusion-based generative approach for both perception and dynamics modeling. Specifically, we formulate state estimation as reconstructing full cloth states from partial observations and dynamics modeling as predicting future states given the current state and robot actions. Leveraging a transformer-based diffusion model, our method achieves accurate state reconstruction and reduces long-horizon dynamics prediction errors by an order of magnitude compared to prior approaches. We integrate our dynamics models with model-predictive control and show that our framework enables effective cloth folding on real robotic systems, demonstrating the potential of generative models for deformable object manipulation under partial observability and complex dynamics.
- PDF: https://openreview.net/pdf?id=oDUbsdc0Ru
- Forum: https://openreview.net/forum?id=oDUbsdc0Ru
Diffusion-Guided Multi-Arm Motion Planning
- Authors: Viraj Parimi, Brian C. Williams
- Abstract: Multi-arm motion planning is fundamental for enabling arms to complete collaborative tasks in shared spaces but current methods struggle with scalability due to exponential state-space growth and reliance on large training datasets for learned models. Inspired by Multi-Agent Path Finding (MAPF), which decomposes planning into single-agent problems coupled with collision resolution, we propose a novel diffusion-guided multi-arm planner (DG-MAP) that enhances scalability of learning-based models while reducing their reliance on massive multi-arm datasets. Recognizing that collisions are primarily pairwise, we train two conditional diffusion models, one to generate feasible single-arm trajectories, and a second, to model the dual-arm dynamics required for effective pairwise collision resolution. By integrating these specialized generative models within a MAPF-inspired structured decomposition, our planner efficiently scales to larger number of arms. Evaluations against alternative learning-based methods across various team sizes demonstrate our method’s effectiveness and practical applicability. Code and data will be made publicly available. View video demonstrations in our supplementary material.
- PDF: https://openreview.net/pdf?id=AO0BKxf3ss
- Forum: https://openreview.net/forum?id=AO0BKxf3ss
DinIVA: Learning In-Context Adaptability to Pre-Trained Vision-Language-Action Models
- Authors: Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, Insup Lee
- Abstract: Multi-task ``vision-language-action’’ (VLA) models have recently demonstrated increasing promise as generalist foundation models for robotics, achieving non-trivial performance out of the box on new tasks in new environments. However, for such models to be truly useful, an end user must have easy means to teach them to improve. For language and vision models, the emergent ability to perform in-context learning (ICL) has proven to be a versatile and highly useful interface to easily teach new tasks with no parameter finetuning. Unfortunately, VLAs pre-trained with imitation learning objectives do not naturally acquire ICL abilities. In this paper, we demonstrate that, with the right finetuning recipe and a small robot demonstration dataset, it is possible to inject in-context adaptability post hoc into such a VLA. After retraining for in-context learning (RICL), our system permits an end user to provide a small number (10-20) of demonstrations for a new task. RICL then fetches the most relevant portions of those demonstrations into the VLA context to exploit ICL, performing the new task and boosting task performance. We apply RICL to inject ICL into the $\pi_0$-FAST VLA, and show that it permits large in-context improvements for a variety of new manipulation tasks with only 20 demonstrations per task, without any parameter updates. When parameter updates on the target task demonstrations is possible, RICL finetuning further boosts performance. We release code and model weights for RICL-$\pi_0$-FAST alongside the paper to enable, for the first time, a simple in-context learning interface for new manipulation tasks.
- PDF: https://openreview.net/pdf?id=6AASPlloSt
- Forum: https://openreview.net/forum?id=6AASPlloSt
Disentangled Multi-Context Meta-Learning: Unlocking Robust and Generalized Task Learning
- Authors: Seonsoo Kim, Jun-Gill Kang, Taehong Kim, Seongil Hong
- Abstract: In meta-learning and its downstream tasks, many methods use implicit adaptation to represent task-specific variations. However, implicit approaches hinder interpretability and make it difficult to understand which task factors drive performance. In this work, we introduce a disentangled multi-context meta-learning framework that explicitly learns separate context vectors for different aspects that define a task. By decoupling these factors, our approach improves both robustness, through deeper task understanding, and generalization, by enabling context vector sharing across tasks with the same context. We evaluate our approach in two domains. First, on a sinusoidal regression benchmark, our model outperforms baselines on out-of-distribution tasks and generalizes to unseen sine functions by sharing context vectors associated with shared amplitudes or phase shifts. Second, in a quadruped locomotion task, we disentangle the robot-specific properties and the characteristics of the terrain in the robot dynamics model. Using these context vectors in reinforcement learning, the learned policy demonstrates improved robustness under out-of-distribution conditions, compared to a model using a single unified context. Furthermore, by effectively sharing context, our model enables successful sim-to-real policy transfer to challenging terrains with out-of-distribution robot-specific properties using only real data from flat terrain, which is not achievable with single-task adaptation.
- PDF: https://openreview.net/pdf?id=0ViTEgiFiQ
- Forum: https://openreview.net/forum?id=0ViTEgiFiQ
Distributed Upload and Active Labeling for Resource-Constrained Fleet Learning
- Authors: Oguzhan Akcin, Harsh Goel, Ruihan Zhao, Sandeep P. Chinchali
- Abstract: In multi-robot systems, fleets are often deployed to collect data that improves the performance of machine learning models for downstream perception and planning. However, real-world robotic deployments generate vast amounts of data across diverse conditions, while only a small portion can be transmitted or labeled due to limited bandwidth, constrained onboard storage, and high annotation costs. To address these challenges, we propose Distributed Upload and Active Labeling (DUAL), a decentralized, two-stage data collection framework for resource-constrained robotic fleets. In the first stage, each robot independently selects a subset of its local observations to upload under storage and communication constraints. In the second stage, the cloud selects a subset of uploaded data to label, subject to a global annotation budget. We evaluate DUAL on classification tasks spanning multiple sensing modalities, as well as on RoadNet—a real-world dataset we collected from vehicle-mounted cameras for time and weather classification. We further validate our approach in a physical experiment using a Franka Emika Panda robot arm, where it learns to move a red cube to a green bowl. Finally, we test DUAL on trajectory prediction using the nuScenes autonomous driving dataset to assess generalization to complex prediction tasks. Across all settings, DUAL consistently outperforms state-of-the-art baselines, achieving up to 31.1% gain in classification accuracy and a 13% improvement in real-world robotics task completion rates.
- PDF: https://openreview.net/pdf?id=M1e2PEMLp2
- Forum: https://openreview.net/forum?id=M1e2PEMLp2
Divide, Discover, Deploy: Factorized Skill Learning with Symmetry and Style Priors
- Authors: Rafael Cathomen, Mayank Mittal, Marin Vlastelica, Marco Hutter
- Abstract: Unsupervised Skill Discovery (USD) allows agents to autonomously learn diverse behaviors without task-specific rewards. While recent USD methods have shown promise, their application to real-world robotics remains underexplored. In this paper, we propose a modular USD framework to address the challenges in safety, interpretability, and deployability of the learned skills. Our approach factorizes the state space to learn disentangled skill representations and assigns different skill discovery algorithms to each factor based on the desired intrinsic reward function. To encourage structured morphology-aware skills, we introduce symmetry-based inductive biases tailored to individual factors. We also incorporate a style factor and regularization penalties to promote safe and robust behaviors. We evaluate our framework in simulation using a quadrupedal robot and demonstrate zero-shot transfer of the learned skills to real hardware. Our results show that factorization and symmetry lead to the discovery of structured, human-interpretable behaviors, while the style factor and penalties enhance safety and diversity. Additionally, we show that the learned skills can be used for downstream tasks and perform on par with oracle policies trained with hand-crafted rewards. To facilitate future research, we will release our code upon publication.
- PDF: https://openreview.net/pdf?id=Ddb8w8FVV9
- Forum: https://openreview.net/forum?id=Ddb8w8FVV9
Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving
- Authors: Mingyi Wang, Jingke Wang, Tengju Ye, Kaicheng Yu
- Abstract: Recent breakthroughs in large language models (LLMs) have not only advanced natural language processing but also inspired their application in domains with structurally similar problems—most notably, autonomous driving motion generation. Both domains involve autoregressive sequence modeling, token-based representations, and context-aware decision making, making the transfer of LLM components a natural and increasingly common practice. However, despite promising early attempts, a systematic understanding of which LLM modules are truly transferable remains lacking. In this paper, we present a comprehensive evaluation of five key LLM modules—tokenizer design, positional embedding, pre-training paradigms, post-training strategies, and test-time computation—within the context of motion generation for autonomous driving. Through extensive experiments on the Waymo Sim Agents benchmark, we demonstrate that, when appropriately adapted, these modules can significantly improve performance for autonomous driving motion generation. In addition, we identify which techniques can be effectively transferred, analyze the potential reasons for the failure of others, and discuss the specific adaptations needed for autonomous driving scenarios. We evaluate our method on the Sim Agents task and achieve competitive results.
- PDF: https://openreview.net/pdf?id=AP7kM1xk2a
- Forum: https://openreview.net/forum?id=AP7kM1xk2a
$Door(s)$: Junction State Estimation for Efficient Exploration in Reinforcement Learning
- Authors: Benjamin Fele, Jan Babic
- Abstract: Exploration is one of the important bottlenecks for efficient learning in reinforcement learning, especially in the presence of sparse rewards. One way to traverse the environment faster is by passing through junctions, or metaphorical doors, in the state space. We propose a novel heuristic, $Door(s)$, focused on such narrow passages that serve as pathways to a large number of other states. Our approach works by estimating the state occupancy distribution and allows computation of its entropy, which forms the basis for our measure. Its computation is more sample-efficient compared to other similar methods and robustly works over longer horizons. Our results highlight the detection of dead-end states, show increased exploration efficiency, and demonstrate that $Door(s)$ encodes specific behaviors useful for downstream learning of various robotic manipulation tasks.
- PDF: https://openreview.net/pdf?id=NtnPVwUCAH
- Forum: https://openreview.net/forum?id=NtnPVwUCAH
D-Cubed: Latent Diffusion Trajectory Optimisation for Dexterous Deformable Manipulation
- Authors: Jun Yamada, Shaohong Zhong, Jack Collins, Ingmar Posner
- Abstract: Mastering deformable object manipulation often necessitates the use of anthropomorphic, high-degree-of-freedom robot hands capable of precise, contact-rich control. However, current trajectory optimisation methods often struggle in these settings due to the large search space and the sparse task information available from shape-matching cost functions, particularly when contact is absent. In this work, we propose D-Cubed, a novel trajectory optimisation method using a latent diffusion model (LDM) trained from a task-agnostic play dataset to solve dexterous deformable object manipulation tasks. D-Cubed learns a skill-latent space that encodes short-horizon actions from a play dataset using a VAE and trains a LDM to compose the skill latents into a skill trajectory, representing a long-horizon action trajectory. To optimise a trajectory for a target task, we introduce a novel gradient-free guided sampling method that employs the Cross-Entropy method within the reverse diffusion process. In particular, D-Cubed samples a small number of noisy skill trajectories using the LDM for exploration and evaluates the trajectories in simulation. Then D-Cubed selects the trajectory with the lowest cost for the subsequent reverse process. This effectively explores promising solution areas and optimises the sampled trajectories towards a target task throughout the reverse diffusion process. Through empirical evaluation on a published benchmark of dexterous deformable object manipulation tasks, we demonstrate that D-Cubed outperforms traditional trajectory optimisation and competitive baseline approaches by a significant margin.
- PDF: https://openreview.net/pdf?id=5htQM8jqOe
- Forum: https://openreview.net/forum?id=5htQM8jqOe
Dynamics-Compliant Trajectory Diffusion for Super-Nominal Payload Manipulation
- Authors: Anuj Pasricha, Joewie J. Koh, Jay Vakil, Alessandro Roncone
- Abstract: Nominal payload ratings for articulated robots are typically derived from worst-case configurations, resulting in uniform payload constraints across the entire workspace. This conservative approach severely underutilizes the robot’s inherent capabilities—our analysis demonstrates that manipulators can safely handle payloads well above nominal capacity across broad regions of their workspace while staying within joint angle, velocity, acceleration, and torque limits. To address this gap between assumed and actual capability, we propose a novel trajectory generation approach using denoising diffusion models that explicitly incorporates payload constraints into the planning process. Unlike traditional sampling-based methods that rely on inefficient trial-and-error, optimization-based methods that are prohibitively slow, or kinodynamic planners that struggle with problem dimensionality, our approach generates dynamically feasible joint-space trajectories in constant time that can be directly executed on physical hardware without post-processing. Experimental validation on a 7 DoF Franka Emika Panda robot demonstrates that up to 67.6% of the workspace remains accessible even with payloads exceeding 3 times the nominal capacity. This expanded operational envelope highlights the importance of a more nuanced consideration of payload dynamics in motion planning algorithms.
- PDF: https://openreview.net/pdf?id=8RdxHk9hpr
- Forum: https://openreview.net/forum?id=8RdxHk9hpr
Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection
- Authors: Abrar Anwar, Rohan Gupta, Zain Merchant, Sayan Ghosh, Willie Neiswanger, Jesse Thomason
- Abstract: Evaluating learned robot control policies to determine their performance costs the experimenter time and effort. As robots become more capable in accomplishing diverse tasks, evaluating across all these tasks becomes more difficult as it is impractical to test every policy on every task multiple times. Rather than considering the average performance of a policy on a task, we consider the distribution of performance over time. In a multi-task policy evaluation setting, we actively model the distribution of robot performance across multiple tasks and policies as we sequentially execute experiments. We show that natural language is a useful prior in modeling relationships between tasks because they often share similarities that can reveal potential relationships in policy behavior. We leverage this formulation to reduce experimenter effort by using a cost-aware information gain heuristic to efficiently select informative trials. We conduct experiments on existing evaluation data from real robots and simulations and find a 50% reduction in estimates of the mean performance given a fixed cost budget. We encourage the use of our surrogate model as a scalable approach to track progress in evaluation.
- PDF: https://openreview.net/pdf?id=pJ5FONkM9N
- Forum: https://openreview.net/forum?id=pJ5FONkM9N
Elucidating the Design Space of Torque-aware Vision-Language-Action Models
- Authors: Zongzheng Zhang, Haobo Xu, Zhuo Yang, Chenghao Yue, Zehao Lin, Huan-ang Gao, Ziwei Wang, Hao Zhao
- Abstract: Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder. This is because torque signals align more closely with the decoder’s input, and the decoder is more sensitive to variations in input. Second, torque history proves to be a critical signal. We find that the most effective way to incorporate it is by summarizing the entire history into a single token, as this preserves the original input pattern of the decoder. Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings. Code, models, and datasets will be released.
- PDF: https://openreview.net/pdf?id=HAmi1X11BO
- Forum: https://openreview.net/forum?id=HAmi1X11BO
Embrace Contacts: humanoid shadowing with full body ground contacts
- Authors: Ziwen Zhuang, Hang Zhao
- Abstract: Previous humanoid robot research works treat the robot as a bipedal mobile manipulation platform, where only the feet and hands contact the environment. However, we humans use all body parts to interact with the world, e.g., we sit in chairs, get up from the ground, or roll on the floor. Contacting the environment using body parts other than feet and hands brings significant challenges in both model-predictive control and reinforcement learning-based methods: an unpredictable contact sequence makes it almost impossible for model-predictive control to plan ahead in real time; the success of sim-to-real reinforcement learning for humanoids heavily depends on the acceleration of the rigid-body physical simulator and the simplification of collision detection. On the other hand, lacking extreme torso movement of humanoid data makes all other components non-trivial to design, such as dataset distribution, motion commands, and task rewards. To address these challenges, we propose a general humanoid motion framework that takes discrete motion commands and controls the robot’s motor actions in real time. Using a GPU-accelerated simulator, we train a humanoid whole-body control policy that follows the high-level motion command in the real world in real time, even with stochastic contacts and extremely large robot base rotation and not-so-feasible motion commands.
- PDF: https://openreview.net/pdf?id=JibqR9sEdW
- Forum: https://openreview.net/forum?id=JibqR9sEdW
EndoVLA: Dual-Phase Vision-Language-Action for Precise Autonomous Tracking in Endoscopy
- Authors: NG CHI KIT, Long Bai, Guankun Wang, Yupeng Wang, Huxin Gao, Kun yuan, Chenhan Jin, Tieyong Zeng, Hongliang Ren
- Abstract: In endoscopic procedures, autonomous tracking of abnormal regions and following of circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile—each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, resulting in poor generalization across variable scenes. Vision–Language–Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative to semantically adapt to surgeon prompts, without the need for manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the inherently complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To this end, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Provided endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to predefined circular markers during circumferential cutting. To address the unique challenges posed by data scarcity and domain shifts, we propose a dual-phase strategy, with supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning using task-aware rewards. Our approach significantly enhances the tracking performance in endoscopy, and zero-shot generalization of tracking in general scenes and more challenging sequential tasks.
- PDF: https://openreview.net/pdf?id=7XyO9Y1hI1
- Forum: https://openreview.net/forum?id=7XyO9Y1hI1
Ensuring Force Safety in Vision-Guided Robotic Manipulation via Implicit Tactile Calibration
- Authors: Lai Wei, Jiahua Ma, Yibo Hu, Ruimao Zhang
- Abstract: In unstructured environments, robotic manipulation tasks involving objects with constrained motion trajectories—such as door opening—often experience discrepancies between the robot’s vision-guided end-effector trajectory and the object’s constrained motion path. Such discrepancies generate unintended harmful forces, which, if exacerbated, may lead to task failure and potential damage to the manipulated objects or the robot itself. To address this issue, this paper introduces a novel diffusion framework, termed SafeDiff. Unlike conventional methods that sequentially fuse visual and tactile data to predict future robot states, our approach generates a prospective state sequence based on the current robot state and visual context observations, using real-time force feedback as a calibration signal. This implicitly adjusts the robot’s state within the state space, enhancing operational success rates and significantly reducing harmful forces during manipulation, thus ensuring manipulation force safety. Additionally, we develop a large-scale simulation dataset named SafeDoorManip50k, offering extensive multimodal data to train and evaluate the proposed method. Extensive experiments show that our visual-tactile model substantially mitigates the risk of harmful forces in the door opening task, across both simulated and real-world settings.
- PDF: https://openreview.net/pdf?id=yeuA6M8JIX
- Forum: https://openreview.net/forum?id=yeuA6M8JIX
Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering
- Authors: Muhammad Fadhil Ginting, Dong-Ki Kim, Xiangyun Meng, Andrzej Marek Reinke, Bandi Jai Krishna, Navid Kayhani, Oriana Peltzer, David Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel Kochenderfer, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei
- Abstract: As robots become increasingly capable of operating over extended periods—spanning days, weeks, and even months—they are expected to accumulate knowledge of their environments and leverage this experience to assist humans more effectively. This paper studies the problem of Long-term Active Embodied Question Answering (LA-EQA), a new task in which a robot must both recall past experiences and actively explore its environment to answer complex, temporally-grounded questions. Unlike traditional EQA settings, which typically focus either on understanding the present environment alone or on recalling a single past observation, LA-EQA challenges an agent to reason over past, present, and possible future states, deciding when to explore, when to consult its memory, and when to stop gathering observations and provide a final answer. Standard EQA approaches based on large models struggle in this setting due to limited context windows, absence of persistent memory, and an inability to combine memory recall with active exploration. To address this, we propose a structured memory system for robots, inspired by the mind palace method from cognitive science. Our method encodes episodic experiences as scene-graph-based world instances, forming a reasoning and planning algorithm that enables targeted memory retrieval and guided navigation. To balance the exploration-recall trade-off, we introduce value-of-information-based stopping criteria that determines when the agent has gathered sufficient information. We evaluate our method on real-world experiments and introduce a new benchmark that spans popular simulation environments and actual industrial sites. Our approach significantly outperforms state-of-the-art baselines, yielding substantial gains in both answer accuracy and exploration efficiency.
- PDF: https://openreview.net/pdf?id=4eMWCoWUKR
- Forum: https://openreview.net/forum?id=4eMWCoWUKR
Estimating Value of Assistance for Online POMDP Robotic Agents
- Authors: Yuval Goshen, Sarah Keren
- Abstract: Robotic agents operating in dynamic, partially observable environments often benefit from teammate assistance. We address the challenge of determining when and how to assist in multi-robot systems where agents can modify the physical environment, such as moving obstacles that block perception or manipulation. For robots using online POMDP planning, evaluating assistance impacts requires computationally intensive policy evaluation, making real-time decisions difficult. We formulate Value of Assistance (VOA) for POMDP agents and develop efficient heuristics that approximate VOA without requiring complete policy evaluation. Our empirical evaluation on both a standard POMDP benchmark and a collaborative manipulation task demonstrates that our Full Information heuristic enables real-time assistance decisions while maintaining sufficient accuracy for effective helping action selection.
- PDF: https://openreview.net/pdf?id=xzR8rBRgPp
- Forum: https://openreview.net/forum?id=xzR8rBRgPp
exUMI: Extensible Robot Teaching System with Action-aware Task-agnostic Tactile Representation
- Authors: Yue Xu, Litao Wei, Pengyu An, Qingyu Zhang, Yong-Lu Li
- Abstract: Tactile-aware robot learning faces critical challenges in data collection and representation due to data scarcity and sparsity, and the absence of force feedback in existing systems. To address these limitations, we introduce a tactile robot learning system with both hardware and algorithm innovations. We present exUMI, an extensible data collection device that enhances the vanilla UMI with robust proprioception (via AR MoCap and rotary encoder), modular visuo-tactile sensing, and automated calibration, achieving 100% data usability. Building on an efficient collection of over 1 M tactile frames, we propose Tactile Prediction Pretraining (TPP), a representation learning framework through action-aware temporal tactile prediction, capturing contact dynamics and mitigates tactile sparsity. Real-world experiments show that TPP outperforms traditional tactile imitation learning. Our work bridges the gap between human tactile intuition and robot learning through co-designed hardware and algorithms, offering open-source resources to advance contact-rich manipulation research.
- PDF: https://openreview.net/pdf?id=b86nyIOJWq
- Forum: https://openreview.net/forum?id=b86nyIOJWq
Extracting Visual Plans from Unlabeled Videos via Symbolic Guidance
- Authors: Wenyan Yang, Ahmet Tikna, Yi Zhao, Yuying Zhang, Luigi Palopoli, Marco Roveri, Joni Pajarinen
- Abstract: Visual planning, by offering a sequence of intermediate visual subgoals to a goal-conditioned low-level policy, achieves promising performance on long-horizon manipulation tasks. To obtain the subgoals, existing methods typically resort to video generation models but suffer from model hallucination and computational cost. We present Vis2Plan, an efficient, explainable and white-box visual planning framework powered by symbolic guidance. From raw, unlabeled play data, Vis2Plan harnesses vision foundation models to automatically extract a compact set of task symbols, which allows building a high-level symbolic transition graph for multi-goal, multi-stage planning. At test time, given a desired task goal, our planner conducts planning at the symbolic level and assembles a sequence of physically consistent intermediate sub-goal images grounded by the underlying symbolic representation. Our Vis2Plan outperforms strong diffusion video generation-based visual planners by delivering 53% higher aggregate success rate while generating visual plans 35$\times$ faster. The results indicate that Vis2Plan is able to generate physically consistent image goals while offering fully inspectable reasoning steps.
- PDF: https://openreview.net/pdf?id=HMcBBIg1Th
- Forum: https://openreview.net/forum?id=HMcBBIg1Th
FACET: Force-Adaptive Control via Impedance Reference Tracking for Legged Robots
- Authors: Botian Xu, Haoyang Weng, Qingzhou Lu, Yang Gao, Huazhe Xu
- Abstract: Reinforcement learning (RL) has made significant strides in legged robot control, enabling locomotion across diverse terrains and complex loco-manipulation capabilities. However, the commonly used position or velocity tracking-based objectives are agnostic to forces experienced by the robot, leading to stiff and potentially dangerous behaviors and poor control during forceful interactions. To address this limitation, we present Force-Adaptive Control via Impedance Reference Tracking (FACET). Inspired by impedance control, we use RL to train a control policy to imitate a virtual mass-spring-damper system, allowing fine-grained control under external forces by manipulating the virtual spring. In simulation, we demonstrate that our quadruped robot achieves improved robustness to large impulses (up to 200 Ns) and exhibits controllable compliance, achieving an 80% reduction in collision impulse. The policy is deployed to a physical robot, demonstrating both compliant behavior, such as initiation/cessation of movement with finger tip, and the ability to pull payloads up to 10kg. Further extension to a legged loco-manipulator and a humanoid shows the applicability of our method to more complex settings to enable whole-body compliance control.
- PDF: https://openreview.net/pdf?id=EcOGafgvuC
- Forum: https://openreview.net/forum?id=EcOGafgvuC
Fail2Progress: Learning from Real-World Robot Failures with Stein Variational Inference
- Authors: Yixuan Huang, Novella Alvina, Mohanraj Devendran Shanthi, Tucker Hermans
- Abstract: Skill effect models for long-horizon manipulation tasks are prone to failures in conditions not covered by training data distributions. Therefore, enabling robots to reason about and learn from failures is necessary. We investigate the problem of efficiently generating a dataset targeted to observed failures. After fine-tuning a skill effect model on this dataset, we evaluate the extent to which the model can recover from failures and minimize future failures. We propose Fail2Progress, an approach that leverages Stein variational inference to generate multiple simulation environments in parallel, enabling efficient data sample generation similar to observed failures. Our method is capable of handling several challenging mobile manipulation tasks, including transporting multiple objects, organizing a constrained shelf, and tabletop organization. Through large-scale simulation and real-world experiments, we demonstrate that our approach excels at learning from failures across different numbers of objects. Furthermore, we show that Fail2Progress outperforms several baselines.
- PDF: https://openreview.net/pdf?id=06YyNxzwae
- Forum: https://openreview.net/forum?id=06YyNxzwae
Fast Flow-based Visuomotor Policies via Conditional Optimal Transport Couplings
- Authors: Andreas Sochopoulos, Nikolay Malkin, Nikolaos Tsagkas, Joao Moura, Michael Gienger, Sethu Vijayakumar
- Abstract: Diffusion and flow matching policies have recently demonstrated remarkable performance in robotic applications by accurately capturing multimodal robot trajectory distributions. However, their computationally expensive inference, due to the numerical integration of an ODE or SDE, limits their applicability as real-time controllers for robots. We introduce a methodology that utilizes conditional Optimal Transport couplings between noise and samples to enforce straight solutions in the flow ODE for robot action generation tasks. We show that naively coupling noise and samples fails in conditional tasks and propose incorporating condition variables into the coupling process to improve few-step performance. The proposed few-step policy achieves a 4% higher success rate with a 10$\times$ speed-up compared to Diffusion Policy on a diverse set of simulation tasks. Moreover, it produces high-quality and diverse action trajectories within 1-2 steps on a set of real-world robot tasks. Our method also retains the same training complexity as Diffusion Policy and vanilla Flow Matching, in contrast to distillation-based approaches.
- PDF: https://openreview.net/pdf?id=Nt4LmgZ7v9
- Forum: https://openreview.net/forum?id=Nt4LmgZ7v9
FastUMI: A Scalable and Hardware-Independent Universal Manipulation Interface with Dataset
- Authors: Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan CHEN, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li
- Abstract: Real-world manipulation datasets for robotic arms remain scarce due to the high costs, rigid hardware dependencies, and complex setup procedures associated with existing data collection methods. We introduce, a redesigned Universal Manipulation Interface (UMI) that addresses these challenges, enabling low-cost, scalable, and rapid deployment across heterogeneous platforms. FastUMI achieves this through: (i) hardware decoupling via extensive mechanical reengineering, which removes dependence on specialized robotic components while preserving a consistent visual perspective; (ii) replacement of complex visual–inertial odometry with a commercial off-the-shelf tracker, simplifying the software stack without compromising pose estimation accuracy; and (iii) the provision of an integrated ecosystem that streamlines data acquisition, automates quality control, and ensures compatibility with both standard and enhanced imitation-learning pipelines. To facilitate further research, we release an open-access dataset comprising over 15,000 real-world demonstrations spanning 24 tasks constituting one of the most extensive UMI-like resources to date. Empirical evaluations show that FastUMI supports rapid deployment, reduces operational overhead, and delivers robust performance across diverse manipulation scenarios, advancing scalable data-driven robotic learning.
- PDF: https://openreview.net/pdf?id=RUSscFSEfD
- Forum: https://openreview.net/forum?id=RUSscFSEfD
FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real
- Authors: Weiheng Liu, Yuxuan Wan, Jilong Wang, Yuxuan Kuang, Xuesong Shi, Haoran Li, Dongbin Zhao, Zhizheng Zhang, He Wang
- Abstract: Generalizable object fetching in cluttered scenes remains a fundamental and application-critical challenge in embodied AI. Closely packed objects cause inevitable occlusions, making safe action generation particularly difficult. Under such partial observability, effective policies must not only generalize across diverse objects and layouts but also reason about occlusion to avoid collisions. However, collecting large-scale real-world data for this task remains prohibitively expensive, leaving this problem largely unsolved. In this paper, we introduce FetchBot, a sim-to-real framework for this challenge. We first curate a large-scale synthetic dataset featuring 1M diverse scenes and 500k representative demonstrations. Based on this dataset, FetchBot employs a depth-conditioned method for action generation, which leverages structural cues to enable robust obstacle-aware action planning. However, depth is perfect in simulation but noisy in real-world environments. To address this sim-to-real gap, FetchBot predicts depth from RGB inputs using a foundation model and integrates local occupancy prediction as a co-training task, providing a generalizable latent representation for sim-to-real transfer. Extensive experiments in simulation and real-world environments demonstrate FetchBot’s strong zero-shot sim-to-real transfer, effective clutter handling, and adaptability to novel scenarios. In cluttered environments, it achieves an average success rate of 89.95%, significantly outperforming prior methods. Moreover, FetchBot demonstrates excellent robustness in challenging cases, such as fetching transparent, reflective, and irregular objects, highlighting its practical value.
- PDF: https://openreview.net/pdf?id=5ySSVlJBOn
- Forum: https://openreview.net/forum?id=5ySSVlJBOn
Few-Shot Neuro-Symbolic Imitation Learning for Long-Horizon Planning and Acting
- Authors: Pierrick Lorang, Hong Lu, Johannes Huemer, Patrik Zips, Matthias Scheutz
- Abstract: Imitation learning enables intelligent systems to acquire complex behaviors with minimal supervision. However, existing methods often focus on short-horizon skills, require large datasets, and struggle to solve long-horizon tasks or generalize across task variations and distribution shifts. We propose a novel neuro-symbolic framework that jointly learns continuous control policies and symbolic domain abstractions from a few skill demonstrations. Our method abstracts high-level task structures into a graph, discovers symbolic rules via an Answer Set Programming solver, and trains low-level controllers using diffusion policy imitation learning. A high-level oracle filters task-relevant information to focus each controller on a minimal observation and action space. Our graph-based neuro-symbolic framework enables capturing complex state transitions, including non-spatial and temporal relations, that data-driven learning or clustering techniques often fail to discover in limited demonstration datasets. We validate our approach in six domains that involve four robotic arms, Stacking, Kitchen, Assembly, and Towers of Hanoi environments, and a distinct Automated Forklift domain with two environments. The results demonstrate high data efficiency with as few as five skill demonstrations, strong zero- and few-shot generalizations, and interpretable decision making. Our code is publicly available.
- PDF: https://openreview.net/pdf?id=bILubVwPoD
- Forum: https://openreview.net/forum?id=bILubVwPoD
First Order Model-Based RL through Decoupled Backpropagation
- Authors: Joseph Amigo, Rooholla Khorrambakht, Elliot Chane-Sane, Nicolas Mansard, Ludovic Righetti
- Abstract: There is growing interest in reinforcement learning (RL) methods that leverage the simulator’s derivatives to improve learning efficiency. While early gradient-based approaches have demonstrated superior performance compared to derivative-free methods, accessing simulator gradients is often impractical due to their implementation cost or unavailability. Model-based RL (MBRL) can approximate these gradients via learned dynamics models, but the solver efficiency suffers from compounding prediction errors during training rollouts, which can degrade policy performance. We propose an approach that decouples trajectory generation from gradient computation: trajectories are unrolled using a simulator, while gradients are computed via backpropagation through a learned differentiable model of the simulator. This hybrid design enables efficient and consistent first-order policy optimization, even when simulator gradients are unavailable, as well as learning a critic from simulation rollouts, which is more accurate. Our method achieves the sample efficiency and speed of specialized optimizers such as SHAC, while maintaining the generality of standard approaches like PPO and avoiding ill behaviors observed in other first-order MBRL methods. We empirically validate our algorithm on benchmark control tasks and demonstrate its effectiveness on a real Go2 quadruped robot, across both quadrupedal and bipedal locomotion tasks.
- PDF: https://openreview.net/pdf?id=2dXMfk3qRU
- Forum: https://openreview.net/forum?id=2dXMfk3qRU
FLARE: Robot Learning with Implicit World Modeling
- Authors: Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loïc Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, Linxi Fan
- Abstract: We introduce Future LAtent Representation AlignmEnt (FLARE), a novel framework that integrates predictive world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, FLARE enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, FLARE requires only minimal architectural modifications—adding a few tokens to standard vision-language-action (VLA) models—yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, FLARE achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, FLARE unlocks the ability to co-train with human egocentric video demonstrations lacking action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as 1 robot demonstration. Our results establish FLARE as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
- PDF: https://openreview.net/pdf?id=HXJ6pUSn1L
- Forum: https://openreview.net/forum?id=HXJ6pUSn1L
FlashBack: Consistency Model-Accelerated Shared Autonomy
- Authors: Luzhe Sun, Jingtian Ji, Xiangshan Tan, Matthew Walter
- Abstract: Shared autonomy is an enabling technology that provides users with control authority over robots that would otherwise be difficult if not impossible to directly control. Yet, standard methods make assumptions that limit their adoption in practice—for example, prior knowledge of the user’s goals or the objective (i.e., reward) function that they wish to optimize, knowledge of the user’s policy, or query-level access to the user during training. Diffusion-based approaches to shared autonomy do not make such assumptions and instead only require access to demonstrations of desired behaviors, while allowing the user to maintain control authority. However, these advantages have come at the expense of high computational complexity, which has made real-time shared autonomy all but impossible. To overcome this limitation, we propose Consistency Shared Autonomy (CSA), a shared autonomy framework that employs a consistency model-based formulation of diffusion. Key to CSA is that it employs the distilled probability flow of ordinary differential equations (PF ODE) to generate high-fidelity samples in a single step. This results in inference speeds significantly than what is possible with previous diffusion-based approaches to shared autonomy, enabling real-time assistance in complex domains with only a single function evaluation. Further, by intervening on flawed actions at intermediate states of the PF ODE, CSA enables varying levels of assistance. We evaluate CSA on a variety of challenging simulated and real-world robot control problems, demonstrating significant improvements over state-of-the-art methods both in terms of task performance and computational efficiency.
- PDF: https://openreview.net/pdf?id=zk4fRmHF0Q
- Forum: https://openreview.net/forum?id=zk4fRmHF0Q
FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models
- Authors: Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov
- Abstract: Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers a 25.9% improvement over state-of-the-art baselines across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. All code, pretrained weights, and training recipes are publicly released to democratize efficient VLA development.
- PDF: https://openreview.net/pdf?id=JeppaebLRD
- Forum: https://openreview.net/forum?id=JeppaebLRD
Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models
- Authors: Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, Pietro Mazzaglia
- Abstract: Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent’s own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.
- PDF: https://openreview.net/pdf?id=Ict1OjU9gl
- Forum: https://openreview.net/forum?id=Ict1OjU9gl
FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection
- Authors: Anqi Joyce Yang, James Tu, Nikita Dvornik, Enxu Li, Raquel Urtasun
- Abstract: In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multimodal fusion designs leads to large gains for long-tailed 3D detection.
- PDF: https://openreview.net/pdf?id=19LSN4QnV4
- Forum: https://openreview.net/forum?id=19LSN4QnV4
Force-Modulated Visual Policy for Robot-Assisted Dressing with Arm Motions
- Authors: Alexis Yihong Hao, Yufei Wang, Navin Sriram Ravie, Bharath Hegde, David Held, Zackory Erickson
- Abstract: Robot-assisted dressing has the potential to significantly improve the lives of individuals with mobility impairments. To ensure an effective and comfortable dressing experience, the robot must be able to handle challenging deformable garments, apply appropriate forces, and adapt to limb movements throughout the dressing process. Prior work often makes simplifying assumptions—such as static human limbs during dressing—which limits real-world applicability. In this work, we develop a robot-assisted dressing system capable of handling partial observations with visual occlusions, as well as robustly adapting to arm motions during the dressing process. Given a policy trained in simulation with partial observations, we propose a method to fine-tune it in the real world using a small amount of data and multi-modal feedback from vision and force sensing, to further improve the policy’s adaptability to arm motions and enhance safety. We evaluate our method in simulation with simplified articulated human meshes and in a real world human study with 12 participants across 264 dressing trials. Our policy successfully dresses two long-sleeve everyday garments onto the participants while being adaptive to various kinds of arm motions, and greatly outperforms prior baselines in terms of task completion and user feedback.
- PDF: https://openreview.net/pdf?id=1HW2UhshIT
- Forum: https://openreview.net/forum?id=1HW2UhshIT
From Real World to Logic and Back: Learning Generalizable Relational Concepts For Long Horizon Robot Planning
- Authors: Naman Shah, Jayesh Nagpal, Siddharth Srivastava
- Abstract: Humans efficiently generalize from limited demonstrations, but robots still struggle to transfer learned knowledge to complex, unseen tasks with longer horizons and increased complexity. We propose the first known method enabling robots to autonomously invent relational concepts directly from small sets of unannotated, unsegmented demonstrations. The learned symbolic concepts are grounded into logic-based world models, facilitating efficient zero-shot generalization to significantly more complex tasks. Empirical results demonstrate that our approach achieves performance comparable to hand-crafted models, successfully scaling execution horizons and handling up to 18 times more objects than seen in training, providing the first autonomous framework for learning transferable symbolic abstractions from raw robot trajectories.
- PDF: https://openreview.net/pdf?id=1cA6OYsfoJ
- Forum: https://openreview.net/forum?id=1cA6OYsfoJ
From Space to Time: Enabling Adaptive Safety with Learned Value Functions via Disturbance Recasting
- Authors: Sander Tonkens, Nikhil Uday Shinde, Azra Begzadić, Michael C. Yip, Jorge Cortes, Sylvia Lee Herbert
- Abstract: Safe operation is essential for autonomous systems in safety-critical environments such as urban air mobility. Value function-based safety filters provide formal guarantees on safety, wrapping learned or planning-based controllers with a layer of protection. Recent approaches leverage offline learned value functions to scale these safety filters to high-dimensional systems. Yet these methods assume detailed prior knowledge of all possible sources of model mismatch, in the form of disturbances, in the environment – information that is typically unavailable in real world settings. Even in well-mapped environments like urban canyons or industrial sites, drones encounter complex, spatially-varying disturbances arising from payload-drone interaction, turbulent airflow, and other environmental factors. We introduce Space2Time, which enables safe and adaptive deployment of offline-learned safety filters under unknown, spatially-varying disturbances. The key idea is to reparameterize spatial disturbances as a time-varying formulation, allowing the use of temporally varying precomputed value functions during online operation. We validate Space2Time through extensive simulations on diverse quadcopter models and real-world hardware experiments, demonstrating significantly improved safety performance over worst-case and naive baselines.
- PDF: https://openreview.net/pdf?id=GUgIpVTC1T
- Forum: https://openreview.net/forum?id=GUgIpVTC1T
GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation
- Authors: Hang Yin, Haoyu Wei, Xiuwei Xu, Wenxuan Guo, Jie Zhou, Jiwen Lu
- Abstract: In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot’s navigation path and final goal. To handle cases of no solution or multiple solutions, we construct the navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show our framework can effectively generalize to new environments and instruction sets, paving the way for more robust and autonomous navigation framework.
- PDF: https://openreview.net/pdf?id=mjYKNIRqpy
- Forum: https://openreview.net/forum?id=mjYKNIRqpy
Generalist Robot Manipulation beyond Action Labeled Data
- Authors: Alexander Spiridonov, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel
- Abstract: Recent advances in generalist robot manipulation leverage pre-trained Vision–Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels—featuring humans and/or robots in action—enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations—improving downstream generalist robot policies—but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.
- PDF: https://openreview.net/pdf?id=ZqBXnR6ppz
- Forum: https://openreview.net/forum?id=ZqBXnR6ppz
Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation
- Authors: Chuye Zhang, Xiaoxiong Zhang, Linfang Zheng, Wei Pan, Wei Zhang
- Abstract: Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems.
- PDF: https://openreview.net/pdf?id=VmCkEvRULX
- Forum: https://openreview.net/forum?id=VmCkEvRULX
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
- Authors: Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, Sean Kirmani
- Abstract: How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. \textit{Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video.} To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
- PDF: https://openreview.net/pdf?id=HprBJupvvM
- Forum: https://openreview.net/forum?id=HprBJupvvM
Geometric Red-Teaming for Robotic Manipulation
- Authors: Divyam Goel, Yufei Wang, Tiancheng Wu, Guixiu Qiao, Pavel Piliptchak, David Held, Zackory Erickson
- Abstract: Standard evaluation protocols in robotic manipulation typically assess policy performance over curated, in-distribution test sets, offering limited insight into how systems fail under plausible variation. We introduce a red-teaming framework that probes robustness through object-centric geometric perturbations, automatically generating CrashShapes—structurally valid, user-constrained mesh deformations that trigger catastrophic failures in pre-trained manipulation policies. The method integrates a Jacobian field–based deformation model with a gradient-free, simulator-in-the-loop optimization strategy. Across insertion, articulation, and grasping tasks, our approach consistently discovers deformations that collapse policy performance, revealing brittle failure modes missed by static benchmarks. By combining task-level policy rollouts with constraint-aware shape exploration, we aim to build a general purpose framework for structured, object-centric robustness evaluation in robotic manipulation. We additionally show that fine-tuning on individual CrashShapes, a process we refer to as blue-teaming, improves task success by up to 60 percentage points on those shapes, while preserving performance on the original object, demonstrating the utility of red-teamed geometries for targeted policy refinement. Finally, we validate both red-teaming and blue-teaming results with a real robotic arm, observing that simulated CrashShapes reduce task success from 90% to as low as 22.5%, and that blue-teaming recovers performance to up to 90% on the corresponding real-world geometry—closely matching simulation outcomes.
- PDF: https://openreview.net/pdf?id=ux5EptB7xZ
- Forum: https://openreview.net/forum?id=ux5EptB7xZ
GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation
- Authors: Teli Ma, Jia Zheng, Zifan Wang, Ziyao Gao, Jiaming Zhou, Junwei Liang
- Abstract: Learning manipulation skills from human demonstration videos offers a promising path toward generalizable and interpretable robotic intelligence—particularly through the lens of actionable affordances. However, transferring such knowledge remains challenging due to: 1) a lack of large-scale datasets with precise affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts. To address these gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset comprising 500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local affordance training framework that effectively transfers actionable affordance knowledge from human demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic manipulation tasks. By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as valuable resources for bridging the gap between human demonstrations and robotic manipulation capabilities. We will release our dataset, code and models.
- PDF: https://openreview.net/pdf?id=HT34hQcU91
- Forum: https://openreview.net/forum?id=HT34hQcU91
Granular loco-manipulation: Repositioning rocks through strategic sand avalanche
- Authors: Haodi Hu, Yue Wu, Daniel Seita, Feifei Qian
- Abstract: Legged robots have the potential to leverage obstacles to climb steep sand slopes. However, efficiently repositioning these obstacles to desired locations is challenging. Here we present DiffusiveGRAIN, a learning-based method that enables a multi-legged robot to strategically induce localized sand avalanches during locomotion and indirectly manipulate obstacles. We conducted 375 trials, systematically varying obstacle spacing, robot orientation, and leg actions in 75 of them. Results show that movement of closely-spaced obstacles exhibit significant interference, requiring joint modeling. In addition, different multi-leg excavation actions could cause distinct robot state changes, necessitating integrated planning of manipulation and locomotion. To address these challenges, DiffusiveGRAIN includes a diffusion-based environment predictor to capture multi-obstacle movements under granular flow interferences and a robot state predictor to estimate changes in robot state from multi-leg action patterns. Deployment experiments (90 trials) demonstrate that by integrating the environment and robot state predictors, the robot can autonomously plan its movements based on loco-manipulation goals, successfully shifting closely located rocks to desired locations in over 65% of trials. Our study showcases the potential for a locomoting robot to strategically manipulate obstacles to achieve improved mobility on challenging terrains.
- PDF: https://openreview.net/pdf?id=BNdgT6GeC6
- Forum: https://openreview.net/forum?id=BNdgT6GeC6
GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation
- Authors: Abhay Deshpande, Yuquan Deng, Jordi Salvador, Arijit Ray, Winson Han, Jiafei Duan, Rose Hendrix, Yuke Zhu, Ranjay Krishna
- Abstract: We present GraspMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given “”pour me some tea””, GraspMolmo selects a grasp on a teapot handle rather than its body. Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from a large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects. In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35% achieved by the next best alternative. GraspMolmo also successfully demonstrates the ability to predict semantically correct bimanual grasps zero-shot. We release our synthetic dataset, code, model, and benchmarks to accelerate research in task-semantic robotic manipulation.
- PDF: https://openreview.net/pdf?id=SebHZk78aS
- Forum: https://openreview.net/forum?id=SebHZk78aS
GraspQP: Differentiable Optimization of Force Closure for Diverse and Robust Dexterous Grasping
- Authors: René Zurbrügg, Andrei Cramariuc, Marco Hutter
- Abstract: Dexterous robotic hands enable versatile interactions through the flexibility and adaptability of a multi-finger setup, allowing for a wise range of task-specific grasp configurations in diverse environments. However, access to diverse and high-quality grasp data is essential to fully exploit the capabilities of dexterous hands, be it to train grasp prediction models from point clouds, train manipulation policies, or to support high-level task planning with a broader range of action options. Existing approaches for dataset generation rely on sampling-based algorithms or simplified force-closure analysis, which tend to converge to power grasps and often exhibit limited diversity. In this work, we propose a method to synthesize large-scale, diverse, and physically feasible grasps that additionally go beyond simple power grasps to more refined manipulation, such as pinches or tri-finger precision grasps. We introduce a rigorous differentiable energy formulation of force closure, implicitly defined through a Quadratic Program (QP). In addition, we present an adjusted optimization method (MALA*) that improves performance by dynamically rejecting gradient steps based on the global sample distribution. We extensively evaluate our approach and demonstrate significant improvements in both grasp diversity and the stability of final grasp predictions. Finally, we provide a new, large-scale grasp dataset for the 5’700 objects from DexGraspNet, consisting of five different grippers and three different grasp types.
- PDF: https://openreview.net/pdf?id=aZwWRycAXi
- Forum: https://openreview.net/forum?id=aZwWRycAXi
GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
- Authors: Saumya Saxena, Blake Buchanan, Chris Paxton, Peiqi Liu, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer
- Abstract: In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps, and further demonstrate GraphEQA in two separate real world environments. Videos and code are available at https://grapheqa.github.io.
- PDF: https://openreview.net/pdf?id=Yy9EVIajH5
- Forum: https://openreview.net/forum?id=Yy9EVIajH5
HALO : Human Preference Aligned Offline Reward Learning for Robot Navigation
- Authors: Gershom Seneviratne, Jianyu An, Sahire Ellahy, Kasun Weerakoon, Mohamed Bashir Elnoor, Jonathan Deepak Kannan, Amogha Thalihalla Sunil, Dinesh Manocha
- Abstract: In this paper, we introduce HALO, a novel Offline Reward Learning algorithm that quantifies human intuition in navigation into a vision-based reward function for robot navigation. HALO learns a reward model from offline data, leveraging expert trajectories collected from mobile robots. During training, actions are randomly sampled from the action space around the expert action and ranked using a Boltzmann probability distribution that combines their distance to the expert action with human preference scores derived from intuitive navigation queries based on the corresponding egocentric camera feed. These scores establish preference rankings, enabling the training of a novel reward model based on Plackett-Luce loss, which allows for preference-driven navigation. To demonstrate the effectiveness of HALO, we deploy its reward model in two downstream applications: (i) an offline learned policy trained directly on the HALO-derived rewards, and (ii) a model-predictive-control (MPC) based planner that incorporates the HALO reward as an additional cost term. This showcases the versatility of HALO across both learning-based and classical navigation frameworks. Our real-world deployments on a Clearpath Husky across multiple scenarios demonstrate that policies trained with HALO achieve improved performance over state-of-the-art methods in terms of success rate and normalized trajectory length while maintaining lower Fréchet distance with the human expert trajectories.
- PDF: https://openreview.net/pdf?id=PMKwnV6Azi
- Forum: https://openreview.net/forum?id=PMKwnV6Azi
“Hand-Eye Autonomous Delivery: Learning Humanoid Navigation, Locomotion and Reaching”
- Authors: Sirui Chen, Yufei Ye, Zi-ang Cao, Pei Xu, Jennifer Lew, Karen Liu
- Abstract: We propose Hand-Eye Autonomous Delivery (HEAD), a framework that learns navigation, locomotion, and reaching skills for humanoids, directly from human motion and vision perception data. We take a modular approach where the high-level planner commands the target position and orientation of the hands and eyes of the humanoid, delivered by the low-level policy that controls the whole-body movements. Specifically, the low-level whole-body controller learns to track the three points (eyes, left hand, and right hand) from existing large-scale human motion capture data while high-level policy learns from human data collected by Aria glasses. Our modular approach decouples the ego-centric vision perception from physical actions, promoting efficient learning and scalability to novel scenes. We evaluate our method both in simulation and in the real-world, demonstrating humanoid’s capabilities to navigate and reach in complex environments designed for humans.
- PDF: https://openreview.net/pdf?id=H0EgeP3feg
- Forum: https://openreview.net/forum?id=H0EgeP3feg
Hold My Beer: Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control
- Authors: Yitang Li, Yuanhang Zhang, Wenli Xiao, Chaoyi Pan, Haoyang Weng, Guanqi He, Tairan He, Guanya Shi
- Abstract: Can your humanoid walk up and hand you a full cup of beer—without spilling a drop? While humanoids are increasingly featured in flashy demos—dancing, delivering packages, traversing rough terrain—fine-grained control during locomotion remains a significant challenge. In particular, stabilizing a filled end-effector (EE) while walking is far from solved, due to a fundamental mismatch in task dynamics: locomotion demands slow-timescale, robust control, whereas EE stabilization requires rapid, high-precision corrections. To address this, we propose SoFTA, a Slow-Fast Two-Agent framework that decouples upper-body and lower-body control into separate agents operating at different frequencies and with distinct rewards. This temporal and objective separation mitigates policy interference mitagates objective conflict and enables coordinated whole-body behavior. SoFTA executes upper-body actions at 100 Hz for precise EE control and lower-body actions at 50 Hz for robust gait. It reduces EE acceleration by 2-5x to baselines and performs 2–3x closer to human-level stability, enabling delicate tasks such as carrying nearly full cups, capturing steady video during locomotion, and disturbance rejection with EE stability.
- PDF: https://openreview.net/pdf?id=Bl2VfU9NhF
- Forum: https://openreview.net/forum?id=Bl2VfU9NhF
Human-like Navigation in a World Built for Humans
- Authors: Bhargav Chandaka, Gloria Xinyue Wang, Haozhe Chen, Henry Che, Albert J. Zhai, Shenlong Wang
- Abstract: When navigating in a man-made environment they haven’t visited before—like an office building—humans employ behaviors such as reading signs and asking others for directions. These behaviors help humans reach their destinations efficiently by reducing the need to search through large areas. Existing robot navigation systems lack the ability to execute such behaviors and are thus highly inefficient at navigating within large environments. We present ReasonNav, a modular navigation system which integrates these human-like navigation skills by leveraging the reasoning capabilites of a vision-language model (VLM). We design compact input and output abstractions based on navigation landmarks, allowing the VLM to focus on language understanding and reasoning. We evaluate ReasonNav on real and simulated navigation tasks and show that the agent successfully employs higher-order reasoning to navigate efficiently in large, complex buildings.
- PDF: https://openreview.net/pdf?id=nMiyWyNhQx
- Forum: https://openreview.net/forum?id=nMiyWyNhQx
Humanoid Policy ~ Human Policy
- Authors: Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, Xiaolong Wang
- Abstract: Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection,n which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both the generalization and robustness of HAT with significantly better data collection efficiency.
- PDF: https://openreview.net/pdf?id=Tx54fkQ3Cq
- Forum: https://openreview.net/forum?id=Tx54fkQ3Cq
HuB: Learning Extreme Humanoid Balance
- Authors: Tong Zhang, Boyuan Zheng, Ruiqian Nai, Yingdong Hu, Yen-Jen Wang, Geng Chen, Fanqi Lin, Jiongye Li, Chuye Hong, Koushil Sreenath, Yang Gao
- Abstract: The human body demonstrates exceptional motor capabilities—such as standing steadily on one foot or performing a high kick with the leg raised over 1.5 meters—both requiring precise balance control. While recent research on humanoid control has leveraged reinforcement learning to track human motions for skill acquisition, applying this paradigm to balance-intensive tasks remains challenging. In this work, we identify three key obstacles: instability from reference motion errors, learning difficulties due to morphological mismatch, and the sim-to-real gap caused by sensor noise and unmodeled dynamics. To address these challenges, we propose $\textbf{HuB}$ ($\textbf{Hu}$manoid $\textbf{B}$alance), a unified framework that integrates $\textit{reference motion refinement}$, $\textit{balance-aware policy learning}$, and $\textit{sim-to-real robustness training}$, with each component targeting a specific challenge. We validate our approach on the Unitree G1 humanoid robot across challenging quasi-static balance tasks, including extreme single-legged poses such as $\texttt{Swallow Balance}$ and $\texttt{Bruce Lee’s Kick}$. Our policy remains stable even under strong physical disturbances—such as a forceful soccer strike—while baseline methods consistently fail to complete these tasks.
- PDF: https://openreview.net/pdf?id=FCpYuGtN4j
- Forum: https://openreview.net/forum?id=FCpYuGtN4j
HyperTASR: Hypernetwork-Driven Task-Aware Scene Representations for Robust Manipulation
- Authors: Li Sun, Jiefeng Wu, Feng Chen, Ruizhe Liu, Yanchao Yang
- Abstract: Effective policy learning for robotic manipulation requires scene representations that selectively capture task-relevant environmental features. Current approaches typically employ task-agnostic representation extraction, failing to emulate the dynamic perceptual adaptation observed in human cognition. We present HyperTASR, a hypernetwork-driven framework that modulates scene representations based on both task objectives and the execution phase. Our architecture dynamically generates representation transformation parameters conditioned on task specifications and progression state, enabling representations to evolve contextually throughout task execution. This approach maintains architectural compatibility with existing policy learning frameworks while fundamentally reconfiguring how visual features are processed. Unlike methods that simply concatenate or fuse task embeddings with task-agnostic representations, HyperTASR establishes computational separation between task-contextual and state-dependent processing paths, enhancing learning efficiency and representational quality. Comprehensive evaluations in both simulation and real-world environments demonstrate substantial performance improvements across different representation paradigms. Most notably, HyperTASR elevates success rates by over 27% when applied to GNFactor and achieves unprecedented single-view performance exceeding 80% success with 3D Diffuser Actor. Through ablation studies and attention visualization, we confirm that our approach selectively prioritizes task-relevant scene information, closely mirroring human adaptive perception during manipulation tasks.
- PDF: https://openreview.net/pdf?id=BcJlmjF1vV
- Forum: https://openreview.net/forum?id=BcJlmjF1vV
“Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models”
- Authors: Seungjae Lee, Daniel Ekpo, Haowen Liu, Furong Huang, Abhinav Shrivastava, Jia-Bin Huang
- Abstract: Exploration is key for general-purpose robotic learning, particularly in open-ended environments where explicit guidance or task-specific feedback is limited. Vision-language models (VLMs), which can reason about object semantics, spatial relations, and potential outcomes, offer a promising foundation for guiding exploratory behavior by generating high-level goals or transitions. However, their outputs lack grounding, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration often emerges from the drive to discover novel scene configurations and to understand the environment. Inspired by this, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE produces more diverse and meaningful exploration than RL baselines. The collected data facilitates learning downstream tasks that closely match those of policies trained on human-collected demonstrations.
- PDF: https://openreview.net/pdf?id=9AHjtHLlIe
- Forum: https://openreview.net/forum?id=9AHjtHLlIe
Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics
- Authors: Ryoga Oishi, Sho Sakaino, Toshiaki Tsuji
- Abstract: In the field of robot learning, it is becoming possible to coordinate robot action through language instructions. On the other hand, it is still a difficult task to adjust the action based on human instructions because human instructions are often qualitative, and there are cases where there is no one-to-one correspondence between the behavior and the instructions. In this paper, we propose a motion generation model that can adjust actions in response to qualitative human instructions during task execution. The core of the proposed method is a learning architecture that maps qualitative human instructions to actions. Specifically, the demonstration is divided into short action sequences, and labels reflecting human qualitative senses are assigned to these sequences to realize learning that links human qualitative instructions and robot actions. In evaluation experiments, we verified the effectiveness of the method in two tasks: a pick-and-place task and a wiping task. Experimental results showed that the proposed method is able to generate motions in response to human qualitative instructions during task execution, whereas the conventional method generates trajectories all at once, making it impossible to adjust motions during task execution.
- PDF: https://openreview.net/pdf?id=Af2RMaWRjm
- Forum: https://openreview.net/forum?id=Af2RMaWRjm
ImMimic: Cross-Domain Imitation from Human Videos via Mapping and Interpolation
- Authors: Yangcen Liu, Woo Chul Shin, Yunhai Han, Zhenyang Chen, Harish Ravichandar, Danfei Xu
- Abstract: Learning robot manipulation from abundant human videos offers a scalable alternative to costly robot-specific data collection. However, domain gaps across visual, morphological, and physical aspects hinder direct imitation. To effectively bridge the domain gap, we propose ImMimic, an embodiment-agnostic co-training framework that leverages both human videos and a small amount of teleoperated robot demonstrations. ImMimic uses Dynamic Time Warping (DTW) with either action- or visual-based mapping to map retargeted human hand poses to robot joints, followed by MixUp interpolation between paired human and robot trajectories. Our key insights are (1) retargeted human hand trajectories provide informative action labels, and (2) interpolation over the mapped data creates intermediate domains that facilitate smooth domain adaptation during co-training. Evaluations on four real-world manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq, Fin Ray, Allegro, Ability) show that ImMimic improves task success rates and execution smoothness, highlighting its efficacy to bridge the domain gap for robust robot manipulation. The project website can be found at https://sites.google.com/view/immimic.
- PDF: https://openreview.net/pdf?id=7iaYcss56y
- Forum: https://openreview.net/forum?id=7iaYcss56y
Improving Efficiency of Sampling-based Motion Planning via Message-Passing Monte Carlo
- Authors: Makram Chahine, T. Konstantin Rusch, Zach J Patterson, Daniela Rus
- Abstract: Sampling-based motion planning methods, while effective in high-dimensional spaces, often suffer from inefficiencies due to irregular sampling distributions, leading to suboptimal exploration of the configuration space. In this paper, we propose an approach that enhances the efficiency of these methods by utilizing low-discrepancy distributions generated through Message-Passing Monte Carlo (MPMC). MPMC leverages Graph Neural Networks (GNNs) to generate point sets that uniformly cover the space, with uniformity assessed using the the $\mathcal{L}_p$-discrepancy measure, which quantifies the irregularity of sample distributions. By improving the uniformity of the point sets, our approach significantly reduces computational overhead and the number of samples required for solving motion planning problems. Experimental results demonstrate that our method outperforms traditional sampling techniques in terms of planning efficiency.
- PDF: https://openreview.net/pdf?id=bi8o9p6h2R
- Forum: https://openreview.net/forum?id=bi8o9p6h2R
In-Context Iterative Policy Improvement for Dynamic Manipulation
- Authors: Mark Van der Merwe, Devesh K. Jha
- Abstract: Attention-based architectures trained on internet-scale language data have demonstrated state of the art reasoning ability for various language-based tasks, such as logic problems and textual reasoning. Additionally, these Large Language Models (LLMs) have exhibited the ability to perform few-shot prediction via in-context learning, in which input-output examples provided in the prompt are generalized to new inputs. This ability furthermore extends beyond standard language tasks, enabling few-shot learning for general patterns. In this work, we consider the application of in-context learning with pre-trained language models for dynamic manipulation. Dynamic manipulation introduces several crucial challenges, including increased dimensionality, complex dynamics, and partial observability. To address this, we take an iterative approach, and formulate our in-context learning problem to predict adjustments to a parametric policy based on previous interactions. We show across several tasks in simulation and on a physical robot that utilizing in-context learning outperforms alternative methods in the low data regime.
- PDF: https://openreview.net/pdf?id=IV35hjIZwz
- Forum: https://openreview.net/forum?id=IV35hjIZwz
IRIS: An Immersive Robot Interaction System
- Authors: Xinkai Jiang, Qihao Yuan, Enes Ulas Dincer, Hongyi Zhou, Ge Li, Xueyin Li, Xiaogang Jia, Timo Schnizer, Nicolas Schreiber, Weiran Liao, Julius Haag, Kailai Li, Gerhard Neumann, Rudolf Lioutikov
- Abstract: This paper introduces IRIS, an Immersive Robot Interaction System leveraging Extended Reality (XR). Existing XR-based systems enable efficient data collection but are often challenging to reproduce and reuse due to their specificity to particular robots, objects, simulators, and environments. IRIS addresses these issues by supporting immersive interaction and data collection across diverse simulators and real-world scenarios. It visualizes arbitrary rigid and deformable objects, robots from simulation, and integrates real-time sensor-generated point clouds for real-world applications. Additionally, IRIS enhances collaborative capabilities by enabling multiple users to simultaneously interact within the same virtual scene. Extensive experiments demonstrate that IRIS offers efficient and intuitive data collection in both simulated and real-world settings.
- PDF: https://openreview.net/pdf?id=b2mXmmGX8E
- Forum: https://openreview.net/forum?id=b2mXmmGX8E
“Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop”
- Authors: Justin Kerr, Kush Hari, Ethan Weber, Chung Min Kim, Brent Yi, tyler bonnen, Ken Goldberg, Angjoo Kanazawa
- Abstract: Humans do not passively observe the visual world—we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by introducing a BC-RL loop trained using teleoperated demonstrations recorded with a 360 camera. The resulting video enables a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing reinforcement learning of gaze behavior. The hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct actions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. We evaluate EyeRobot on five large workspace manipulation tasks and compare performance to two common camera setups: wrist and external cameras. Our experiments suggest EyeRobot exhibits hand-eye coordination which effectively facilitates action such as visual search or target switching, which enable manipulation across large workspaces.
- PDF: https://openreview.net/pdf?id=KKpdjQK6nT
- Forum: https://openreview.net/forum?id=KKpdjQK6nT
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes
- Authors: Shalin Jain, Jiazhen Liu, Siva Kailas, Harish Ravichandar
- Abstract: Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi-Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics-relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a Jax-powered end-to-end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi-robot reinforcement learning (MRRL) policies with realistic robot dynamics and safety constraints, supporting both parallelization and hardware acceleration. Our generalizable learning interface provides an easy-to-use integration with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level-Based Foraging) to a realistic robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open-access sim-to-real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi-robot learning research and evaluation.
- PDF: https://openreview.net/pdf?id=jPHhft5tNo
- Forum: https://openreview.net/forum?id=jPHhft5tNo
Joint Model-based Model-free Diffusion for Planning with Constraints
- Authors: Wonsuhk Jung, Utkarsh Aashu Mishra, Nadun Ranawaka Arachchige, Yongxin Chen, Danfei Xu, Shreyas Kousik
- Abstract: Model-free diffusion planners have shown great promise for robot motion planning, but practical robotic systems often require combining them with model-based optimization modules to enforce constraints, such as safety. Naively integrating these modules presents compatibility challenges when diffusion’s multi-modal outputs behave adversarially to optimization-based modules. To address this, we introduce Joint Model-based Model-free Diffusion (JM2D), a novel generative modeling framework. JM2D formulates module integration as a joint sampling problem to maximize compatibility via an interaction potential, without additional training. Using importance sampling, JM2D guides modules outputs based only on evaluations of the interaction potential, thus handling non-differentiable objectives commonly arising from non-convex optimization modules. We evaluate JM2D via application to aligning diffusion planners with safety modules on offline RL and robot manipulation. JM2D significantly improves task performance compared to conventional safety filters without sacrificing safety. Further, we show that conditional generation is a special case of JM2D and elucidate key design choices by comparing with SOTA gradient-based and projection-based diffusion planners. More details at: \url{https://sites.google.com/view/joint-mbmf-diffusion}.
- PDF: https://openreview.net/pdf?id=E9t1ekt6W9
- Forum: https://openreview.net/forum?id=E9t1ekt6W9
KDPE: A Kernel Density Estimation Strategy for Diffusion Policy Trajectory Selection
- Authors: Andrea Rosasco, Federico Ceola, Giulia Pasquale, Lorenzo Natale
- Abstract: Learning robot policies that capture multimodality in the training data has been a long-standing open challenge for behavior cloning. Recent approaches tackle the problem by modeling the conditional action distribution with generative models. One of these approaches is Diffusion Policy, which relies on a diffusion model to denoise random points into robot action trajectories. While achieving state-of-the-art performance, it has two main drawbacks that may lead the robot out of the data distribution during policy execution. First, the stochasticity of the denoising process can highly impact on the quality of generated trajectory of actions. Second, being a supervised learning approach, it can learn data outliers from the dataset used for training. Recent work focuses on mitigating these limitations by combining Diffusion Policy either with large-scale training or with classical behavior cloning algorithms. Instead, we propose KDPE, a Kernel Density Estimation-based strategy that filters out potentially harmful trajectories output of Diffusion Policy while keeping a low test-time computational overhead. For Kernel Density Estimation, we propose a manifold-aware kernel to model a probability density function for actions composed of end-effector Cartesian position, orientation, and gripper state. KDPE overall achieves better performance than Diffusion Policy on simulated single-arm RoboMimic and MimicGen tasks, and on three real robot experiments:PickPlush, a tabletop grasping task, CubeSort, a multimodal pick and place task, and CoffeeMaking, a task that requires long-horizon capabilities and precise execution. The code will be released upon acceptance and additional material is provided on our anonymized project page:https://kdpe-robotics.github.io.
- PDF: https://openreview.net/pdf?id=w0zDVjLscj
- Forum: https://openreview.net/forum?id=w0zDVjLscj
KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation
- Authors: Di Zhang, Chengbo Yuan, Chuan Wen, Hai Zhang, Junqiao Zhao, Yang Gao
- Abstract: Collecting demonstrations enriched with fine-grained tactile information is critical for dexterous manipulation, particularly in contact-rich tasks that require precise force control and physical interaction. While prior works primarily focus on teleoperation or video-based retargeting, they often suffer from kinematic mismatches and the absence of real-time tactile feedback, hindering the acquisition of high-fidelity tactile data. To mitigate this issue, we propose KineDex, a hand-over-hand kinesthetic teaching paradigm in which the operator’s motion is directly transferred to the dexterous hand, enabling the collection of physically grounded demonstrations enriched with accurate tactile feedback. To resolve occlusions from human hand, we apply inpainting technique to preprocess the visual observations. Based on these demonstrations, we then train a visuomotor policy using tactile-augmented inputs and implement force control during deployment for precise contact-rich manipulation. We evaluate KineDex on a suite of challenging contact-rich manipulation tasks, including particularly difficult scenarios such as squeezing toothpaste onto a toothbrush, which require precise multi-finger coordination and stable force regulation. Across these tasks, KineDex achieves an average success rate of 74.4%, representing a 57.7% improvement over the variant without force control. Comparative experiments with teleoperation and user studies further validate the advantages of KineDex in data collection efficiency and operability. Specifically, KineDex collects data over twice as fast as teleoperation across two tasks of varying difficulty, while maintaining a near-100% success rate, compared to under 50% for teleoperation.
- PDF: https://openreview.net/pdf?id=GKueYvjqSS
- Forum: https://openreview.net/forum?id=GKueYvjqSS
KineSoft: Learning Proprioceptive Manipulation Policies with Soft Robot Hands
- Authors: Uksang Yoo, Jonathan Francis, Jean Oh, Jeffrey Ichnowski
- Abstract: Underactuated soft robot hands offer inherent safety and adaptability advantages over rigid systems, but developing dexterous manipulation skills remains challenging. While imitation learning shows promise for complex manipulation tasks, traditional approaches struggle with soft systems due to demonstration collection challenges and ineffective state representations. We present KineSoft, a framework enabling direct kinesthetic teaching of soft robotic hands by leveraging their natural compliance as a skill teaching advantage rather than only as a control challenge. KineSoft makes two key contributions: (1) an internal strain sensing array providing occlusion-free proprioceptive shape estimation, and (2) a shape-based imitation learning framework that uses proprioceptive feedback with a low-level shape-conditioned controller to ground diffusion-based policies. This enables human demonstrators to physically guide the robot while the system learns to associate proprioceptive patterns with successful manipulation strategies. We validate KineSoft through physical experiments, demonstrating superior shape estimation accuracy compared to baseline methods, precise shape-trajectory tracking, and higher task success rates compared to baseline imitation learning approaches. KineSoft’s results demonstrate that embracing the inherent properties of soft robots leads to intuitive and robust dexterous manipulation capabilities.
- PDF: https://openreview.net/pdf?id=PwKsCO6TAF
- Forum: https://openreview.net/forum?id=PwKsCO6TAF
KoopMotion: Learning Almost Divergence Free Koopman Flow Fields for Motion Planning
- Authors: Alice Kate Li, Thales C. Silva, Victoria Edwards, Vijay Kumar, M. Ani Hsieh
- Abstract: In this work, we propose a novel flow field-based motion planning method that drives a robot from any initial state to a desired reference trajectory such that it converges to the trajectory’s end point. Despite demonstrated efficacy in using Koopman operator theory for modeling dynamical systems, Koopman does not inherently enforce convergence to desired trajectories nor to specified goals, a requirement when learning from demonstrations (LfD). We present KoopMotion which represents motion flow fields as dynamical systems, parameterized by Koopman Operators, and leverages the divergence properties of the learnt flow fields to obtain smooth motion fields that converge to a desired reference trajectory when the robot is placed away from the desired trajectory, and tracks the trajectory until the end point. To demonstrate the effectiveness of our approach, we show evaluations of KoopMotion on the LASA human handwriting dataset, including spectral analysis. We also perform experiments on a physical robot, verifying KoopMotion on a miniature autonomous surface vehicle operating in a non-static fluid flow environment. Our approach is highly sample efficient in both space and time, requiring only 3% of the LASA dataset to generate dense motion plans. Additionally, KoopMotion provides a significant improvement over baselines when comparing metrics that measure spatial and temporal dynamics modeling efficacy.
- PDF: https://openreview.net/pdf?id=hg9YtHV8MJ
- Forum: https://openreview.net/forum?id=hg9YtHV8MJ
LaDi-WM: A Latent Diffusion-Based World Model for Predictive Manipulation
- Authors: Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, Kai Xu
- Abstract: Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments. The full source code will be publicly available.
- PDF: https://openreview.net/pdf?id=o2w2iiMyEU
- Forum: https://openreview.net/forum?id=o2w2iiMyEU
Latent Adaptive Planner for Dynamic Manipulation
- Authors: Donghun Noh, Deqian Kong, Minglu Zhao, Andrew Lizarraga, Jianwen Xie, Ying Nian Wu, Dennis Hong
- Abstract: This paper presents Latent Adaptive Planner (LAP), a novel approach for dynamic nonprehensile manipulation tasks that formulates planning as latent space inference, effectively learned from human demonstration videos. Our method addresses key challenges in visuomotor policy learning through a principled variational replanning framework that maintains temporal consistency while efficiently adapting to environmental changes. LAP employs Bayesian updating in latent space to incrementally refine plans as new observations become available, striking an optimal balance between computational efficiency and real-time adaptability. We bridge the embodiment gap between humans and robots through model-based proportional mapping that regenerates accurate kinematic-dynamic joint states and object positions from human demonstrations. Experimental evaluations across multiple complex manipulation benchmarks demonstrate that LAP achieves state-of-the-art performance, outperforming existing approaches in success rate, trajectory smoothness, and energy efficiency, particularly in dynamic adaptation scenarios. Our approach enables robots to perform complex interactions with human-like adaptability while providing an expandable framework applicable to diverse robotic platforms using the same human demonstration videos.
- PDF: https://openreview.net/pdf?id=bmpDAqsJov
- Forum: https://openreview.net/forum?id=bmpDAqsJov
Latent Theory of Mind: A Decentralized Diffusion Architecture for Cooperative Manipulation
- Authors: Chengyang He, Gadiel Mark Sznaier Camps, Xu Liu, Mac Schwager, Guillaume Adrien Sartoretti
- Abstract: We present Latent Theory of Mind (LatentToM), a decentralized diffusion policy architecture for collaborative robot manipulation. Our policy allows multiple manipulators with their own perception and computation to collaborate with each other towards a common task goal with or without explicit communication. Our key innovation lies in allowing each agent to maintain two latent representations: an ego embedding specific to the robot, and a consensus embedding trained to be common to both robots, despite their different sensor streams and poses. We further let each robot train a decoder to infer the other robot’s ego embedding from their consensus embedding, akin to “”theory of mind”” in latent space. Training occurs centrally, with all the policies’ consensus encoders supervised by a loss inspired by sheaf theory, a mathematical theory for clustering data on a topological manifold. Specifically, we introduce a first-order cohomology loss to enforce sheaf-consistent alignment of the consensus embeddings. To preserve the expressiveness of the consensus embedding, we further propose structural constraints based on theory of mind and a directional consensus mechanism. Execution can be fully distributed, requiring no explicit communication between policies. In which case, the information is exchanged implicitly through each robot’s sensor stream by observing the actions of the other robots and their effects on the scene. Alternatively, execution can leverage direct communication to share the robots’ consensus embeddings, where the embeddings are shared once during each inference step and are aligned using the sheaf Laplacian. While we tested our method using two manipulators, our approach can naturally be extended to an arbitrary number of agents. In our hardware experiments, LatentToM outperforms a naive decentralized diffusion baseline, and shows comparable performance with a state-of-the-art centralized diffusion policy for bi-manual manipulation. Additionally, we show that LatentToM is naturally robust to temporary robot failure or delays, while a centralized policy may fail.
- PDF: https://openreview.net/pdf?id=b24y5SENo5
- Forum: https://openreview.net/forum?id=b24y5SENo5
LaVA-Man: Learning Visual Action Representations for Robot Manipulation
- Authors: Chaoran Zhu, Hengyi Wang, Yik Lung Pang, Changjae Oh
- Abstract: Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then train a model to map this similarity to robot actions. However, this two-step approach limits the model to capture the relationship between visual observations and textual instructions, leading to reduced precision in manipulation tasks. We propose to learn visual-action representations through a self-supervised pretext task: reconstructing a masked goal image conditioned on an input image and textual instructions. This formulation allows the model to learn visual-action representations without robot action supervision. The learned representations can then be fine-tuned for manipulation tasks with only a few demonstrations. We also introduce the \textit{Omni-Object Pick-and-Place} dataset, which consists of annotated robot tabletop manipulation episodes, including 180 object classes and 3,200 instances with corresponding textual instructions. This dataset enables the model to acquire diverse object priors and allows for a more comprehensive evaluation of its generalisation capability across object instances. Experimental results on the five benchmarks, including both simulated and real-robot validations, demonstrate that our method outperforms prior art.
- PDF: https://openreview.net/pdf?id=1D6XYy6ofW
- Forum: https://openreview.net/forum?id=1D6XYy6ofW
Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online
- Authors: Yishu Li, Xinyi Mao, Ying Yuan, Kyutae Sim, Ben Eisner, David Held
- Abstract: We introduce a novel History-Aware VErifier (HAVE) to disambiguate uncertain scenarios online by leveraging past interactions. Robots frequently encounter visually ambiguous objects whose manipulation outcomes remain uncertain until physically interacted with. While generative models alone could theoretically adapt to such ambiguity, in practice they obtain suboptimal performance in ambiguous cases, even when conditioned on action history. To address this, we propose explicitly decoupling action generation from verification: we use an unconditional diffusion-based generator to propose multiple candidate actions and employ our history-aware verifier to select the most promising action by reasoning about past interactions. Through theoretical analysis, we demonstrate that employing a verifier significantly improves expected action quality. Empirical evaluations and analysis across multiple simulated and real-world environments including articulated objects, multi-modal doors, and uneven object pick-up confirm the effectiveness of our method and improvements over baselines.
- PDF: https://openreview.net/pdf?id=iWMM4oxMBu
- Forum: https://openreview.net/forum?id=iWMM4oxMBu
Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation
- Authors: Peiyuan Zhi, Peiyang Li, Jianqin Yin, Baoxiong Jia, Siyuan Huang
- Abstract: Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on position or force control, overlooking their integration. In this work, we propose a unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of active position and force commands alongside external disturbances force, we use reinforcement learning to learn a policy that estimates forces from the robot’s historical states and compensates for them through position and velocity adjustments. Such a policy enables a wide range of manipulation behaviors under varying combinations of force and position inputs, including position tracking, force application, force tracking, and compliant robot behaviors. Additionally, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately ~39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal mobile manipulation platform and a humanoid validate the versatility and robustness of the proposed policy across diverse scenarios.
- PDF: https://openreview.net/pdf?id=MpJTyAqA0t
- Forum: https://openreview.net/forum?id=MpJTyAqA0t
Learning Deployable Locomotion Control via Differentiable Simulation
- Authors: Clemens Schwarke, Victor Klemm, Joshua Bagajo, Jean Pierre Sleiman, Ignat Georgiev, Jesus Tordesillas Torres, Marco Hutter
- Abstract: Differentiable simulators promise to improve sample efficiency in robot learning by providing analytic gradients of the system dynamics. Yet, their application to contact-rich tasks like locomotion is complicated by the inherently non-smooth nature of contact, impeding effective gradient-based optimization. Existing works thus often rely on soft contact models that provide smooth gradients but lack physical accuracy, constraining results to simulation. To address this limitation, we propose a differentiable contact model designed to provide informative gradients while maintaining high physical fidelity. We demonstrate the efficacy of our approach by training a quadrupedal locomotion policy within our differentiable simulator leveraging analytic gradients and successfully transferring the learned policy zero-shot to the real world. To the best of our knowledge, this represents the first successful sim-to-real transfer of a legged locomotion policy learned entirely within a differentiable simulator, establishing the feasibility of using differentiable simulation for real-world locomotion control.
- PDF: https://openreview.net/pdf?id=KHWIHwnYbn
- Forum: https://openreview.net/forum?id=KHWIHwnYbn
Learning from 10 Demos: Generalisable and Sample-Efficient Policy Learning with Oriented Affordance Frames
- Authors: Krishan Rana, Jad Abou-Chakra, Sourav Garg, Robert Lee, Ian Reid, Niko Suenderhauf
- Abstract: Imitation learning has unlocked the potential for robots to exhibit highly dexterous behaviours. However, it still struggles with long-horizon, multi-object tasks due to poor sample efficiency and limited generalisation. Existing methods require a substantial number of demonstrations to cover possible task variations, making them costly and often impractical for real-world deployment. We address this challenge by introducing \emph{oriented affordance frames}, a structured representation for state and action spaces that improves spatial and intra-category generalisation and enables policies to be learned efficiently from only 10 demonstrations. More importantly, we show how this abstraction allows for compositional generalisation of independently trained sub-policies to solve long-horizon, multi-object tasks. To seamlessly transition between sub-policies, we introduce the notion of self-progress prediction, which we directly derive from the duration of the training demonstrations. We validate our method across three real-world tasks, each requiring multi-step, multi-object interactions. Despite the small dataset, our policies generalise robustly to unseen object appearances, geometries, and spatial arrangements, achieving high success rates without reliance on exhaustive training data. Video demonstration can be found on our anonymised project page: https://affordance-policy.github.io/.
- PDF: https://openreview.net/pdf?id=1K3kjo91Q1
- Forum: https://openreview.net/forum?id=1K3kjo91Q1
Learning Impact-Rich Rotational Maneuvers via Centroidal Velocity Rewards and Sim-to-Real Techniques: A One-Leg Hopper Flip Case Study
- Authors: Dongyun Kang, Gijeong Kim, JongHun Choe, Hajun Kim, Hae-Won Park
- Abstract: Dynamic rotational maneuvers, such as front flips, inherently involve large angular momentum generation and intense impact forces, presenting major challenges for reinforcement learning and sim-to-real transfer. In this work, we propose a general framework for learning and deploying impact-rich, rotation-intensive behaviors through centroidal velocity-based rewards and actuator-aware sim-to-real techniques. We identify that conventional link-level reward formulations fail to induce true whole-body rotation and introduce a centroidal angular velocity reward that accurately captures system-wide rotational dynamics. To bridge the sim-to-real gap under extreme conditions, we model motor operating regions (MOR) and apply transmission load regularization to ensure realistic torque commands and mechanical robustness. Using the one-leg hopper front flip as a representative case study, we demonstrate the first successful hardware realization of a full front flip. Our results highlight that incorporating centroidal dynamics and actuator constraints is critical for reliably executing highly dynamic motions.
- PDF: https://openreview.net/pdf?id=dmXFboqSnX
- Forum: https://openreview.net/forum?id=dmXFboqSnX
Learning Long-Context Diffusion Policies via Past-Token Prediction
- Authors: Marcel Torne Villasevil, Andy Tang, Yuejiang Liu, Chelsea Finn
- Abstract: Reasoning over long sequences of observations and actions is essential for many robotic tasks. Yet, learning effective long-context policies from demonstrations remains challenging. As context length increases, training becomes increasingly expensive due to rising memory demands, and policy performance often degrades as a result of spurious correlations. Recent methods typically sidestep these issues by truncating context length, discarding historical information that may be critical for subsequent decisions. In this paper, we propose an alternative approach that explicitly regularizes the retention of past information. We first revisit the copycat problem in imitation learning and identify an opposite challenge in recent diffusion policies: rather than over-relying on prior actions, they often fail to capture essential dependencies between past and future actions. To address this, we introduce Past-Token Prediction (PTP), an auxiliary task in which the policy learns to predict past action tokens alongside future ones. This regularization significantly improves temporal modeling in the policy head, with minimal reliance on visual representations. Building on this observation, we further introduce a multistage training strategy: pre-train the visual encoder with short contexts, and fine-tune the policy head using cached long-context embeddings. This strategy preserves the benefits of PTP while greatly reducing memory and computational overhead. Finally, we extend PTP into a self-verification mechanism at test time, enabling the policy to score and select candidates consistent with past actions during inference. Experiments across four real-world and six simulated tasks demonstrate that our proposed method improves the performance of long-context diffusion policies by 3× and accelerates policy training by more than 10×. Videos are available at https://ptp-robot.github.io.
- PDF: https://openreview.net/pdf?id=o0LBjJxUeS
- Forum: https://openreview.net/forum?id=o0LBjJxUeS
Learning Long-Horizon Robot Manipulation Skills via Privileged Action
- Authors: Xiaofeng Mao, Yucheng XU, Zhaole Sun, Elle Miller, Daniel Layeghi, Michael Mistry
- Abstract: Long-horizon contact-rich tasks are challenging to learn with reinforcement learning, due to ineffective exploration of high-dimensional state spaces with sparse rewards. The learning process often gets stuck in local optimum and demands task-specific reward fine-tuning for complex scenarios. In this work, we propose a structured framework that leverages privileged actions with curriculum learning, enabling the policy to efficiently acquire long-horizon skills without relying on extensive reward engineering or reference trajectories. Specifically, we use privileged actions in simulation with a general training procedure that would be infeasible to implement in real-world scenarios. These privileges include relaxed constraints and virtual forces that enhance interaction and exploration with objects. Our results successfully achieve complex multi-stage long-horizon tasks that naturally combine non-prehensile manipulation with grasping to lift objects from non-graspable poses. We demonstrate generality by maintaining a parsimonious reward structure and showing convergence to diverse and robust behaviors across various environments. Our approach outperforms state-of-the-art methods in these tasks, converging to solutions where others fail.
- PDF: https://openreview.net/pdf?id=yOWUy97hmd
- Forum: https://openreview.net/forum?id=yOWUy97hmd
Learning Smooth State-Dependent Traversability from Dense Point Clouds
- Authors: Zihao Dong, Alan Papalia, Leonard Jung, Alenna Spiro, Philip R Osteen, Christa S. Robison, Michael Everett
- Abstract: A key open challenge in off-road autonomy is that the traversability of terrain often depends on the vehicle’s state. In particular, some obstacles are only traversable from some orientations. However, learning this interaction by encoding the angle of approach as a model input demands a large and diverse training dataset and is computationally inefficient during planning due to repeated model inference. To address these challenges, we present SPARTA, a method for estimating approach angle conditioned traversability from point clouds. Specifically, we impose geometric structure into our network by outputting a smooth analytical function over the 1-Sphere that predicts risk distribution for any angle of approach with minimal overhead and can be reused for subsequent queries. The function is composed of Fourier basis functions, which has important advantages for generalization due to their periodic nature and smoothness. We demonstrate SPARTA both in a high-fidelity simulation platform, where our model achieves a 91% success rate crossing a 40m boulder field (compared to 73% for the baseline), and on hardware, illustrating the generalization ability of the model to real-world settings.
- PDF: https://openreview.net/pdf?id=oYC10hiFua
- Forum: https://openreview.net/forum?id=oYC10hiFua
Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation
- Authors: Rachel Luo, Heng Yang, Michael Watson, Apoorva Sharma, Sushant Veer, Edward Schmerling, Marco Pavone
- Abstract: Learning-based robotic systems demand rigorous validation to assure reliable performance, but extensive real‐world testing is often prohibitively expensive and if conducted may still yield insufficient data for high-confidence guarantees. In this work, we introduce a general estimation framework that leverages paired data across test platforms, e.g., paired simulation and real‐world observations, to achieve better estimates of real-world metrics via the method of control variates. By incorporating cheap and abundant auxiliary measurements (for example, simulator outputs) as control variates for costly real‐world samples, our method provably reduces the variance of Monte Carlo estimates and thus requires significantly fewer real‐world samples to attain a specified confidence bound on the mean performance. We provide theoretical analysis characterizing the variance and sample-efficiency improvement, and demonstrate empirically in autonomous driving and quadruped robotics settings that our approach achieves high‐probability bounds with markedly reduced sample complexity. Our technique can lower the real‐world testing burden for validating the performance of the stack, thereby enabling more efficient and cost‐effective experimental evaluation of robotic systems.
- PDF: https://openreview.net/pdf?id=ejV8YHTpHR
- Forum: https://openreview.net/forum?id=ejV8YHTpHR
LLM-Guided Probabilistic Program Induction for POMDP Model Estimation
- Authors: Aidan Curtis, Hao Tang, Thiago Veloso, Kevin Ellis, Joshua B. Tenenbaum, Tomás Lozano-Pérez, Leslie Pack Kaelbling
- Abstract: Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.
- PDF: https://openreview.net/pdf?id=R5Y7Sr1DIe
- Forum: https://openreview.net/forum?id=R5Y7Sr1DIe
LocoFormer: Generalist Locomotion via Long-context Adaptation
- Authors: Min Liu, Deepak Pathak, Ananye Agarwal
- Abstract: Humans and animals exhibit flexible locomotion strategies, such as learning to walk within minutes, and efficient adaptation to changes in morphology. In contrast, modern locomotion controllers are manually tuned for specific embodiments. In this paper, we present LocoFormer, a generalist policy that can control previously unseen legged and wheeled robots, even without precise knowledge of their kinematics. LocoFormer is able to adapt to changes in morphology and dynamics at test time. We find that two key choices enable adaptation. First, we train massive scale RL on procedurally generated robots with aggressive domain randomization. Second, in contrast to previous policies that are myopic with short context lengths, we extend context by orders of magnitude to span episode boundaries. We deploy the same LocoFormer to varied robots, and show robust control even with large disturbances such as weight and motor failures. In extreme scenarios, we see emergent adaptation across episodes, LocoFormer learns from falls in early episodes to improve control strategies in later ones. We believe this simple yet general recipe can be used to train foundation models for other robotic skills in the future. Videos at generalist-locomotion.github.io.
- PDF: https://openreview.net/pdf?id=VqmAvBkFhw
- Forum: https://openreview.net/forum?id=VqmAvBkFhw
LocoTouch: Learning Dynamic Quadrupedal Transport with Tactile Sensing
- Authors: Changyi Lin, Yuxin Ray Song, Boda Huo, Mingyang Yu, Yikai Wang, Shiqi Liu, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Jie Tan, Yiyue Luo, Ding Zhao
- Abstract: Quadrupedal robots have demonstrated remarkable agility and robustness in traversing complex terrains. However, they struggle with dynamic object interactions, where contact must be precisely sensed and controlled. To bridge this gap, we present LocoTouch, a system that equips quadrupedal robots with tactile sensing to address a particularly challenging task in this category: long-distance transport of unsecured cylindrical objects, which typically requires custom mounting or fastening mechanisms to maintain stability. For efficient large-area tactile sensing, we design a high-density distributed tactile sensor that covers the entire back of the robot. To effectively leverage tactile feedback for robot control, we develop a simulation environment with high-fidelity tactile signals, and train tactile-aware transport policies using a two-stage learning pipeline. Furthermore, we design a novel reward function to promote robust, symmetric, and frequency-adaptive locomotion gaits. After training in simulation, LocoTouch transfers zero-shot to the real world, reliably transporting a wide range of unsecured cylindrical objects with diverse sizes, weights, and surface properties. Moreover, it remains robust over long distances, on uneven terrain, and under severe perturbations.
- PDF: https://openreview.net/pdf?id=XaJkbK02Vm
- Forum: https://openreview.net/forum?id=XaJkbK02Vm
LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations
- Authors: Weikang Wan, Jiawei Fu, Xiaodi Yuan, Yifeng Zhu, Hao Su
- Abstract: Developing robotic systems capable of robustly executing long-horizon manipulation tasks with human-level dexterity is challenging, as such tasks require both physical dexterity and seamless sequencing of manipulation skills while robustly handling environment variations. While imitation learning offers a promising approach, acquiring comprehensive datasets is resource-intensive. In this work, we propose a learning framework and system LodeStar that automatically decomposes task demonstrations into semantically meaningful skills using off-the-shelf foundation models, and generates diverse synthetic demonstration datasets from a few human demos through reinforcement learning. These sim-augmented datasets enable robust skill training, with a Skill Routing Transformer (SRT) policy effectively chaining the learned skills together to execute complex long-horizon manipulation tasks. Experimental evaluations on three challenging real-world long-horizon dexterous manipulation tasks demonstrate that our approach significantly improves task performance and robustness compared to previous baselines. Videos are available at lodestar-robot.github.io.
- PDF: https://openreview.net/pdf?id=6yB6AX8aSU
- Forum: https://openreview.net/forum?id=6yB6AX8aSU
Long Range Navigator (LRN): Extending robot planning horizons beyond metric maps
- Authors: Matt Schmittle, Rohan Baijal, Nathan Hatch, Rosario Scalise, Mateo Guaman Castro, Sidharth Talia, Khimya Khetarpal, Byron Boots, Siddhartha Srinivasa
- Abstract: A robot navigating an outdoor environment with no prior knowledge of the space must rely on its local sensing, which is in the form of a local metric map or local policy with some fixed horizon. A limited planning horizon can often result in myopic decisions leading the robot off course or worse, into very difficult terrain. In this work, we make a key observation that long range navigation only necessitates identifying good frontier directions for planning instead of full map knowledge. To address this, we introduce Long Range Navigator (LRN), which learns to predict ‘affordable’ frontier directions from high-dimensional camera images. LRN is trained entirely on unlabeled egocentric videos, making it scalable and adaptable. In off-road tests on Spot and a large vehicle, LRN reduces human interventions and improves decision speed when integrated into existing navigation stacks.
- PDF: https://openreview.net/pdf?id=QtVZUPCKrY
- Forum: https://openreview.net/forum?id=QtVZUPCKrY
Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation
- Authors: Yiguo Fan, Shuanghao Bai, Xinyang Tong, Pengxiang Ding, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, Zhaoxin Fan, Badong Chen, Donglin Wang
- Abstract: Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.
- PDF: https://openreview.net/pdf?id=irh5o90Mj1
- Forum: https://openreview.net/forum?id=irh5o90Mj1
Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation
- Authors: Yajvan Ravan, Adam Rashid, Alan Yu, Kai McClennen, Gio Huh, Kevin Yang, Zhutian Yang, Qinxi Yu, Xiaolong Wang, Phillip Isola, Ge Yang
- Abstract: We introduce Lucid-XR, a generative data engine for creating diverse and realistic-looking data to train real-world robot systems. At the core of Lucid-XR is vuer, a web-based physics simulation environment that runs directly on the XR headset, enabling internet-scale access to immersive, latency-free virtual interactions without requiring specialized equipment. The complete system integrates on-device physics simulation with on-device human-to-robot pose retargeting, that are further amplified by a physics-guided video generation pipeline commandable with natural language specifications. We demonstrate zero-shot sim-to-real transfer of robot visual policies, trained entirely on Lucid-XR’s synthetic data, across bimanual and dexterous manipulation tasks that involve flexible materials, adhesive interaction between particles, and rigid body contact.
- PDF: https://openreview.net/pdf?id=3p7rTnLJM8
- Forum: https://openreview.net/forum?id=3p7rTnLJM8
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation
- Authors: Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, Daniel Seita
- Abstract: Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 35 common and state-of-the-art VLM families on our benchmark, including variants to test different model sizes. The performance of VLMs significantly varies across tasks, and there is a strong correlation between this performance and trends in our real-world manipulation tasks. It also shows that there remains a significant gap between these models and human-level understanding.
- PDF: https://openreview.net/pdf?id=MCC3MG2aRH
- Forum: https://openreview.net/forum?id=MCC3MG2aRH
ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training
- Authors: Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, Dieter Fox
- Abstract: Generative models based on flow matching offer significant potential for learning robot policies, particularly in generating high-dimensional, dexterous behaviors that are conditioned on diverse observations. In this work, we introduce ManiFlow, an advanced flow matching model specifically designed to support dexterous manipulation tasks. ManiFlow improves over flow matching both in the learning procedure and in the model architecture, resulting in better robustness and efficacy. It consistently exhibits strong generalization capabilities, outperforming existing state-of-the-art robot learning methods on a wide range of benchmarks. We also demonstrate the powerful capabilities of ManiFlow in solving complex bimanual dexterous manipulation challenges.
- PDF: https://openreview.net/pdf?id=etSYDtRO0Z
- Forum: https://openreview.net/forum?id=etSYDtRO0Z
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
- Authors: Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji, Wenhao Tang, Wenbo Ding, Chao Yu, Yu Wang
- Abstract: In this paper, we tackle the problem of learning to play 3v3 multi-drone volleyball, a new embodied competitive task that requires both high-level strategic coordination and low-level agile control. The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control. We design a three-stage population-based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low-level skills, (II) learning high-level strategy via self-play with fixed low-level controllers, and (III) joint fine-tuning through co-self-play. Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant. Moreover, co-self-play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme.
- PDF: https://openreview.net/pdf?id=23FdMTxEh7
- Forum: https://openreview.net/forum?id=23FdMTxEh7
Mechanistic Interpretability for Steering Vision-Language-Action Models
- Authors: Bear Häon, Kaylene Caswell Stocking, Ian Chuang, Claire Tomlin
- Abstract: Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control—establishing a new paradigm for transparent and steerable foundation models in robotics.
- PDF: https://openreview.net/pdf?id=YvsUD8C9QS
- Forum: https://openreview.net/forum?id=YvsUD8C9QS
Meta-Optimization and Program Search using Language Models for Task and Motion Planning
- Authors: Denis Shcherba, Eckart Cobo-Briesewitz, Cornelius V. Braun, Marc Toussaint
- Abstract: Intelligent interaction with the real world requires robotic agents to jointly reason over high-level plans and low-level controls. This requirement is formalized in the task and motion planning (TAMP) problem, in which symbolic planning and continuous trajectory generation must be solved in a coordinated manner. Recently, foundation model-based approaches to TAMP have presented impressive results, including fast planning times and the execution of natural language instructions. Yet, the optimal interface between high-level plan and low-level motion generation remains to be found: prior approaches are limited by either too much abstraction (e.g., chaining simplified skill primitives) or a lack thereof (e.g., direct joint angle prediction). Our method introduces a novel technique employing a form of meta-optimization to address these shortcomings by: (i) using program search over trajectory optimization problems as an interface between foundation model and robot controllers, and (ii) leveraging a zero-order method to optimize numerical values in the foundation model output. Results on challenging object manipulation and drawing tasks confirm that our proposed method improves over prior TAMP approaches.
- PDF: https://openreview.net/pdf?id=1n1Liq6So4
- Forum: https://openreview.net/forum?id=1n1Liq6So4
MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention
- Authors: Yuxin Chen, Chen Tang, Jianglan Wei, Chenran Li, Thomas Tian, Xiang Zhang, Wei Zhan, Peter Stone, Masayoshi Tomizuka
- Abstract: Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy’s execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning, designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert’s and the prior policy’s underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention compared to other baselines.
- PDF: https://openreview.net/pdf?id=iQQy1BKlGv
- Forum: https://openreview.net/forum?id=iQQy1BKlGv
MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
- Authors: Chao Tang, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong, Hanbo Zhang, David Hsu, Hong Zhang
- Abstract: Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with 3D functional keypoints, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc’s one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects.
- PDF: https://openreview.net/pdf?id=CBYGhryESq
- Forum: https://openreview.net/forum?id=CBYGhryESq
MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs
- Authors: Zheyu Zhuang, Ruiyu Wang, Giovanni Luca Marchetti, Florian T. Pokorny, Danica Kragic
- Abstract: Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras, enabling impressive visuomotor performance. However, it remains constrained by the cost of collecting sufficiently diverse demonstrations, especially for generalizing across workspace variations. We propose MirrorDuo, a mirroring-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving ``collect one, get one for free.’’ It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or just 5 demonstrations in the target arrangement.
- PDF: https://openreview.net/pdf?id=cUeY476ohd
- Forum: https://openreview.net/forum?id=cUeY476ohd
Mobi-$\pi$: Mobilizing Your Robot Learning Policy
- Authors: Jingyun Yang, Isabella Huang, Brandon Vu, Max Bajracharya, Rika Antonova, Jeannette Bohg
- Abstract: Learned visuomotor policies are capable of performing increasingly complex manipulation tasks. However, most of these policies are trained on data collected from limited robot positions and camera viewpoints. This leads to poor generalization to novel robot positions, which limits the use of these policies on mobile platforms, especially for precise tasks like pressing buttons or turning faucets. In this work, we formulate the “”policy mobilization”” problem: find a mobile robot base pose in a novel environment that is in distribution with respect to a manipulation policy trained on a limited set of camera viewpoints. Compared to retraining the policy itself to be more robust to unseen robot base pose initializations, policy mobilization decouples navigation from manipulation and thus does not require additional demonstrations. With that, our formulation is still compatible with any approach that improves manipulation policy robustness. To study policy mobilization, we introduce the Mobi-$\pi$ framework, which includes: (1) metrics that quantify the difficulty of mobilizing a given policy, (2) a suite of simulated mobile manipulation tasks based on RoboCasa to evaluate policy mobilization, (3) visualization tools for analysis, and (4) several baseline methods. We also propose a novel approach that bridges navigation and manipulation by optimizing the robot’s base pose to align with an in-distribution base pose for a learned policy. Our approach utilizes a 3D Gaussian Splatting model for novel viewpoint synthesis, a score function to evaluate pose suitability, as well as sampling-based optimization to identify optimal robot poses. We show that our approach on average outperforms the best baseline by 7.65$\times$ in simulation and 2.38$\times$ in the real world, demonstrating its effectiveness for policy mobilization.
- PDF: https://openreview.net/pdf?id=LnryWopsfJ
- Forum: https://openreview.net/forum?id=LnryWopsfJ
Motion Blender Gaussian Splatting for Dynamic Reconstruction
- Authors: Xinyu Zhang, Haonan Chang, Yuhan Liu, Abdeslam Boularias
- Abstract: Gaussian splatting has emerged as a powerful tool for high-fidelity reconstruction of dynamic scenes. However, existing methods primarily rely on implicit motion representations, such as encoding motions into neural networks or per-Gaussian parameters, which makes it difficult to further manipulate the reconstructed motions. This lack of explicit controllability limits existing methods to replaying recorded motions only, which hinders a wider application. To address this, we propose Motion Blender Gaussian Splatting (MB-GS), a novel framework that uses motion graph as an explicit and sparse motion representation. The motion of graph links is propagated to individual Gaussians via dual quaternion skinning, with learnable weight painting functions determining the influence of each link. The motion graphs and 3D Gaussians are jointly optimized from input videos via differentiable rendering. Experiments show that MB-GS achieves state-of-the-art performance on the iPhone dataset while being competitive on HyperNeRF. Additionally, we demonstrate the application potential of our method in animating novel object motions, synthesizing robot demonstrations through motion editing, and predicting robot actions through visual planning.
- PDF: https://openreview.net/pdf?id=4Po2mqLjrQ
- Forum: https://openreview.net/forum?id=4Po2mqLjrQ
Motion Priors Reimagined: Adapting Flat-Terrain Skills for Complex Quadruped Mobility
- Authors: Zewei Zhang, Chenhao Li, Takahiro Miki, Marco Hutter
- Abstract: Reinforcement learning (RL)-based legged locomotion controllers often require meticulous reward tuning to track velocities or goal positions while preserving smooth motion on various terrains. Motion imitation methods via RL using demonstration data reduce reward engineering but fail to generalize to novel environments. We address this by proposing a hierarchical RL framework in which a low-level policy is first pre-trained to imitate animal motions on flat ground, thereby establishing motion priors. A subsequent high-level, goal-conditioned policy then builds on these priors, learning residual corrections that enable perceptive locomotion, local obstacle avoidance, and goal-directed navigation across diverse and rugged terrains. Simulation experiments illustrate the effectiveness of learned residuals in adapting to progressively challenging uneven terrains while still preserving the locomotion characteristics provided by the motion priors. Furthermore, our results demonstrate improvements in motion regularization over baseline models trained without motion priors under similar reward setups. Real-world experiments with an ANYmal-D quadruped robot confirm our policy’s capability to generalize animal-like locomotion skills to complex terrains, demonstrating smooth and efficient locomotion and local navigation performance amidst challenging terrains with obstacles.
- PDF: https://openreview.net/pdf?id=JXBm4Xfrvj
- Forum: https://openreview.net/forum?id=JXBm4Xfrvj
MoTo: A Zero-shot Plug-in Interaction-aware Navigation for General Mobile Manipulation
- Authors: Zhenyu Wu, Angyuan Ma, Xiuwei Xu, Hang Yin, Yinan Liang, Ziwei Wang, Jiwen Lu, Haibin Yan
- Abstract: Mobile manipulation is the fundamental challenge for robotics in assisting humans with diverse tasks and environments in everyday life. Conventional mobile manipulation approaches often struggle to generalize across different tasks and environments due to the lack of large-scale training. However, recent advances in manipulation foundation models demonstrate impressive generalization capability on a wide range of fixed-base manipulation tasks, which are still limited to a fixed setting. Therefore, we devise a plug-in module named MoTo, which can be combined with any off-the-shelf manipulation foundation model to empower them with mobile manipulation ability. Specifically, we propose an interaction-aware navigation policy to generate agent docking points for generalized mobile manipulation. To enable zero-shot ability, we propose an interaction keypoints framework via vision-language models (VLM) under multi-view consistency for both target object and robotic arm following instructions, where fixed-base manipulation foundation models can be employed. We further propose motion planning objectives for the mobile base and robot arm, which minimize the distance between the two keypoints and maintain the physical feasibility of trajectories. In this way, MoTo guides the agent to move to the docking points where fixed-base manipulation can be successfully performed, and leverages VLM generation and trajectory optimization to achieve mobile manipulation in a zero-shot manner, without any requirement on mobile manipulation expert data. Extensive experimental results on OVMM and real-world demonstrate that MoTo achieves success rates of 2.68% and 16.67% higher than the state-of-the-art mobile manipulation methods, respectively, without requiring additional training data.
- PDF: https://openreview.net/pdf?id=Th1kFSnjUW
- Forum: https://openreview.net/forum?id=Th1kFSnjUW
Multi-critic Learning for Whole-body End-effector Twist Tracking
- Authors: Aravind Elanjimattathil Vijayan, Andrei Cramariuc, Mattia Risiglione, Christian Gehring, Marco Hutter
- Abstract: Learning whole-body control for locomotion and arm motions in a single policy has challenges, as the two tasks have conflicting goals. For instance, efficient locomotion typically favors a horizontal base orientation, while end-effector tracking may benefit from base tilting to extend reachability. Additionally, current Reinforcement Learning (RL) approaches using a pose-based task specification lack the ability to directly control the end-effector velocity, making smoothly executing trajectories very challenging. To address these limitations, we propose an RL-based framework that allows for dynamic, velocity-aware whole-body end-effector control. Our method introduces a multi-critic actor architecture that decouples the reward signals for locomotion and manipulation, simplifying reward tuning and allowing the policy to resolve task conflicts more effectively. Furthermore, we design a twist-based end-effector task formulation that can track both discrete poses and motion trajectories. We validate our approach through a set of simulation and hardware experiments using a quadruped robot equipped with a robotic arm. The resulting controller can simultaneously walk and move its end-effector and shows emergent whole-body behaviors, where the base assists the arm in extending the workspace, despite a lack of explicit formulations.
- PDF: https://openreview.net/pdf?id=sVWKm4UiTL
- Forum: https://openreview.net/forum?id=sVWKm4UiTL
Multi-Loco: Unifying Multi-Embodiment Legged Locomotion via Reinforcement Learning Augmented Diffusion
- Authors: Shunpeng Yang, Zhen Fu, Zhefeng Cao, Guo Junde, Patrick Wensing, Wei Zhang, Hua Chen
- Abstract: Generalizing locomotion policies across diverse legged robots with varying morphologies is a key challenge due to differences in observation/action dimensions and system dynamics. In this work, we propose \textit{Multi-Loco}, a novel unified framework combining a morphology-agnostic generative diffusion model with a lightweight residual policy optimized via reinforcement learning (RL). The diffusion model captures morphology-invariant locomotion patterns from diverse cross-embodiment datasets, improving generalization and robustness. The residual policy is shared across all embodiments and refines the actions generated by the diffusion model, enhancing task-aware performance and robustness for real-world deployment. We evaluated our method with a rich library of four legged robots in both simulation and real-world experiments. Compared to a standard RL framework with PPO, our approach - replacing the Gaussian policy with a diffusion model and residual term - achieves a 10.35% average return improvement, with gains up to 13.57% in wheeled-biped locomotion tasks. These results highlight the benefits of cross-embodiment data and composite generative architectures in learning robust, generalized locomotion skills.
- PDF: https://openreview.net/pdf?id=ypDETG94BS
- Forum: https://openreview.net/forum?id=ypDETG94BS
Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
- Authors: Jiaqi Cheng, Mingfeng Fan, Xuefeng Zhang, Jingsong Liang, Yuhong Cao, Guohua Wu, Guillaume Adrien Sartoretti
- Abstract: Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.
- PDF: https://openreview.net/pdf?id=r29CIl3ePP
- Forum: https://openreview.net/forum?id=r29CIl3ePP
Neural Robot Dynamics
- Authors: Jie Xu, Eric Heiden, Iretiayo Akinola, Dieter Fox, Miles Macklin, Yashraj Narang
- Abstract: Accurate and efficient simulation of modern robots remains challenging due to their high degrees of freedom and intricate mechanisms. Neural simulators have emerged as a promising alternative to traditional analytical simulators, capable of efficiently predicting complex dynamics and adapting to real-world data; however, existing neural simulators typically require application-specific training and fail to generalize to novel tasks and/or environments, primarily due to inadequate representations of the global state. In this work, we address the problem of learning generalizable neural simulators for robots that are structured as articulated rigid bodies. We propose NeRD (Neural Robot Dynamics), learned robot-specific dynamics models for predicting future states for articulated rigid bodies under contact constraints. NeRD uniquely replaces the low-level dynamics and contact solvers in an analytical simulator and employs a robot-centric and spatially-invariant simulation state representation. We integrate the learned NeRD models as an interchangeable backend solver within a state-of-the-art robotics simulator. We conduct extensive experiments to show that the NeRD simulators are stable and accurate over a thousand simulation steps; generalize across tasks and environment configurations; enable policy learning exclusively in a neural engine; and, unlike most classical simulators, can be fine-tuned from real-world data to bridge the gap between simulation and reality.
- PDF: https://openreview.net/pdf?id=HqqyJ9A2fy
- Forum: https://openreview.net/forum?id=HqqyJ9A2fy
NeuralSVCD for Efficient Swept Volume Collision Detection
- Authors: Hojin Jung, Dongwon Son, Beomjoon Kim
- Abstract: Robot manipulation in unstructured environments requires efficient and reliable Swept Volume Collision Detection (SVCD) for safe motion planning. Traditional discrete methods potentially miss collisions between these points, whereas SVCD continuously checks for collisions along the entire trajectory. Existing SVCD methods typically face a trade-off between efficiency and accuracy, limiting practical use. In this paper, we introduce NeuralSVCD, a novel neural encoder-decoder architecture tailored to overcome this trade-off. Our approach leverages shape locality and temporal locality through distributed geometric representations and temporal optimization. This enhances computational efficiency without sacrificing accuracy. Comprehensive experiments show that NeuralSVCD consistently outperforms existing state-of-the-art SVCD methods in terms of both collision detection accuracy and computational efficiency, demonstrating its robust applicability across diverse robotic manipulation scenarios. Code and videos are available at https://neuralsvcd.github.io/.
- PDF: https://openreview.net/pdf?id=2xvxn3Hm3n
- Forum: https://openreview.net/forum?id=2xvxn3Hm3n
Non-conflicting Energy Minimization in Reinforcement Learning based Robot Control
- Authors: Skand Peri, Akhil Perincherry, Bikram Pandit, Stefan Lee
- Abstract: Efficient robot locomotion often requires balancing task performance with energy expenditure. A common approach in reinforcement learning (RL) is to penalize energy use directly in the reward function. This requires carefully weighting the reward terms to avoid undesirable trade-offs where energy minimization harms task success or vice versa. In this work, we propose a hyperparameter-free gradient optimization method to minimize energy without conflicting with task performance. Inspired by recent works in multitask learning, our method applies policy gradient projection between task and energy objectives to promote non-conflicting updates. We evaluate this technique on standard locomotion benchmarks of DM-Control and HumanoidBench and demonstrate a reduction of $64$% energy usage while maintaining comparable task performance. Further, we conduct experiments on a Unitree GO2 quadruped showcasing Sim2Real transfer of energy efficient policies. Our method is easy to implement in standard RL pipelines with minimal code changes, and offers a principled alternative to reward shaping for energy efficient control policies.
  PDF: https://openreview.net/pdf?id=kUA2ec94LI
- Forum: https://openreview.net/forum?id=kUA2ec94LI
O$^3$Afford: One-Shot 3D Object-to-Object Affordance Grounding for Generalizable Robotic Manipulation
- Authors: Tongxuan Tian, Xuhui Kang, Yen-Ling Kuo
- Abstract: Grounding object affordance is fundamental to robotic manipulation as it establishes the critical link between perception and action among interacting objects. However, prior works predominantly focus on predicting single-object affordance, overlooking the fact that most real-world interactions involve relationships between pairs of objects. In this work, we address the challenge of object-to-object affordance grounding under limited data. Inspired by recent advances in few-shot learning with 2D vision foundation models, we propose a novel one-shot 3D object-to-object affordance learning approach for robotic manipulation. Semantic features from vision foundation models combined with point cloud representation for geometric understanding enable our one-shot learning pipeline to generalize effectively to novel objects and categories. We further integrate our 3D affordance representation with large language models (LLMs) for optimization-based motion planning, significantly enhancing LLMs’ capability to comprehend and reason about object interactions when generating task-specific constraint functions. Our experiments on 3D object-to-object affordance grounding and robotic manipulation demonstrate that our O$^3$Afford significantly outperforms existing baselines in terms of both accuracy and generalization capability.
- PDF: https://openreview.net/pdf?id=rJRFFDVTnf
- Forum: https://openreview.net/forum?id=rJRFFDVTnf
ObjectReact: Learning Object-Relative Control for Visual Navigation
- Authors: Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski, Madhava Krishna, Feras Dayoub, Ian Reid
- Abstract: Visual navigation using only a single camera and a topological map has recently become an appealing alternative to methods that require additional sensors and 3D maps. This is typically achieved through an “”image-relative”” approach to estimating control from a given pair of current observation and subgoal image. However, image-level representations of the world have limitations because images are strictly tied to the agent’s pose and embodiment. In contrast, objects, being a property of the map, offer an embodiment- and trajectory-invariant world representation. In this work, we present a new paradigm of learning “”object-relative”” control that exhibits several desirable characteristics: a) new routes can be traversed without strictly requiring to imitate prior experience, b) the control prediction problem can be decoupled from solving the image matching problem, and c) high invariance can be achieved in cross-embodiment deployment for variations across both training-testing and mapping-execution settings. We propose a topometric map representation in the form of a relative”” 3D scene graph, which is used to obtain more informative object-level global path planning costs. We train a local controller, dubbed “”ObjectReact””, conditioned directly on a high-level ``WayObject Costmap’’ representation that eliminates the need for an explicit RGB input. We demonstrate the advantages of learning object-relative control over its image-relative counterpart across sensor height variations and multiple navigation tasks that challenge the underlying spatial understanding capability, e.g., navigating a map trajectory in the reverse direction. We further show that our sim-only policy is able to generalize well to real-world indoor environments.
- PDF: https://openreview.net/pdf?id=thVTNoJ4Lx
- Forum: https://openreview.net/forum?id=thVTNoJ4Lx
Omni-Perception: Omnidirectional Collision Avoidance of Legged Robots in Dynamic Environments
- Authors: Zifan Wang, Teli Ma, Yufei Jia, Xun Yang, Jiaming Zhou, Wenlong OUYANG, Qiang Zhang, Junwei Liang
- Abstract: Agile locomotion in complex 3D environments requires robust spatial awareness to safely avoid diverse obstacles such as aerial clutter, uneven terrain, and dynamic agents. Depth-based perception approaches often struggle with sensor noise, lighting variability, computational overhead from intermediate representations (e.g., elevation maps), and difficulties with non-planar obstacles, limiting performance in unstructured environments. In contrast, direct integration of LiDAR sensing into end-to-end learning for legged locomotion remains underexplored.We propose Omni-Perception, an end-to-end locomotion policy that achieves 3D spatial awareness and omnidirectional collision avoidance by directly processing raw LiDAR point clouds. At its core is PD-RiskNet (Proximal-Distal Risk-Aware Hierarchical Network), a novel perception module that interprets spatio-temporal LiDAR data for environmental risk assessment. To facilitate efficient policy learning, we develop a high-fidelity LiDAR simulation toolkit with realistic noise modeling and fast raycasting, compatible with platforms such as Isaac Gym, Genesis, and MuJoCo, enabling scalable training and effective sim-to-real transfer.Learning reactive control policies directly from raw LiDAR data enables the robot to navigate complex environments with static and dynamic obstacles more robustly than approaches relying on intermediate maps or limited sensing. We validate Omni-Perception through real-world experiments and extensive simulation, demonstrating strong omnidirectional avoidance capabilities and superior locomotion performance in highly dynamic environments.We will open-source our code and models.
- PDF: https://openreview.net/pdf?id=KUSYJIlKor
- Forum: https://openreview.net/forum?id=KUSYJIlKor
One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies
- Authors: Chuer Pan, Litian Liang, Dominik Bauer, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Shuran Song
- Abstract: Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot’s initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.
- PDF: https://openreview.net/pdf?id=Hu3NoPMAg4
- Forum: https://openreview.net/forum?id=Hu3NoPMAg4
“One View, Many Worlds: Single-Image to 3D object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation”
- Authors: Zheng Geng, Nan Wang, Shaocong Xu, Chongjie Ye, Bohan Li, Zhaoxi Chen, Sida Peng, Hao Zhao
- Abstract: Estimating the 6D pose of arbitrary objects from a single reference image is a critical yet challenging task in robotics, especially considering the long-tail distribution of real-world instances. While category-level and model-based approaches have achieved notable progress, they remain limited in generalizing to unseen objects under one-shot settings. In this work, we propose a novel pipeline for fast and accurate one-shot 6D pose and scale estimation. Leveraging recent advances in single-view 3D generation, we first build high-fidelity textured meshes without requiring known object poses. To resolve scale ambiguity, we introduce a coarse-to-fine alignment module that estimates both object size and initial pose by matching 2D-3D features with depth information. We then generate a diversified set of plausible 3D models using text-guided generative augmentation and render them with Blender to synthesize large-scale, domain-randomized training data for pose estiamtion. This synthetic data bridges the domain gap and enables robust fine-tuning of pose estimators. Our method achieves state-of-the-art results on several 6D pose benchmarks, and we further validate its effectiveness on a newly collected in-the-wild dataset. Finally, we integrate our system with a dexterous hand, demonstrating its robustness in real-world robotic grasping tasks. All code, data, and models will be released to foster future research.
- PDF: https://openreview.net/pdf?id=kto4zVmo4w
- Forum: https://openreview.net/forum?id=kto4zVmo4w
OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion
- Authors: Shuhao Kang, Youqi Liao, Yan Xia, Olaf Wysocki, Boris Jutzi, Daniel Cremers
- Abstract: LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel network for LiDAR place recognition that leverages OpenStreetMap (OSM) as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components. First, a cross-modal visibility mask that identifies maximal observable regions from both modalities to guide feature learning. Second, an adaptive radial fusion module that dynamically consolidates radial features into discriminative global descriptors. Extensive experiments on the KITTI and KITTI-360 datasets demonstrate OPAL’s superiority, achieving 15.98% higher recall at @1m threshold for top-1 retrieved matches, along with 12x faster inference speed compared to the state-of-the-art approach. Code and datasets will be publicly available.
- PDF: https://openreview.net/pdf?id=DKXx17oaUf
- Forum: https://openreview.net/forum?id=DKXx17oaUf
“ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation”
- Authors: Suning Huang, Qianzhong Chen, Xiaohan Zhang, Jiankai Sun, Mac Schwager
- Abstract: 3D world models (i.e., learning-based 3D dynamics models) offer a promising approach to generalizable robotic manipulation by capturing the underlying physics of environment evolution conditioned on robot actions. However, existing 3D world models are primarily limited to single-material dynamics using a particle-based Graph Neural Network model, and often require time-consuming 3D scene reconstruction to obtain 3D particle tracks for training. In this work, we present ParticleFormer, a Transformer-based point cloud world model trained with a hybrid point cloud reconstruction loss, supervising both global and local dynamics features in multi-material, multi-object robot interactions. ParticleFormer captures fine-grained multi-object interactions between rigid, deformable, and flexible materials, trained directly from real-world robot perception data without an elaborate scene reconstruction. We demonstrate the model’s effectiveness both in 3D scene forecasting tasks, and in downstream manipulation tasks using a Model Predictive Control (MPC) policy. In addition, we extend existing dynamics learning benchmarks to include diverse multi-material, multi-object interaction scenarios. We validate our method on six simulation and three real-world experiments, where it consistently outperforms leading baselines by achieving superior dynamics prediction accuracy and less rollout error in downstream visuomotor tasks. Experimental videos are available at https://particleformer.github.io/.
- PDF: https://openreview.net/pdf?id=7wGYX11BJB
- Forum: [https://openreview.net/forum?id
  330.4s
  Use Arrow Up and Arrow Down to select a turn, Enter to jump to it, and Escape to return to the chat.