Fast-UMI

A Scalable and Hardware-Independent Universal Manipulation Interface

† * Equal contribution, ‡ Project Leader

1 Shanghai AI Lab   2 Xi'an Jiaotong-Liverpool University
3 Northwestern Polytechnical University   4 Institute of AI, China Telecom Corp Ltd



Physical prototypes of Fast-UMI system




Project Overview

Collecting real-world manipulation trajectory data involving robotic arms is essential for developing general-purpose action policies in robotic manipulation, yet such data remains scarce. Existing methods face limitations such as high costs, labor intensity, hardware dependencies, and complex setup requirements involving SLAM algorithms. In this work, we introduce Fast-UMI, an interface-mediated manipulation system comprising two key components: a handheld device operated by humans for data collection and a robot-mounted device used during policy inference. Our approach employs a decoupled design compatible with a wide range of grippers while maintaining consistent observation perspectives, allowing models trained on handheld-collected data to be directly applied to real robots. By directly obtaining the end-effector pose using existing commercial hardware products, we eliminate the need for complex SLAM deployment and calibration, streamlining data processing. Fast-UMI provides supporting software tools for efficient robot learning data collection and conversion, facilitating rapid, plug-and-play functionality. This system offers an efficient and user-friendly tool for robotic learning data acquisition.



Newly Designed Components

Components of Newly Designed Fast-UMI Prototype Device

Visual Alignment and Consistency

To ensure visual consistency between the handheld and robot-mounted devices, we established a visual alignment guideline: the bottom of the GoPro's fisheye lens image aligns with the bottom of the gripper's fingertips.


The views captured by the GoPro cameras on the handheld device and the robot-mounted device, respectively, with the red dashed line indicating the ends of the fingertips.

RealSense T265 trajectory accuracy

The T265 trajectory demonstrates an average positional error of 0.0237 m. While the T265 sensor provides reasonably accurate pose estimation suitable for many robotic manipulation tasks, inherent biases and variances exist that should be accounted for in precision-critical applications.


Evaluation of the RealSense T265 trajectory accuracy compared to motion capture (MoCap) ground truth data.
(a) Spatial trajectories along three axes: T265 measurements (red lines) and MoCap ground truth (green lines). (b) Positional errors of the T265 sensor relative to MoCap along the three axes.

Demos

Demo1:  Grasp a cup and place it into a sink (ACT is used here).   Watch Video (TBA)



More demos will be released soon!




Contact

Dr. Yan Ding, Researcher at Shanghai AI Lab, yding25@binghamton.edu