Flow as the Cross-domain Manipulation Interface

Abstract

We present Im2Flow2Act, a scalable learning framework that enables robots to acquire manipulation skills from diverse data sources. The key idea behind Im2Flow2Act is to use object flow as the manipulation interface, bridging domain gaps between different embodiments (i.e., human and robot) and training environments (i.e., real-world and simulated). Im2Flow2Act comprises two components: a flow generation network and a flow-conditioned policy. The flow generation network, trained on human demonstration videos, generates object flow from the initial scene image, conditioned on the task description. The flow-conditioned policy, trained on simulated robot play data, maps the generated object flow to robot actions to realize the desired object movements. By using flow as input, this policy can be directly deployed in the real world with a minimal sim-to-real gap. By leveraging real-world human videos and simulated robot play data, we bypass the challenges of teleoperating physical robots in the real world, resulting in a scalable system for diverse tasks. We demonstrate Im2Flow2Act's capabilities in a variety of real-world tasks, including the manipulation of rigid, articulated, and deformable objects.

Im2Flow2Act Overview

In Im2Flow2Act, we utilize object flow to bridge the domain gap between both embodiments (human v.s. robot) and training environments (real v.s. simulation). Our final system is able to leverage both a) action-less human video for task-conditioned flow generation and b) task-less simulated robot data for flow conditioned action generation, resulting in c) a language-conditioned multi-task system for a wide variety of real-world manipulation tasks.

Our key idea is to use object flow—the exclusive motion of the manipulated object, excluding any background or embodiment movement—as a unifying interface to connect cross-embodiments (i.e., human and robot) and cross-environments (i.e., real-world and simulated), achieving one-shot generalization for new skills in the real world.

To utilize object flow as a unifying interface for learning from diverse data sources, we design our system into two components:

Flow generation network: The goal of the flow generation network is to learn high-level task planning through cross-embodiment videos, including those of different types of robots and human demonstrations. We develop a language-conditioned flow generation network built on top of the video generation model Animatediff. This high-level planning component generates the task flow based on initial visual observation and task description.
Flow-conditioned policy: The goal of the flow-conditioned imitation learning policy is to achieve the flows generated by the flow generation network, focusing on low-level execution. The policy learns entirely from simulated data to build the mapping between actions and flows. Unlike most sim-to-real work that requires building task-specific simulation, our policy learns entirely from play data, which is easier to collect. Since flow represents motion, a common concept across both real-world and simulated environments, our policy can be seamlessly deployed in real-world settings.

Flow Generation From Human demonstration Video

Object flow encapsulates transferable task knowledge independent of embodiment-specific actions. To exploit this, our language-conditioned flow generation network focused solely on extracting object flows from human demonstration videos.

Flow-Generation Network learn from action-less human video.

Flow-Conditioned Policy From Simulated Play Data

Our flow-conditioned imitation learning policy is learned entirely from simulated play data.. We collected 4800 play data ranging from rigid, articluted and deformable obejcts and train one single policy.

Flow-Conditioned Policy entirely learn from simulated play data.

Combing them together: Im2Flow2Act

Inference Pipeline

During inference, the flow generation network generates a task flow conditioned on an initial RGB image, a task description, and a set of initial keypoints obtained through Grounding DINO. The generated flow is post-processed by a motion filter to remove keypoints on the background. Our flow-conditioned imitation learning policy then completes tasks based on the generated flows.

Our final system provides a scalable framework for acquiring robot manipulation skills by bridging cross-embodiment demonstration video and cost-effective simulated robot play data via object flows as an interface. We achieve an average success rate of 81% across four tasks, including those involving rigid, articulated, and deformable objects.