GT Logo RIPL Logo

Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets

Jeremiah Coholich, Justin Wit, Robert Azarcon, Zsolt Kira

📌 International Conference on Robotics & Automation (ICRA) 2026

Paper arXiv Code Dataset BibTex

Video

Abstract

Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this setting, MANGO outperforms all other image translation methods we tested. In certain real-world tabletop manipulation tasks, MANGO augmentation increases shifted-view success rates by over 40 percentage points compared to policies trained without augmentation.

Slide 1

Data Augmentation with MANGO

With Multiview Augmentation with Novel Generated Observations (MANGO), we augment fixed-camera demonstration data with novel generated viewpoints from simulation.

Fixed-Camera Real Demonstrations

Sim2Real Translated Diverse-Viewpoint Demonstrations

We generate diverse-viewpoint demonstrations from simulation and translate them to diverse-viewpoint real-world demonstrations. The image translation model is only trained on fixed-viewpoint real-world data.

Robot Experiments

In this section, we show rollouts of imitation learning policies trained on MANGO-augmented data. The observations come from perturbed cameras. All videos are shown at 2x speed.

Stack Blocks

Close Laptop

Stack Cups

Pick Coke

Failures

BibTeX

@inproceedings{coholich2026Sim2real,
          title     = {Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets},
          author    = {Coholich, Jeremiah and Wit, Justin and Azarcon, Robert and Kira, Zsolt},
          booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
          year      = {2026},
          url       = {https://arxiv.org/abs/2601.09605}
        }