Navigating to Objects in the Real World

Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks — (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes — and propose concrete steps forward.

We instantiate semantic navigation with the Object Goal navigation task, where a robot starts in a completely unseen environment and is asked to find an instance of an object category, let's say a toilet. The robot has access to only a first-person RGB and depth camera and a pose sensor.

This task is challenging. The robot requires not only spatial scene understanding of distinguishing free space and obstacles and semantic scene understanding of detecting objects, but also requires learning semantic exploration priors. For example, if a human wants to find a toilet in this scene, most of us would choose the hallway because it is most likely to lead to a toilet. Teaching this kind of common sense or semantic priors to an autonomous agent is challenging. While exploring the scene for the desired object, the robot also needs a long-term episodic memory to remember explored and unexplored areas.

So how do we train autonomous agents capable of efficient navigation while tackling all these challenges? A classical approach to this problem builds a geometric map using depth sensors, explores the environment with a heuristic, like frontier exploration, which explores the closest unexplored region, and uses an analytical planner to reach exploration goals and the goal object as soon as it is in sight. An end-to-end learning approach predicts actions directly from raw observations with a deep neural network consisting of visual encoders for image frames followed by a recurrent layer for memory. A modular learning approach builds a semantic map by projecting predicted semantic segmentation using depth, predicts an exploration goal with a goal-oriented semantic policy as a function of the semantic map and the goal object, and reaches it with a planner.

While many approaches to navigate to objects have been proposed over the past few years, learned navigation policies have predominantly been evaluated in simulation, which opens the field to the risk of sim-only research that does not generalize to the real world. We address this issue through a large-scale empirical evaluation of representative classical, end-to-end learning, and modular learning approaches across 6 unseen homes and 6 goal object categories.

We compare approaches in terms of success rate within a limited budget of 200 robot actions and Success weighted by Path Length (SPL), a measure of path efficiency. In simulation, all approaches perform comparably, at around 80% success rate. But in the real world, modular learning and classical approaches transfer really well, up from 81% to 90% and 78% to 80% success rates, respectively. While end-to-end learning fails to transfer, down from 77% to 23% success rate.

We illustrate these results qualitatively with one representative trajectory. All approaches start in a bedroom and are tasked with finding a couch. On the left, modular learning first successfully reaches the couch goal. In the middle, end-to-end learning fails after colliding too many times. On the right, the classical policy finally reaches the couch goal after a detour through the kitchen.

Modular learning improves by 10% real-world success rate over the classical approach. On the left, the goal-oriented semantic exploration policy directly heads towards the bedroom and finds the bed in 98 steps with an SPL of 0.90. On the right, because frontier exploration is agnostic to the bed goal, the policy makes detours through the kitchen and the entrance hallway before finally reaching the bed in 152 steps with an SPL of 0.52. With a limited time budget, inefficient exploration can lead to failure.

The semantic exploration policy of the modular learning approach takes a semantic map as input, while the end-to-end policy directly operates on the RGB-D frames. The semantic map space is invariant between sim and reality, while the image space exhibits a large domain gap. In this example, this gap leads to a segmentation model trained on real-world images to predict a bed false positive in the kitchen.

The semantic map domain invariance allows the modular learning approach to transfer well from sim to reality. In contrast, the image domain gap causes a large drop in performance when transferring a segmentation model trained in the real world to simulation and vice versa. If semantic segmentation transfers poorly from sim to reality, it is reasonable to expect an end-to-end semantic navigation policy trained on sim images to transfer poorly to real-world images.

Surprisingly, modular learning works even better in reality than simulation. Detailed analysis reveals that a lot of the failures of the modular learning policy that occur in sim are due to reconstruction errors, which do not happen in reality. Visual reconstruction errors represent 10% out of the total 19% episode failures, and physical reconstruction errors another 5%. In contrast, failures in the real world are predominantly due to depth sensor errors, while most semantic navigation benchmarks in simulation assume perfect depth sensing. Besides explaining the performance gap between sim and reality for modular learning, this gap in error modes is concerning because it limits the usefulness of simulation to diagnose bottlenecks and further improve policies. We show representative examples of each error mode and propose concrete steps forward to close this gap in the paper.

Theophile Gervet	Soumith Chintala	Dhruv Batra	Jitendra Malik	Devendra Singh Chaplot
CMU	FAIR	FAIR	FAIR	FAIR

Object Goal Navigation

Methods

Large-scale Real-world Empirical Evaluation

Results

Result 1: Modular Learning is Reliable

Result 2: Modular Learning Explores more Efficiently than Classical

Result 3: End-to-end Learning Fails to Transfer

Analysis

Insight 1: Why does Modular Transfer while End-to-end does not?

Insight 2: Sim vs Real Gap in Error Modes for Modular Learning

Takeaways

Short Presentation