Explore until Confident:
Efficient Exploration for Embodied Question Answering


Combine VLM semantic reasoning and rigorous uncertainty quantification to enable agents to efficiently explore relevant regions of unknown 3D environments, and stop to answer questions about them with calibrated confidence.


Simulated scenarios in Habitat-Sim


Real-world scenarios with a Fetch mobile robot

Abstract

We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM β€” leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration β€” leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence.


Embodied Question Answering (EQA)

In EQA tasks, the robot starts at a random location in a 3D scene, explores the space, and stops when it is confident about answering the question. This can be a challenging problem due to highly diverse scenes and lack of an a-priori map of the environment. Previous works rely on training dedicated exploration policies and question answering modules from scratch, which can be data-inefficient and only handle simple questions.

We are interested in using pre-trained large vision-language models (VLMs) for EQA without additional training. VLMs have achieved impressive performance in answering complex questions about static 2D images that sometimes requires reasoning, and we find VLMs can also reason about semantically relevant regions to explore given the question and view. However, there are still two main challenges:
1) Limited Internal Memory of VLMs. EQA benefits from the robot tracking previously explored regions and also ones yet to be explored but relevant for answering the question. However, VLMs do not have an internal memory for mapping the scene and storing such semantic information;
2) Miscalibrated VLMs. VLMs are fine-tuned on pre-trained large language models (LLMs) as the language decoder, and LLMs are often miscalibrated - that is they can be over-confident or under-confident about the output. This makes it difficult to determine when the robot is confident enough about question answering in EQA and then stop exploration.


How can we endow VLMs with the capability of efficient exploration for EQA?

Addressing the first challenge of limited internal memory, we propose building a map of the scene external to the VLM as the robot visits different locations. On top of it, we embed the VLM's knowledge about possible exploration directions into this semantic map to guide the robot's exploration. Such semantic information is obtained by visual prompting: annotating the free space in the current image view, prompting the VLM to choose among the unoccupied regions, and querying its prediction. The values are then stored in the semantic map.

We then leverage such semantic information with Frontier-Based Exploration (FBE). Frontiers are the locations between explored and unexplored regions, and we apply weighted sampling of the frontiers based on their semantic values on the map.
Addressing the second issue of miscalibration, we leverage multi-step conformal prediction, which allows the robot to maintain a set of possible answers (prediction set) over time, and stop when the set reduces to a single answer. Conformal prediction uses a moderately sized (e.g., ~300) set of scenarios for carefully selecting a confidence threshold above which answers are included in the prediction set. This procedure allows us to achieve calibrated confidence: with a user-specified probability, the prediction set is guaranteed to contain the correct answer for a new scenario (under the assumption that calibration and test scenarios are drawn from the same unknown distribution). CP minimizes the prediction set size, which helps the robot to stop as quickly as it can while satisfying calibrated confidence.

HM-EQA Dataset

While prior work has primarily considered synthetic scenes and simple questions such as β€œwhat is the color of the coffee table?” involving basic attributes of relatively large pieces of furniture, we are interested in applying our VLM-based framework in more realistic and diverse scenarios, where the question can be more open-ended and possibly require semantic reasoning. To this end, we propose HM-EQA, a new EQA dataset with 500 questions based on 267 scenes from the Habitat-Matterport 3D Research Dataset (HM3D). We consider five categories of questions:


Citation

[arxiv version]

Acknowledgements

We thank Donovon Jackson, Derick Seale, and Tony Nguyen for contributing to the HM-EQA dataset. The authors were partially supported by the Toyota Research Institute (TRI), the NSF CAREER Award [#2044149], and the Office of Naval Research [N00014-23-1-2148]. This article solely reflects the opinions and conclusions of its authors and NSF, ONR, TRI or any other Toyota entity. The website template is from KnowNo and Nerfies.