Explore until Confident: Efficient Exploration for Embodied Question Answering
- Allen Z. Ren
- Jaden Clark
- Anushri Dixit
- Masha Itkina
- Anirudha Majumdar
- Dorsa Sadigh
Combine VLM semantic reasoning and rigorous uncertainty quantification to enable agents to efficiently explore relevant regions of unknown 3D environments, and stop to answer questions about them with calibrated confidence.
Simulated scenarios in Habitat-Sim
Real-world scenarios with a Fetch mobile robot
Abstract
We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM β leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration β leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence.
Embodied Question Answering (EQA)
In EQA tasks, the robot starts at a random location in a 3D scene, explores the space, and stops when it is confident about answering the question. This can be a challenging problem due to highly diverse scenes and lack of an a-priori map of the environment. Previous works rely on training dedicated exploration policies and question answering modules from scratch, which can be data-inefficient and only handle simple questions.
1) Limited Internal Memory of VLMs. EQA benefits from the robot tracking previously explored regions and also ones yet to be explored but relevant for answering the question. However, VLMs do not have an internal memory for mapping the scene and storing such semantic information;
2) Miscalibrated VLMs. VLMs are fine-tuned on pre-trained large language models (LLMs) as the language decoder, and LLMs are often miscalibrated - that is they can be over-confident or under-confident about the output. This makes it difficult to determine when the robot is confident enough about question answering in EQA and then stop exploration.
How can we endow VLMs with the capability of efficient exploration for EQA?
Addressing the first challenge of limited internal memory, we propose building a map of the scene external to the VLM as the robot visits different locations. On top of it, we embed the VLM's knowledge about possible exploration directions into this semantic map to guide the robot's exploration. Such semantic information is obtained by visual prompting: annotating the free space in the current image view, prompting the VLM to choose among the unoccupied regions, and querying its prediction. The values are then stored in the semantic map.
HM-EQA Dataset
While prior work has primarily considered synthetic scenes and simple questions such as βwhat is the color of the coffee table?β involving basic attributes of relatively large pieces of furniture, we are interested in applying our VLM-based framework in more realistic and diverse scenarios, where the question can be more open-ended and possibly require semantic reasoning. To this end, we propose HM-EQA, a new EQA dataset with 500 questions based on 267 scenes from the Habitat-Matterport 3D Research Dataset (HM3D). We consider five categories of questions:
Acknowledgements
We thank Donovon Jackson, Derick Seale, and Tony Nguyen for contributing to the HM-EQA dataset. The authors were partially supported by the Toyota Research Institute (TRI), the NSF CAREER Award [#2044149], and the Office of Naval Research [N00014-23-1-2148]. This article solely reflects the opinions and conclusions of its authors and NSF, ONR, TRI or any other Toyota entity. The website template is from KnowNo and Nerfies.