Explore until Confident: Efficient Exploration for Embodied Question Answering

Explore until Confident:
Efficient Exploration for Embodied Question Answering

Dataset

Combine VLM semantic reasoning and rigorous uncertainty quantification to enable agents to efficiently explore relevant regions of unknown 3D environments, and stop to answer questions about them with calibrated confidence.

Is my kid on the treadmill?
A) Yes
B) No

Is the lamp next to the sofa turned on?
A) Yes
B) No

How many bedside tables are there in the bedroom with the white bedding?
A) Three
B) None
C) One
D) Two

Which rug did I put next to the kitchen sink?
A) There is no rug
B) White one
C) Gray one
D) Green one

I am going to shower now. I need to grab some towels.
A) There are already some in the bathroom
B) There are none in the bathroom
C) There are some in the bedroom
D) There is only one in the bathroom

I remember leaving some books in one of the rooms, on wooden shelves.
A) In the room with orange wall
B) In the room with white wall
C) In the room with green wall

Where did I leave the striped towel?
A) On the living room floor
B) In the bathroom
C) By the kitchen sink
D) On the dining table

Simulated scenarios in Habitat-Sim

What kind of stools are under the white board?
A) White ones
B) Dark blue ones
C) Black ones
D) Lime green ones

Is there something here that I can cook my cookie dough in?
A) Yes
B) No

Is the dishwasher in the kitchen open or closed?
A) Closed
B) Open

What kind of stools are under the white board?
A) White ones
B) Dark blue ones
C) Black ones
D) Lime green ones

Real-world scenarios with a Fetch mobile robot

Abstract

We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM — leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration — leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence.

Embodied Question Answering (EQA)

In EQA tasks, the robot starts at a random location in a 3D scene, explores the space, and stops when it is confident about answering the question. This can be a challenging problem due to highly diverse scenes and lack of an a-priori map of the environment. Previous works rely on training dedicated exploration policies and question answering modules from scratch, which can be data-inefficient and only handle simple questions.

We are interested in using pre-trained large vision-language models (VLMs) for EQA without additional training. VLMs have achieved impressive performance in answering complex questions about static 2D images that sometimes requires reasoning, and we find VLMs can also reason about semantically relevant regions to explore given the question and view. However, there are still two main challenges:
1) Limited Internal Memory of VLMs. EQA benefits from the robot tracking previously explored regions and also ones yet to be explored but relevant for answering the question. However, VLMs do not have an internal memory for mapping the scene and storing such semantic information;
2) Miscalibrated VLMs. VLMs are fine-tuned on pre-trained large language models (LLMs) as the language decoder, and LLMs are often miscalibrated - that is they can be over-confident or under-confident about the output. This makes it difficult to determine when the robot is confident enough about question answering in EQA and then stop exploration.

How can we endow VLMs with the capability of efficient exploration for EQA?

Addressing the first challenge of limited internal memory, we propose building a map of the scene external to the VLM as the robot visits different locations. On top of it, we embed the VLM's knowledge about possible exploration directions into this semantic map to guide the robot's exploration. Such semantic information is obtained by visual prompting: annotating the free space in the current image view, prompting the VLM to choose among the unoccupied regions, and querying its prediction. The values are then stored in the semantic map.

We then leverage such semantic information with Frontier-Based Exploration (FBE). Frontiers are the locations between explored and unexplored regions, and we apply weighted sampling of the frontiers based on their semantic values on the map.

Addressing the second issue of miscalibration, we leverage multi-step conformal prediction, which allows the robot to maintain a set of possible answers (prediction set) over time, and stop when the set reduces to a single answer. Conformal prediction uses a moderately sized (e.g., ~300) set of scenarios for carefully selecting a confidence threshold above which answers are included in the prediction set. This procedure allows us to achieve calibrated confidence: with a user-specified probability, the prediction set is guaranteed to contain the correct answer for a new scenario (under the assumption that calibration and test scenarios are drawn from the same unknown distribution). CP minimizes the prediction set size, which helps the robot to stop as quickly as it can while satisfying calibrated confidence.

HM-EQA Dataset

While prior work has primarily considered synthetic scenes and simple questions such as “what is the color of the coffee table?” involving basic attributes of relatively large pieces of furniture, we are interested in applying our VLM-based framework in more realistic and diverse scenarios, where the question can be more open-ended and possibly require semantic reasoning. To this end, we propose HM-EQA, a new EQA dataset with 500 questions based on 267 scenes from the Habitat-Matterport 3D Research Dataset (HM3D). We consider five categories of questions:

Citation

[arxiv version]

@inproceedings{exploreeqa2024,
    title={Explore until Confident: Efficient Exploration for Embodied Question Answering},
    author={Ren, Allen Z. and Clark, Jaden and Dixit, Anushri and Itkina, Masha and Majumdar, Anirudha and Sadigh, Dorsa},
    booktitle={arXiv preprint arXiv:2403.15941},
    year={2024}
}

Acknowledgements

We thank Donovon Jackson, Derick Seale, and Tony Nguyen for contributing to the HM-EQA dataset. The authors were partially supported by the Toyota Research Institute (TRI), the NSF CAREER Award [#2044149], and the Office of Naval Research [N00014-23-1-2148]. This article solely reflects the opinions and conclusions of its authors and NSF, ONR, TRI or any other Toyota entity. The website template is from KnowNo and Nerfies.