Researchers are exploring ways to create embodied AI agents that can understand and interact with their surroundings, much like humans do. To facilitate this goal, Meta has introduced the Open-Vocabulary Embodied Question Answering (OpenEQA) framework, a new benchmark designed to measure an AI agent's understanding of its environment by asking open-vocabulary questions.
OpenEQA consists of two primary tasks: episodic memory EQA and active EQA. In episodic memory EQA, an embodied AI agent answers questions based on its recollection of past experiences, similar to how a human might remember where they left their office badge. Active EQA, on the other hand, requires the agent to take action within the environment to gather necessary information and answer questions, such as checking if there is any fruit left in the kitchen.
The OpenEQA benchmark includes over 1,600 non-templated pairs of questions and answers from human annotators, representing real-world use cases. These question-and-answer pairs are validated by different human annotators to ensure their accuracy and answerability. Additionally, OpenEQA features LLM-Match, an automatic evaluation metric for scoring open vocabulary answers.
Using OpenEQA, Meta researchers benchmarked several state-of-the-art vision+language foundation models (VLMs) and discovered a significant gap between the best-performing models (GPT-4V at 48.5%) and human performance (85.9%). For questions that require spatial understanding, even the most advanced VLMs struggled, indicating that access to visual content provided no significant improvement over language-only models. This suggests that current models are not effectively leveraging visual information and are instead relying on textual priors to answer visual questions.
The release of OpenEQA highlights the importance of developing AI agents that can understand and communicate about the world they perceive. By combining challenging open-vocabulary questions with the ability to answer in natural language, OpenEQA provides a straightforward benchmark that demonstrates a strong understanding of the environment and poses a considerable challenge to current foundational models.
Improving AI's perception and reasoning capabilities is crucial for the development of embodied AI agents that can effectively assist people in everyday life, such as home robots or smart glasses.
By enhancing large language models with the ability to "see" the world and situating them in users' devices, researchers can create AI systems that are grounded in an understanding of the physical world.