MIT engineers have unveiled Clio, a method enabling robots to swiftly map scenes and identify task-relevant objects, applicable to fields from search and rescue to domestic robotics.
Named after the Greek muse of history, Clio empowers robots to identify and remember only the elements crucial for a given task. This innovative method enables robots to automatically segment a scene at different levels of granularity based on tasks specified in natural-language prompts.
The research team, including members from MIT's SPARK Laboratory and Lincoln Laboratory, conducted real-world experiments to demonstrate Clio's capabilities. In one test, they applied Clio to images of a cluttered apartment, where it successfully identified relevant objects based on tasks like "move pile of clothes."
In a more complex demonstration, the team integrated Clio with Boston Dynamics' quadruped robot, Spot. As Spot explored an office building, Clio ran in real-time on an onboard computer, identifying and mapping only the scene elements related to the robot's assigned tasks.
Clio combines state-of-the-art computer vision and large language models with mapping tools and the concept of "information bottleneck" from classic information theory. This unique approach allows robots to compress image segments and store only those most semantically relevant to a given task.
As the team looks to the future, they plan to adapt Clio for more complex, high-level tasks. Maggio explains, "We're still giving Clio tasks that are somewhat specific, like 'find deck of cards.' For search and rescue, you need to give it more high-level tasks, like 'find survivors,' or 'get power back on.'"