According to MIT News, researchers have tackled one of robotics' biggest challenges: creating general-purpose robots that can adapt to various tasks without extensive retraining. The breakthrough comes from a new architecture called Heterogeneous Pretrained Transformers (HPT), which unifies diverse types of robotic data.
"In robotics, people often claim that we don't have enough training data. But in my view, another big problem is that the data come from so many different domains, modalities, and robot hardware. Our work shows how you'd be able to train a robot with all of them put together," lead researcher Lirui Wang, an electrical engineering and computer science graduate student at MIT, told MIT News.
The system processes vision and proprioception inputs using a transformer model, similar to those used in large language models like GPT-4. The researchers developed a method to align data from both vision and proprioception into tokens that the transformer can process, with each input represented by the same fixed number of tokens.
"Proprioception is key to enable a lot of dexterous motions. Because the number of tokens is in our architecture always the same, we place the same importance on proprioception and vision," Wang explained to MIT News.
The training dataset is substantial, incorporating 52 datasets containing more than 200,000 robot trajectories across four categories, including human demonstration videos and simulations. Testing has demonstrated a more than 20 percent improvement in both simulated and real-world tasks, compared to conventional training methods, even when robots encountered tasks significantly different from their training data.
The research team includes Wang, fellow graduate student Jialiang Zhao, Meta research scientist Xinlei Chen, and associate professor Kaiming He. Their findings will be presented at the Conference on Neural Information Processing Systems.
"Our dream is to have a universal robot brain that you could download and use for your robot without any training at all. While we are just in the early stages, we are going to keep pushing hard and hope scaling leads to a breakthrough in robotic policies, like it did with large language models," Wang told MIT News. The research was partially funded by the Amazon Greater Boston Tech Initiative and Toyota Research Institute.