OpenAI has introduced a research preview of Operator, powered by their Computer-Using Agent (CUA) technology, which combines GPT-4o's vision capabilities with advanced reasoning to interact with 'graphical user interfaces' just as humans do.
OpenAI has developed CUA, a model that interacts with graphical user interfaces (GUIs) through the universal interface of screen, mouse, and keyboard. This approach eliminates the need for OS- or web-specific APIs, marking the next step in AI development by enabling models to use the same tools humans rely on daily.
The system has established new state-of-the-art benchmark results, achieving a 38.1% success rate on OSWorld for full computer use tasks, 58.1% on WebArena, and 87% on WebVoyager for web-based tasks. These results demonstrate CUA's ability to navigate and operate across diverse environments using a single general action space.
The system operates through an iterative loop that integrates:
- Perception: Screenshots from the computer are added to the model's context, providing a visual snapshot of the computer's current state
- Reasoning: CUA reasons through the next steps using chain-of-thought, taking into consideration current and past screenshots and actions
- Action: It performs actions—clicking, scrolling, or typing—until it decides that the task is completed or user input is needed
OpenAI has implemented comprehensive safety measures across three major risk categories:
- Misuse: Including refusals of harmful tasks, website blocklists, and real-time moderation
- Model mistakes: Implementing user confirmations, task limitations, and watch mode for sensitive operations
- Frontier risks: Evaluated against scenarios involving autonomous replication and biorisk tooling
The initial research preview is available to Pro users in the U.S. through operator.chatgpt.com, with plans to make CUA available through their API for developers.