OpenAI has introduced a new large language model, o1, trained to perform complex reasoning tasks. The model demonstrates significant improvements in math, coding, and scientific problem-solving, outperforming previous models on various benchmarks.
OpenAI's claim their model, o1, represents a leap forward in machine reasoning capabilities. Trained using reinforcement learning, o1 employs an internal "chain of thought" process before responding to users, allowing it to tackle complex problems more effectively than its predecessors.
In the 2024 International Olympiad in Informatics (IOI), a version of o1 fine-tuned for programming competitions scored 213 points, ranking in the 49th percentile.
On standardised tests, o1 showed remarkable improvement over GPT-4o, its predecessor. For instance, on the American Invitational Mathematics Examination (AIME), o1 averaged 11.1 out of 15 problems solved with a single attempt, compared to GPT-4o's 1.8. This performance places o1 among the top 500 students nationally, above the cutoff for the USA Mathematical Olympiad.
The model also excelled in scientific problem-solving, surpassing human PhD-level accuracy on the GPQA Diamond benchmark, which tests expertise in chemistry, physics, and biology.
However, OpenAI acknowledges that o1 is not universally superior. In evaluations of human preference, o1 was less favored than GPT-4o for some natural language tasks, indicating that it may not be suitable for all applications.
The company reports that o1 demonstrated improved performance on safety evaluations, including better resistance to jailbreaking attempts and harmful prompts. OpenAI has chosen not to make the model's raw "chain of thought" visible to users, citing factors such as user experience and the potential for future monitoring of the model's reasoning process.