Open AI have launched SimpleQA, a new benchmark designed to measure how accurately large language models answer factual questions. The tool addresses the persistent challenge of AI "hallucinations" - the generation of false or unsubstantiated information.
The SimpleQA benchmark evaluates language models' ability to provide accurate answers to straightforward, fact-based questions. The benchmark comprises 4,326 questions across diverse topics, from science and technology to entertainment and sports.
The research team, including Jason Wei, Karina Nguyen, Hyung Won Chung, and other collaborators, designed the benchmark to meet several key criteria: high correctness, topic diversity, and the ability to challenge even the most advanced AI models. Current testing shows that GPT-4o scores less than 40% on the benchmark, indicating significant room for improvement in AI factuality.
The benchmark focuses on short, fact-seeking queries to make measuring factuality more tractable. The development process involved multiple AI trainers independently verifying answers, with questions only included when trainers reached consensus.
The benchmark's quality control process revealed an estimated error rate of approximately 3%, based on a verification study involving 1,000 randomly sampled questions. This relatively low error rate suggests high reliability for evaluation purposes.
Initial testing using SimpleQA has revealed interesting patterns in AI model performance. Larger models typically demonstrate better factual accuracy, while models designed for deeper thinking (such as o1-mini and o1-preview) tend to more frequently choose "not attempt" responses rather than risk providing incorrect information.
The benchmark also provides insights into AI model calibration - the ability of models to accurately assess their own confidence levels. Results show that o1-preview is more calibrated than o1-mini, and GPT-4o is more calibrated than GPT-4o-mini, though models consistently overstate their confidence.