Large Language Models
OpenAI's SimpleQA tests 4,326 factual questions with 3% error rate. GPT-4o scores under 40%, showing larger models excel while deeper thinking ones opt to decline.
"ChatGPT-4 scored 92 in clinical reasoning vs physicians' 74-76. AI-assisted doctors completed diagnoses 1+ minute faster but showed no accuracy gains"
Ministral 3B and 8B models outperform larger peers, support 128k context, and enable on-device AI for robotics and local analytics. Pricing from $0.04/million tokens.
Stanford study: LLMs can assess and optimise educational materials, replicating learning effects and generating content preferred by human teachers.
NVIDIA optimised LLMs, achieving 3.5x latency improvement for Llama 70B in under a year. Blackwell platform shows 4x performance boost and first FP4 precision use in MLPerf.
Anthropic implements measures to prevent AI misuse in 2024 US elections, including policy updates, detection systems, and redirects to voting information.