Anthropic
"AI systems went from solving 2% of real coding problems to 49% in one year. Anthropic warns the window for safe regulation is closing fast."
Using Edit & Bash tools, newest Claude 3.5 Sonnet hits 49% on SWE-bench's GitHub issue tests, despite challenges like hidden tests & high token costs.
"Claude 3.5 Sonnet scored 93.7% on HumanEval, tops SWE-bench Verified, offers real-time debugging and test generation via Amazon Bedrock inference"
"New code sandbox allows Claude to run JavaScript, clean data, create visualisations, and perform real-time analysis on uploaded CSV files."
''Plug and play AI now moves cursors, clicks, and types through virtual keyboards, moving beyond traditional AI-specific interface requirements."
Anthropic updates Responsible Scaling Policy. New capability thresholds for AI safety. ASL Standards introduced. Kaplan as Responsible Scaling Officer.