The upgraded Claude 3.5 Sonnet has achieved 49% on SWE-bench Verified, a software engineering evaluation, beating the previous model's 45%.
SWE-bench is an AI evaluation benchmark that assesses a model's ability to complete real-world software engineering tasks. Specifically, it tests how models can resolve GitHub issues from popular open-source Python repositories.
The benchmark evaluates not just the AI model in isolation, but rather an entire "agent" system. In this context, an "agent" refers to the combination of an AI model and the software scaffolding around it. This scaffolding is responsible for generating the prompts that go into the model, parsing the model's output to take action, and managing the interaction loop where the result of the model's previous action is incorporated into its next prompt.
The implementation relied on two main tools:
- A Bash Tool for executing bash commands
- An Edit Tool for viewing, creating, and editing files and directories
Performance comparison across models:
- Claude 3.5 Sonnet (new): 49%
- Previous state-of-the-art: 45%
- Claude 3.5 Sonnet (old): 33%
- Claude 3 Opus: 22%
The implementation faced several challenges:
- Duration and high token costs, with some successful runs taking hundreds of turns and >100k tokens
- Grading issues related to environment setup and install patches
- Hidden tests preventing the model from seeing what it's being graded against
- Multimodal limitations, as there was no implementation for viewing files saved to the filesystem or referenced as URLs