Upgraded Claude 3.5 Sonnet Achieves New High Score on Software Engineering Benchmark

Using Edit & Bash tools, newest Claude 3.5 Sonnet hits 49% on SWE-bench's GitHub issue tests, despite challenges like hidden tests & high token costs.

AI-360

Oct 31, 2024 2 min read

The upgraded Claude 3.5 Sonnet has achieved 49% on SWE-bench Verified, a software engineering evaluation, beating the previous model's 45%.

SWE-bench is an AI evaluation benchmark that assesses a model's ability to complete real-world software engineering tasks. Specifically, it tests how models can resolve GitHub issues from popular open-source Python repositories.

The benchmark evaluates not just the AI model in isolation, but rather an entire "agent" system. In this context, an "agent" refers to the combination of an AI model and the software scaffolding around it. This scaffolding is responsible for generating the prompts that go into the model, parsing the model's output to take action, and managing the interaction loop where the result of the model's previous action is incorporated into its next prompt.

The implementation relied on two main tools:

- A Bash Tool for executing bash commands

- An Edit Tool for viewing, creating, and editing files and directories

Performance comparison across models:

- Claude 3.5 Sonnet (new): 49%

- Previous state-of-the-art: 45%

- Claude 3.5 Sonnet (old): 33%

- Claude 3 Opus: 22%

The implementation faced several challenges:

- Duration and high token costs, with some successful runs taking hundreds of turns and >100k tokens

- Grading issues related to environment setup and install patches

- Hidden tests preventing the model from seeing what it's being graded against

- Multimodal limitations, as there was no implementation for viewing files saved to the filesystem or referenced as URLs

Sign up for AI-360