GPT-5 on SWE-bench: Cost & performance deep-dive
This blog post covers the results of running mini-SWE-agent with GPT-5, GPT-5-mini, and GPT-5-nano. Results will be added to the SWE-bench (bash-only) leaderboard shortly.
GPT-5 is as good as Sonnet 4, but quite a bit cheaper
- GPT-5 is as good as Sonnet 4, but quite a bit cheaper
- For sacrificing only a little bit of performance (5%pt), GPT-5-mini is incredibly cheap
GPT-5-nano
is even cheaper, I would say you pay half for half the performance- You can reproduce our numbers for just $18 (with GPT-5-mini) using the command at the bottom!
SWE-bench scores
First of all, the mandatory bar chart:
Immediately we can see that Anthropic's Claude Opus 4 is still unbeaten, and GPT-5 is on par with Claude Sonnet 4.
However, we're still very excited about GPT-5, and that's because of the cost!
Note that we run all GPT-5-*
models with the default setting (verbosity and reasoning effort set to medium).
Sonnet 4 is run at zero temperature (there's no temperature for the GPT-5-*
models).
Also note that this is a different evaluation than the one in the GPT-5 blog post, as they evaluate using Agentless. As the name implies, this is less of an agent, but rather a RAG-based system that proposes a lot of different "one-shot" edits out of which the best one is chosen. This is a fantastic system, but it is also relatively complex (and all the RAG needs to be specifically engineered for each language that you're tackling).
In contrast, our mini
agent is really just this class:
Agent class
"""Basic agent class. See https://mini-swe-agent.com/latest/advanced/control_flow/ for visual explanation."""
import os
import platform
import re
import subprocess
from collections.abc import Callable
from dataclasses import asdict, dataclass
from jinja2 import Template
from minisweagent import Environment, Model
@dataclass
class AgentConfig:
# The default settings are the bare minimum to run the agent. Take a look at the config files for improved settings.
system_template: str = "You are a helpful assistant that can do anything."
instance_template: str = (
"Your task: {{task}}. Please reply with a single shell command in triple backticks. "
"To finish, the first line of the output of the shell command must be 'COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT'."
)
timeout_template: str = (
"The last command <command>{{action['action']}}</command> timed out and has been killed.\n"
"The output of the command was:\n <output>\n{{output}}\n</output>\n"
"Please try another command and make sure to avoid those requiring interactive input."
)
format_error_template: str = "Please always provide EXACTLY ONE action in triple backticks."
action_observation_template: str = "Observation: {{output}}"
step_limit: int = 0
cost_limit: float = 3.0
class NonTerminatingException(Exception):
"""Raised for conditions that can be handled by the agent."""
class FormatError(NonTerminatingException):
"""Raised when the LM's output is not in the expected format."""
class ExecutionTimeoutError(NonTerminatingException):
"""Raised when the action execution timed out."""
class TerminatingException(Exception):
"""Raised for conditions that terminate the agent."""
class Submitted(TerminatingException):
"""Raised when the LM declares that the agent has finished its task."""
class LimitsExceeded(TerminatingException):
"""Raised when the agent has reached its cost or step limit."""
class DefaultAgent:
def __init__(self, model: Model, env: Environment, *, config_class: Callable = AgentConfig, **kwargs):
self.config = config_class(**kwargs)
self.messages: list[dict] = []
self.model = model
self.env = env
def render_template(self, template: str, **kwargs) -> str:
cs = asdict(self.config) | asdict(self.env.config) | asdict(self.model.config) | platform.uname()._asdict()
return Template(template).render(**kwargs, **cs, **os.environ)
def add_message(self, role: str, content: str, **kwargs):
self.messages.append({"role": role, "content": content, **kwargs})
def run(self, task: str) -> tuple[str, str]:
"""Run step() until agent is finished. Return exit status & message"""
self.messages = []
self.add_message("system", self.render_template(self.config.system_template))
self.add_message("user", self.render_template(self.config.instance_template, task=task))
while True:
try:
self.step()
except NonTerminatingException as e:
self.add_message("user", str(e))
except TerminatingException as e:
self.add_message("user", str(e))
return type(e).__name__, str(e)
def step(self) -> dict:
"""Query the LM, execute the action, return the observation."""
return self.get_observation(self.query())
def query(self) -> dict:
"""Query the model and return the response."""
if 0 < self.config.step_limit <= self.model.n_calls or 0 < self.config.cost_limit <= self.model.cost:
raise LimitsExceeded()
response = self.model.query(self.messages)
self.add_message("assistant", **response)
return response
def get_observation(self, response: dict) -> dict:
"""Execute the action and return the observation."""
output = self.execute_action(self.parse_action(response))
observation = self.render_template(self.config.action_observation_template, output=output)
self.add_message("user", observation)
return output
def parse_action(self, response: dict) -> dict:
"""Parse the action from the message. Returns the action."""
actions = re.findall(r"```bash\n(.*?)\n```", response["content"], re.DOTALL)
if len(actions) == 1:
return {"action": actions[0].strip(), **response}
raise FormatError(self.render_template(self.config.format_error_template, actions=actions))
def execute_action(self, action: dict) -> dict:
try:
output = self.env.execute(action["action"])
except subprocess.TimeoutExpired as e:
output = e.output.decode("utf-8", errors="replace") if e.output else ""
raise ExecutionTimeoutError(
self.render_template(self.config.timeout_template, action=action, output=output)
)
except TimeoutError:
raise ExecutionTimeoutError(self.render_template(self.config.timeout_template, action=action, output=""))
self.has_finished(output)
return output
def has_finished(self, output: dict[str, str]):
"""Raises Submitted exception with final output if the agent has finished its task."""
lines = output.get("output", "").lstrip().splitlines()
if lines and lines[0].strip() in ["MINI_SWE_AGENT_FINAL_OUTPUT", "COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT"]:
raise Submitted("\n".join(lines[1:]))
Agent control flow
Check out the control flow guide for a visual explanation of the agent's control flow following this picture:
SWE-bench config
agent:
system_template: |
You are a helpful assistant that can interact multiple times with a computer shell to solve programming tasks.
Your response must contain exactly ONE bash code block with ONE command (or commands connected with && or ||).
Include a THOUGHT section before your command where you explain your reasoning process.
Format your response as shown in <format_example>.
<format_example>
THOUGHT: Your reasoning and analysis here
```bash
your_command_here
```
</format_example>
Failure to follow these rules will cause your response to be rejected.
instance_template: |
<pr_description>
Consider the following PR description:
{{task}}
</pr_description>
<instructions>
# Task Instructions
## Overview
You're a software engineer interacting continuously with a computer by submitting commands.
You'll be helping implement necessary changes to meet requirements in the PR description.
Your task is specifically to make changes to non-test files in the current directory in order to fix the issue described in the PR description in a way that is general and consistent with the codebase.
IMPORTANT: This is an interactive process where you will think and issue ONE command, see its result, then think and issue your next command.
For each response:
1. Include a THOUGHT section explaining your reasoning and what you're trying to accomplish
2. Provide exactly ONE bash command to execute
## Important Boundaries
- MODIFY: Regular source code files in {{working_dir}}
- DO NOT MODIFY: Tests, configuration files (pyproject.toml, setup.cfg, etc.)
## Recommended Workflow
1. Analyze the codebase by finding and reading relevant files
2. Create a script to reproduce the issue
3. Edit the source code to resolve the issue
4. Verify your fix works by running your script again
5. Test edge cases to ensure your fix is robust
## Command Execution Rules
You are operating in an environment where
1. You write a single command
2. The system executes that command in a subshell
3. You see the result
4. You write your next command
Each response should include:
1. A **THOUGHT** section where you explain your reasoning and plan
2. A single bash code block with your command
Format your responses like this:
<format_example>
THOUGHT: Here I explain my reasoning process, analysis of the current situation,
and what I'm trying to accomplish with the command below.
```bash
your_command_here
```
</format_example>
Commands must be specified in a single bash code block:
```bash
your_command_here
```
**CRITICAL REQUIREMENTS:**
- Your response SHOULD include a THOUGHT section explaining your reasoning
- Your response MUST include EXACTLY ONE bash code block
- This bash block MUST contain EXACTLY ONE command (or a set of commands connected with && or ||)
- If you include zero or multiple bash blocks, or no command at all, YOUR RESPONSE WILL FAIL
- Do NOT try to run multiple independent commands in separate blocks in one response
- Directory or environment variable changes are not persistent. Every action is executed in a new subshell.
- However, you can prefix any action with `MY_ENV_VAR=MY_VALUE cd /path/to/working/dir && ...` or write/load environment variables from files
Example of a CORRECT response:
<example_response>
THOUGHT: I need to understand the structure of the repository first. Let me check what files are in the current directory to get a better understanding of the codebase.
```bash
ls -la
```
</example_response>
Example of an INCORRECT response:
<example_response>
THOUGHT: I need to examine the codebase and then look at a specific file. I'll run multiple commands to do this.
```bash
ls -la
```
Now I'll read the file:
```bash
cat file.txt
```
</example_response>
If you need to run multiple commands, either:
1. Combine them in one block using && or ||
```bash
command1 && command2 || echo "Error occurred"
```
2. Wait for the first command to complete, see its output, then issue the next command in your following response.
## Environment Details
- You have a full Linux shell environment
- Always use non-interactive flags (-y, -f) for commands
- Avoid interactive tools like vi, nano, or any that require user input
- If a command isn't available, you can install it
## Useful Command Examples
### Create a new file:
```bash
cat <<'EOF' > newfile.py
import numpy as np
hello = "world"
print(hello)
EOF
```
### Edit files with sed:
```bash
# Replace all occurrences
sed -i 's/old_string/new_string/g' filename.py
# Replace only first occurrence
sed -i 's/old_string/new_string/' filename.py
# Replace first occurrence on line 1
sed -i '1s/old_string/new_string/' filename.py
# Replace all occurrences in lines 1-10
sed -i '1,10s/old_string/new_string/g' filename.py
```
### View file content:
```bash
# View specific lines with numbers
nl -ba filename.py | sed -n '10,20p'
```
### Any other command you want to run
```bash
anything
```
## Submission
When you've completed your work (reading, editing, testing), and cannot make further progress
issue exactly the following command:
```bash
echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && git add -A && git diff --cached
```
This command will submit your work.
You cannot continue working (reading, editing, testing) in any way on this task after submitting.
</instructions>
action_observation_template: |
<returncode>{{output.returncode}}</returncode>
{% if output.output | length < 10000 -%}
<output>
{{ output.output -}}
</output>
{%- else -%}
<warning>
The output of your last command was too long.
Please try a different command that produces less output.
If you're looking at a file you can try use head, tail or sed to view a smaller number of lines selectively.
If you're using grep or find and it produced too much output, you can use a more selective search pattern.
If you really need to see something from the full command's output, you can redirect output to a file and then search in that file.
</warning>
{%- set elided_chars = output.output | length - 10000 -%}
<output_head>
{{ output.output[:5000] }}
</output_head>
<elided_chars>
{{ elided_chars }} characters elided
</elided_chars>
<output_tail>
{{ output.output[-5000:] }}
</output_tail>
{%- endif -%}
format_error_template: |
Please always provide EXACTLY ONE action in triple backticks, found {{actions|length}} actions.
Please format your action in triple backticks as shown in <response_example>.
<response_example>
Here are some thoughts about why you want to perform the action.
```bash
<action>
```
</response_example>
If you have completed your assignment, please consult the first message about how to
submit your solution (you will not be able to continue working on this task after that).
step_limit: 250
cost_limit: 3.
environment:
cwd: "/testbed"
timeout: 60
env:
PAGER: cat
MANPAGER: cat
LESS: -R
PIP_PROGRESS_BAR: 'off'
TQDM_DISABLE: '1'
model:
model_name: "claude-sonnet-4-20250514"
model_kwargs:
drop_params: true
temperature: 0.0
Cost analysis
Cost is tricky to compare with agents, because agents succeed fast, but fail slowly. If an agent doesn't succeed, it should just continue trying until it succeeds, or hits a run time limit. And that's (almost) what happens.
For a fair comparison, all LMs benchmarked with mini
on our SWE-bench (bash-only) leaderboard are run with a $3 budget up to 250 steps.
However, most LMs succeed much much earlier (usually definitely before 50 steps).
Here's how this looks like:
Right away we notice a few things:
GPT-5-*
shows strongly diminishing returns already after 30 steps- Definitely don't run it for than 50 steps
Sonnet 4
takes more steps and only maxes out at around 100 steps
Note that for this plot, we assume that if the agent doesn't submit
its solution by step i
, it hasn't solved the problem yet
(this is a slight simplification, because the agent might still do extended testing after all the edits).
What does this mean for the cost? If we look at agent performance & cost for different step limits, we get the following plot (here every point is the performance/cost at one specific step limit):
Conclusions:
GPT-5
is cheaper thanSonnet 4
(how much depends on how much you care about every little bit of performance)- But
GPT-5-mini
is the real winner here! It definitely maxes out at less than 1/5th of the cost ofGPT-5
and you only sacrifice some 5%pt of performance! GPT-5-nano
is even cheaper, maxing out somewhere at 1.5ct/intance!
So what's the overall takeaway?
- GPT-5 is as good as Sonnet 4, but somewhat cheaper
- For sacrificing only a little bit of performance (5%pt), GPT-5-mini is incredibly cheap
GPT-5-nano
is even cheaper, but probably not worth it for most SWE use cases
So the real winner in my opinion is GPT-5-mini
!
Want to run mini
with GPT-5?
Check out our notes on how to run mini
with GPT-5.
You can reproduce our numbers in this blog by following the swebench tutorial, but
the tl;dr is to run (remove the temperature setting from the swebench.yaml
file first,
because it's not supported by the GPT-5-*
models):
mini-extra swebench --subset verified --split test --shuffle \
--model openai/gpt-5-mini -o gpt-5-mini --workers 10
and to evaluate
cd gpt-5-mini
sb-cli submit swe-bench_verified test --predictions_path preds.json --run_id gpt-5-mini
GPT-5-mini ran in around 1.5h for $18.