SWE-bench

Overview

We provide two scripts to run on the SWE-bench benchmark.
mini-extra swebench runs on all task instances in batch mode.
mini-extra swebench-single runs on a single task instance with interactivity (useful for debugging).

Usage

Batch modeSingle instance (for debugging)

mini-extra swebench --help
# or
python src/minisweagent/run/extra/swebench.py --help
# Example:
mini-extra swebench \
    --model claude-sonnet-4-20250514 \
    --subset verified \
    --split test \
    --workers 4

mini-extra swebench-single --help
# or
python src/minisweagent/run/extra/swebench_single.py --help
# Example:
mini-extra swebench-single \
    --subset verified \
    --split test \
    --model claude-sonnet-4-20250514 \
    -i sympy__sympy-15599
# or
mini-extra swebench-single \
    --subset verified \
    --split test \
    -m claude-sonnet-4-20250514 \
    -i 0  # instance index

Evaluating on SWE-bench

You can use the sb-cli for extremely fast, cloud-based evaluations (and it's free!). After installing it and getting a token, simply run:

sb-cli submit swe-bench_verified test --predictions_path preds.json --run_id some-id-for-your-run

Typically you will have results within 20 minutes (this is not limited by how many instances you run, but by the slowest-to-evaluate instance in SWE-bench).

FAQ

What happens to uncompleted tasks when I abort with KeyboardInterrupt?

Trajectories are only saved upon completion, so most likely, you can just rerun the script to complete the tasks next time. However, you should still check for KeyboardInterrupt in preds.json in case some tasks were aborted but saved.

Certain tasks are being stuck even though I deleted the trajectories.

The completed instances are inferred from preds.json. Remove the corresponding items from the file.

How can I run on a different dataset?

As long as it follows the SWE-bench format, you can use --subset /path/to/your/dataset to run on a custom dataset. The dataset needs to be loadable as datasets.load_dataset(path, split=split).

Some progress runners are stuck at 'initializing task' for a very long time

They might be pulling docker containers -- the run sshould start immediately the next time.

Implementation

Default config

Read on GitHub

agent:
  system_template: |
    You are a helpful assistant that can interact multiple times with a computer shell to solve programming tasks.
    Your response must contain exactly ONE bash code block with ONE command (or commands connected with && or ||).

    Include a THOUGHT section before your command where you explain your reasoning process.
    Format your response as shown in <format_example>.

    <format_example>
    THOUGHT: Your reasoning and analysis here

    ```bash
    your_command_here
    ```
    </format_example>

    Failure to follow these rules will cause your response to be rejected.
  instance_template: |
    <pr_description>
    Consider the following PR description:
    {{task}}
    </pr_description>

    <instructions>
    # Task Instructions

    ## Overview
    You're a software engineer interacting continuously with a computer by submitting commands.
    You'll be helping implement necessary changes to meet requirements in the PR description.
    Your task is specifically to make changes to non-test files in the current directory in order to fix the issue described in the PR description in a way that is general and consistent with the codebase.

    IMPORTANT: This is an interactive process where you will think and issue ONE command, see its result, then think and issue your next command.

    For each response:
    1. Include a THOUGHT section explaining your reasoning and what you're trying to accomplish
    2. Provide exactly ONE bash command to execute

    ## Important Boundaries
    - MODIFY: Regular source code files in {{working_dir}}
    - DO NOT MODIFY: Tests, configuration files (pyproject.toml, setup.cfg, etc.)

    ## Recommended Workflow
    1. Analyze the codebase by finding and reading relevant files
    2. Create a script to reproduce the issue
    3. Edit the source code to resolve the issue
    4. Verify your fix works by running your script again
    5. Test edge cases to ensure your fix is robust

    ## Command Execution Rules
    You are operating in an environment where
    1. You write a single command
    2. The system executes that command in a subshell
    3. You see the result
    4. You write your next command

    Each response should include:
    1. A **THOUGHT** section where you explain your reasoning and plan
    2. A single bash code block with your command

    Format your responses like this:

    <format_example>
    THOUGHT: Here I explain my reasoning process, analysis of the current situation,
    and what I'm trying to accomplish with the command below.

    ```bash
    your_command_here
    ```
    </format_example>

    Commands must be specified in a single bash code block:

    ```bash
    your_command_here
    ```

    **CRITICAL REQUIREMENTS:**
    - Your response SHOULD include a THOUGHT section explaining your reasoning
    - Your response MUST include EXACTLY ONE bash code block
    - This bash block MUST contain EXACTLY ONE command (or a set of commands connected with && or ||)
    - If you include zero or multiple bash blocks, or no command at all, YOUR RESPONSE WILL FAIL
    - Do NOT try to run multiple independent commands in separate blocks in one response
    - Directory or environment variable changes are not persistent. Every action is executed in a new subshell.
    - However, you can prefix any action with `MY_ENV_VAR=MY_VALUE cd /path/to/working/dir && ...` or write/load environment variables from files

    Example of a CORRECT response:
    <example_response>
    THOUGHT: I need to understand the structure of the repository first. Let me check what files are in the current directory to get a better understanding of the codebase.

    ```bash
    ls -la
    ```
    </example_response>

    Example of an INCORRECT response:
    <example_response>
    THOUGHT: I need to examine the codebase and then look at a specific file. I'll run multiple commands to do this.

    ```bash
    ls -la
    ```

    Now I'll read the file:

    ```bash
    cat file.txt
    ```
    </example_response>

    If you need to run multiple commands, either:
    1. Combine them in one block using && or ||
    ```bash
    command1 && command2 || echo "Error occurred"
    ```

    2. Wait for the first command to complete, see its output, then issue the next command in your following response.

    ## Environment Details
    - You have a full Linux shell environment
    - Always use non-interactive flags (-y, -f) for commands
    - Avoid interactive tools like vi, nano, or any that require user input
    - If a command isn't available, you can install it

    ## Useful Command Examples

    ### Create a new file:
    ```bash
    cat <<'EOF' > newfile.py
    import numpy as np
    hello = "world"
    print(hello)
    EOF
    ```

    ### Edit files with sed:
    ```bash
    # Replace all occurrences
    sed -i 's/old_string/new_string/g' filename.py

    # Replace only first occurrence
    sed -i 's/old_string/new_string/' filename.py

    # Replace first occurrence on line 1
    sed -i '1s/old_string/new_string/' filename.py

    # Replace all occurrences in lines 1-10
    sed -i '1,10s/old_string/new_string/g' filename.py
    ```

    ### View file content:
    ```bash
    # View specific lines with numbers
    nl -ba filename.py | sed -n '10,20p'
    ```

    ### Any other command you want to run
    ```bash
    anything
    ```

    ## Submission
    When you've completed your changes or can't make further progress
    issue exactly the following command:

    ```bash
    echo MINI_SWE_AGENT_FINAL_OUTPUT && git add -A && git diff --cached
    ```

    This command will submit your changes.
    You cannot continue working on this task after submitting.
    </instructions>
  action_observation_template: |
    <returncode>{{output.returncode}}</returncode>
    {% if output.output | length < 10000 -%}
    <output>
    {{ output.output -}}
    </output>
    {%- else -%}
    <warning>
    The output of your last command was too long.
    Please try a different command that produces less output.
    If you're looking at a file you can try use head, tail or sed to view a smaller number of lines selectively.
    If you're using grep or find and it produced too much output, you can use a more selective search pattern.
    If you really need to see something from the full command's output, you can redirect output to a file and then search in that file.
    </warning>
    {%- set elided_chars = output.output | length - 10000 -%}
    <output_head>
    {{ output.output[:5000] }}
    </output_head>
    <elided_chars>
    {{ elided_chars }} characters elided
    </elided_chars>
    <output_tail>
    {{ output.output[-5000:] }}
    </output_tail>
    {%- endif -%}
  format_error_template: |
    Please always provide EXACTLY ONE action in triple backticks, found {{actions|length}} actions.

    Please format your action in triple backticks as shown in <response_example>.

    <response_example>
    Here are some thoughts about why you want to perform the action.

    ```bash
    <action>
    ```
    </response_example>

    If you have completed your assignment, please consult the first message about how to
    submit your solution (you will not be able to continue working on this task after that).
  step_limit: 250
  cost_limit: 3.

environment:
  cwd: "/testbed"
  timeout: 60
  env:
    PAGER: cat
    MANPAGER: cat
    LESS: -R
    PIP_PROGRESS_BAR: 'off'
    TQDM_DISABLE: '1'

model:
  model_name: "claude-sonnet-4-20250514"
  model_kwargs:
    temperature: 0.0
    drop_params: true

swebench.py run script

#!/usr/bin/env python3

"""Run mini-SWE-agent on SWEBench instances.

[not dim]
More information about the usage: [bold green]https://mini-swe-agent.com/latest/usage/swebench/[/bold green]
[/not dim]
"""

import concurrent.futures
import json
import random
import re
import threading
import time
import traceback
from pathlib import Path

import typer
import yaml
from datasets import load_dataset
from rich.live import Live

from minisweagent.agents.default import DefaultAgent
from minisweagent.config import builtin_config_dir, get_config_path
from minisweagent.environments.docker import DockerEnvironment
from minisweagent.models import get_model
from minisweagent.run.extra.utils.batch_progress import RunBatchProgressManager
from minisweagent.run.utils.save import save_traj

app = typer.Typer(rich_markup_mode="rich", add_completion=False)

DATASET_MAPPING = {
    "full": "princeton-nlp/SWE-Bench",
    "verified": "princeton-nlp/SWE-Bench_Verified",
    "lite": "princeton-nlp/SWE-Bench_Lite",
    "multimodal": "princeton-nlp/SWE-Bench_Multimodal",
    "multilingual": "swe-bench/SWE-Bench_Multilingual",
    "smith": "SWE-bench/SWE-smith",
    "_test": "klieret/swe-bench-dummy-test-dataset",
}


_OUTPUT_FILE_LOCK = threading.Lock()


class ProgressTrackingAgent(DefaultAgent):
    """Simple wrapper around DefaultAgent that provides progress updates."""

    def __init__(self, *args, progress_manager: RunBatchProgressManager, instance_id: str = "", **kwargs):
        super().__init__(*args, **kwargs)
        self.progress_manager: RunBatchProgressManager = progress_manager
        self.instance_id = instance_id

    def step(self) -> dict:
        """Override step to provide progress updates."""
        self.progress_manager.update_instance_status(
            self.instance_id, f"Step {self.model.n_calls + 1:3d} (${self.model.cost:.2f})"
        )
        return super().step()


def get_swebench_docker_image_name(instance: dict) -> str:
    """Get the image name for a SWEBench instance."""
    image_name = instance.get("image_name", None)
    if image_name is None:
        # Docker doesn't allow double underscore, so we replace them with a magic token
        iid = instance["instance_id"]
        id_docker_compatible = iid.replace("__", "_1776_")
        image_name = f"swebench/sweb.eval.x86_64.{id_docker_compatible}:latest".lower()
    return image_name


def update_preds_file(output_path: Path, instance_id: str, model_name: str, result: str):
    """Update the output JSON file with results from a single instance."""
    with _OUTPUT_FILE_LOCK:
        output_data = {}
        if output_path.exists():
            output_data = json.loads(output_path.read_text())
        output_data[instance_id] = {
            "model_name_or_path": model_name,
            "instance_id": instance_id,
            "model_patch": result,
        }
        output_path.write_text(json.dumps(output_data, indent=2))


def remove_from_preds_file(output_path: Path, instance_id: str):
    """Remove an instance from the predictions file."""
    if not output_path.exists():
        return
    with _OUTPUT_FILE_LOCK:
        output_data = json.loads(output_path.read_text())
        if instance_id in output_data:
            del output_data[instance_id]
            output_path.write_text(json.dumps(output_data, indent=2))


def process_instance(
    instance: dict,
    output_dir: Path,
    model_name: str | None,
    config_path: str | Path,
    progress_manager: RunBatchProgressManager,
) -> None:
    """Process a single SWEBench instance."""
    instance_id = instance["instance_id"]
    instance_dir = output_dir / instance_id
    # avoid inconsistent state if something here fails and there's leftover previous files
    remove_from_preds_file(output_dir / "preds.json", instance_id)
    (instance_dir / f"{instance_id}.traj.json").unlink(missing_ok=True)

    image_name = get_swebench_docker_image_name(instance)
    config = yaml.safe_load(get_config_path(config_path).read_text())
    model = get_model(model_name, config=config.get("model", {}))
    task = instance["problem_statement"]

    progress_manager.on_instance_start(instance_id)
    progress_manager.update_instance_status(instance_id, "Pulling/starting docker")

    agent = None
    extra_info = None

    try:
        env = DockerEnvironment(**(config.get("environment", {}) | {"image": image_name}))
        agent = ProgressTrackingAgent(
            model,
            env,
            progress_manager=progress_manager,
            instance_id=instance_id,
            **config.get("agent", {}),
        )
        exit_status, result = agent.run(task)
    except Exception as e:
        print(f"Error processing instance {instance_id}: {e}\n{traceback.format_exc()}")
        exit_status, result = type(e).__name__, str(e)
        extra_info = {"traceback": traceback.format_exc()}
    finally:
        save_traj(
            agent,
            instance_dir / f"{instance_id}.traj.json",
            exit_status=exit_status,
            result=result,
            extra_info=extra_info,
            instance_id=instance_id,
        )
        update_preds_file(output_dir / "preds.json", instance_id, model.config.model_name, result)
        progress_manager.on_instance_end(instance_id, exit_status)


def filter_instances(
    instances: list[dict], *, filter_spec: str, slice_spec: str = "", shuffle: bool = False
) -> list[dict]:
    """Filter and slice a list of SWEBench instances."""
    if shuffle:
        instances = sorted(instances.copy(), key=lambda x: x["instance_id"])
        random.seed(42)
        random.shuffle(instances)
    before_filter = len(instances)
    instances = [instance for instance in instances if re.match(filter_spec, instance["instance_id"])]
    if (after_filter := len(instances)) != before_filter:
        print(f"Instance filter: {before_filter} -> {after_filter} instances")
    if slice_spec:
        values = [int(x) if x else None for x in slice_spec.split(":")]
        instances = instances[slice(*values)]
        if (after_slice := len(instances)) != before_filter:
            print(f"Instance slice: {before_filter} -> {after_slice} instances")
    return instances


@app.command()
def main(
    subset: str = typer.Option("lite", "--subset", help="SWEBench subset to use or path to a dataset"),
    split: str = typer.Option("dev", "--split", help="Dataset split"),
    slice_spec: str = typer.Option("", "--slice", help="Slice specification (e.g., '0:5' for first 5 instances)"),
    filter_spec: str = typer.Option("", "--filter", help="Filter instance IDs by regex"),
    shuffle: bool = typer.Option(False, "--shuffle", help="Shuffle instances"),
    output: str = typer.Option("", "-o", "--output", help="Output directory"),
    workers: int = typer.Option(1, "-w", "--workers", help="Number of worker threads for parallel processing"),
    model: str | None = typer.Option(None, "-m", "--model", help="Model to use"),
    redo_existing: bool = typer.Option(False, "--redo-existing", help="Redo existing instances"),
    config: Path = typer.Option(
        builtin_config_dir / "extra" / "swebench.yaml", "-c", "--config", help="Path to a config file"
    ),
) -> None:
    """Run mini-SWE-agent on SWEBench instances"""
    dataset_path = DATASET_MAPPING.get(subset, subset)
    print(f"Loading dataset {dataset_path}, split {split}...")
    instances = list(load_dataset(dataset_path, split=split))

    instances = filter_instances(instances, filter_spec=filter_spec, slice_spec=slice_spec, shuffle=shuffle)
    output_path = Path(output)
    if not redo_existing and (output_path / "preds.json").exists():
        existing_instances = list(json.loads((output_path / "preds.json").read_text()).keys())
        print(f"Skipping {len(existing_instances)} existing instances")
        instances = [instance for instance in instances if instance["instance_id"] not in existing_instances]

    output_path.mkdir(parents=True, exist_ok=True)
    print(f"Running on {len(instances)} instances...")
    print(f"Results will be saved to {output_path}")

    progress_manager = RunBatchProgressManager(len(instances), output_path / f"exit_statuses_{time.time()}.yaml")

    def process_futures(futures: dict[concurrent.futures.Future, str]):
        for future in concurrent.futures.as_completed(futures):
            try:
                future.result()
            except concurrent.futures.CancelledError:
                pass
            except Exception as e:
                instance_id = futures[future]
                print(f"Error in future for instance {instance_id}: {e}")
                traceback.print_exc()
                progress_manager.on_uncaught_exception(instance_id, e)

    with Live(progress_manager.render_group, refresh_per_second=4):
        with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
            futures = {
                executor.submit(process_instance, instance, output_path, model, config, progress_manager): instance[
                    "instance_id"
                ]
                for instance in instances
            }
            try:
                process_futures(futures)
            except KeyboardInterrupt:
                print("Cancelling all pending jobs. Press ^C again to exit immediately.")
                for future in futures:
                    if not future.running() and not future.done():
                        future.cancel()
                process_futures(futures)


if __name__ == "__main__":
    app()

swebench_single.py run script

"""Run on a single SWE-Bench instance."""

from pathlib import Path

import typer
import yaml
from datasets import load_dataset

from minisweagent.agents.interactive import InteractiveAgent
from minisweagent.config import builtin_config_dir, get_config_path
from minisweagent.environments.docker import DockerEnvironment
from minisweagent.models import get_model
from minisweagent.run.extra.swebench import DATASET_MAPPING, get_swebench_docker_image_name

app = typer.Typer(add_completion=False)


@app.command()
def main(
    subset: str = typer.Option("lite", "--subset", help="SWEBench subset to use or path to a dataset"),
    split: str = typer.Option("dev", "--split", help="Dataset split"),
    instance_spec: str = typer.Option(None, "-i", "--instance", help="SWE-Bench instance ID"),
    model_name: str | None = typer.Option(None, "-m", "--model", help="Model to use"),
    config_path: Path = typer.Option(
        builtin_config_dir / "extra" / "swebench.yaml", "-c", "--config", help="Path to a config file"
    ),
) -> None:
    """Run on a single SWE-Bench instance."""
    try:
        dataset_path = DATASET_MAPPING[subset]
    except KeyError:
        dataset_path = subset
    print(f"Loading dataset {dataset_path}, split {split}...")
    instances = {
        inst["instance_id"]: inst  # type: ignore
        for inst in load_dataset(dataset_path, split=split)
    }
    if instance_spec.isnumeric():
        instance_spec = sorted(instances.keys())[int(instance_spec)]
    instance: dict = instances[instance_spec]  # type: ignore

    _config = yaml.safe_load(get_config_path(config_path).read_text())
    env = DockerEnvironment(**(_config.get("environment", {}) | {"image": get_swebench_docker_image_name(instance)}))
    agent = InteractiveAgent(
        get_model(model_name, _config.get("model", {})),
        env,
        **(_config.get("agent", {}) | {"mode": "yolo"}),
    )
    agent.run(instance["problem_statement"])


if __name__ == "__main__":
    app()

bug_report Something broken/unclear?

Open an issue on GitHub!

help Open-ended discussions

Join our Slack!

Our projects