SWE-bench
Overview
- We provide two scripts to run on the SWE-bench benchmark.
mini-extra swebenchruns on all task instances in batch mode.mini-extra swebench-singleruns on a single task instance with interactivity (useful for debugging).- You can also take a look at the runscripts to figure out how to build your own batch processing pipeline.
Usage
Docker container availability
The docker containers for Linux assume an x86 Linux architecture; you might not be able to run them on other architectures.
Quickstart
We provide two different scripts: swebench and swebench-single:
Batch mode runs on all task instances in parallel.
mini-extra swebench --help
# or
python src/minisweagent/run/benchmarks/swebench.py --help
# Example:
mini-extra swebench \
--model anthropic/claude-sonnet-4-5-20250929 \
--subset verified \
--split test \
--workers 4
Basic flags:
-o,--output- Output directory-m,--model- Model to use-c,--config- Path to a config file (default:swebench.yamlin theconfigdirectory)-w,--workers- Number of worker threads for parallel processing (default:1)
Data selection flags:
--subset- SWEBench subset to use or path to a dataset (default:lite)--split- Dataset split (default:dev)--slice- Slice specification (e.g., '0:5' for first 5 instances)--filter- Filter instance IDs by regex--shuffle- Shuffle instances (default:False)--redo-existing- Redo existing instances (default:False)
Advanced flags:
--environment-class- Environment type to use (recommended:dockerorsingularity)
Single instance mode runs on a single task instance with interactivity. This is meant for debugging, and so unlike the batch mode command above, this will not produce a preds.json file.
mini-extra swebench-single --help
# or
python src/minisweagent/run/benchmarks/swebench_single.py --help
# Example:
mini-extra swebench-single \
--subset verified \
--split test \
--model anthropic/claude-sonnet-4-5-20250929 \
-i sympy__sympy-15599
# or
mini-extra swebench-single \
--subset verified \
--split test \
-m anthropic/claude-sonnet-4-5-20250929 \
-i 0 # instance index
Note: If you want to run the script without prompting for confirmation at exit,
add the --exit-immediately flag.
Basic flags:
-m,--model- Model to use-c,--config- Path to a config file (default:swebench.yamlin theconfigdirectory)-o,--output- Output trajectory file (default: saves to global config directory)
Data selection flags:
--subset- SWEBench subset to use or path to a dataset (default:lite)--split- Dataset split (default:dev)-i,--instance- SWE-Bench instance ID (default:0)
Advanced flags:
--environment-class- Environment type to use (recommended:dockerorsingularity)--exit-immediately- Exit immediately when the agent wants to finish instead of prompting (default:False)
Evaluating on SWE-bench
You have two options to evaluate on SWE-bench: Our free cloud-based evaluation or the SWE-bench CLI.
You can use the sb-cli for extremely fast, cloud-based evaluations (and it's free!). After installing it and getting a token, simply run:
sb-cli submit swe-bench_verified test --predictions_path preds.json --run_id some-id-for-your-run
Typically you will have results within 20 minutes (this is not limited by how many instances you run, but by the slowest-to-evaluate instance in SWE-bench).
You can also use a local installation of SWE-bench for evaluation:
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Verified \
--predictions_path preds.jsonl \
--max_workers <num_workers> \
--run_id <run_id>
FAQ
Can I set global cost limits?
Yes, you can set global cost limits with the MSWEA_GLOBAL_CALL_LIMIT and MSWEA_GLOBAL_COST_LIMIT environment variables/global config.
See global configuration for more details.
What happens to uncompleted tasks when I abort with KeyboardInterrupt?
Trajectories are only saved upon completion, so most likely, you can just rerun the script to complete the tasks next time.
However, you should still check for KeyboardInterrupt in preds.json in case some tasks were aborted but saved.
Certain tasks are being stuck even though I deleted the trajectories.
The completed instances are inferred from preds.json. Remove the corresponding items from the file.
How can I run on a different dataset?
As long as it follows the SWE-bench format, you can use --subset /path/to/your/dataset to run on a custom dataset.
The dataset needs to be loadable as datasets.load_dataset(path, split=split).
Some progress runners are stuck at 'initializing task' for a very long time / time out
They might be pulling docker containers -- the run should start immediately the next time.
If you see timeouts because of docker pull operations, you might want to increase environment.pull_timeout
from the default of 120 (seconds).
I have some docker issues
Try running the docker command manually to see what's going on (it should be printed out in the console).
Confirm that it's running with docker ps, and that you can use docker exec -it <container-id> ls to get some output.
Docker isn't available on my HPC cluster.
You can use the singularity/apptainer backend by setting environment.environment_class to singularity
in your agent config file
or specify --environment-class singularity from the command line
Can I run a startup command in the environment?
Yes, you can use the run.env_startup_command config option to run a command in the environment before the agent starts.
For example:
run:
env_startup_command: "apt-get update && apt-get install -y python3-pip"
The command is rendered with the instance variables as template variables using jinja2.
For example, you could use
run:
env_startup_command: "git clone {{ repo_url }} . --force"
which might be particularly useful when running with environments like bubblewrap.
What environment can I use for SWE-bench?
See this guide for more details.
Implementation
Default config
agent:
system_template: |
You are a helpful assistant that can interact with a computer shell to solve programming tasks.
instance_template: |
<pr_description>
Consider the following PR description:
{{task}}
</pr_description>
<instructions>
# Task Instructions
## Overview
You're a software engineer interacting continuously with a computer by submitting commands.
You'll be helping implement necessary changes to meet requirements in the PR description.
Your task is specifically to make changes to non-test files in the current directory in order to fix the issue described in the PR description in a way that is general and consistent with the codebase.
<IMPORTANT>This is an interactive process where you will think and issue AT LEAST ONE command, see the result, then think and issue your next command(s).</important>
For each response:
1. Include a THOUGHT section explaining your reasoning and what you're trying to accomplish
2. Provide one or more bash tool calls to execute
## Important Boundaries
- MODIFY: Regular source code files in /testbed (this is the working directory for all your subsequent commands)
- DO NOT MODIFY: Tests, configuration files (pyproject.toml, setup.cfg, etc.)
## Recommended Workflow
1. Analyze the codebase by finding and reading relevant files
2. Create a script to reproduce the issue
3. Edit the source code to resolve the issue
4. Verify your fix works by running your script again
5. Test edge cases to ensure your fix is robust
## Command Execution Rules
You are operating in an environment where
1. You issue at least one command
2. The system executes the command(s) in a subshell
3. You see the result(s)
4. You write your next command(s)
Each response should include:
1. **Reasoning text** where you explain your analysis and plan
2. At least one tool call with your command
**CRITICAL REQUIREMENTS:**
- Your response SHOULD include reasoning text explaining what you're doing
- Your response MUST include AT LEAST ONE bash tool call. You can make MULTIPLE tool calls in a single response when the commands are independent (e.g., searching multiple files, reading different parts of the codebase).
- Directory or environment variable changes are not persistent. Every action is executed in a new subshell.
- However, you can prefix any action with `MY_ENV_VAR=MY_VALUE cd /path/to/working/dir && ...` or write/load environment variables from files
Example of a CORRECT response:
<example_response>
I need to understand the Builder-related code. Let me find relevant files and check the project structure.
[Makes multiple bash tool calls: {"command": "ls -la"}, {"command": "find src -name '*.java' | grep -i builder"}, {"command": "cat README.md | head -50"}]
</example_response>
## Environment Details
- You have a full Linux shell environment
- Always use non-interactive flags (-y, -f) for commands
- Avoid interactive tools like vi, nano, or any that require user input
- You can use bash commands or invoke any tool that is available in the environment
- You can also create new tools or scripts to help you with the task
- If a tool isn't available, you can also install it
## Submission
When you've completed your work, you MUST submit your changes as a git patch.
Follow these steps IN ORDER, with SEPARATE commands:
Step 1: Create the patch file
Run `git diff -- path/to/file1 path/to/file2 > patch.txt` listing only the source files you modified.
Do NOT commit your changes.
<IMPORTANT>
The patch must only contain changes to the specific source files you modified to fix the issue.
Do not submit file creations or changes to any of the following files:
- test and reproduction files
- helper scripts, tests, or tools that you created
- installation, build, packaging, configuration, or setup scripts unless they are directly part of the issue you were fixing (you can assume that the environment is already set up for your client)
- binary or compiled files
</IMPORTANT>
Step 2: Verify your patch
Inspect patch.txt to confirm it only contains your intended changes and headers show `--- a/` and `+++ b/` paths.
Step 3: Submit (EXACT command required)
You MUST use this EXACT command to submit:
```bash
echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt
```
If the command fails (nonzero exit status), it will not submit.
<CRITICAL>
- Creating/viewing the patch and submitting it MUST be separate commands (not combined with &&).
- If you modify patch.txt after verifying, you SHOULD verify again before submitting.
- You CANNOT continue working (reading, editing, testing) in any way on this task after submitting.
</CRITICAL>
</instructions>
step_limit: 250
cost_limit: 3.
environment:
cwd: "/testbed"
timeout: 60
interpreter: ["bash", "-c"]
env:
PAGER: cat
MANPAGER: cat
LESS: -R
PIP_PROGRESS_BAR: "off"
TQDM_DISABLE: "1"
environment_class: docker
model:
observation_template: |
{% if output.exception_info -%}
<exception>{{output.exception_info}}</exception>
{% endif -%}
<returncode>{{output.returncode}}</returncode>
{% if output.output | length < 10000 -%}
<output>
{{ output.output -}}
</output>
{%- else -%}
<warning>
The output of your last command was too long.
Please try a different command that produces less output.
If you're looking at a file you can try use head, tail or sed to view a smaller number of lines selectively.
If you're using grep or find and it produced too much output, you can use a more selective search pattern.
If you really need to see something from the full command's output, you can redirect output to a file and then search in that file.
</warning>
{%- set elided_chars = output.output | length - 10000 -%}
<output_head>
{{ output.output[:5000] }}
</output_head>
<elided_chars>
{{ elided_chars }} characters elided
</elided_chars>
<output_tail>
{{ output.output[-5000:] }}
</output_tail>
{%- endif -%}
format_error_template: |
Tool call error:
<error>
{{error}}
</error>
Here is general guidance on how to submit correct toolcalls:
Every response needs to use the 'bash' tool at least once to execute commands.
Call the bash tool with your command as the argument:
- Tool: bash
- Arguments: {"command": "your_command_here"}
If you have completed your assignment, please consult the first message about how to
submit your solution (you will not be able to continue working on this task after that).
model_name: "anthropic/claude-sonnet-4-5-20250929"
model_kwargs:
drop_params: true
temperature: 0.0
parallel_tool_calls: true
swebench.py run script
#!/usr/bin/env python3
"""Run mini-SWE-agent on SWE-bench instances in batch mode."""
# Read this first: https://mini-swe-agent.com/latest/usage/swebench/ (usage docs)
import concurrent.futures
import json
import random
import re
import threading
import time
import traceback
from pathlib import Path
import typer
from jinja2 import StrictUndefined, Template
from rich.live import Live
from minisweagent import Environment
from minisweagent.agents.default import DefaultAgent
from minisweagent.config import builtin_config_dir, get_config_from_spec
from minisweagent.environments import get_environment
from minisweagent.models import get_model
from minisweagent.run.benchmarks.utils.batch_progress import RunBatchProgressManager
from minisweagent.utils.log import add_file_handler, logger
from minisweagent.utils.serialize import UNSET, recursive_merge
_HELP_TEXT = """Run mini-SWE-agent on SWEBench instances.
[not dim]
More information about the usage: [bold green]https://mini-swe-agent.com/latest/usage/swebench/[/bold green]
[/not dim]
"""
_CONFIG_SPEC_HELP_TEXT = """Path to config files, filenames, or key-value pairs.
[bold red]IMPORTANT:[/bold red] [red]If you set this option, the default config file will not be used.[/red]
So you need to explicitly set it e.g., with [bold green]-c swebench.yaml <other options>[/bold green]
Multiple configs will be recursively merged.
Examples:
[bold red]-c model.model_kwargs.temperature=0[/bold red] [red]You forgot to add the default config file! See above.[/red]
[bold green]-c swebench.yaml -c model.model_kwargs.temperature=0.5[/bold green]
[bold green]-c swebench.yaml -c agent.max_iterations=50[/bold green]
"""
DEFAULT_CONFIG_FILE = builtin_config_dir / "benchmarks" / "swebench.yaml"
DATASET_MAPPING = {
"full": "princeton-nlp/SWE-Bench",
"verified": "princeton-nlp/SWE-Bench_Verified",
"lite": "princeton-nlp/SWE-Bench_Lite",
"multimodal": "princeton-nlp/SWE-Bench_Multimodal",
"multilingual": "swe-bench/SWE-Bench_Multilingual",
"smith": "SWE-bench/SWE-smith",
"_test": "klieret/swe-bench-dummy-test-dataset",
"rebench": "nebius/SWE-rebench",
}
app = typer.Typer(rich_markup_mode="rich", add_completion=False)
_OUTPUT_FILE_LOCK = threading.Lock()
class ProgressTrackingAgent(DefaultAgent):
"""Simple wrapper around DefaultAgent that provides progress updates."""
def __init__(self, *args, progress_manager: RunBatchProgressManager, instance_id: str = "", **kwargs):
super().__init__(*args, **kwargs)
self.progress_manager: RunBatchProgressManager = progress_manager
self.instance_id = instance_id
def step(self) -> dict:
"""Override step to provide progress updates."""
self.progress_manager.update_instance_status(self.instance_id, f"Step {self.n_calls + 1:3d} (${self.cost:.2f})")
return super().step()
def get_swebench_docker_image_name(instance: dict) -> str:
"""Get the image name for a SWEBench instance."""
image_name = instance.get("image_name", None) or instance.get("docker_image", None)
if image_name is None:
# Docker doesn't allow double underscore, so we replace them with a magic token
iid = instance["instance_id"]
id_docker_compatible = iid.replace("__", "_1776_")
image_name = f"docker.io/swebench/sweb.eval.x86_64.{id_docker_compatible}:latest".lower()
return image_name
def get_sb_environment(config: dict, instance: dict) -> Environment:
env_config = config.setdefault("environment", {})
env_config["environment_class"] = env_config.get("environment_class", "docker")
image_name = get_swebench_docker_image_name(instance)
if env_config["environment_class"] in ["docker", "swerex_modal"]:
env_config["image"] = image_name
elif env_config["environment_class"] in ["singularity", "contree"]:
env_config["image"] = "docker://" + image_name
env = get_environment(env_config)
if startup_command := config.get("run", {}).get("env_startup_command"):
startup_command = Template(startup_command, undefined=StrictUndefined).render(**instance)
out = env.execute(startup_command)
if out["returncode"] != 0:
raise RuntimeError(f"Error executing startup command: {out}")
return env
def update_preds_file(output_path: Path, instance_id: str, model_name: str, result: str):
"""Update the output JSON file with results from a single instance."""
with _OUTPUT_FILE_LOCK:
output_data = {}
if output_path.exists():
output_data = json.loads(output_path.read_text())
output_data[instance_id] = {
"model_name_or_path": model_name,
"instance_id": instance_id,
"model_patch": result,
}
output_path.write_text(json.dumps(output_data, indent=2))
def remove_from_preds_file(output_path: Path, instance_id: str):
"""Remove an instance from the predictions file."""
if not output_path.exists():
return
with _OUTPUT_FILE_LOCK:
output_data = json.loads(output_path.read_text())
if instance_id in output_data:
del output_data[instance_id]
output_path.write_text(json.dumps(output_data, indent=2))
def process_instance(
instance: dict,
output_dir: Path,
config: dict,
progress_manager: RunBatchProgressManager,
) -> None:
"""Process a single SWEBench instance."""
instance_id = instance["instance_id"]
instance_dir = output_dir / instance_id
# avoid inconsistent state if something here fails and there's leftover previous files
remove_from_preds_file(output_dir / "preds.json", instance_id)
(instance_dir / f"{instance_id}.traj.json").unlink(missing_ok=True)
model = get_model(config=config.get("model", {}))
task = instance["problem_statement"]
progress_manager.on_instance_start(instance_id)
progress_manager.update_instance_status(instance_id, "Pulling/starting environment")
agent = None
exit_status = None
result = None
extra_info = {}
try:
env = get_sb_environment(config, instance)
agent = ProgressTrackingAgent(
model,
env,
progress_manager=progress_manager,
instance_id=instance_id,
**config.get("agent", {}),
)
info = agent.run(task)
exit_status = info.get("exit_status")
result = info.get("submission")
except Exception as e:
logger.error(f"Error processing instance {instance_id}: {e}", exc_info=True)
exit_status, result = type(e).__name__, ""
extra_info = {"traceback": traceback.format_exc(), "exception_str": str(e)}
finally:
if agent is not None:
traj_path = instance_dir / f"{instance_id}.traj.json"
agent.save(
traj_path,
{
"info": {
"exit_status": exit_status,
"submission": result,
**extra_info,
},
"instance_id": instance_id,
},
)
logger.info(f"Saved trajectory to '{traj_path}'")
update_preds_file(output_dir / "preds.json", instance_id, model.config.model_name, result)
progress_manager.on_instance_end(instance_id, exit_status)
def filter_instances(
instances: list[dict], *, filter_spec: str, slice_spec: str = "", shuffle: bool = False
) -> list[dict]:
"""Filter and slice a list of SWEBench instances."""
if shuffle:
instances = sorted(instances.copy(), key=lambda x: x["instance_id"])
random.seed(42)
random.shuffle(instances)
before_filter = len(instances)
instances = [instance for instance in instances if re.match(filter_spec, instance["instance_id"])]
if (after_filter := len(instances)) != before_filter:
logger.info(f"Instance filter: {before_filter} -> {after_filter} instances")
if slice_spec:
values = [int(x) if x else None for x in slice_spec.split(":")]
instances = instances[slice(*values)]
if (after_slice := len(instances)) != before_filter:
logger.info(f"Instance slice: {before_filter} -> {after_slice} instances")
return instances
# fmt: off
@app.command(help=_HELP_TEXT)
def main(
subset: str = typer.Option("lite", "--subset", help="SWEBench subset to use or path to a dataset", rich_help_panel="Data selection"),
split: str = typer.Option("dev", "--split", help="Dataset split", rich_help_panel="Data selection"),
slice_spec: str = typer.Option("", "--slice", help="Slice specification (e.g., '0:5' for first 5 instances)", rich_help_panel="Data selection"),
filter_spec: str = typer.Option("", "--filter", help="Filter instance IDs by regex", rich_help_panel="Data selection"),
shuffle: bool = typer.Option(False, "--shuffle", help="Shuffle instances", rich_help_panel="Data selection"),
output: str = typer.Option("", "-o", "--output", help="Output directory", rich_help_panel="Basic"),
workers: int = typer.Option(1, "-w", "--workers", help="Number of worker threads for parallel processing", rich_help_panel="Basic"),
model: str | None = typer.Option(None, "-m", "--model", help="Model to use", rich_help_panel="Basic"),
model_class: str | None = typer.Option(None, "--model-class", help="Model class to use (e.g., 'anthropic' or 'minisweagent.models.anthropic.AnthropicModel')", rich_help_panel="Advanced"),
redo_existing: bool = typer.Option(False, "--redo-existing", help="Redo existing instances", rich_help_panel="Data selection"),
config_spec: list[str] = typer.Option([str(DEFAULT_CONFIG_FILE)], "-c", "--config", help=_CONFIG_SPEC_HELP_TEXT, rich_help_panel="Basic"),
environment_class: str | None = typer.Option(None, "--environment-class", help="Environment type to use. Recommended are docker or singularity", rich_help_panel="Advanced"),
) -> None:
# fmt: on
output_path = Path(output)
output_path.mkdir(parents=True, exist_ok=True)
logger.info(f"Results will be saved to {output_path}")
add_file_handler(output_path / "minisweagent.log")
from datasets import load_dataset
dataset_path = DATASET_MAPPING.get(subset, subset)
logger.info(f"Loading dataset {dataset_path}, split {split}...")
instances = list(load_dataset(dataset_path, split=split))
instances = filter_instances(instances, filter_spec=filter_spec, slice_spec=slice_spec, shuffle=shuffle)
if not redo_existing and (output_path / "preds.json").exists():
existing_instances = list(json.loads((output_path / "preds.json").read_text()).keys())
logger.info(f"Skipping {len(existing_instances)} existing instances")
instances = [instance for instance in instances if instance["instance_id"] not in existing_instances]
logger.info(f"Running on {len(instances)} instances...")
logger.info(f"Building agent config from specs: {config_spec}")
configs = [get_config_from_spec(spec) for spec in config_spec]
configs.append({
"environment": {"environment_class": environment_class or UNSET},
"model": {"model_name": model or UNSET, "model_class": model_class or UNSET},
})
config = recursive_merge(*configs)
progress_manager = RunBatchProgressManager(len(instances), output_path / f"exit_statuses_{time.time()}.yaml")
def process_futures(futures: dict[concurrent.futures.Future, str]):
for future in concurrent.futures.as_completed(futures):
try:
future.result()
except concurrent.futures.CancelledError:
pass
except Exception as e:
instance_id = futures[future]
logger.error(f"Error in future for instance {instance_id}: {e}", exc_info=True)
progress_manager.on_uncaught_exception(instance_id, e)
with Live(progress_manager.render_group, refresh_per_second=4):
with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
futures = {
executor.submit(process_instance, instance, output_path, config, progress_manager): instance[
"instance_id"
]
for instance in instances
}
try:
process_futures(futures)
except KeyboardInterrupt:
logger.info("Cancelling all pending jobs. Press ^C again to exit immediately.")
for future in futures:
if not future.running() and not future.done():
future.cancel()
process_futures(futures)
if __name__ == "__main__":
app()
swebench_single.py run script
"""Run on a single SWE-Bench instance."""
from pathlib import Path
import typer
from datasets import load_dataset
from minisweagent import global_config_dir
from minisweagent.agents import get_agent
from minisweagent.config import builtin_config_dir, get_config_from_spec
from minisweagent.models import get_model
from minisweagent.run.benchmarks.swebench import (
DATASET_MAPPING,
get_sb_environment,
)
from minisweagent.utils.log import logger
from minisweagent.utils.serialize import UNSET, recursive_merge
DEFAULT_OUTPUT_FILE = global_config_dir / "last_swebench_single_run.traj.json"
DEFAULT_CONFIG_FILE = builtin_config_dir / "benchmarks" / "swebench.yaml"
app = typer.Typer(rich_markup_mode="rich", add_completion=False)
_CONFIG_SPEC_HELP_TEXT = """Path to config files, filenames, or key-value pairs.
[bold red]IMPORTANT:[/bold red] [red]If you set this option, the default config file will not be used.[/red]
So you need to explicitly set it e.g., with [bold green]-c swebench.yaml <other options>[/bold green]
Multiple configs will be recursively merged.
Examples:
[bold red]-c model.model_kwargs.temperature=0[/bold red] [red]You forgot to add the default config file! See above.[/red]
[bold green]-c swebench.yaml -c model.model_kwargs.temperature=0.5[/bold green]
[bold green]-c swebench.yaml -c agent.mode=yolo[/bold green]
"""
# fmt: off
@app.command()
def main(
subset: str = typer.Option("lite", "--subset", help="SWEBench subset to use or path to a dataset", rich_help_panel="Data selection"),
split: str = typer.Option("dev", "--split", help="Dataset split", rich_help_panel="Data selection"),
instance_spec: str = typer.Option(0, "-i", "--instance", help="SWE-Bench instance ID or index", rich_help_panel="Data selection"),
model_name: str | None = typer.Option(None, "-m", "--model", help="Model to use", rich_help_panel="Basic"),
model_class: str | None = typer.Option(None, "--model-class", help="Model class to use (e.g., 'anthropic' or 'minisweagent.models.anthropic.AnthropicModel')", rich_help_panel="Advanced"),
agent_class: str | None = typer.Option(None, "--agent-class", help="Agent class to use (e.g., 'interactive' or 'minisweagent.agents.interactive.InteractiveAgent')", rich_help_panel="Advanced"),
environment_class: str | None = typer.Option(None, "--environment-class", help="Environment class to use (e.g., 'docker' or 'minisweagent.environments.docker.DockerEnvironment')", rich_help_panel="Advanced"),
yolo: bool = typer.Option(False, "-y", "--yolo", help="Run without confirmation"),
cost_limit: float | None = typer.Option(None, "-l", "--cost-limit", help="Cost limit. Set to 0 to disable."),
config_spec: list[str] = typer.Option([str(DEFAULT_CONFIG_FILE)], "-c", "--config", help=_CONFIG_SPEC_HELP_TEXT, rich_help_panel="Basic"),
exit_immediately: bool = typer.Option(False, "--exit-immediately", help="Exit immediately when the agent wants to finish instead of prompting.", rich_help_panel="Advanced"),
output: Path | None = typer.Option(DEFAULT_OUTPUT_FILE, "-o", "--output", help="Output trajectory file", rich_help_panel="Basic"),
) -> None:
# fmt: on
"""Run on a single SWE-Bench instance."""
dataset_path = DATASET_MAPPING.get(subset, subset)
logger.info(f"Loading dataset from {dataset_path}, split {split}...")
instances = {
inst["instance_id"]: inst # type: ignore
for inst in load_dataset(dataset_path, split=split)
}
if instance_spec.isnumeric():
instance_spec = sorted(instances.keys())[int(instance_spec)]
instance: dict = instances[instance_spec] # type: ignore
logger.info(f"Building agent config from specs: {config_spec}")
configs = [get_config_from_spec(spec) for spec in config_spec]
configs.append({
"agent": {
"agent_class": agent_class or UNSET,
"mode": "yolo" if yolo else UNSET,
"cost_limit": cost_limit or UNSET,
"confirm_exit": False if exit_immediately else UNSET,
"output_path": output or UNSET,
},
"model": {
"model_class": model_class or UNSET,
"model_name": model_name or UNSET,
},
"environment": {
"environment_class": environment_class or UNSET,
},
})
config = recursive_merge(*configs)
env = get_sb_environment(config, instance)
agent = get_agent(
get_model(config=config.get("model", {})),
env,
config.get("agent", {}),
default_type="interactive",
)
agent.run(instance["problem_statement"])
if __name__ == "__main__":
app()
