codewhale/docs/SWEBENCH.md

# SWE-bench

CodeWhale's SWE-bench adapter writes the prediction file that the official
SWE-bench evaluation harness expects. It does not replace the harness; it
generates `model_patch` rows from a local task workspace.

## One Instance

Start from a workspace checked out at the SWE-bench instance base commit, with
the issue text saved locally:

```bash
codewhale swebench run \
  --instance-id django__django-12345 \
  --issue-file issue.md \
  --predictions-path all_preds.jsonl
```

`run` invokes tool-backed non-interactive mode, equivalent to
`codewhale exec --auto`, with `stream-json` output by default. When the turn
finishes, CodeWhale exports `git diff --binary --no-ext-diff` as one JSONL
prediction row:

```json
{"instance_id":"django__django-12345","model_name_or_path":"codewhale/deepseek-v4-pro","model_patch":"diff --git ..."}
```

If you already ran CodeWhale, or edited the workspace manually, export the
current diff without another model turn:

```bash
codewhale swebench export \
  --instance-id django__django-12345 \
  --predictions-path all_preds.jsonl
```

Both commands update the row for the same `instance_id` instead of appending a
duplicate row. Untracked files are marked with `git add -N` before diff export
so newly-created files appear in the patch.

## Evaluate

Install SWE-bench and Docker using the official SWE-bench setup instructions,
then pass the prediction file to the official harness:

```bash
python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --predictions_path all_preds.jsonl \
  --max_workers 1 \
  --run_id codewhale-smoke
```

On Apple Silicon, the official SWE-bench docs recommend adding
`--namespace ''` so images build locally instead of pulling Linux images.

## Batch Driver Shape

A simple batch runner should prepare each instance workspace, write the issue
body to `issue.md`, run `codewhale swebench run`, then call the harness once
on the accumulated `all_preds.jsonl`.

For reproducible runs, pin:

- CodeWhale version and commit: `codewhale --version`
- Model label: `--model-name-or-path codewhale/deepseek-v4-pro`
- Dataset and split used by the harness
- Docker platform and worker count
- The `all_preds.jsonl` file and CodeWhale stream logs

Official references:

- SWE-bench repository: https://github.com/SWE-bench/SWE-bench
- SWE-bench harness docs: https://www.swebench.com/SWE-bench/api/harness/