25ce4f5970
- SWE-bench: codewhale swebench run/export writes prediction JSONL from working-tree diff, with untracked-file inclusion via git add -N - CLI: --workspace / -C global flag forwards to TUI for file ops - CLI: codewhale exec --auto semantics clarified in help text - Markdown: table pipes inside inline code no longer create phantom columns (split_table_cells with backtick-awareness) - Receipt: floor_char_boundary prevents multibyte UTF-8 slice panic - Contributors: Ling (LING71671 #1839 #1911), Ben Younes (ousamabenyounes #1938), jeoor npm fix (#1860) credited across all 3 READMEs - ja-JP README: 19 contributors synced to parity with EN/zh-CN (80 each) - Docs: SWEBENCH.md, RECURSIVE_SELF_IMPROVEMENT.md, MODES.md exec clarification - Sub-agent footer: Alt+V hint now says 'details' not 'raw'
75 lines
2.4 KiB
Markdown
75 lines
2.4 KiB
Markdown
# SWE-bench
|
|
|
|
CodeWhale's SWE-bench adapter writes the prediction file that the official
|
|
SWE-bench evaluation harness expects. It does not replace the harness; it
|
|
generates `model_patch` rows from a local task workspace.
|
|
|
|
## One Instance
|
|
|
|
Start from a workspace checked out at the SWE-bench instance base commit, with
|
|
the issue text saved locally:
|
|
|
|
```bash
|
|
codewhale swebench run \
|
|
--instance-id django__django-12345 \
|
|
--issue-file issue.md \
|
|
--predictions-path all_preds.jsonl
|
|
```
|
|
|
|
`run` invokes tool-backed non-interactive mode, equivalent to
|
|
`codewhale exec --auto`, with `stream-json` output by default. When the turn
|
|
finishes, CodeWhale exports `git diff --binary --no-ext-diff` as one JSONL
|
|
prediction row:
|
|
|
|
```json
|
|
{"instance_id":"django__django-12345","model_name_or_path":"codewhale/deepseek-v4-pro","model_patch":"diff --git ..."}
|
|
```
|
|
|
|
If you already ran CodeWhale, or edited the workspace manually, export the
|
|
current diff without another model turn:
|
|
|
|
```bash
|
|
codewhale swebench export \
|
|
--instance-id django__django-12345 \
|
|
--predictions-path all_preds.jsonl
|
|
```
|
|
|
|
Both commands update the row for the same `instance_id` instead of appending a
|
|
duplicate row. Untracked files are marked with `git add -N` before diff export
|
|
so newly-created files appear in the patch.
|
|
|
|
## Evaluate
|
|
|
|
Install SWE-bench and Docker using the official SWE-bench setup instructions,
|
|
then pass the prediction file to the official harness:
|
|
|
|
```bash
|
|
python -m swebench.harness.run_evaluation \
|
|
--dataset_name princeton-nlp/SWE-bench_Lite \
|
|
--predictions_path all_preds.jsonl \
|
|
--max_workers 1 \
|
|
--run_id codewhale-smoke
|
|
```
|
|
|
|
On Apple Silicon, the official SWE-bench docs recommend adding
|
|
`--namespace ''` so images build locally instead of pulling Linux images.
|
|
|
|
## Batch Driver Shape
|
|
|
|
A simple batch runner should prepare each instance workspace, write the issue
|
|
body to `issue.md`, run `codewhale swebench run`, then call the harness once
|
|
on the accumulated `all_preds.jsonl`.
|
|
|
|
For reproducible runs, pin:
|
|
|
|
- CodeWhale version and commit: `codewhale --version`
|
|
- Model label: `--model-name-or-path codewhale/deepseek-v4-pro`
|
|
- Dataset and split used by the harness
|
|
- Docker platform and worker count
|
|
- The `all_preds.jsonl` file and CodeWhale stream logs
|
|
|
|
Official references:
|
|
|
|
- SWE-bench repository: https://github.com/SWE-bench/SWE-bench
|
|
- SWE-bench harness docs: https://www.swebench.com/SWE-bench/api/harness/
|