Files
codewhale/docs/BENCHMARKS.md
T
idling11 0cd8bcde1b feat(bench): add CLI comparison harness
Harvest #3009 for the v0.8.59 release lane. Adds a paired Terminal-Bench harness for CodeWhale and Codex, a Codex Harbor adapter, generated-result ignore protection, and benchmark docs.

Maintainer amendments keep explicit zero-valued metrics, regenerate parent task names, write refreshed summaries in regenerate mode, and allow transcript paths outside the repo.

Fixes #2952.
2026-06-12 01:15:00 -07:00

182 lines
5.0 KiB
Markdown

# Benchmarks
CodeWhale integrates with three external benchmarks to measure real-world
coding-agent performance. Each benchmark tests a different surface:
| Benchmark | What it tests | Harness | Output format |
|---|---|---|---|
| **SWE-bench** | Patch generation from GitHub issues | CodeWhale built-in (`codewhale swebench`) | `all_preds.jsonl` |
| **Terminal-Bench** | End-to-end terminal tasks (compile, deploy, configure) | Harbor framework adapter | Harbor result JSON |
| **PinchBench** | Real-world agent tasks (calendar, email, coding, research) | Standalone runner via OpenClaw-compatible adapter | PinchBench result JSON |
All three require Docker. SWE-bench and Terminal-Bench also need the official
evaluation harness installed separately.
## Prerequisites
```bash
# Docker (all benchmarks)
docker --version
# Python 3.10+ with uv (Terminal-Bench, PinchBench, SWE-bench eval)
python3 --version
uv --version
# CodeWhale v0.8.53+
codewhale --version
# API key
export DEEPSEEK_API_KEY="sk-..."
```
## SWE-bench
CodeWhale has built-in SWE-bench support via `codewhale swebench run` and
`codewhale swebench export`. See [docs/SWEBENCH.md](SWEBENCH.md) for the
single-instance workflow.
### Batch run
```bash
# Run all instances from a dataset split
./scripts/benchmarks/run-swebench.sh \
--dataset princeton-nlp/SWE-bench_Lite \
--split test \
--predictions-path ./results/swebench_preds.jsonl
# Run a single instance
./scripts/benchmarks/run-swebench.sh \
--instance-id django__django-12345 \
--issue-file ./issue.md \
--predictions-path ./results/swebench_preds.jsonl
```
### Evaluate
```bash
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path ./results/swebench_preds.jsonl \
--max_workers 1 \
--run_id codewhale-v0.8.53
```
## Terminal-Bench (via Harbor)
Terminal-Bench tests agents on real terminal tasks — compiling, deploying,
configuring servers, training models. The [Harbor framework](https://github.com/harbor-framework/harbor)
is the official harness.
CodeWhale plugs in via a Harbor adapter (`scripts/benchmarks/harbor/codewhale_agent.py`).
### Setup
```bash
pip install harbor
```
### Run
```bash
# Via the convenience script
./scripts/benchmarks/run-terminal-bench.sh \
--dataset terminal-bench@2.0 \
--model deepseek/deepseek-chat \
--n-concurrent 4
# Or directly with harbor
harbor run \
--dataset terminal-bench@2.0 \
--agent codewhale \
--model deepseek/deepseek-chat \
--n-concurrent 4
```
### Custom agent path
If the adapter is not installed system-wide, point Harbor at it:
```bash
harbor run \
--dataset terminal-bench@2.0 \
--agent scripts.benchmarks.harbor.codewhale_agent:CodeWhaleAgent \
--model deepseek/deepseek-chat
```
### Compare CodeWhale and Codex
Use the paired comparison harness when you need one normalized row per CLI for
the same task, model, timeout, and environment:
```bash
python scripts/benchmarks/cli-compare.py \
--task prove-plus-comm \
--model deepseek/deepseek-chat \
--runs 3
```
The harness writes raw Harbor logs plus `summary.json`, `summary.md`, and
`metadata.json` under `benchmark_results/cli-compare-*`. Missing metrics are
reported as JSON `null`, and generated run directories are intentionally ignored
by git; keep only curated summaries in docs or release notes.
## PinchBench
PinchBench measures agent performance on real-world tasks — scheduling, email
triage, code generation, research, file management. It uses OpenClaw as the
agent runtime.
### Setup
```bash
./scripts/benchmarks/run-pinchbench.sh --install
```
### Run (MiMo v2.5 Pro — default)
```bash
# MiMo v2.5 Pro via OpenRouter (default)
./scripts/benchmarks/run-pinchbench.sh
# MiMo v2.5 Pro via direct Xiaomi API
./scripts/benchmarks/run-pinchbench.sh --direct-mimo
# Specific tasks
./scripts/benchmarks/run-pinchbench.sh --suite task_calendar,task_stock
```
### Run (other models)
```bash
./scripts/benchmarks/run-pinchbench.sh --model openrouter/deepseek/deepseek-v4-pro
```
### MiMo v2.5 notes
PinchBench routes through OpenRouter by default. MiMo models are available as
`openrouter/xiaomi/mimo-v2.5-pro` (Pro) and `openrouter/xiaomi/mimo-v2.5`
(Omni). For direct Xiaomi API access, use `--direct-mimo` with
`XIAOMI_MIMO_API_KEY` set.
See `scripts/benchmarks/run-pinchbench.sh --help` for full option reference.
## Reproducibility checklist
When publishing benchmark results, record:
- [ ] CodeWhale version: `codewhale --version`
- [ ] Git commit: `git rev-parse HEAD`
- [ ] Model and provider (e.g. `deepseek/deepseek-chat`)
- [ ] Benchmark dataset and version
- [ ] Docker platform (`linux/amd64` vs `linux/arm64`)
- [ ] Worker concurrency
- [ ] Timestamp (UTC)
- [ ] Full result file (`all_preds.jsonl`, Harbor result dir, or PinchBench results JSON)
## References
- SWE-bench: https://github.com/SWE-bench/SWE-bench
- Terminal-Bench: https://github.com/laude-institute/terminal-bench / https://www.tbench.ai
- Harbor: https://github.com/harbor-framework/harbor / https://harborframework.com
- PinchBench: https://github.com/pinchbench/skill / https://pinchbench.com