codewhale/scripts/benchmarks/README.md

# Benchmark Scripts

Convenience runners for evaluating CodeWhale against external benchmarks.

## Quick Start

```bash
# Set your API key
export DEEPSEEK_API_KEY="sk-..."

# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
  --instance-id django__django-12345 \
  --issue-file ./issue.md

# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
  --model deepseek/deepseek-chat

# CodeWhale vs Codex comparison rows
python scripts/benchmarks/cli-compare.py \
  --task prove-plus-comm \
  --model deepseek/deepseek-chat

# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
  --install \
  --model deepseek/deepseek-chat
```

## Files

- `run-swebench.sh` — SWE-bench batch driver and evaluator
- `run-terminal-bench.sh` — Terminal-Bench runner via Harbor
- `run-pinchbench.sh` — PinchBench runner with auto-install
- `cli-compare.py` — CodeWhale/Codex Terminal-Bench comparison harness
- `harbor/__init__.py` — Harbor adapter for CodeWhale (Python)
- `harbor/codewhale_agent.py` — Adapter entry point
- `harbor/codex_agent.py` — Codex adapter for paired CLI comparisons

## Documentation

See [docs/BENCHMARKS.md](../../docs/BENCHMARKS.md) for full setup instructions,
reproducibility checklists, and references.