Files
codewhale/scripts/benchmarks/README.md
T
Hunter B b329a532f5 feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration
Benchmark harness for evaluating CodeWhale against three external
benchmarks:

- SWE-bench: batch driver wrapping existing codewhale swebench commands
- Terminal-Bench: Harbor adapter (BaseInstalledAgent) for container eval
- PinchBench: runner with auto-install for real-world agent tasks

Includes docs/BENCHMARKS.md umbrella doc with setup, usage, and
reproducibility checklist. Scripts record version/commit/timestamp
metadata for each run.

Branch: codex/v0.8.53-benchmarks (based on v0.8.53)
2026-06-04 19:22:06 -07:00

38 lines
1002 B
Markdown

# Benchmark Scripts
Convenience runners for evaluating CodeWhale against external benchmarks.
## Quick Start
```bash
# Set your API key
export DEEPSEEK_API_KEY="sk-..."
# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
--instance-id django__django-12345 \
--issue-file ./issue.md
# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
--model deepseek/deepseek-chat
# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
--install \
--model deepseek/deepseek-chat
```
## Files
- `run-swebench.sh` — SWE-bench batch driver and evaluator
- `run-terminal-bench.sh` — Terminal-Bench runner via Harbor
- `run-pinchbench.sh` — PinchBench runner with auto-install
- `harbor/__init__.py` — Harbor adapter for CodeWhale (Python)
- `harbor/codewhale_agent.py` — Adapter entry point
## Documentation
See [docs/BENCHMARKS.md](../../docs/BENCHMARKS.md) for full setup instructions,
reproducibility checklists, and references.