b329a532f5
Benchmark harness for evaluating CodeWhale against three external benchmarks: - SWE-bench: batch driver wrapping existing codewhale swebench commands - Terminal-Bench: Harbor adapter (BaseInstalledAgent) for container eval - PinchBench: runner with auto-install for real-world agent tasks Includes docs/BENCHMARKS.md umbrella doc with setup, usage, and reproducibility checklist. Scripts record version/commit/timestamp metadata for each run. Branch: codex/v0.8.53-benchmarks (based on v0.8.53)
38 lines
1002 B
Markdown
38 lines
1002 B
Markdown
# Benchmark Scripts
|
|
|
|
Convenience runners for evaluating CodeWhale against external benchmarks.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Set your API key
|
|
export DEEPSEEK_API_KEY="sk-..."
|
|
|
|
# SWE-bench (single instance)
|
|
./scripts/benchmarks/run-swebench.sh \
|
|
--instance-id django__django-12345 \
|
|
--issue-file ./issue.md
|
|
|
|
# Terminal-Bench (via Harbor)
|
|
./scripts/benchmarks/run-terminal-bench.sh \
|
|
--model deepseek/deepseek-chat
|
|
|
|
# PinchBench (auto-install + run)
|
|
./scripts/benchmarks/run-pinchbench.sh \
|
|
--install \
|
|
--model deepseek/deepseek-chat
|
|
```
|
|
|
|
## Files
|
|
|
|
- `run-swebench.sh` — SWE-bench batch driver and evaluator
|
|
- `run-terminal-bench.sh` — Terminal-Bench runner via Harbor
|
|
- `run-pinchbench.sh` — PinchBench runner with auto-install
|
|
- `harbor/__init__.py` — Harbor adapter for CodeWhale (Python)
|
|
- `harbor/codewhale_agent.py` — Adapter entry point
|
|
|
|
## Documentation
|
|
|
|
See [docs/BENCHMARKS.md](../../docs/BENCHMARKS.md) for full setup instructions,
|
|
reproducibility checklists, and references.
|