0cd8bcde1b
Harvest #3009 for the v0.8.59 release lane. Adds a paired Terminal-Bench harness for CodeWhale and Codex, a Codex Harbor adapter, generated-result ignore protection, and benchmark docs. Maintainer amendments keep explicit zero-valued metrics, regenerate parent task names, write refreshed summaries in regenerate mode, and allow transcript paths outside the repo. Fixes #2952.
Benchmark Scripts
Convenience runners for evaluating CodeWhale against external benchmarks.
Quick Start
# Set your API key
export DEEPSEEK_API_KEY="sk-..."
# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
--instance-id django__django-12345 \
--issue-file ./issue.md
# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
--model deepseek/deepseek-chat
# CodeWhale vs Codex comparison rows
python scripts/benchmarks/cli-compare.py \
--task prove-plus-comm \
--model deepseek/deepseek-chat
# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
--install \
--model deepseek/deepseek-chat
Files
run-swebench.sh— SWE-bench batch driver and evaluatorrun-terminal-bench.sh— Terminal-Bench runner via Harborrun-pinchbench.sh— PinchBench runner with auto-installcli-compare.py— CodeWhale/Codex Terminal-Bench comparison harnessharbor/__init__.py— Harbor adapter for CodeWhale (Python)harbor/codewhale_agent.py— Adapter entry pointharbor/codex_agent.py— Codex adapter for paired CLI comparisons
Documentation
See docs/BENCHMARKS.md for full setup instructions, reproducibility checklists, and references.