5d9b5f67cb
- Match actual Harbor CLI interface (no invented flags)
- Proper BaseInstalledAgent subclass for Codex
- Robust token extraction from stream JSONL + transcript parsing
- Heuristic answer_len extraction (## Final Answer markers)
- Metadata capture: versions, git commit, platform, timestamp
- --regenerate walks existing run directories
- All missing fields explicit null, never zero
- Support multiple runs per task with run_idx tracking
The harness is designed to run:
harbor run --dataset terminal-bench@2.0:<task> --agent ... --model ...
for both codex and codewhale agents, then normalize the results.
Benchmark Scripts
Convenience runners for evaluating CodeWhale against external benchmarks.
Quick Start
# Set your API key
export DEEPSEEK_API_KEY="sk-..."
# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
--instance-id django__django-12345 \
--issue-file ./issue.md
# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
--model deepseek/deepseek-chat
# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
--install \
--model deepseek/deepseek-chat
Files
run-swebench.sh— SWE-bench batch driver and evaluatorrun-terminal-bench.sh— Terminal-Bench runner via Harborrun-pinchbench.sh— PinchBench runner with auto-installharbor/__init__.py— Harbor adapter for CodeWhale (Python)harbor/codewhale_agent.py— Adapter entry point
Documentation
See docs/BENCHMARKS.md for full setup instructions, reproducibility checklists, and references.