feat(bench): add CLI comparison harness

Harvest #3009 for the v0.8.59 release lane. Adds a paired Terminal-Bench harness for CodeWhale and Codex, a Codex Harbor adapter, generated-result ignore protection, and benchmark docs.

Maintainer amendments keep explicit zero-valued metrics, regenerate parent task names, write refreshed summaries in regenerate mode, and allow transcript paths outside the repo.

Fixes #2952.
This commit is contained in:
idling11
2026-06-12 01:15:00 -07:00
committed by Hunter B
parent f99fff969a
commit 0cd8bcde1b
6 changed files with 754 additions and 0 deletions
+17
View File
@@ -103,6 +103,23 @@ harbor run \
--model deepseek/deepseek-chat
```
### Compare CodeWhale and Codex
Use the paired comparison harness when you need one normalized row per CLI for
the same task, model, timeout, and environment:
```bash
python scripts/benchmarks/cli-compare.py \
--task prove-plus-comm \
--model deepseek/deepseek-chat \
--runs 3
```
The harness writes raw Harbor logs plus `summary.json`, `summary.md`, and
`metadata.json` under `benchmark_results/cli-compare-*`. Missing metrics are
reported as JSON `null`, and generated run directories are intentionally ignored
by git; keep only curated summaries in docs or release notes.
## PinchBench
PinchBench measures agent performance on real-world tasks — scheduling, email