feat(bench): add CLI comparison harness

Harvest #3009 for the v0.8.59 release lane. Adds a paired Terminal-Bench harness for CodeWhale and Codex, a Codex Harbor adapter, generated-result ignore protection, and benchmark docs. Maintainer amendments keep explicit zero-valued metrics, regenerate parent task names, write refreshed summaries in regenerate mode, and allow transcript paths outside the repo. Fixes #2952.
2026-06-12 01:15:00 -07:00
parent f99fff969a
commit 0cd8bcde1b
6 changed files with 754 additions and 0 deletions
@@ -103,6 +103,23 @@ harbor run \
  --model deepseek/deepseek-chat
 ```

+### Compare CodeWhale and Codex
+
+Use the paired comparison harness when you need one normalized row per CLI for
+the same task, model, timeout, and environment:
+
+```bash
+python scripts/benchmarks/cli-compare.py \
+  --task prove-plus-comm \
+  --model deepseek/deepseek-chat \
+  --runs 3
+```
+
+The harness writes raw Harbor logs plus `summary.json`, `summary.md`, and
+`metadata.json` under `benchmark_results/cli-compare-*`. Missing metrics are
+reported as JSON `null`, and generated run directories are intentionally ignored
+by git; keep only curated summaries in docs or release notes.
+
 ## PinchBench

 PinchBench measures agent performance on real-world tasks — scheduling, email