feat(bench): add CLI comparison harness
Harvest #3009 for the v0.8.59 release lane. Adds a paired Terminal-Bench harness for CodeWhale and Codex, a Codex Harbor adapter, generated-result ignore protection, and benchmark docs. Maintainer amendments keep explicit zero-valued metrics, regenerate parent task names, write refreshed summaries in regenerate mode, and allow transcript paths outside the repo. Fixes #2952.
This commit is contained in:
@@ -103,6 +103,23 @@ harbor run \
|
||||
--model deepseek/deepseek-chat
|
||||
```
|
||||
|
||||
### Compare CodeWhale and Codex
|
||||
|
||||
Use the paired comparison harness when you need one normalized row per CLI for
|
||||
the same task, model, timeout, and environment:
|
||||
|
||||
```bash
|
||||
python scripts/benchmarks/cli-compare.py \
|
||||
--task prove-plus-comm \
|
||||
--model deepseek/deepseek-chat \
|
||||
--runs 3
|
||||
```
|
||||
|
||||
The harness writes raw Harbor logs plus `summary.json`, `summary.md`, and
|
||||
`metadata.json` under `benchmark_results/cli-compare-*`. Missing metrics are
|
||||
reported as JSON `null`, and generated run directories are intentionally ignored
|
||||
by git; keep only curated summaries in docs or release notes.
|
||||
|
||||
## PinchBench
|
||||
|
||||
PinchBench measures agent performance on real-world tasks — scheduling, email
|
||||
|
||||
Reference in New Issue
Block a user