feat(bench): add CLI comparison harness

Harvest #3009 for the v0.8.59 release lane. Adds a paired Terminal-Bench harness for CodeWhale and Codex, a Codex Harbor adapter, generated-result ignore protection, and benchmark docs.

Maintainer amendments keep explicit zero-valued metrics, regenerate parent task names, write refreshed summaries in regenerate mode, and allow transcript paths outside the repo.

Fixes #2952.
This commit is contained in:
idling11
2026-06-12 01:15:00 -07:00
committed by Hunter B
parent f99fff969a
commit 0cd8bcde1b
6 changed files with 754 additions and 0 deletions
+2
View File
@@ -123,4 +123,6 @@ scripts/run_deep_swe.py
# Benchmark artifacts and caches re-included by !scripts/**
results/
benchmark_results/*
!benchmark_results/.gitkeep
scripts/**/__pycache__/