codewhale

dgf1988/codewhale

Files

T

Hanmiao Li 5d9b5f67cb feat(bench): improve cli-compare harness with real Harbor integration (#3009 )

- Match actual Harbor CLI interface (no invented flags)
- Proper BaseInstalledAgent subclass for Codex
- Robust token extraction from stream JSONL + transcript parsing
- Heuristic answer_len extraction (## Final Answer markers)
- Metadata capture: versions, git commit, platform, timestamp
- --regenerate walks existing run directories
- All missing fields explicit null, never zero
- Support multiple runs per task with run_idx tracking

The harness is designed to run:
    harbor run --dataset terminal-bench@2.0:<task> --agent ... --model ...

for both codex and codewhale agents, then normalize the results.

2026-06-12 10:53:36 -07:00

__init__.py

feat(benchmarks): default PinchBench to MiMo v2.5 Pro, add direct-mimo routing

2026-06-04 19:33:43 -07:00

codewhale_agent.py

feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration

2026-06-04 19:22:06 -07:00

codex_agent.py

feat(bench): improve cli-compare harness with real Harbor integration (#3009 )

2026-06-12 10:53:36 -07:00