5d9b5f67cb
- Match actual Harbor CLI interface (no invented flags)
- Proper BaseInstalledAgent subclass for Codex
- Robust token extraction from stream JSONL + transcript parsing
- Heuristic answer_len extraction (## Final Answer markers)
- Metadata capture: versions, git commit, platform, timestamp
- --regenerate walks existing run directories
- All missing fields explicit null, never zero
- Support multiple runs per task with run_idx tracking
The harness is designed to run:
harbor run --dataset terminal-bench@2.0:<task> --agent ... --model ...
for both codex and codewhale agents, then normalize the results.
0 lines
0 B
Plaintext
0 lines
0 B
Plaintext
The file is empty.