3 Commits

Author SHA1 Message Date
Hunter B b7798ba0f6 feat(benchmarks): default PinchBench to direct MiMo routing, auto-read config
PinchBench runner now defaults to direct Xiaomi API (no OpenRouter).
Reads API key from ~/.codewhale/config.toml [providers.xiaomi_mimo]
when XIAOMI_MIMO_API_KEY env var is not set. --openrouter flag for
the old OpenRouter path.
2026-06-04 19:38:46 -07:00
Hunter B a5f27aae3a feat(benchmarks): default PinchBench to MiMo v2.5 Pro, add direct-mimo routing
PinchBench runner now defaults to openrouter/xiaomi/mimo-v2.5-pro instead
of deepseek/deepseek-chat. Adds --direct-mimo flag for routing through
Xiaomi's API directly (bypasses OpenRouter), with tp-/sk- key type
detection and endpoint mismatch warnings.

Harbor adapter gains --provider CLI flag for MiMo provider routing.

Known issues documented in docs/MIMO_BENCHMARK_ISSUES.md:
- PinchBench model validation requires OpenRouter prefix
- OPENROUTER_API_KEY needed even for some direct-provider paths
- Token Plan vs pay-as-you-go key/endpoint mismatch
- PinchBench runs through OpenClaw, not CodeWhale
2026-06-04 19:33:43 -07:00
Hunter B b329a532f5 feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration
Benchmark harness for evaluating CodeWhale against three external
benchmarks:

- SWE-bench: batch driver wrapping existing codewhale swebench commands
- Terminal-Bench: Harbor adapter (BaseInstalledAgent) for container eval
- PinchBench: runner with auto-install for real-world agent tasks

Includes docs/BENCHMARKS.md umbrella doc with setup, usage, and
reproducibility checklist. Scripts record version/commit/timestamp
metadata for each run.

Branch: codex/v0.8.53-benchmarks (based on v0.8.53)
2026-06-04 19:22:06 -07:00