a5f27aae3a
PinchBench runner now defaults to openrouter/xiaomi/mimo-v2.5-pro instead of deepseek/deepseek-chat. Adds --direct-mimo flag for routing through Xiaomi's API directly (bypasses OpenRouter), with tp-/sk- key type detection and endpoint mismatch warnings. Harbor adapter gains --provider CLI flag for MiMo provider routing. Known issues documented in docs/MIMO_BENCHMARK_ISSUES.md: - PinchBench model validation requires OpenRouter prefix - OPENROUTER_API_KEY needed even for some direct-provider paths - Token Plan vs pay-as-you-go key/endpoint mismatch - PinchBench runs through OpenClaw, not CodeWhale
Benchmark Scripts
Convenience runners for evaluating CodeWhale against external benchmarks.
Quick Start
# Set your API key
export DEEPSEEK_API_KEY="sk-..."
# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
--instance-id django__django-12345 \
--issue-file ./issue.md
# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
--model deepseek/deepseek-chat
# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
--install \
--model deepseek/deepseek-chat
Files
run-swebench.sh— SWE-bench batch driver and evaluatorrun-terminal-bench.sh— Terminal-Bench runner via Harborrun-pinchbench.sh— PinchBench runner with auto-installharbor/__init__.py— Harbor adapter for CodeWhale (Python)harbor/codewhale_agent.py— Adapter entry point
Documentation
See docs/BENCHMARKS.md for full setup instructions, reproducibility checklists, and references.