dgf1988/codewhale

Files

T

Hunter B a5f27aae3a feat(benchmarks): default PinchBench to MiMo v2.5 Pro, add direct-mimo routing

PinchBench runner now defaults to openrouter/xiaomi/mimo-v2.5-pro instead
of deepseek/deepseek-chat. Adds --direct-mimo flag for routing through
Xiaomi's API directly (bypasses OpenRouter), with tp-/sk- key type
detection and endpoint mismatch warnings.

Harbor adapter gains --provider CLI flag for MiMo provider routing.

Known issues documented in docs/MIMO_BENCHMARK_ISSUES.md:
- PinchBench model validation requires OpenRouter prefix
- OPENROUTER_API_KEY needed even for some direct-provider paths
- Token Plan vs pay-as-you-go key/endpoint mismatch
- PinchBench runs through OpenClaw, not CodeWhale

2026-06-04 19:33:43 -07:00

4.5 KiB

Raw Blame History

Benchmarks

CodeWhale integrates with three external benchmarks to measure real-world coding-agent performance. Each benchmark tests a different surface:

Benchmark	What it tests	Harness	Output format
SWE-bench	Patch generation from GitHub issues	CodeWhale built-in (`codewhale swebench`)	`all_preds.jsonl`
Terminal-Bench	End-to-end terminal tasks (compile, deploy, configure)	Harbor framework adapter	Harbor result JSON
PinchBench	Real-world agent tasks (calendar, email, coding, research)	Standalone runner via OpenClaw-compatible adapter	PinchBench result JSON

All three require Docker. SWE-bench and Terminal-Bench also need the official evaluation harness installed separately.

Prerequisites

# Docker (all benchmarks)
docker --version

# Python 3.10+ with uv (Terminal-Bench, PinchBench, SWE-bench eval)
python3 --version
uv --version

# CodeWhale v0.8.53+
codewhale --version

# API key
export DEEPSEEK_API_KEY="sk-..."

SWE-bench

CodeWhale has built-in SWE-bench support via codewhale swebench run and codewhale swebench export. See docs/SWEBENCH.md for the single-instance workflow.

Batch run

# Run all instances from a dataset split
./scripts/benchmarks/run-swebench.sh \
  --dataset princeton-nlp/SWE-bench_Lite \
  --split test \
  --predictions-path ./results/swebench_preds.jsonl

# Run a single instance
./scripts/benchmarks/run-swebench.sh \
  --instance-id django__django-12345 \
  --issue-file ./issue.md \
  --predictions-path ./results/swebench_preds.jsonl

Evaluate

python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --predictions_path ./results/swebench_preds.jsonl \
  --max_workers 1 \
  --run_id codewhale-v0.8.53

Terminal-Bench (via Harbor)

Terminal-Bench tests agents on real terminal tasks — compiling, deploying, configuring servers, training models. The Harbor framework is the official harness.

CodeWhale plugs in via a Harbor adapter (scripts/benchmarks/harbor/codewhale_agent.py).

Setup

pip install harbor

Run

# Via the convenience script
./scripts/benchmarks/run-terminal-bench.sh \
  --dataset terminal-bench@2.0 \
  --model deepseek/deepseek-chat \
  --n-concurrent 4

# Or directly with harbor
harbor run \
  --dataset terminal-bench@2.0 \
  --agent codewhale \
  --model deepseek/deepseek-chat \
  --n-concurrent 4

Custom agent path

If the adapter is not installed system-wide, point Harbor at it:

harbor run \
  --dataset terminal-bench@2.0 \
  --agent scripts.benchmarks.harbor.codewhale_agent:CodeWhaleAgent \
  --model deepseek/deepseek-chat

PinchBench

PinchBench measures agent performance on real-world tasks — scheduling, email triage, code generation, research, file management. It uses OpenClaw as the agent runtime.

Setup

./scripts/benchmarks/run-pinchbench.sh --install

Run (MiMo v2.5 Pro — default)

# MiMo v2.5 Pro via OpenRouter (default)
./scripts/benchmarks/run-pinchbench.sh

# MiMo v2.5 Pro via direct Xiaomi API
./scripts/benchmarks/run-pinchbench.sh --direct-mimo

# Specific tasks
./scripts/benchmarks/run-pinchbench.sh --suite task_calendar,task_stock

Run (other models)

./scripts/benchmarks/run-pinchbench.sh --model openrouter/deepseek/deepseek-v4-pro

MiMo v2.5 notes

PinchBench routes through OpenRouter by default. MiMo models are available as openrouter/xiaomi/mimo-v2.5-pro (Pro) and openrouter/xiaomi/mimo-v2.5 (Omni). For direct Xiaomi API access, use --direct-mimo with XIAOMI_MIMO_API_KEY set.

See scripts/benchmarks/run-pinchbench.sh --help for full option reference.

Reproducibility checklist

When publishing benchmark results, record:

CodeWhale version: codewhale --version
Git commit: git rev-parse HEAD
Model and provider (e.g. deepseek/deepseek-chat)
Benchmark dataset and version
Docker platform (linux/amd64 vs linux/arm64)
Worker concurrency
Timestamp (UTC)
Full result file (all_preds.jsonl, Harbor result dir, or PinchBench results JSON)

References

SWE-bench: https://github.com/SWE-bench/SWE-bench
Terminal-Bench: https://github.com/laude-institute/terminal-bench / https://www.tbench.ai
Harbor: https://github.com/harbor-framework/harbor / https://harborframework.com
PinchBench: https://github.com/pinchbench/skill / https://pinchbench.com

4.5 KiB Raw Blame History

Benchmarks

Prerequisites

SWE-bench

Batch run

Evaluate

Terminal-Bench (via Harbor)

Setup

Run

Custom agent path

PinchBench

Setup

Run (MiMo v2.5 Pro — default)

Run (other models)

MiMo v2.5 notes

Reproducibility checklist

References

4.5 KiB

Raw Blame History