feat(benchmarks): default PinchBench to MiMo v2.5 Pro, add direct-mimo routing

PinchBench runner now defaults to openrouter/xiaomi/mimo-v2.5-pro instead
of deepseek/deepseek-chat. Adds --direct-mimo flag for routing through
Xiaomi's API directly (bypasses OpenRouter), with tp-/sk- key type
detection and endpoint mismatch warnings.

Harbor adapter gains --provider CLI flag for MiMo provider routing.

Known issues documented in docs/MIMO_BENCHMARK_ISSUES.md:
- PinchBench model validation requires OpenRouter prefix
- OPENROUTER_API_KEY needed even for some direct-provider paths
- Token Plan vs pay-as-you-go key/endpoint mismatch
- PinchBench runs through OpenClaw, not CodeWhale
This commit is contained in:
Hunter B
2026-06-04 19:33:43 -07:00
parent b329a532f5
commit a5f27aae3a
4 changed files with 246 additions and 43 deletions
+24 -13
View File
@@ -112,26 +112,37 @@ agent runtime.
### Setup
```bash
git clone https://github.com/pinchbench/skill.git /tmp/pinchbench
cd /tmp/pinchbench
uv venv && source .venv/bin/activate
uv pip install -e .
./scripts/benchmarks/run-pinchbench.sh --install
```
### Run
### Run (MiMo v2.5 Pro — default)
```bash
# Via the convenience script
./scripts/benchmarks/run-pinchbench.sh \
--model deepseek/deepseek-chat \
--suite all
# MiMo v2.5 Pro via OpenRouter (default)
./scripts/benchmarks/run-pinchbench.sh
# Or directly
cd /tmp/pinchbench && ./scripts/run.sh \
--model deepseek/deepseek-chat \
--suite all
# MiMo v2.5 Pro via direct Xiaomi API
./scripts/benchmarks/run-pinchbench.sh --direct-mimo
# Specific tasks
./scripts/benchmarks/run-pinchbench.sh --suite task_calendar,task_stock
```
### Run (other models)
```bash
./scripts/benchmarks/run-pinchbench.sh --model openrouter/deepseek/deepseek-v4-pro
```
### MiMo v2.5 notes
PinchBench routes through OpenRouter by default. MiMo models are available as
`openrouter/xiaomi/mimo-v2.5-pro` (Pro) and `openrouter/xiaomi/mimo-v2.5`
(Omni). For direct Xiaomi API access, use `--direct-mimo` with
`XIAOMI_MIMO_API_KEY` set.
See `scripts/benchmarks/run-pinchbench.sh --help` for full option reference.
## Reproducibility checklist
When publishing benchmark results, record:
+89
View File
@@ -0,0 +1,89 @@
# MiMo v2.5 Benchmarking — Known Issues
Tracking doc for quirks and workarounds when benchmarking Xiaomi MiMo v2.5
through CodeWhale's harness integrations.
## PinchBench
### Issue 1: Model validation requires OpenRouter prefix
PinchBench validates models against OpenRouter's `/models` endpoint. If you
pass `mimo-v2.5-pro` without the `openrouter/xiaomi/` prefix, validation is
skipped entirely (it assumes it's a non-OpenRouter model). This means you
won't know if the model ID is wrong until the run fails.
**Workaround:** Always use `openrouter/xiaomi/mimo-v2.5-pro` for OpenRouter
routing, or use `--direct-mimo` for the Xiaomi API.
### Issue 2: PinchBench requires OPENROUTER_API_KEY
Even when using a direct provider, PinchBench's `lib_agent.py` checks for
`OPENROUTER_API_KEY` in some code paths. The `--direct-mimo` flag in our
runner works around this by setting up a custom OpenAI-compatible provider
entry in OpenClaw's `models.json` and exporting `OPENAI_API_KEY`/`OPENAI_BASE_URL`.
### Issue 3: Token Plan vs Pay-as-you-go key mismatch
Xiaomi MiMo has two API endpoints:
- **Token Plan** (`tp-` keys): `https://token-plan-sgp.xiaomimimo.com/v1`
- **Pay-as-you-go** (`sk-` keys): `https://api.xiaomimimo.com/v1`
Using the wrong key type with the wrong endpoint produces auth errors. The
runner now detects this and warns.
### Issue 4: OpenClaw is the runtime, not CodeWhale
PinchBench runs tasks through OpenClaw, not CodeWhale. This means the
benchmark measures MiMo v2.5's performance through OpenClaw's agent harness,
not through CodeWhale's tool system. For CodeWhale-native evaluation,
Terminal-Bench (via Harbor) is the better fit.
**Future:** Create a CodeWhale-native PinchBench adapter that loads tasks
from PinchBench's `tasks/` directory and runs them through `codewhale exec`.
## Terminal-Bench (Harbor)
### Issue 1: MiMo provider routing
Harbor passes models as `provider/model` format. For MiMo via OpenRouter,
use `openrouter/xiaomi/mimo-v2.5-pro`. For direct Xiaomi API, pass
`--provider xiaomi-mimo` as an extra agent flag.
### Issue 2: Container environment
The Harbor adapter installs codewhale via npm in the container. MiMo API
keys must be forwarded from the host environment. The adapter checks for
`XIAOMI_MIMO_API_KEY`, `OPENROUTER_API_KEY`, and `OPENAI_API_KEY`.
## SWE-bench
### Issue 1: MiMo thinking mode
MiMo v2.5 Pro supports extended thinking. For SWE-bench patch generation,
ensure the thinking level is set appropriately. The `--thinking high` flag
is passed through the CLI.
### Issue 2: Context window
MiMo v2.5 Pro has a 128K context window. Large SWE-bench instances (e.g.,
Django, sympy) may benefit from the full window. No special handling needed,
but worth monitoring token usage.
## Environment Variables Reference
```
# Xiaomi MiMo direct API
XIAOMI_MIMO_API_KEY=tp-... # Token Plan key
XIAOMI_MIMO_API_KEY=sk-... # Pay-as-you-go key
XIAOMI_MIMO_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
XIAOMI_MIMO_MODEL=mimo-v2.5-pro
# Aliases also accepted
XIAOMI_API_KEY=...
MIMO_API_KEY=...
MIMO_BASE_URL=...
MIMO_MODEL=...
# OpenRouter (for MiMo via OpenRouter)
OPENROUTER_API_KEY=...
```
+6
View File
@@ -52,6 +52,12 @@ class CodeWhaleAgent(BaseInstalledAgent):
type="str",
default="high",
),
CliFlag(
"provider",
cli="--provider",
type="str",
default=None,
),
]
@staticmethod
+127 -30
View File
@@ -1,27 +1,43 @@
#!/usr/bin/env bash
# run-pinchbench.sh — Run CodeWhale through PinchBench.
# run-pinchbench.sh — Run PinchBench benchmarks with CodeWhale model routing.
#
# PinchBench evaluates agent performance on real-world tasks. It normally
# targets OpenClaw, but this script adapts the workflow for CodeWhale by
# leveraging the OpenRouter-compatible model routing.
# PinchBench evaluates agent performance on real-world tasks (calendar, email,
# coding, research, file management). It uses OpenClaw as the agent runtime and
# routes models through OpenRouter by default.
#
# Known issues with Xiaomi MiMo v2.5:
# 1. PinchBench validates models against OpenRouter's /models endpoint.
# MiMo models MUST use the openrouter/ prefix or validation is skipped.
# 2. PinchBench requires OPENROUTER_API_KEY even when using a direct provider.
# The --direct-mimo flag sets up a custom OpenAI-compatible endpoint in
# OpenClaw's models.json to bypass this.
# 3. MiMo v2.5 Pro has a 128K context window but PinchBench tasks are small.
# No special handling needed, but worth noting for cost estimates.
# 4. The Xiaomi Token Plan endpoint (token-plan-sgp.xiaomimimo.com) uses
# tp- prefixed keys. Pay-as-you-go (api.xiaomimimo.com) uses sk- keys.
# Make sure XIAOMI_MIMO_API_KEY matches the endpoint you're using.
# 5. OpenRouter model ID for MiMo: xiaomi/mimo-v2.5-pro (Pro) or
# xiaomi/mimo-v2.5 (Omni). PinchBench expects the full provider/model.
#
# Usage:
# ./scripts/benchmarks/run-pinchbench.sh --help
# ./scripts/benchmarks/run-pinchbench.sh --model deepseek/deepseek-chat
# ./scripts/benchmarks/run-pinchbench.sh --model xiaomi/mimo-v2.5-pro
# ./scripts/benchmarks/run-pinchbench.sh --direct-mimo --suite task_calendar
#
# Prerequisites:
# - PinchBench cloned (or install via this script)
# - PinchBench cloned (or use --install)
# - Python 3.10+ with uv
# - OPENROUTER_API_KEY or DEEPSEEK_API_KEY set
# - A running OpenClaw instance (PinchBench's default runtime)
# - OPENROUTER_API_KEY (for OpenRouter routing)
# - OR XIAOMI_MIMO_API_KEY + --direct-mimo (for direct Xiaomi API)
# - A running OpenClaw instance
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
# Defaults
MODEL="deepseek/deepseek-chat"
# Defaults — MiMo v2.5 Pro via OpenRouter
MODEL="openrouter/xiaomi/mimo-v2.5-pro"
SUITE="all"
PINCHBENCH_DIR="${PINCHBENCH_DIR:-/tmp/pinchbench}"
RESULTS_DIR="./results/pinchbench"
@@ -29,19 +45,28 @@ INSTALL_PINCHBENCH=false
RUNS=1
JUDGE_MODEL=""
NO_UPLOAD=true
DIRECT_MIMO=false
MIMO_BASE_URL=""
EXTRA_ARGS=()
usage() {
cat <<EOF
Usage: $(basename "$0") [OPTIONS]
Run PinchBench benchmarks with CodeWhale-compatible model routing.
Run PinchBench benchmarks. Defaults to Xiaomi MiMo v2.5 Pro via OpenRouter.
Options:
--model MODEL Model in provider/name format (default: deepseek/deepseek-chat)
--suite SUITE Task suite: all, automated-only, or comma-separated IDs (default: all)
--model MODEL Model ID (default: openrouter/xiaomi/mimo-v2.5-pro)
Common values:
openrouter/xiaomi/mimo-v2.5-pro — MiMo Pro via OpenRouter
openrouter/xiaomi/mimo-v2.5 — MiMo Omni via OpenRouter
openrouter/deepseek/deepseek-v4-pro — DeepSeek V4 Pro via OpenRouter
--suite SUITE Task suite: all, automated-only, or comma-separated IDs
--runs N Runs per task for averaging (default: 1)
--judge MODEL Judge model for LLM grading
--judge MODEL Judge model for LLM grading (default: uses OpenClaw agent)
--direct-mimo Route MiMo directly via Xiaomi API (bypasses OpenRouter)
Requires XIAOMI_MIMO_API_KEY. Sets model to mimo-v2.5-pro.
--mimo-base-url URL Override MiMo API base URL (default: Token Plan Singapore)
--pinchbench-dir DIR PinchBench install directory (default: /tmp/pinchbench)
--results-dir DIR Local results directory (default: ./results/pinchbench)
--install Install/clone PinchBench before running
@@ -49,15 +74,26 @@ Options:
-- [EXTRA_ARGS...] Additional arguments passed to PinchBench
-h, --help Show this help
Environment variables:
OPENROUTER_API_KEY Required for OpenRouter model routing
XIAOMI_MIMO_API_KEY Required for --direct-mimo (or XIAOMI_API_KEY / MIMO_API_KEY)
XIAOMI_MIMO_BASE_URL Override MiMo API endpoint
Examples:
# Basic run with DeepSeek
$(basename "$0") --model deepseek/deepseek-chat
# MiMo v2.5 Pro via OpenRouter (default)
$(basename "$0")
# Install and run
$(basename "$0") --install --model deepseek/deepseek-chat
# MiMo v2.5 Pro via direct Xiaomi API
$(basename "$0") --direct-mimo
# Specific tasks only
$(basename "$0") --suite task_calendar,task_stock --model deepseek/deepseek-chat
# Specific tasks with MiMo
$(basename "$0") --suite task_calendar,task_stock
# Install PinchBench and run
$(basename "$0") --install
# DeepSeek V4 Pro via OpenRouter
$(basename "$0") --model openrouter/deepseek/deepseek-v4-pro
EOF
}
@@ -67,6 +103,8 @@ while [[ $# -gt 0 ]]; do
--suite) SUITE="$2"; shift 2 ;;
--runs) RUNS="$2"; shift 2 ;;
--judge) JUDGE_MODEL="$2"; shift 2 ;;
--direct-mimo) DIRECT_MIMO=true; shift ;;
--mimo-base-url) MIMO_BASE_URL="$2"; shift 2 ;;
--pinchbench-dir) PINCHBENCH_DIR="$2"; shift 2 ;;
--results-dir) RESULTS_DIR="$2"; shift 2 ;;
--install) INSTALL_PINCHBENCH=true; shift ;;
@@ -77,7 +115,57 @@ while [[ $# -gt 0 ]]; do
esac
done
# Install PinchBench if requested
# ── Direct MiMo mode ────────────────────────────────────────────────────────
# When --direct-mimo is set, we configure PinchBench to use Xiaomi's API
# directly instead of routing through OpenRouter. This creates a custom
# OpenAI-compatible provider entry in OpenClaw's models.json.
if [[ "$DIRECT_MIMO" == true ]]; then
MODEL="mimo-v2.5-pro"
# Resolve API key from multiple env var names
MIMO_KEY="${XIAOMI_MIMO_API_KEY:-${XIAOMI_API_KEY:-${MIMO_API_KEY:-}}}"
if [[ -z "$MIMO_KEY" ]]; then
echo "Error: --direct-mimo requires XIAOMI_MIMO_API_KEY (or XIAOMI_API_KEY / MIMO_API_KEY)" >&2
echo " Token Plan keys (tp-...): https://token-plan-sgp.xiaomimimo.com/v1" >&2
echo " Pay-as-you-go keys (sk-...): https://api.xiaomimimo.com/v1" >&2
exit 1
fi
# Determine base URL: flag > env > default (Token Plan Singapore)
if [[ -z "$MIMO_BASE_URL" ]]; then
MIMO_BASE_URL="${XIAOMI_MIMO_BASE_URL:-https://token-plan-sgp.xiaomimimo.com/v1}"
fi
# Detect key type and warn if mismatched
if [[ "$MIMO_KEY" == tp-* && "$MIMO_BASE_URL" == *"api.xiaomimimo.com"* ]]; then
echo "Warning: tp- key used with pay-as-you-go endpoint. Token Plan keys work with:" >&2
echo " https://token-plan-sgp.xiaomimimo.com/v1" >&2
elif [[ "$MIMO_KEY" == sk-* && "$MIMO_BASE_URL" == *"token-plan"* ]]; then
echo "Warning: sk- key used with Token Plan endpoint. Pay-as-you-go keys work with:" >&2
echo " https://api.xiaomimimo.com/v1" >&2
fi
echo "Direct MiMo mode:"
echo " Model: $MODEL"
echo " Endpoint: $MIMO_BASE_URL"
echo " Key type: ${MIMO_KEY:0:3}..."
echo ""
# Export for PinchBench's lib_agent.py custom provider setup
export OPENAI_API_KEY="$MIMO_KEY"
export OPENAI_BASE_URL="$MIMO_BASE_URL"
fi
# ── Prereq checks ───────────────────────────────────────────────────────────
if [[ "$DIRECT_MIMO" != true ]]; then
# OpenRouter mode — need the key
if [[ -z "${OPENROUTER_API_KEY:-}" ]]; then
echo "Warning: OPENROUTER_API_KEY not set. PinchBench may fail model validation." >&2
echo " Either set OPENROUTER_API_KEY or use --direct-mimo with XIAOMI_MIMO_API_KEY." >&2
fi
fi
# ── Install PinchBench ──────────────────────────────────────────────────────
if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then
echo "Installing PinchBench to $PINCHBENCH_DIR ..."
if [[ -d "$PINCHBENCH_DIR" ]]; then
@@ -91,7 +179,6 @@ if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then
uv pip install -e .
fi
# Verify PinchBench is available
if [[ ! -d "$PINCHBENCH_DIR" ]]; then
echo "Error: PinchBench not found at $PINCHBENCH_DIR" >&2
echo "Run with --install to clone it automatically." >&2
@@ -100,21 +187,21 @@ fi
cd "$PINCHBENCH_DIR"
# Activate venv if it exists
if [[ -f ".venv/bin/activate" ]]; then
source .venv/bin/activate
fi
mkdir -p "$RESULTS_DIR"
# Record metadata
# ── Record metadata ─────────────────────────────────────────────────────────
METADATA_FILE="$RESULTS_DIR/run_metadata.json"
cat > "$METADATA_FILE" <<META
{
"codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
"git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
"pinchbench_commit": "$(git rev-parse HEAD 2>/dev/null || echo unknown)",
"pinchbench_commit": "$(git -C "$PINCHBENCH_DIR" rev-parse HEAD 2>/dev/null || echo unknown)",
"model": "$MODEL",
"routing": "$(if [[ "$DIRECT_MIMO" == true ]]; then echo "direct-xiaomi"; else echo "openrouter"; fi)",
"suite": "$SUITE",
"runs": $RUNS,
"timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
@@ -123,7 +210,7 @@ cat > "$METADATA_FILE" <<META
META
echo "Run metadata: $METADATA_FILE"
# Build PinchBench command
# ── Build and run PinchBench ────────────────────────────────────────────────
PB_ARGS=("--model" "$MODEL" "--suite" "$SUITE" "--runs" "$RUNS" "--output-dir" "$RESULTS_DIR")
if [[ -n "$JUDGE_MODEL" ]]; then
@@ -134,13 +221,23 @@ if [[ "$NO_UPLOAD" == true ]]; then
PB_ARGS+=("--no-upload")
fi
# Pass direct-mimo endpoint info via env for lib_agent.py's custom provider setup
if [[ "$DIRECT_MIMO" == true ]]; then
PB_ARGS+=("--base-url" "$MIMO_BASE_URL")
fi
PB_ARGS+=("${EXTRA_ARGS[@]}")
echo "Running PinchBench..."
echo " Model: $MODEL"
echo " Suite: $SUITE"
echo " Runs: $RUNS"
echo " Output: $RESULTS_DIR"
echo " Model: $MODEL"
echo " Suite: $SUITE"
echo " Runs: $RUNS"
echo " Output: $RESULTS_DIR"
if [[ "$DIRECT_MIMO" == true ]]; then
echo " Routing: Direct Xiaomi API ($MIMO_BASE_URL)"
else
echo " Routing: OpenRouter"
fi
echo ""
./scripts/run.sh "${PB_ARGS[@]}"