feat(benchmarks): default PinchBench to MiMo v2.5 Pro, add direct-mimo routing
PinchBench runner now defaults to openrouter/xiaomi/mimo-v2.5-pro instead of deepseek/deepseek-chat. Adds --direct-mimo flag for routing through Xiaomi's API directly (bypasses OpenRouter), with tp-/sk- key type detection and endpoint mismatch warnings. Harbor adapter gains --provider CLI flag for MiMo provider routing. Known issues documented in docs/MIMO_BENCHMARK_ISSUES.md: - PinchBench model validation requires OpenRouter prefix - OPENROUTER_API_KEY needed even for some direct-provider paths - Token Plan vs pay-as-you-go key/endpoint mismatch - PinchBench runs through OpenClaw, not CodeWhale
This commit is contained in:
+24
-13
@@ -112,26 +112,37 @@ agent runtime.
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
git clone https://github.com/pinchbench/skill.git /tmp/pinchbench
|
||||
cd /tmp/pinchbench
|
||||
uv venv && source .venv/bin/activate
|
||||
uv pip install -e .
|
||||
./scripts/benchmarks/run-pinchbench.sh --install
|
||||
```
|
||||
|
||||
### Run
|
||||
### Run (MiMo v2.5 Pro — default)
|
||||
|
||||
```bash
|
||||
# Via the convenience script
|
||||
./scripts/benchmarks/run-pinchbench.sh \
|
||||
--model deepseek/deepseek-chat \
|
||||
--suite all
|
||||
# MiMo v2.5 Pro via OpenRouter (default)
|
||||
./scripts/benchmarks/run-pinchbench.sh
|
||||
|
||||
# Or directly
|
||||
cd /tmp/pinchbench && ./scripts/run.sh \
|
||||
--model deepseek/deepseek-chat \
|
||||
--suite all
|
||||
# MiMo v2.5 Pro via direct Xiaomi API
|
||||
./scripts/benchmarks/run-pinchbench.sh --direct-mimo
|
||||
|
||||
# Specific tasks
|
||||
./scripts/benchmarks/run-pinchbench.sh --suite task_calendar,task_stock
|
||||
```
|
||||
|
||||
### Run (other models)
|
||||
|
||||
```bash
|
||||
./scripts/benchmarks/run-pinchbench.sh --model openrouter/deepseek/deepseek-v4-pro
|
||||
```
|
||||
|
||||
### MiMo v2.5 notes
|
||||
|
||||
PinchBench routes through OpenRouter by default. MiMo models are available as
|
||||
`openrouter/xiaomi/mimo-v2.5-pro` (Pro) and `openrouter/xiaomi/mimo-v2.5`
|
||||
(Omni). For direct Xiaomi API access, use `--direct-mimo` with
|
||||
`XIAOMI_MIMO_API_KEY` set.
|
||||
|
||||
See `scripts/benchmarks/run-pinchbench.sh --help` for full option reference.
|
||||
|
||||
## Reproducibility checklist
|
||||
|
||||
When publishing benchmark results, record:
|
||||
|
||||
@@ -0,0 +1,89 @@
|
||||
# MiMo v2.5 Benchmarking — Known Issues
|
||||
|
||||
Tracking doc for quirks and workarounds when benchmarking Xiaomi MiMo v2.5
|
||||
through CodeWhale's harness integrations.
|
||||
|
||||
## PinchBench
|
||||
|
||||
### Issue 1: Model validation requires OpenRouter prefix
|
||||
|
||||
PinchBench validates models against OpenRouter's `/models` endpoint. If you
|
||||
pass `mimo-v2.5-pro` without the `openrouter/xiaomi/` prefix, validation is
|
||||
skipped entirely (it assumes it's a non-OpenRouter model). This means you
|
||||
won't know if the model ID is wrong until the run fails.
|
||||
|
||||
**Workaround:** Always use `openrouter/xiaomi/mimo-v2.5-pro` for OpenRouter
|
||||
routing, or use `--direct-mimo` for the Xiaomi API.
|
||||
|
||||
### Issue 2: PinchBench requires OPENROUTER_API_KEY
|
||||
|
||||
Even when using a direct provider, PinchBench's `lib_agent.py` checks for
|
||||
`OPENROUTER_API_KEY` in some code paths. The `--direct-mimo` flag in our
|
||||
runner works around this by setting up a custom OpenAI-compatible provider
|
||||
entry in OpenClaw's `models.json` and exporting `OPENAI_API_KEY`/`OPENAI_BASE_URL`.
|
||||
|
||||
### Issue 3: Token Plan vs Pay-as-you-go key mismatch
|
||||
|
||||
Xiaomi MiMo has two API endpoints:
|
||||
- **Token Plan** (`tp-` keys): `https://token-plan-sgp.xiaomimimo.com/v1`
|
||||
- **Pay-as-you-go** (`sk-` keys): `https://api.xiaomimimo.com/v1`
|
||||
|
||||
Using the wrong key type with the wrong endpoint produces auth errors. The
|
||||
runner now detects this and warns.
|
||||
|
||||
### Issue 4: OpenClaw is the runtime, not CodeWhale
|
||||
|
||||
PinchBench runs tasks through OpenClaw, not CodeWhale. This means the
|
||||
benchmark measures MiMo v2.5's performance through OpenClaw's agent harness,
|
||||
not through CodeWhale's tool system. For CodeWhale-native evaluation,
|
||||
Terminal-Bench (via Harbor) is the better fit.
|
||||
|
||||
**Future:** Create a CodeWhale-native PinchBench adapter that loads tasks
|
||||
from PinchBench's `tasks/` directory and runs them through `codewhale exec`.
|
||||
|
||||
## Terminal-Bench (Harbor)
|
||||
|
||||
### Issue 1: MiMo provider routing
|
||||
|
||||
Harbor passes models as `provider/model` format. For MiMo via OpenRouter,
|
||||
use `openrouter/xiaomi/mimo-v2.5-pro`. For direct Xiaomi API, pass
|
||||
`--provider xiaomi-mimo` as an extra agent flag.
|
||||
|
||||
### Issue 2: Container environment
|
||||
|
||||
The Harbor adapter installs codewhale via npm in the container. MiMo API
|
||||
keys must be forwarded from the host environment. The adapter checks for
|
||||
`XIAOMI_MIMO_API_KEY`, `OPENROUTER_API_KEY`, and `OPENAI_API_KEY`.
|
||||
|
||||
## SWE-bench
|
||||
|
||||
### Issue 1: MiMo thinking mode
|
||||
|
||||
MiMo v2.5 Pro supports extended thinking. For SWE-bench patch generation,
|
||||
ensure the thinking level is set appropriately. The `--thinking high` flag
|
||||
is passed through the CLI.
|
||||
|
||||
### Issue 2: Context window
|
||||
|
||||
MiMo v2.5 Pro has a 128K context window. Large SWE-bench instances (e.g.,
|
||||
Django, sympy) may benefit from the full window. No special handling needed,
|
||||
but worth monitoring token usage.
|
||||
|
||||
## Environment Variables Reference
|
||||
|
||||
```
|
||||
# Xiaomi MiMo direct API
|
||||
XIAOMI_MIMO_API_KEY=tp-... # Token Plan key
|
||||
XIAOMI_MIMO_API_KEY=sk-... # Pay-as-you-go key
|
||||
XIAOMI_MIMO_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
|
||||
XIAOMI_MIMO_MODEL=mimo-v2.5-pro
|
||||
|
||||
# Aliases also accepted
|
||||
XIAOMI_API_KEY=...
|
||||
MIMO_API_KEY=...
|
||||
MIMO_BASE_URL=...
|
||||
MIMO_MODEL=...
|
||||
|
||||
# OpenRouter (for MiMo via OpenRouter)
|
||||
OPENROUTER_API_KEY=...
|
||||
```
|
||||
@@ -52,6 +52,12 @@ class CodeWhaleAgent(BaseInstalledAgent):
|
||||
type="str",
|
||||
default="high",
|
||||
),
|
||||
CliFlag(
|
||||
"provider",
|
||||
cli="--provider",
|
||||
type="str",
|
||||
default=None,
|
||||
),
|
||||
]
|
||||
|
||||
@staticmethod
|
||||
|
||||
@@ -1,27 +1,43 @@
|
||||
#!/usr/bin/env bash
|
||||
# run-pinchbench.sh — Run CodeWhale through PinchBench.
|
||||
# run-pinchbench.sh — Run PinchBench benchmarks with CodeWhale model routing.
|
||||
#
|
||||
# PinchBench evaluates agent performance on real-world tasks. It normally
|
||||
# targets OpenClaw, but this script adapts the workflow for CodeWhale by
|
||||
# leveraging the OpenRouter-compatible model routing.
|
||||
# PinchBench evaluates agent performance on real-world tasks (calendar, email,
|
||||
# coding, research, file management). It uses OpenClaw as the agent runtime and
|
||||
# routes models through OpenRouter by default.
|
||||
#
|
||||
# Known issues with Xiaomi MiMo v2.5:
|
||||
# 1. PinchBench validates models against OpenRouter's /models endpoint.
|
||||
# MiMo models MUST use the openrouter/ prefix or validation is skipped.
|
||||
# 2. PinchBench requires OPENROUTER_API_KEY even when using a direct provider.
|
||||
# The --direct-mimo flag sets up a custom OpenAI-compatible endpoint in
|
||||
# OpenClaw's models.json to bypass this.
|
||||
# 3. MiMo v2.5 Pro has a 128K context window but PinchBench tasks are small.
|
||||
# No special handling needed, but worth noting for cost estimates.
|
||||
# 4. The Xiaomi Token Plan endpoint (token-plan-sgp.xiaomimimo.com) uses
|
||||
# tp- prefixed keys. Pay-as-you-go (api.xiaomimimo.com) uses sk- keys.
|
||||
# Make sure XIAOMI_MIMO_API_KEY matches the endpoint you're using.
|
||||
# 5. OpenRouter model ID for MiMo: xiaomi/mimo-v2.5-pro (Pro) or
|
||||
# xiaomi/mimo-v2.5 (Omni). PinchBench expects the full provider/model.
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/benchmarks/run-pinchbench.sh --help
|
||||
# ./scripts/benchmarks/run-pinchbench.sh --model deepseek/deepseek-chat
|
||||
# ./scripts/benchmarks/run-pinchbench.sh --model xiaomi/mimo-v2.5-pro
|
||||
# ./scripts/benchmarks/run-pinchbench.sh --direct-mimo --suite task_calendar
|
||||
#
|
||||
# Prerequisites:
|
||||
# - PinchBench cloned (or install via this script)
|
||||
# - PinchBench cloned (or use --install)
|
||||
# - Python 3.10+ with uv
|
||||
# - OPENROUTER_API_KEY or DEEPSEEK_API_KEY set
|
||||
# - A running OpenClaw instance (PinchBench's default runtime)
|
||||
# - OPENROUTER_API_KEY (for OpenRouter routing)
|
||||
# - OR XIAOMI_MIMO_API_KEY + --direct-mimo (for direct Xiaomi API)
|
||||
# - A running OpenClaw instance
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
|
||||
# Defaults
|
||||
MODEL="deepseek/deepseek-chat"
|
||||
# Defaults — MiMo v2.5 Pro via OpenRouter
|
||||
MODEL="openrouter/xiaomi/mimo-v2.5-pro"
|
||||
SUITE="all"
|
||||
PINCHBENCH_DIR="${PINCHBENCH_DIR:-/tmp/pinchbench}"
|
||||
RESULTS_DIR="./results/pinchbench"
|
||||
@@ -29,19 +45,28 @@ INSTALL_PINCHBENCH=false
|
||||
RUNS=1
|
||||
JUDGE_MODEL=""
|
||||
NO_UPLOAD=true
|
||||
DIRECT_MIMO=false
|
||||
MIMO_BASE_URL=""
|
||||
EXTRA_ARGS=()
|
||||
|
||||
usage() {
|
||||
cat <<EOF
|
||||
Usage: $(basename "$0") [OPTIONS]
|
||||
|
||||
Run PinchBench benchmarks with CodeWhale-compatible model routing.
|
||||
Run PinchBench benchmarks. Defaults to Xiaomi MiMo v2.5 Pro via OpenRouter.
|
||||
|
||||
Options:
|
||||
--model MODEL Model in provider/name format (default: deepseek/deepseek-chat)
|
||||
--suite SUITE Task suite: all, automated-only, or comma-separated IDs (default: all)
|
||||
--model MODEL Model ID (default: openrouter/xiaomi/mimo-v2.5-pro)
|
||||
Common values:
|
||||
openrouter/xiaomi/mimo-v2.5-pro — MiMo Pro via OpenRouter
|
||||
openrouter/xiaomi/mimo-v2.5 — MiMo Omni via OpenRouter
|
||||
openrouter/deepseek/deepseek-v4-pro — DeepSeek V4 Pro via OpenRouter
|
||||
--suite SUITE Task suite: all, automated-only, or comma-separated IDs
|
||||
--runs N Runs per task for averaging (default: 1)
|
||||
--judge MODEL Judge model for LLM grading
|
||||
--judge MODEL Judge model for LLM grading (default: uses OpenClaw agent)
|
||||
--direct-mimo Route MiMo directly via Xiaomi API (bypasses OpenRouter)
|
||||
Requires XIAOMI_MIMO_API_KEY. Sets model to mimo-v2.5-pro.
|
||||
--mimo-base-url URL Override MiMo API base URL (default: Token Plan Singapore)
|
||||
--pinchbench-dir DIR PinchBench install directory (default: /tmp/pinchbench)
|
||||
--results-dir DIR Local results directory (default: ./results/pinchbench)
|
||||
--install Install/clone PinchBench before running
|
||||
@@ -49,15 +74,26 @@ Options:
|
||||
-- [EXTRA_ARGS...] Additional arguments passed to PinchBench
|
||||
-h, --help Show this help
|
||||
|
||||
Environment variables:
|
||||
OPENROUTER_API_KEY Required for OpenRouter model routing
|
||||
XIAOMI_MIMO_API_KEY Required for --direct-mimo (or XIAOMI_API_KEY / MIMO_API_KEY)
|
||||
XIAOMI_MIMO_BASE_URL Override MiMo API endpoint
|
||||
|
||||
Examples:
|
||||
# Basic run with DeepSeek
|
||||
$(basename "$0") --model deepseek/deepseek-chat
|
||||
# MiMo v2.5 Pro via OpenRouter (default)
|
||||
$(basename "$0")
|
||||
|
||||
# Install and run
|
||||
$(basename "$0") --install --model deepseek/deepseek-chat
|
||||
# MiMo v2.5 Pro via direct Xiaomi API
|
||||
$(basename "$0") --direct-mimo
|
||||
|
||||
# Specific tasks only
|
||||
$(basename "$0") --suite task_calendar,task_stock --model deepseek/deepseek-chat
|
||||
# Specific tasks with MiMo
|
||||
$(basename "$0") --suite task_calendar,task_stock
|
||||
|
||||
# Install PinchBench and run
|
||||
$(basename "$0") --install
|
||||
|
||||
# DeepSeek V4 Pro via OpenRouter
|
||||
$(basename "$0") --model openrouter/deepseek/deepseek-v4-pro
|
||||
EOF
|
||||
}
|
||||
|
||||
@@ -67,6 +103,8 @@ while [[ $# -gt 0 ]]; do
|
||||
--suite) SUITE="$2"; shift 2 ;;
|
||||
--runs) RUNS="$2"; shift 2 ;;
|
||||
--judge) JUDGE_MODEL="$2"; shift 2 ;;
|
||||
--direct-mimo) DIRECT_MIMO=true; shift ;;
|
||||
--mimo-base-url) MIMO_BASE_URL="$2"; shift 2 ;;
|
||||
--pinchbench-dir) PINCHBENCH_DIR="$2"; shift 2 ;;
|
||||
--results-dir) RESULTS_DIR="$2"; shift 2 ;;
|
||||
--install) INSTALL_PINCHBENCH=true; shift ;;
|
||||
@@ -77,7 +115,57 @@ while [[ $# -gt 0 ]]; do
|
||||
esac
|
||||
done
|
||||
|
||||
# Install PinchBench if requested
|
||||
# ── Direct MiMo mode ────────────────────────────────────────────────────────
|
||||
# When --direct-mimo is set, we configure PinchBench to use Xiaomi's API
|
||||
# directly instead of routing through OpenRouter. This creates a custom
|
||||
# OpenAI-compatible provider entry in OpenClaw's models.json.
|
||||
if [[ "$DIRECT_MIMO" == true ]]; then
|
||||
MODEL="mimo-v2.5-pro"
|
||||
|
||||
# Resolve API key from multiple env var names
|
||||
MIMO_KEY="${XIAOMI_MIMO_API_KEY:-${XIAOMI_API_KEY:-${MIMO_API_KEY:-}}}"
|
||||
if [[ -z "$MIMO_KEY" ]]; then
|
||||
echo "Error: --direct-mimo requires XIAOMI_MIMO_API_KEY (or XIAOMI_API_KEY / MIMO_API_KEY)" >&2
|
||||
echo " Token Plan keys (tp-...): https://token-plan-sgp.xiaomimimo.com/v1" >&2
|
||||
echo " Pay-as-you-go keys (sk-...): https://api.xiaomimimo.com/v1" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Determine base URL: flag > env > default (Token Plan Singapore)
|
||||
if [[ -z "$MIMO_BASE_URL" ]]; then
|
||||
MIMO_BASE_URL="${XIAOMI_MIMO_BASE_URL:-https://token-plan-sgp.xiaomimimo.com/v1}"
|
||||
fi
|
||||
|
||||
# Detect key type and warn if mismatched
|
||||
if [[ "$MIMO_KEY" == tp-* && "$MIMO_BASE_URL" == *"api.xiaomimimo.com"* ]]; then
|
||||
echo "Warning: tp- key used with pay-as-you-go endpoint. Token Plan keys work with:" >&2
|
||||
echo " https://token-plan-sgp.xiaomimimo.com/v1" >&2
|
||||
elif [[ "$MIMO_KEY" == sk-* && "$MIMO_BASE_URL" == *"token-plan"* ]]; then
|
||||
echo "Warning: sk- key used with Token Plan endpoint. Pay-as-you-go keys work with:" >&2
|
||||
echo " https://api.xiaomimimo.com/v1" >&2
|
||||
fi
|
||||
|
||||
echo "Direct MiMo mode:"
|
||||
echo " Model: $MODEL"
|
||||
echo " Endpoint: $MIMO_BASE_URL"
|
||||
echo " Key type: ${MIMO_KEY:0:3}..."
|
||||
echo ""
|
||||
|
||||
# Export for PinchBench's lib_agent.py custom provider setup
|
||||
export OPENAI_API_KEY="$MIMO_KEY"
|
||||
export OPENAI_BASE_URL="$MIMO_BASE_URL"
|
||||
fi
|
||||
|
||||
# ── Prereq checks ───────────────────────────────────────────────────────────
|
||||
if [[ "$DIRECT_MIMO" != true ]]; then
|
||||
# OpenRouter mode — need the key
|
||||
if [[ -z "${OPENROUTER_API_KEY:-}" ]]; then
|
||||
echo "Warning: OPENROUTER_API_KEY not set. PinchBench may fail model validation." >&2
|
||||
echo " Either set OPENROUTER_API_KEY or use --direct-mimo with XIAOMI_MIMO_API_KEY." >&2
|
||||
fi
|
||||
fi
|
||||
|
||||
# ── Install PinchBench ──────────────────────────────────────────────────────
|
||||
if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then
|
||||
echo "Installing PinchBench to $PINCHBENCH_DIR ..."
|
||||
if [[ -d "$PINCHBENCH_DIR" ]]; then
|
||||
@@ -91,7 +179,6 @@ if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then
|
||||
uv pip install -e .
|
||||
fi
|
||||
|
||||
# Verify PinchBench is available
|
||||
if [[ ! -d "$PINCHBENCH_DIR" ]]; then
|
||||
echo "Error: PinchBench not found at $PINCHBENCH_DIR" >&2
|
||||
echo "Run with --install to clone it automatically." >&2
|
||||
@@ -100,21 +187,21 @@ fi
|
||||
|
||||
cd "$PINCHBENCH_DIR"
|
||||
|
||||
# Activate venv if it exists
|
||||
if [[ -f ".venv/bin/activate" ]]; then
|
||||
source .venv/bin/activate
|
||||
fi
|
||||
|
||||
mkdir -p "$RESULTS_DIR"
|
||||
|
||||
# Record metadata
|
||||
# ── Record metadata ─────────────────────────────────────────────────────────
|
||||
METADATA_FILE="$RESULTS_DIR/run_metadata.json"
|
||||
cat > "$METADATA_FILE" <<META
|
||||
{
|
||||
"codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
|
||||
"git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
|
||||
"pinchbench_commit": "$(git rev-parse HEAD 2>/dev/null || echo unknown)",
|
||||
"pinchbench_commit": "$(git -C "$PINCHBENCH_DIR" rev-parse HEAD 2>/dev/null || echo unknown)",
|
||||
"model": "$MODEL",
|
||||
"routing": "$(if [[ "$DIRECT_MIMO" == true ]]; then echo "direct-xiaomi"; else echo "openrouter"; fi)",
|
||||
"suite": "$SUITE",
|
||||
"runs": $RUNS,
|
||||
"timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
@@ -123,7 +210,7 @@ cat > "$METADATA_FILE" <<META
|
||||
META
|
||||
echo "Run metadata: $METADATA_FILE"
|
||||
|
||||
# Build PinchBench command
|
||||
# ── Build and run PinchBench ────────────────────────────────────────────────
|
||||
PB_ARGS=("--model" "$MODEL" "--suite" "$SUITE" "--runs" "$RUNS" "--output-dir" "$RESULTS_DIR")
|
||||
|
||||
if [[ -n "$JUDGE_MODEL" ]]; then
|
||||
@@ -134,13 +221,23 @@ if [[ "$NO_UPLOAD" == true ]]; then
|
||||
PB_ARGS+=("--no-upload")
|
||||
fi
|
||||
|
||||
# Pass direct-mimo endpoint info via env for lib_agent.py's custom provider setup
|
||||
if [[ "$DIRECT_MIMO" == true ]]; then
|
||||
PB_ARGS+=("--base-url" "$MIMO_BASE_URL")
|
||||
fi
|
||||
|
||||
PB_ARGS+=("${EXTRA_ARGS[@]}")
|
||||
|
||||
echo "Running PinchBench..."
|
||||
echo " Model: $MODEL"
|
||||
echo " Suite: $SUITE"
|
||||
echo " Runs: $RUNS"
|
||||
echo " Output: $RESULTS_DIR"
|
||||
echo " Model: $MODEL"
|
||||
echo " Suite: $SUITE"
|
||||
echo " Runs: $RUNS"
|
||||
echo " Output: $RESULTS_DIR"
|
||||
if [[ "$DIRECT_MIMO" == true ]]; then
|
||||
echo " Routing: Direct Xiaomi API ($MIMO_BASE_URL)"
|
||||
else
|
||||
echo " Routing: OpenRouter"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
./scripts/run.sh "${PB_ARGS[@]}"
|
||||
|
||||
Reference in New Issue
Block a user