From a5f27aae3a4b054b9503f39333d07980c120eb3b Mon Sep 17 00:00:00 2001 From: Hunter B Date: Thu, 4 Jun 2026 19:33:43 -0700 Subject: [PATCH] feat(benchmarks): default PinchBench to MiMo v2.5 Pro, add direct-mimo routing PinchBench runner now defaults to openrouter/xiaomi/mimo-v2.5-pro instead of deepseek/deepseek-chat. Adds --direct-mimo flag for routing through Xiaomi's API directly (bypasses OpenRouter), with tp-/sk- key type detection and endpoint mismatch warnings. Harbor adapter gains --provider CLI flag for MiMo provider routing. Known issues documented in docs/MIMO_BENCHMARK_ISSUES.md: - PinchBench model validation requires OpenRouter prefix - OPENROUTER_API_KEY needed even for some direct-provider paths - Token Plan vs pay-as-you-go key/endpoint mismatch - PinchBench runs through OpenClaw, not CodeWhale --- docs/BENCHMARKS.md | 37 +++--- docs/MIMO_BENCHMARK_ISSUES.md | 89 +++++++++++++++ scripts/benchmarks/harbor/__init__.py | 6 + scripts/benchmarks/run-pinchbench.sh | 157 +++++++++++++++++++++----- 4 files changed, 246 insertions(+), 43 deletions(-) create mode 100644 docs/MIMO_BENCHMARK_ISSUES.md diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md index 0d7f0e5f..390e2a02 100644 --- a/docs/BENCHMARKS.md +++ b/docs/BENCHMARKS.md @@ -112,26 +112,37 @@ agent runtime. ### Setup ```bash -git clone https://github.com/pinchbench/skill.git /tmp/pinchbench -cd /tmp/pinchbench -uv venv && source .venv/bin/activate -uv pip install -e . +./scripts/benchmarks/run-pinchbench.sh --install ``` -### Run +### Run (MiMo v2.5 Pro — default) ```bash -# Via the convenience script -./scripts/benchmarks/run-pinchbench.sh \ - --model deepseek/deepseek-chat \ - --suite all +# MiMo v2.5 Pro via OpenRouter (default) +./scripts/benchmarks/run-pinchbench.sh -# Or directly -cd /tmp/pinchbench && ./scripts/run.sh \ - --model deepseek/deepseek-chat \ - --suite all +# MiMo v2.5 Pro via direct Xiaomi API +./scripts/benchmarks/run-pinchbench.sh --direct-mimo + +# Specific tasks +./scripts/benchmarks/run-pinchbench.sh --suite task_calendar,task_stock ``` +### Run (other models) + +```bash +./scripts/benchmarks/run-pinchbench.sh --model openrouter/deepseek/deepseek-v4-pro +``` + +### MiMo v2.5 notes + +PinchBench routes through OpenRouter by default. MiMo models are available as +`openrouter/xiaomi/mimo-v2.5-pro` (Pro) and `openrouter/xiaomi/mimo-v2.5` +(Omni). For direct Xiaomi API access, use `--direct-mimo` with +`XIAOMI_MIMO_API_KEY` set. + +See `scripts/benchmarks/run-pinchbench.sh --help` for full option reference. + ## Reproducibility checklist When publishing benchmark results, record: diff --git a/docs/MIMO_BENCHMARK_ISSUES.md b/docs/MIMO_BENCHMARK_ISSUES.md new file mode 100644 index 00000000..25e155a1 --- /dev/null +++ b/docs/MIMO_BENCHMARK_ISSUES.md @@ -0,0 +1,89 @@ +# MiMo v2.5 Benchmarking — Known Issues + +Tracking doc for quirks and workarounds when benchmarking Xiaomi MiMo v2.5 +through CodeWhale's harness integrations. + +## PinchBench + +### Issue 1: Model validation requires OpenRouter prefix + +PinchBench validates models against OpenRouter's `/models` endpoint. If you +pass `mimo-v2.5-pro` without the `openrouter/xiaomi/` prefix, validation is +skipped entirely (it assumes it's a non-OpenRouter model). This means you +won't know if the model ID is wrong until the run fails. + +**Workaround:** Always use `openrouter/xiaomi/mimo-v2.5-pro` for OpenRouter +routing, or use `--direct-mimo` for the Xiaomi API. + +### Issue 2: PinchBench requires OPENROUTER_API_KEY + +Even when using a direct provider, PinchBench's `lib_agent.py` checks for +`OPENROUTER_API_KEY` in some code paths. The `--direct-mimo` flag in our +runner works around this by setting up a custom OpenAI-compatible provider +entry in OpenClaw's `models.json` and exporting `OPENAI_API_KEY`/`OPENAI_BASE_URL`. + +### Issue 3: Token Plan vs Pay-as-you-go key mismatch + +Xiaomi MiMo has two API endpoints: +- **Token Plan** (`tp-` keys): `https://token-plan-sgp.xiaomimimo.com/v1` +- **Pay-as-you-go** (`sk-` keys): `https://api.xiaomimimo.com/v1` + +Using the wrong key type with the wrong endpoint produces auth errors. The +runner now detects this and warns. + +### Issue 4: OpenClaw is the runtime, not CodeWhale + +PinchBench runs tasks through OpenClaw, not CodeWhale. This means the +benchmark measures MiMo v2.5's performance through OpenClaw's agent harness, +not through CodeWhale's tool system. For CodeWhale-native evaluation, +Terminal-Bench (via Harbor) is the better fit. + +**Future:** Create a CodeWhale-native PinchBench adapter that loads tasks +from PinchBench's `tasks/` directory and runs them through `codewhale exec`. + +## Terminal-Bench (Harbor) + +### Issue 1: MiMo provider routing + +Harbor passes models as `provider/model` format. For MiMo via OpenRouter, +use `openrouter/xiaomi/mimo-v2.5-pro`. For direct Xiaomi API, pass +`--provider xiaomi-mimo` as an extra agent flag. + +### Issue 2: Container environment + +The Harbor adapter installs codewhale via npm in the container. MiMo API +keys must be forwarded from the host environment. The adapter checks for +`XIAOMI_MIMO_API_KEY`, `OPENROUTER_API_KEY`, and `OPENAI_API_KEY`. + +## SWE-bench + +### Issue 1: MiMo thinking mode + +MiMo v2.5 Pro supports extended thinking. For SWE-bench patch generation, +ensure the thinking level is set appropriately. The `--thinking high` flag +is passed through the CLI. + +### Issue 2: Context window + +MiMo v2.5 Pro has a 128K context window. Large SWE-bench instances (e.g., +Django, sympy) may benefit from the full window. No special handling needed, +but worth monitoring token usage. + +## Environment Variables Reference + +``` +# Xiaomi MiMo direct API +XIAOMI_MIMO_API_KEY=tp-... # Token Plan key +XIAOMI_MIMO_API_KEY=sk-... # Pay-as-you-go key +XIAOMI_MIMO_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1 +XIAOMI_MIMO_MODEL=mimo-v2.5-pro + +# Aliases also accepted +XIAOMI_API_KEY=... +MIMO_API_KEY=... +MIMO_BASE_URL=... +MIMO_MODEL=... + +# OpenRouter (for MiMo via OpenRouter) +OPENROUTER_API_KEY=... +``` diff --git a/scripts/benchmarks/harbor/__init__.py b/scripts/benchmarks/harbor/__init__.py index 3bde0431..122cc03d 100644 --- a/scripts/benchmarks/harbor/__init__.py +++ b/scripts/benchmarks/harbor/__init__.py @@ -52,6 +52,12 @@ class CodeWhaleAgent(BaseInstalledAgent): type="str", default="high", ), + CliFlag( + "provider", + cli="--provider", + type="str", + default=None, + ), ] @staticmethod diff --git a/scripts/benchmarks/run-pinchbench.sh b/scripts/benchmarks/run-pinchbench.sh index e86740ad..788f4028 100755 --- a/scripts/benchmarks/run-pinchbench.sh +++ b/scripts/benchmarks/run-pinchbench.sh @@ -1,27 +1,43 @@ #!/usr/bin/env bash -# run-pinchbench.sh — Run CodeWhale through PinchBench. +# run-pinchbench.sh — Run PinchBench benchmarks with CodeWhale model routing. # -# PinchBench evaluates agent performance on real-world tasks. It normally -# targets OpenClaw, but this script adapts the workflow for CodeWhale by -# leveraging the OpenRouter-compatible model routing. +# PinchBench evaluates agent performance on real-world tasks (calendar, email, +# coding, research, file management). It uses OpenClaw as the agent runtime and +# routes models through OpenRouter by default. +# +# Known issues with Xiaomi MiMo v2.5: +# 1. PinchBench validates models against OpenRouter's /models endpoint. +# MiMo models MUST use the openrouter/ prefix or validation is skipped. +# 2. PinchBench requires OPENROUTER_API_KEY even when using a direct provider. +# The --direct-mimo flag sets up a custom OpenAI-compatible endpoint in +# OpenClaw's models.json to bypass this. +# 3. MiMo v2.5 Pro has a 128K context window but PinchBench tasks are small. +# No special handling needed, but worth noting for cost estimates. +# 4. The Xiaomi Token Plan endpoint (token-plan-sgp.xiaomimimo.com) uses +# tp- prefixed keys. Pay-as-you-go (api.xiaomimimo.com) uses sk- keys. +# Make sure XIAOMI_MIMO_API_KEY matches the endpoint you're using. +# 5. OpenRouter model ID for MiMo: xiaomi/mimo-v2.5-pro (Pro) or +# xiaomi/mimo-v2.5 (Omni). PinchBench expects the full provider/model. # # Usage: # ./scripts/benchmarks/run-pinchbench.sh --help -# ./scripts/benchmarks/run-pinchbench.sh --model deepseek/deepseek-chat +# ./scripts/benchmarks/run-pinchbench.sh --model xiaomi/mimo-v2.5-pro +# ./scripts/benchmarks/run-pinchbench.sh --direct-mimo --suite task_calendar # # Prerequisites: -# - PinchBench cloned (or install via this script) +# - PinchBench cloned (or use --install) # - Python 3.10+ with uv -# - OPENROUTER_API_KEY or DEEPSEEK_API_KEY set -# - A running OpenClaw instance (PinchBench's default runtime) +# - OPENROUTER_API_KEY (for OpenRouter routing) +# - OR XIAOMI_MIMO_API_KEY + --direct-mimo (for direct Xiaomi API) +# - A running OpenClaw instance set -euo pipefail SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" -# Defaults -MODEL="deepseek/deepseek-chat" +# Defaults — MiMo v2.5 Pro via OpenRouter +MODEL="openrouter/xiaomi/mimo-v2.5-pro" SUITE="all" PINCHBENCH_DIR="${PINCHBENCH_DIR:-/tmp/pinchbench}" RESULTS_DIR="./results/pinchbench" @@ -29,19 +45,28 @@ INSTALL_PINCHBENCH=false RUNS=1 JUDGE_MODEL="" NO_UPLOAD=true +DIRECT_MIMO=false +MIMO_BASE_URL="" EXTRA_ARGS=() usage() { cat <&2 + echo " Token Plan keys (tp-...): https://token-plan-sgp.xiaomimimo.com/v1" >&2 + echo " Pay-as-you-go keys (sk-...): https://api.xiaomimimo.com/v1" >&2 + exit 1 + fi + + # Determine base URL: flag > env > default (Token Plan Singapore) + if [[ -z "$MIMO_BASE_URL" ]]; then + MIMO_BASE_URL="${XIAOMI_MIMO_BASE_URL:-https://token-plan-sgp.xiaomimimo.com/v1}" + fi + + # Detect key type and warn if mismatched + if [[ "$MIMO_KEY" == tp-* && "$MIMO_BASE_URL" == *"api.xiaomimimo.com"* ]]; then + echo "Warning: tp- key used with pay-as-you-go endpoint. Token Plan keys work with:" >&2 + echo " https://token-plan-sgp.xiaomimimo.com/v1" >&2 + elif [[ "$MIMO_KEY" == sk-* && "$MIMO_BASE_URL" == *"token-plan"* ]]; then + echo "Warning: sk- key used with Token Plan endpoint. Pay-as-you-go keys work with:" >&2 + echo " https://api.xiaomimimo.com/v1" >&2 + fi + + echo "Direct MiMo mode:" + echo " Model: $MODEL" + echo " Endpoint: $MIMO_BASE_URL" + echo " Key type: ${MIMO_KEY:0:3}..." + echo "" + + # Export for PinchBench's lib_agent.py custom provider setup + export OPENAI_API_KEY="$MIMO_KEY" + export OPENAI_BASE_URL="$MIMO_BASE_URL" +fi + +# ── Prereq checks ─────────────────────────────────────────────────────────── +if [[ "$DIRECT_MIMO" != true ]]; then + # OpenRouter mode — need the key + if [[ -z "${OPENROUTER_API_KEY:-}" ]]; then + echo "Warning: OPENROUTER_API_KEY not set. PinchBench may fail model validation." >&2 + echo " Either set OPENROUTER_API_KEY or use --direct-mimo with XIAOMI_MIMO_API_KEY." >&2 + fi +fi + +# ── Install PinchBench ────────────────────────────────────────────────────── if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then echo "Installing PinchBench to $PINCHBENCH_DIR ..." if [[ -d "$PINCHBENCH_DIR" ]]; then @@ -91,7 +179,6 @@ if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then uv pip install -e . fi -# Verify PinchBench is available if [[ ! -d "$PINCHBENCH_DIR" ]]; then echo "Error: PinchBench not found at $PINCHBENCH_DIR" >&2 echo "Run with --install to clone it automatically." >&2 @@ -100,21 +187,21 @@ fi cd "$PINCHBENCH_DIR" -# Activate venv if it exists if [[ -f ".venv/bin/activate" ]]; then source .venv/bin/activate fi mkdir -p "$RESULTS_DIR" -# Record metadata +# ── Record metadata ───────────────────────────────────────────────────────── METADATA_FILE="$RESULTS_DIR/run_metadata.json" cat > "$METADATA_FILE" </dev/null || echo unknown)", "git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)", - "pinchbench_commit": "$(git rev-parse HEAD 2>/dev/null || echo unknown)", + "pinchbench_commit": "$(git -C "$PINCHBENCH_DIR" rev-parse HEAD 2>/dev/null || echo unknown)", "model": "$MODEL", + "routing": "$(if [[ "$DIRECT_MIMO" == true ]]; then echo "direct-xiaomi"; else echo "openrouter"; fi)", "suite": "$SUITE", "runs": $RUNS, "timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)", @@ -123,7 +210,7 @@ cat > "$METADATA_FILE" <