feat(benchmarks): default PinchBench to MiMo v2.5 Pro, add direct-mimo routing

PinchBench runner now defaults to openrouter/xiaomi/mimo-v2.5-pro instead of deepseek/deepseek-chat. Adds --direct-mimo flag for routing through Xiaomi's API directly (bypasses OpenRouter), with tp-/sk- key type detection and endpoint mismatch warnings. Harbor adapter gains --provider CLI flag for MiMo provider routing. Known issues documented in docs/MIMO_BENCHMARK_ISSUES.md: - PinchBench model validation requires OpenRouter prefix - OPENROUTER_API_KEY needed even for some direct-provider paths - Token Plan vs pay-as-you-go key/endpoint mismatch - PinchBench runs through OpenClaw, not CodeWhale
2026-06-04 19:33:43 -07:00
parent b329a532f5
commit a5f27aae3a
4 changed files with 246 additions and 43 deletions
@@ -112,26 +112,37 @@ agent runtime.
 ### Setup

 ```bash
-git clone https://github.com/pinchbench/skill.git /tmp/pinchbench
-cd /tmp/pinchbench
-uv venv && source .venv/bin/activate
-uv pip install -e .
+./scripts/benchmarks/run-pinchbench.sh --install
 ```

-### Run
+### Run (MiMo v2.5 Pro — default)

 ```bash
-# Via the convenience script
-./scripts/benchmarks/run-pinchbench.sh \
-  --model deepseek/deepseek-chat \
-  --suite all
+# MiMo v2.5 Pro via OpenRouter (default)
+./scripts/benchmarks/run-pinchbench.sh

-# Or directly
-cd /tmp/pinchbench && ./scripts/run.sh \
-  --model deepseek/deepseek-chat \
-  --suite all
+# MiMo v2.5 Pro via direct Xiaomi API
+./scripts/benchmarks/run-pinchbench.sh --direct-mimo
+
+# Specific tasks
+./scripts/benchmarks/run-pinchbench.sh --suite task_calendar,task_stock
 ```

+### Run (other models)
+
+```bash
+./scripts/benchmarks/run-pinchbench.sh --model openrouter/deepseek/deepseek-v4-pro
+```
+
+### MiMo v2.5 notes
+
+PinchBench routes through OpenRouter by default. MiMo models are available as
+`openrouter/xiaomi/mimo-v2.5-pro` (Pro) and `openrouter/xiaomi/mimo-v2.5`
+(Omni). For direct Xiaomi API access, use `--direct-mimo` with
+`XIAOMI_MIMO_API_KEY` set.
+
+See `scripts/benchmarks/run-pinchbench.sh --help` for full option reference.
+
 ## Reproducibility checklist

 When publishing benchmark results, record:
@@ -0,0 +1,89 @@
+# MiMo v2.5 Benchmarking — Known Issues
+
+Tracking doc for quirks and workarounds when benchmarking Xiaomi MiMo v2.5
+through CodeWhale's harness integrations.
+
+## PinchBench
+
+### Issue 1: Model validation requires OpenRouter prefix
+
+PinchBench validates models against OpenRouter's `/models` endpoint. If you
+pass `mimo-v2.5-pro` without the `openrouter/xiaomi/` prefix, validation is
+skipped entirely (it assumes it's a non-OpenRouter model). This means you
+won't know if the model ID is wrong until the run fails.
+
+**Workaround:** Always use `openrouter/xiaomi/mimo-v2.5-pro` for OpenRouter
+routing, or use `--direct-mimo` for the Xiaomi API.
+
+### Issue 2: PinchBench requires OPENROUTER_API_KEY
+
+Even when using a direct provider, PinchBench's `lib_agent.py` checks for
+`OPENROUTER_API_KEY` in some code paths. The `--direct-mimo` flag in our
+runner works around this by setting up a custom OpenAI-compatible provider
+entry in OpenClaw's `models.json` and exporting `OPENAI_API_KEY`/`OPENAI_BASE_URL`.
+
+### Issue 3: Token Plan vs Pay-as-you-go key mismatch
+
+Xiaomi MiMo has two API endpoints:
+- **Token Plan** (`tp-` keys): `https://token-plan-sgp.xiaomimimo.com/v1`
+- **Pay-as-you-go** (`sk-` keys): `https://api.xiaomimimo.com/v1`
+
+Using the wrong key type with the wrong endpoint produces auth errors. The
+runner now detects this and warns.
+
+### Issue 4: OpenClaw is the runtime, not CodeWhale
+
+PinchBench runs tasks through OpenClaw, not CodeWhale. This means the
+benchmark measures MiMo v2.5's performance through OpenClaw's agent harness,
+not through CodeWhale's tool system. For CodeWhale-native evaluation,
+Terminal-Bench (via Harbor) is the better fit.
+
+**Future:** Create a CodeWhale-native PinchBench adapter that loads tasks
+from PinchBench's `tasks/` directory and runs them through `codewhale exec`.
+
+## Terminal-Bench (Harbor)
+
+### Issue 1: MiMo provider routing
+
+Harbor passes models as `provider/model` format. For MiMo via OpenRouter,
+use `openrouter/xiaomi/mimo-v2.5-pro`. For direct Xiaomi API, pass
+`--provider xiaomi-mimo` as an extra agent flag.
+
+### Issue 2: Container environment
+
+The Harbor adapter installs codewhale via npm in the container. MiMo API
+keys must be forwarded from the host environment. The adapter checks for
+`XIAOMI_MIMO_API_KEY`, `OPENROUTER_API_KEY`, and `OPENAI_API_KEY`.
+
+## SWE-bench
+
+### Issue 1: MiMo thinking mode
+
+MiMo v2.5 Pro supports extended thinking. For SWE-bench patch generation,
+ensure the thinking level is set appropriately. The `--thinking high` flag
+is passed through the CLI.
+
+### Issue 2: Context window
+
+MiMo v2.5 Pro has a 128K context window. Large SWE-bench instances (e.g.,
+Django, sympy) may benefit from the full window. No special handling needed,
+but worth monitoring token usage.
+
+## Environment Variables Reference
+
+```
+# Xiaomi MiMo direct API
+XIAOMI_MIMO_API_KEY=tp-...    # Token Plan key
+XIAOMI_MIMO_API_KEY=sk-...    # Pay-as-you-go key
+XIAOMI_MIMO_BASE_URL=https://token-plan-sgp.xiaomimimo.com/v1
+XIAOMI_MIMO_MODEL=mimo-v2.5-pro
+
+# Aliases also accepted
+XIAOMI_API_KEY=...
+MIMO_API_KEY=...
+MIMO_BASE_URL=...
+MIMO_MODEL=...
+
+# OpenRouter (for MiMo via OpenRouter)
+OPENROUTER_API_KEY=...
+```
@@ -52,6 +52,12 @@ class CodeWhaleAgent(BaseInstalledAgent):
            type="str",
            default="high",
        ),
+        CliFlag(
+            "provider",
+            cli="--provider",
+            type="str",
+            default=None,
+        ),
    ]

    @staticmethod
@@ -1,27 +1,43 @@
 #!/usr/bin/env bash
-# run-pinchbench.sh — Run CodeWhale through PinchBench.
+# run-pinchbench.sh — Run PinchBench benchmarks with CodeWhale model routing.
 #
-# PinchBench evaluates agent performance on real-world tasks. It normally
-# targets OpenClaw, but this script adapts the workflow for CodeWhale by
-# leveraging the OpenRouter-compatible model routing.
+# PinchBench evaluates agent performance on real-world tasks (calendar, email,
+# coding, research, file management). It uses OpenClaw as the agent runtime and
+# routes models through OpenRouter by default.
+#
+# Known issues with Xiaomi MiMo v2.5:
+#   1. PinchBench validates models against OpenRouter's /models endpoint.
+#      MiMo models MUST use the openrouter/ prefix or validation is skipped.
+#   2. PinchBench requires OPENROUTER_API_KEY even when using a direct provider.
+#      The --direct-mimo flag sets up a custom OpenAI-compatible endpoint in
+#      OpenClaw's models.json to bypass this.
+#   3. MiMo v2.5 Pro has a 128K context window but PinchBench tasks are small.
+#      No special handling needed, but worth noting for cost estimates.
+#   4. The Xiaomi Token Plan endpoint (token-plan-sgp.xiaomimimo.com) uses
+#      tp- prefixed keys. Pay-as-you-go (api.xiaomimimo.com) uses sk- keys.
+#      Make sure XIAOMI_MIMO_API_KEY matches the endpoint you're using.
+#   5. OpenRouter model ID for MiMo: xiaomi/mimo-v2.5-pro (Pro) or
+#      xiaomi/mimo-v2.5 (Omni). PinchBench expects the full provider/model.
 #
 # Usage:
 #   ./scripts/benchmarks/run-pinchbench.sh --help
-#   ./scripts/benchmarks/run-pinchbench.sh --model deepseek/deepseek-chat
+#   ./scripts/benchmarks/run-pinchbench.sh --model xiaomi/mimo-v2.5-pro
+#   ./scripts/benchmarks/run-pinchbench.sh --direct-mimo --suite task_calendar
 #
 # Prerequisites:
-#   - PinchBench cloned (or install via this script)
+#   - PinchBench cloned (or use --install)
 #   - Python 3.10+ with uv
-#   - OPENROUTER_API_KEY or DEEPSEEK_API_KEY set
-#   - A running OpenClaw instance (PinchBench's default runtime)
+#   - OPENROUTER_API_KEY (for OpenRouter routing)
+#   - OR XIAOMI_MIMO_API_KEY + --direct-mimo (for direct Xiaomi API)
+#   - A running OpenClaw instance

 set -euo pipefail

 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"

-# Defaults
-MODEL="deepseek/deepseek-chat"
+# Defaults — MiMo v2.5 Pro via OpenRouter
+MODEL="openrouter/xiaomi/mimo-v2.5-pro"
 SUITE="all"
 PINCHBENCH_DIR="${PINCHBENCH_DIR:-/tmp/pinchbench}"
 RESULTS_DIR="./results/pinchbench"
@@ -29,19 +45,28 @@ INSTALL_PINCHBENCH=false
 RUNS=1
 JUDGE_MODEL=""
 NO_UPLOAD=true
+DIRECT_MIMO=false
+MIMO_BASE_URL=""
 EXTRA_ARGS=()

 usage() {
    cat <<EOF
 Usage: $(basename "$0") [OPTIONS]

-Run PinchBench benchmarks with CodeWhale-compatible model routing.
+Run PinchBench benchmarks. Defaults to Xiaomi MiMo v2.5 Pro via OpenRouter.

 Options:
-  --model MODEL           Model in provider/name format (default: deepseek/deepseek-chat)
-  --suite SUITE           Task suite: all, automated-only, or comma-separated IDs (default: all)
+  --model MODEL           Model ID (default: openrouter/xiaomi/mimo-v2.5-pro)
+                          Common values:
+                            openrouter/xiaomi/mimo-v2.5-pro  — MiMo Pro via OpenRouter
+                            openrouter/xiaomi/mimo-v2.5      — MiMo Omni via OpenRouter
+                            openrouter/deepseek/deepseek-v4-pro — DeepSeek V4 Pro via OpenRouter
+  --suite SUITE           Task suite: all, automated-only, or comma-separated IDs
  --runs N                Runs per task for averaging (default: 1)
-  --judge MODEL           Judge model for LLM grading
+  --judge MODEL           Judge model for LLM grading (default: uses OpenClaw agent)
+  --direct-mimo           Route MiMo directly via Xiaomi API (bypasses OpenRouter)
+                          Requires XIAOMI_MIMO_API_KEY. Sets model to mimo-v2.5-pro.
+  --mimo-base-url URL     Override MiMo API base URL (default: Token Plan Singapore)
  --pinchbench-dir DIR    PinchBench install directory (default: /tmp/pinchbench)
  --results-dir DIR       Local results directory (default: ./results/pinchbench)
  --install               Install/clone PinchBench before running
@@ -49,15 +74,26 @@ Options:
  -- [EXTRA_ARGS...]      Additional arguments passed to PinchBench
  -h, --help              Show this help

+Environment variables:
+  OPENROUTER_API_KEY      Required for OpenRouter model routing
+  XIAOMI_MIMO_API_KEY     Required for --direct-mimo (or XIAOMI_API_KEY / MIMO_API_KEY)
+  XIAOMI_MIMO_BASE_URL    Override MiMo API endpoint
+
 Examples:
-  # Basic run with DeepSeek
-  $(basename "$0") --model deepseek/deepseek-chat
+  # MiMo v2.5 Pro via OpenRouter (default)
+  $(basename "$0")

-  # Install and run
-  $(basename "$0") --install --model deepseek/deepseek-chat
+  # MiMo v2.5 Pro via direct Xiaomi API
+  $(basename "$0") --direct-mimo

-  # Specific tasks only
-  $(basename "$0") --suite task_calendar,task_stock --model deepseek/deepseek-chat
+  # Specific tasks with MiMo
+  $(basename "$0") --suite task_calendar,task_stock
+
+  # Install PinchBench and run
+  $(basename "$0") --install
+
+  # DeepSeek V4 Pro via OpenRouter
+  $(basename "$0") --model openrouter/deepseek/deepseek-v4-pro
 EOF
 }

@@ -67,6 +103,8 @@ while [[ $# -gt 0 ]]; do
        --suite) SUITE="$2"; shift 2 ;;
        --runs) RUNS="$2"; shift 2 ;;
        --judge) JUDGE_MODEL="$2"; shift 2 ;;
+        --direct-mimo) DIRECT_MIMO=true; shift ;;
+        --mimo-base-url) MIMO_BASE_URL="$2"; shift 2 ;;
        --pinchbench-dir) PINCHBENCH_DIR="$2"; shift 2 ;;
        --results-dir) RESULTS_DIR="$2"; shift 2 ;;
        --install) INSTALL_PINCHBENCH=true; shift ;;
@@ -77,7 +115,57 @@ while [[ $# -gt 0 ]]; do
    esac
 done

-# Install PinchBench if requested
+# ── Direct MiMo mode ────────────────────────────────────────────────────────
+# When --direct-mimo is set, we configure PinchBench to use Xiaomi's API
+# directly instead of routing through OpenRouter. This creates a custom
+# OpenAI-compatible provider entry in OpenClaw's models.json.
+if [[ "$DIRECT_MIMO" == true ]]; then
+    MODEL="mimo-v2.5-pro"
+
+    # Resolve API key from multiple env var names
+    MIMO_KEY="${XIAOMI_MIMO_API_KEY:-${XIAOMI_API_KEY:-${MIMO_API_KEY:-}}}"
+    if [[ -z "$MIMO_KEY" ]]; then
+        echo "Error: --direct-mimo requires XIAOMI_MIMO_API_KEY (or XIAOMI_API_KEY / MIMO_API_KEY)" >&2
+        echo "  Token Plan keys (tp-...): https://token-plan-sgp.xiaomimimo.com/v1" >&2
+        echo "  Pay-as-you-go keys (sk-...): https://api.xiaomimimo.com/v1" >&2
+        exit 1
+    fi
+
+    # Determine base URL: flag > env > default (Token Plan Singapore)
+    if [[ -z "$MIMO_BASE_URL" ]]; then
+        MIMO_BASE_URL="${XIAOMI_MIMO_BASE_URL:-https://token-plan-sgp.xiaomimimo.com/v1}"
+    fi
+
+    # Detect key type and warn if mismatched
+    if [[ "$MIMO_KEY" == tp-* && "$MIMO_BASE_URL" == *"api.xiaomimimo.com"* ]]; then
+        echo "Warning: tp- key used with pay-as-you-go endpoint. Token Plan keys work with:" >&2
+        echo "  https://token-plan-sgp.xiaomimimo.com/v1" >&2
+    elif [[ "$MIMO_KEY" == sk-* && "$MIMO_BASE_URL" == *"token-plan"* ]]; then
+        echo "Warning: sk- key used with Token Plan endpoint. Pay-as-you-go keys work with:" >&2
+        echo "  https://api.xiaomimimo.com/v1" >&2
+    fi
+
+    echo "Direct MiMo mode:"
+    echo "  Model:    $MODEL"
+    echo "  Endpoint: $MIMO_BASE_URL"
+    echo "  Key type: ${MIMO_KEY:0:3}..."
+    echo ""
+
+    # Export for PinchBench's lib_agent.py custom provider setup
+    export OPENAI_API_KEY="$MIMO_KEY"
+    export OPENAI_BASE_URL="$MIMO_BASE_URL"
+fi
+
+# ── Prereq checks ───────────────────────────────────────────────────────────
+if [[ "$DIRECT_MIMO" != true ]]; then
+    # OpenRouter mode — need the key
+    if [[ -z "${OPENROUTER_API_KEY:-}" ]]; then
+        echo "Warning: OPENROUTER_API_KEY not set. PinchBench may fail model validation." >&2
+        echo "  Either set OPENROUTER_API_KEY or use --direct-mimo with XIAOMI_MIMO_API_KEY." >&2
+    fi
+fi
+
+# ── Install PinchBench ──────────────────────────────────────────────────────
 if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then
    echo "Installing PinchBench to $PINCHBENCH_DIR ..."
    if [[ -d "$PINCHBENCH_DIR" ]]; then
@@ -91,7 +179,6 @@ if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then
    uv pip install -e .
 fi

-# Verify PinchBench is available
 if [[ ! -d "$PINCHBENCH_DIR" ]]; then
    echo "Error: PinchBench not found at $PINCHBENCH_DIR" >&2
    echo "Run with --install to clone it automatically." >&2
@@ -100,21 +187,21 @@ fi

 cd "$PINCHBENCH_DIR"

-# Activate venv if it exists
 if [[ -f ".venv/bin/activate" ]]; then
    source .venv/bin/activate
 fi

 mkdir -p "$RESULTS_DIR"

-# Record metadata
+# ── Record metadata ─────────────────────────────────────────────────────────
 METADATA_FILE="$RESULTS_DIR/run_metadata.json"
 cat > "$METADATA_FILE" <<META
 {
    "codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
    "git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
-    "pinchbench_commit": "$(git rev-parse HEAD 2>/dev/null || echo unknown)",
+    "pinchbench_commit": "$(git -C "$PINCHBENCH_DIR" rev-parse HEAD 2>/dev/null || echo unknown)",
    "model": "$MODEL",
+    "routing": "$(if [[ "$DIRECT_MIMO" == true ]]; then echo "direct-xiaomi"; else echo "openrouter"; fi)",
    "suite": "$SUITE",
    "runs": $RUNS,
    "timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
@@ -123,7 +210,7 @@ cat > "$METADATA_FILE" <<META
 META
 echo "Run metadata: $METADATA_FILE"

-# Build PinchBench command
+# ── Build and run PinchBench ────────────────────────────────────────────────
 PB_ARGS=("--model" "$MODEL" "--suite" "$SUITE" "--runs" "$RUNS" "--output-dir" "$RESULTS_DIR")

 if [[ -n "$JUDGE_MODEL" ]]; then
@@ -134,13 +221,23 @@ if [[ "$NO_UPLOAD" == true ]]; then
    PB_ARGS+=("--no-upload")
 fi

+# Pass direct-mimo endpoint info via env for lib_agent.py's custom provider setup
+if [[ "$DIRECT_MIMO" == true ]]; then
+    PB_ARGS+=("--base-url" "$MIMO_BASE_URL")
+fi
+
 PB_ARGS+=("${EXTRA_ARGS[@]}")

 echo "Running PinchBench..."
-echo "  Model:  $MODEL"
-echo "  Suite:  $SUITE"
-echo "  Runs:   $RUNS"
-echo "  Output: $RESULTS_DIR"
+echo "  Model:    $MODEL"
+echo "  Suite:    $SUITE"
+echo "  Runs:     $RUNS"
+echo "  Output:   $RESULTS_DIR"
+if [[ "$DIRECT_MIMO" == true ]]; then
+    echo "  Routing:  Direct Xiaomi API ($MIMO_BASE_URL)"
+else
+    echo "  Routing:  OpenRouter"
+fi
 echo ""

 ./scripts/run.sh "${PB_ARGS[@]}"