feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration

Benchmark harness for evaluating CodeWhale against three external benchmarks: - SWE-bench: batch driver wrapping existing codewhale swebench commands - Terminal-Bench: Harbor adapter (BaseInstalledAgent) for container eval - PinchBench: runner with auto-install for real-world agent tasks Includes docs/BENCHMARKS.md umbrella doc with setup, usage, and reproducibility checklist. Scripts record version/commit/timestamp metadata for each run. Branch: codex/v0.8.53-benchmarks (based on v0.8.53)
2026-06-04 19:21:23 -07:00
parent 8dff2f7525
commit b329a532f5
7 changed files with 792 additions and 0 deletions
@@ -0,0 +1,153 @@
+# Benchmarks
+
+CodeWhale integrates with three external benchmarks to measure real-world
+coding-agent performance. Each benchmark tests a different surface:
+
+| Benchmark | What it tests | Harness | Output format |
+|---|---|---|---|
+| **SWE-bench** | Patch generation from GitHub issues | CodeWhale built-in (`codewhale swebench`) | `all_preds.jsonl` |
+| **Terminal-Bench** | End-to-end terminal tasks (compile, deploy, configure) | Harbor framework adapter | Harbor result JSON |
+| **PinchBench** | Real-world agent tasks (calendar, email, coding, research) | Standalone runner via OpenClaw-compatible adapter | PinchBench result JSON |
+
+All three require Docker. SWE-bench and Terminal-Bench also need the official
+evaluation harness installed separately.
+
+## Prerequisites
+
+```bash
+# Docker (all benchmarks)
+docker --version
+
+# Python 3.10+ with uv (Terminal-Bench, PinchBench, SWE-bench eval)
+python3 --version
+uv --version
+
+# CodeWhale v0.8.53+
+codewhale --version
+
+# API key
+export DEEPSEEK_API_KEY="sk-..."
+```
+
+## SWE-bench
+
+CodeWhale has built-in SWE-bench support via `codewhale swebench run` and
+`codewhale swebench export`. See [docs/SWEBENCH.md](SWEBENCH.md) for the
+single-instance workflow.
+
+### Batch run
+
+```bash
+# Run all instances from a dataset split
+./scripts/benchmarks/run-swebench.sh \
+  --dataset princeton-nlp/SWE-bench_Lite \
+  --split test \
+  --predictions-path ./results/swebench_preds.jsonl
+
+# Run a single instance
+./scripts/benchmarks/run-swebench.sh \
+  --instance-id django__django-12345 \
+  --issue-file ./issue.md \
+  --predictions-path ./results/swebench_preds.jsonl
+```
+
+### Evaluate
+
+```bash
+python -m swebench.harness.run_evaluation \
+  --dataset_name princeton-nlp/SWE-bench_Lite \
+  --predictions_path ./results/swebench_preds.jsonl \
+  --max_workers 1 \
+  --run_id codewhale-v0.8.53
+```
+
+## Terminal-Bench (via Harbor)
+
+Terminal-Bench tests agents on real terminal tasks — compiling, deploying,
+configuring servers, training models. The [Harbor framework](https://github.com/harbor-framework/harbor)
+is the official harness.
+
+CodeWhale plugs in via a Harbor adapter (`scripts/benchmarks/harbor/codewhale_agent.py`).
+
+### Setup
+
+```bash
+pip install harbor
+```
+
+### Run
+
+```bash
+# Via the convenience script
+./scripts/benchmarks/run-terminal-bench.sh \
+  --dataset terminal-bench@2.0 \
+  --model deepseek/deepseek-chat \
+  --n-concurrent 4
+
+# Or directly with harbor
+harbor run \
+  --dataset terminal-bench@2.0 \
+  --agent codewhale \
+  --model deepseek/deepseek-chat \
+  --n-concurrent 4
+```
+
+### Custom agent path
+
+If the adapter is not installed system-wide, point Harbor at it:
+
+```bash
+harbor run \
+  --dataset terminal-bench@2.0 \
+  --agent scripts.benchmarks.harbor.codewhale_agent:CodeWhaleAgent \
+  --model deepseek/deepseek-chat
+```
+
+## PinchBench
+
+PinchBench measures agent performance on real-world tasks — scheduling, email
+triage, code generation, research, file management. It uses OpenClaw as the
+agent runtime.
+
+### Setup
+
+```bash
+git clone https://github.com/pinchbench/skill.git /tmp/pinchbench
+cd /tmp/pinchbench
+uv venv && source .venv/bin/activate
+uv pip install -e .
+```
+
+### Run
+
+```bash
+# Via the convenience script
+./scripts/benchmarks/run-pinchbench.sh \
+  --model deepseek/deepseek-chat \
+  --suite all
+
+# Or directly
+cd /tmp/pinchbench && ./scripts/run.sh \
+  --model deepseek/deepseek-chat \
+  --suite all
+```
+
+## Reproducibility checklist
+
+When publishing benchmark results, record:
+
+- [ ] CodeWhale version: `codewhale --version`
+- [ ] Git commit: `git rev-parse HEAD`
+- [ ] Model and provider (e.g. `deepseek/deepseek-chat`)
+- [ ] Benchmark dataset and version
+- [ ] Docker platform (`linux/amd64` vs `linux/arm64`)
+- [ ] Worker concurrency
+- [ ] Timestamp (UTC)
+- [ ] Full result file (`all_preds.jsonl`, Harbor result dir, or PinchBench results JSON)
+
+## References
+
+- SWE-bench: https://github.com/SWE-bench/SWE-bench
+- Terminal-Bench: https://github.com/laude-institute/terminal-bench / https://www.tbench.ai
+- Harbor: https://github.com/harbor-framework/harbor / https://harborframework.com
+- PinchBench: https://github.com/pinchbench/skill / https://pinchbench.com
@@ -0,0 +1,37 @@
+# Benchmark Scripts
+
+Convenience runners for evaluating CodeWhale against external benchmarks.
+
+## Quick Start
+
+```bash
+# Set your API key
+export DEEPSEEK_API_KEY="sk-..."
+
+# SWE-bench (single instance)
+./scripts/benchmarks/run-swebench.sh \
+  --instance-id django__django-12345 \
+  --issue-file ./issue.md
+
+# Terminal-Bench (via Harbor)
+./scripts/benchmarks/run-terminal-bench.sh \
+  --model deepseek/deepseek-chat
+
+# PinchBench (auto-install + run)
+./scripts/benchmarks/run-pinchbench.sh \
+  --install \
+  --model deepseek/deepseek-chat
+```
+
+## Files
+
+- `run-swebench.sh` — SWE-bench batch driver and evaluator
+- `run-terminal-bench.sh` — Terminal-Bench runner via Harbor
+- `run-pinchbench.sh` — PinchBench runner with auto-install
+- `harbor/__init__.py` — Harbor adapter for CodeWhale (Python)
+- `harbor/codewhale_agent.py` — Adapter entry point
+
+## Documentation
+
+See [docs/BENCHMARKS.md](../../docs/BENCHMARKS.md) for full setup instructions,
+reproducibility checklists, and references.
@@ -0,0 +1,175 @@
+"""
+Harbor adapter for CodeWhale.
+
+Lets Harbor evaluate CodeWhale as an agent on Terminal-Bench and other
+Harbor-compatible datasets.
+
+Usage (after pip install harbor):
+
+    harbor run \\
+      --dataset terminal-bench@2.0 \\
+      --agent scripts.benchmarks.harbor.codewhale_agent:CodeWhaleAgent \\
+      --model deepseek/deepseek-chat
+
+Or register the agent name in Harbor's AgentName enum for shorter invocations.
+"""
+
+import json
+import os
+import shlex
+from pathlib import Path, PurePosixPath
+from typing import Any
+
+from harbor.agents.installed.base import (
+    BaseInstalledAgent,
+    CliFlag,
+    with_prompt_template,
+)
+from harbor.environments.base import BaseEnvironment
+from harbor.models.agent.context import AgentContext
+
+
+class CodeWhaleAgent(BaseInstalledAgent):
+    """
+    CodeWhale agent adapter for Harbor.
+
+    Installs the ``codewhale`` CLI via npm into the task container and runs
+    tasks in non-interactive exec mode with full tool access.
+    """
+
+    _OUTPUT_FILENAME = "codewhale.txt"
+
+    CLI_FLAGS = [
+        CliFlag(
+            "max_subagents",
+            cli="--max-subagents",
+            type="int",
+            default=4,
+        ),
+        CliFlag(
+            "thinking",
+            cli="--thinking",
+            type="str",
+            default="high",
+        ),
+    ]
+
+    @staticmethod
+    def name() -> str:
+        return "codewhale"
+
+    def version(self) -> str | None:
+        return getattr(self, "_version", None)
+
+    def get_version_command(self) -> str | None:
+        return "codewhale --version 2>/dev/null || codewhale-tui --version 2>/dev/null"
+
+    def parse_version(self, stdout: str) -> str:
+        text = stdout.strip()
+        for line in text.splitlines():
+            line = line.strip()
+            if line:
+                # Strip any prefix like "codewhale " or "codewhale-cli "
+                for prefix in ("codewhale-tui ", "codewhale-cli ", "codewhale "):
+                    if line.lower().startswith(prefix):
+                        return line[len(prefix):]
+                return line
+        return text
+
+    async def install(self, environment: BaseEnvironment) -> None:
+        """Install CodeWhale via npm in the container."""
+        # Install system dependencies
+        await self.exec_as_root(
+            environment,
+            command=(
+                "if ldd --version 2>&1 | grep -qi musl || [ -f /etc/alpine-release ]; then"
+                "  apk add --no-cache curl bash nodejs npm git ripgrep;"
+                " elif command -v apt-get &>/dev/null; then"
+                "  apt-get update && apt-get install -y curl git ripgrep;"
+                " elif command -v yum &>/dev/null; then"
+                "  yum install -y curl git ripgrep;"
+                " fi"
+            ),
+            env={"DEBIAN_FRONTEND": "noninteractive"},
+        )
+
+        # Install Node.js if not present (some images lack it)
+        await self.exec_as_root(
+            environment,
+            command=(
+                "if ! command -v node &>/dev/null; then"
+                "  curl -fsSL https://deb.nodesource.com/setup_20.x | bash - &&"
+                "  apt-get install -y nodejs;"
+                " fi"
+            ),
+            env={"DEBIAN_FRONTEND": "noninteractive"},
+        )
+
+        # Install CodeWhale CLI via npm
+        await self.exec_as_agent(
+            environment,
+            command="npm install -g codewhale",
+        )
+
+    @with_prompt_template
+    async def run(
+        self,
+        instruction: str,
+        environment: BaseEnvironment,
+        context: AgentContext,
+    ) -> None:
+        """Run CodeWhale in non-interactive exec mode on the task."""
+        escaped_instruction = shlex.quote(instruction)
+
+        # Build CLI flags from agent config
+        cli_flags = self.build_cli_flags()
+        extra_flags = (cli_flags + " ") if cli_flags else ""
+
+        # Determine API key environment variables to forward
+        env: dict[str, str] = {}
+
+        # DeepSeek
+        deepseek_key = os.environ.get("DEEPSEEK_API_KEY", "")
+        if deepseek_key:
+            env["DEEPSEEK_API_KEY"] = deepseek_key
+
+        # OpenRouter (fallback)
+        openrouter_key = os.environ.get("OPENROUTER_API_KEY", "")
+        if openrouter_key:
+            env["OPENROUTER_API_KEY"] = openrouter_key
+
+        # Generic OpenAI-compatible
+        openai_key = os.environ.get("OPENAI_API_KEY", "")
+        if openai_key:
+            env["OPENAI_API_KEY"] = openai_key
+
+        # Build model flag if model_name is provided
+        model_flag = ""
+        if self.model_name:
+            # Harbor passes model as "provider/model"; CodeWhale uses --model
+            model_flag = f"--model {shlex.quote(self.model_name)} "
+
+        output_path = f"/logs/agent/{self._OUTPUT_FILENAME}"
+
+        # Run CodeWhale in non-interactive YOLO exec mode
+        # --yolo enables full tool access (auto-approved)
+        # --auto runs non-interactively and exits when done
+        # --stream-json gives us structured output for trajectory parsing
+        await self.exec_as_agent(
+            environment,
+            command=(
+                f"codewhale exec --yolo --auto --stream-json "
+                f"{model_flag}{extra_flags}"
+                f"--workspace /workspace "
+                f"{escaped_instruction} "
+                f"2>&1 | tee {shlex.quote(output_path)}"
+            ),
+            env=env if env else None,
+        )
+
+    def populate_context_post_run(self, context: AgentContext) -> None:
+        """Parse CodeWhale's output for any post-run metadata."""
+        # CodeWhale writes its results to the working tree as git diffs.
+        # Harbor's eval harness inspects the workspace directly, so no
+        # special trajectory parsing is needed for basic eval.
+        pass
@@ -0,0 +1,4 @@
+"""Harbor adapter entry point for CodeWhale."""
+from scripts.benchmarks.harbor import CodeWhaleAgent  # noqa: F401
+
+__all__ = ["CodeWhaleAgent"]
@@ -0,0 +1,149 @@
+#!/usr/bin/env bash
+# run-pinchbench.sh — Run CodeWhale through PinchBench.
+#
+# PinchBench evaluates agent performance on real-world tasks. It normally
+# targets OpenClaw, but this script adapts the workflow for CodeWhale by
+# leveraging the OpenRouter-compatible model routing.
+#
+# Usage:
+#   ./scripts/benchmarks/run-pinchbench.sh --help
+#   ./scripts/benchmarks/run-pinchbench.sh --model deepseek/deepseek-chat
+#
+# Prerequisites:
+#   - PinchBench cloned (or install via this script)
+#   - Python 3.10+ with uv
+#   - OPENROUTER_API_KEY or DEEPSEEK_API_KEY set
+#   - A running OpenClaw instance (PinchBench's default runtime)
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+# Defaults
+MODEL="deepseek/deepseek-chat"
+SUITE="all"
+PINCHBENCH_DIR="${PINCHBENCH_DIR:-/tmp/pinchbench}"
+RESULTS_DIR="./results/pinchbench"
+INSTALL_PINCHBENCH=false
+RUNS=1
+JUDGE_MODEL=""
+NO_UPLOAD=true
+EXTRA_ARGS=()
+
+usage() {
+    cat <<EOF
+Usage: $(basename "$0") [OPTIONS]
+
+Run PinchBench benchmarks with CodeWhale-compatible model routing.
+
+Options:
+  --model MODEL           Model in provider/name format (default: deepseek/deepseek-chat)
+  --suite SUITE           Task suite: all, automated-only, or comma-separated IDs (default: all)
+  --runs N                Runs per task for averaging (default: 1)
+  --judge MODEL           Judge model for LLM grading
+  --pinchbench-dir DIR    PinchBench install directory (default: /tmp/pinchbench)
+  --results-dir DIR       Local results directory (default: ./results/pinchbench)
+  --install               Install/clone PinchBench before running
+  --upload                Upload results to pinchbench.com leaderboard
+  -- [EXTRA_ARGS...]      Additional arguments passed to PinchBench
+  -h, --help              Show this help
+
+Examples:
+  # Basic run with DeepSeek
+  $(basename "$0") --model deepseek/deepseek-chat
+
+  # Install and run
+  $(basename "$0") --install --model deepseek/deepseek-chat
+
+  # Specific tasks only
+  $(basename "$0") --suite task_calendar,task_stock --model deepseek/deepseek-chat
+EOF
+}
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --model) MODEL="$2"; shift 2 ;;
+        --suite) SUITE="$2"; shift 2 ;;
+        --runs) RUNS="$2"; shift 2 ;;
+        --judge) JUDGE_MODEL="$2"; shift 2 ;;
+        --pinchbench-dir) PINCHBENCH_DIR="$2"; shift 2 ;;
+        --results-dir) RESULTS_DIR="$2"; shift 2 ;;
+        --install) INSTALL_PINCHBENCH=true; shift ;;
+        --upload) NO_UPLOAD=false; shift ;;
+        --) shift; EXTRA_ARGS=("$@"); break ;;
+        -h|--help) usage; exit 0 ;;
+        *) echo "Unknown option: $1" >&2; usage >&2; exit 1 ;;
+    esac
+done
+
+# Install PinchBench if requested
+if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then
+    echo "Installing PinchBench to $PINCHBENCH_DIR ..."
+    if [[ -d "$PINCHBENCH_DIR" ]]; then
+        cd "$PINCHBENCH_DIR" && git pull
+    else
+        git clone https://github.com/pinchbench/skill.git "$PINCHBENCH_DIR"
+    fi
+    cd "$PINCHBENCH_DIR"
+    uv venv .venv 2>/dev/null || true
+    source .venv/bin/activate
+    uv pip install -e .
+fi
+
+# Verify PinchBench is available
+if [[ ! -d "$PINCHBENCH_DIR" ]]; then
+    echo "Error: PinchBench not found at $PINCHBENCH_DIR" >&2
+    echo "Run with --install to clone it automatically." >&2
+    exit 1
+fi
+
+cd "$PINCHBENCH_DIR"
+
+# Activate venv if it exists
+if [[ -f ".venv/bin/activate" ]]; then
+    source .venv/bin/activate
+fi
+
+mkdir -p "$RESULTS_DIR"
+
+# Record metadata
+METADATA_FILE="$RESULTS_DIR/run_metadata.json"
+cat > "$METADATA_FILE" <<META
+{
+    "codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
+    "git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
+    "pinchbench_commit": "$(git rev-parse HEAD 2>/dev/null || echo unknown)",
+    "model": "$MODEL",
+    "suite": "$SUITE",
+    "runs": $RUNS,
+    "timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+    "platform": "$(uname -s)/$(uname -m)"
+}
+META
+echo "Run metadata: $METADATA_FILE"
+
+# Build PinchBench command
+PB_ARGS=("--model" "$MODEL" "--suite" "$SUITE" "--runs" "$RUNS" "--output-dir" "$RESULTS_DIR")
+
+if [[ -n "$JUDGE_MODEL" ]]; then
+    PB_ARGS+=("--judge" "$JUDGE_MODEL")
+fi
+
+if [[ "$NO_UPLOAD" == true ]]; then
+    PB_ARGS+=("--no-upload")
+fi
+
+PB_ARGS+=("${EXTRA_ARGS[@]}")
+
+echo "Running PinchBench..."
+echo "  Model:  $MODEL"
+echo "  Suite:  $SUITE"
+echo "  Runs:   $RUNS"
+echo "  Output: $RESULTS_DIR"
+echo ""
+
+./scripts/run.sh "${PB_ARGS[@]}"
+
+echo ""
+echo "Results written to $RESULTS_DIR"
@@ -0,0 +1,161 @@
+#!/usr/bin/env bash
+# run-swebench.sh — Batch driver for CodeWhale SWE-bench runs.
+#
+# Usage:
+#   ./scripts/benchmarks/run-swebench.sh --help
+#   ./scripts/benchmarks/run-swebench.sh --dataset princeton-nlp/SWE-bench_Lite --split test
+#   ./scripts/benchmarks/run-swebench.sh --instance-id django__django-12345 --issue-file issue.md
+#
+# Prerequisites:
+#   - codewhale installed and on PATH
+#   - DEEPSEEK_API_KEY set (or appropriate provider key)
+#   - swebench pip package installed (for evaluation step)
+#   - Docker running (for evaluation step)
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+# Defaults
+DATASET=""
+SPLIT="test"
+INSTANCE_ID=""
+ISSUE_FILE=""
+PREDICTIONS_PATH="./results/swebench_preds.jsonl"
+MODEL=""
+WORKSPACE_BASE="/tmp/swebench-workspaces"
+EVAL_ONLY=false
+MAX_WORKERS=1
+
+usage() {
+    cat <<EOF
+Usage: $(basename "$0") [OPTIONS]
+
+Run CodeWhale on SWE-bench instances and produce prediction JSONL.
+
+Options:
+  --dataset DATASET       HuggingFace dataset name (e.g. princeton-nlp/SWE-bench_Lite)
+  --split SPLIT           Dataset split (default: test)
+  --instance-id ID        Run a single instance by ID
+  --issue-file PATH       Issue text file for single-instance mode
+  --predictions-path PATH Output JSONL file (default: ./results/swebench_preds.jsonl)
+  --model MODEL           Model override for CodeWhale
+  --workspace-base DIR    Base dir for instance workspaces (default: /tmp/swebench-workspaces)
+  --eval-only             Skip runs; just evaluate existing predictions file
+  --max-workers N         Parallel workers for evaluation (default: 1)
+  -h, --help              Show this help
+
+Examples:
+  # Run all instances from SWE-bench Lite
+  $(basename "$0") --dataset princeton-nlp/SWE-bench_Lite --split test
+
+  # Run a single instance
+  $(basename "$0") --instance-id django__django-12345 --issue-file ./issue.md
+
+  # Evaluate existing predictions
+  $(basename "$0") --eval-only --predictions-path ./results/swebench_preds.jsonl
+EOF
+}
+
+# Parse args
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --dataset) DATASET="$2"; shift 2 ;;
+        --split) SPLIT="$2"; shift 2 ;;
+        --instance-id) INSTANCE_ID="$2"; shift 2 ;;
+        --issue-file) ISSUE_FILE="$2"; shift 2 ;;
+        --predictions-path) PREDICTIONS_PATH="$2"; shift 2 ;;
+        --model) MODEL="$2"; shift 2 ;;
+        --workspace-base) WORKSPACE_BASE="$2"; shift 2 ;;
+        --eval-only) EVAL_ONLY=true; shift ;;
+        --max-workers) MAX_WORKERS="$2"; shift 2 ;;
+        -h|--help) usage; exit 0 ;;
+        *) echo "Unknown option: $1" >&2; usage >&2; exit 1 ;;
+    esac
+done
+
+mkdir -p "$(dirname "$PREDICTIONS_PATH")" "$WORKSPACE_BASE"
+
+# Record run metadata
+METADATA_FILE="$(dirname "$PREDICTIONS_PATH")/run_metadata.json"
+cat > "$METADATA_FILE" <<META
+{
+    "codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
+    "git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
+    "model": "${MODEL:-default}",
+    "dataset": "${DATASET:-single-instance}",
+    "split": "${SPLIT}",
+    "timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+    "platform": "$(uname -s)/$(uname -m)"
+}
+META
+echo "Run metadata written to $METADATA_FILE"
+
+run_single_instance() {
+    local id="$1"
+    local workspace="$WORKSPACE_BASE/$id"
+
+    echo "=== Running instance: $id ==="
+
+    # Clone or checkout the instance workspace
+    if [[ ! -d "$workspace" ]]; then
+        echo "  Workspace not found at $workspace"
+        echo "  For batch mode, pre-clone instance repos into $WORKSPACE_BASE/"
+        echo "  For single instance, use --issue-file with an existing workspace"
+        return 1
+    fi
+
+    cd "$workspace"
+
+    # Write issue file if provided
+    if [[ -n "$ISSUE_FILE" && -f "$ISSUE_FILE" ]]; then
+        cp "$ISSUE_FILE" "$workspace/issue.md"
+    fi
+
+    # Build the codewhale command
+    local cw_args=("swebench" "run"
+        "--instance-id" "$id"
+        "--predictions-path" "$PREDICTIONS_PATH"
+    )
+
+    if [[ -n "$MODEL" ]]; then
+        cw_args+=("--model" "$MODEL")
+    fi
+
+    codewhale "${cw_args[@]}"
+    echo "  Prediction written for $id"
+}
+
+if [[ "$EVAL_ONLY" == true ]]; then
+    echo "Evaluating existing predictions at $PREDICTIONS_PATH ..."
+    python -m swebench.harness.run_evaluation \
+        --dataset_name "${DATASET:-princeton-nlp/SWE-bench_Lite}" \
+        --predictions_path "$PREDICTIONS_PATH" \
+        --max_workers "$MAX_WORKERS" \
+        --run_id "codewhale-$(date -u +%Y%m%d-%H%M%S)"
+    exit 0
+fi
+
+if [[ -n "$INSTANCE_ID" ]]; then
+    # Single-instance mode
+    run_single_instance "$INSTANCE_ID"
+elif [[ -n "$DATASET" ]]; then
+    # Batch mode: requires a pre-prepared workspace directory structure
+    echo "Batch mode for dataset: $DATASET (split: $SPLIT)"
+    echo ""
+    echo "To run batch SWE-bench:"
+    echo "  1. Install swebench: pip install swebench"
+    echo "  2. Prepare instance workspaces in $WORKSPACE_BASE/"
+    echo "  3. For each instance, run:"
+    echo "     $0 --instance-id <ID> --predictions-path $PREDICTIONS_PATH"
+    echo "  4. Then evaluate:"
+    echo "     $0 --eval-only --dataset $DATASET --predictions-path $PREDICTIONS_PATH"
+    echo ""
+    echo "Automated batch orchestration is planned for v0.9.0."
+    echo "For now, use the SWE-bench docker harness to prepare workspaces."
+else
+    echo "Error: specify --dataset or --instance-id" >&2
+    usage >&2
+    exit 1
+fi
@@ -0,0 +1,113 @@
+#!/usr/bin/env bash
+# run-terminal-bench.sh — Run CodeWhale on Terminal-Bench via Harbor.
+#
+# Usage:
+#   ./scripts/benchmarks/run-terminal-bench.sh --help
+#   ./scripts/benchmarks/run-terminal-bench.sh --dataset terminal-bench@2.0 --model deepseek/deepseek-chat
+#
+# Prerequisites:
+#   - pip install harbor
+#   - Docker running
+#   - DEEPSEEK_API_KEY or OPENROUTER_API_KEY set
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+# Defaults
+DATASET="terminal-bench@2.0"
+MODEL="deepseek/deepseek-chat"
+N_CONCURRENT=4
+AGENT_PATH="$SCRIPT_DIR/harbor/__init__.py:CodeWhaleAgent"
+RESULTS_DIR="./results/terminal-bench"
+EXTRA_ARGS=()
+
+usage() {
+    cat <<EOF
+Usage: $(basename "$0") [OPTIONS]
+
+Run CodeWhale on Terminal-Bench tasks via the Harbor framework.
+
+Options:
+  --dataset DATASET       Harbor dataset (default: terminal-bench@2.0)
+  --model MODEL           Model in provider/name format (default: deepseek/deepseek-chat)
+  --agent PATH            Harbor agent import path (default: local CodeWhale adapter)
+  --n-concurrent N        Parallel task workers (default: 4)
+  --results-dir DIR       Results output directory (default: ./results/terminal-bench)
+  -- [EXTRA_ARGS...]      Additional arguments passed to 'harbor run'
+  -h, --help              Show this help
+
+Examples:
+  # Default run
+  $(basename "$0")
+
+  # Custom model and concurrency
+  $(basename "$0") --model deepseek/deepseek-reasoner --n-concurrent 8
+
+  # Pass extra flags to harbor
+  $(basename "$0") -- --env daytona
+EOF
+}
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --dataset) DATASET="$2"; shift 2 ;;
+        --model) MODEL="$2"; shift 2 ;;
+        --agent) AGENT_PATH="$2"; shift 2 ;;
+        --n-concurrent) N_CONCURRENT="$2"; shift 2 ;;
+        --results-dir) RESULTS_DIR="$2"; shift 2 ;;
+        --) shift; EXTRA_ARGS=("$@"); break ;;
+        -h|--help) usage; exit 0 ;;
+        *) echo "Unknown option: $1" >&2; usage >&2; exit 1 ;;
+    esac
+done
+
+# Check prerequisites
+if ! command -v harbor &>/dev/null; then
+    echo "Error: 'harbor' not found. Install with: pip install harbor" >&2
+    exit 1
+fi
+
+if ! command -v docker &>/dev/null; then
+    echo "Error: Docker not found. Harbor requires Docker." >&2
+    exit 1
+fi
+
+mkdir -p "$RESULTS_DIR"
+
+# Record metadata
+METADATA_FILE="$RESULTS_DIR/run_metadata.json"
+cat > "$METADATA_FILE" <<META
+{
+    "codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
+    "git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
+    "harbor_version": "$(harbor --version 2>/dev/null || echo unknown)",
+    "model": "$MODEL",
+    "dataset": "$DATASET",
+    "agent": "codewhale",
+    "n_concurrent": $N_CONCURRENT,
+    "timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+    "platform": "$(uname -s)/$(uname -m)"
+}
+META
+echo "Run metadata: $METADATA_FILE"
+
+# Run Harbor
+echo "Running Terminal-Bench via Harbor..."
+echo "  Dataset:   $DATASET"
+echo "  Model:     $MODEL"
+echo "  Agent:     $AGENT_PATH"
+echo "  Workers:   $N_CONCURRENT"
+echo ""
+
+harbor run \
+    --dataset "$DATASET" \
+    --agent "$AGENT_PATH" \
+    --model "$MODEL" \
+    --n-concurrent "$N_CONCURRENT" \
+    --results-dir "$RESULTS_DIR" \
+    "${EXTRA_ARGS[@]}"
+
+echo ""
+echo "Results written to $RESULTS_DIR"