feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration
Benchmark harness for evaluating CodeWhale against three external benchmarks: - SWE-bench: batch driver wrapping existing codewhale swebench commands - Terminal-Bench: Harbor adapter (BaseInstalledAgent) for container eval - PinchBench: runner with auto-install for real-world agent tasks Includes docs/BENCHMARKS.md umbrella doc with setup, usage, and reproducibility checklist. Scripts record version/commit/timestamp metadata for each run. Branch: codex/v0.8.53-benchmarks (based on v0.8.53)
This commit is contained in:
@@ -0,0 +1,153 @@
|
||||
# Benchmarks
|
||||
|
||||
CodeWhale integrates with three external benchmarks to measure real-world
|
||||
coding-agent performance. Each benchmark tests a different surface:
|
||||
|
||||
| Benchmark | What it tests | Harness | Output format |
|
||||
|---|---|---|---|
|
||||
| **SWE-bench** | Patch generation from GitHub issues | CodeWhale built-in (`codewhale swebench`) | `all_preds.jsonl` |
|
||||
| **Terminal-Bench** | End-to-end terminal tasks (compile, deploy, configure) | Harbor framework adapter | Harbor result JSON |
|
||||
| **PinchBench** | Real-world agent tasks (calendar, email, coding, research) | Standalone runner via OpenClaw-compatible adapter | PinchBench result JSON |
|
||||
|
||||
All three require Docker. SWE-bench and Terminal-Bench also need the official
|
||||
evaluation harness installed separately.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
```bash
|
||||
# Docker (all benchmarks)
|
||||
docker --version
|
||||
|
||||
# Python 3.10+ with uv (Terminal-Bench, PinchBench, SWE-bench eval)
|
||||
python3 --version
|
||||
uv --version
|
||||
|
||||
# CodeWhale v0.8.53+
|
||||
codewhale --version
|
||||
|
||||
# API key
|
||||
export DEEPSEEK_API_KEY="sk-..."
|
||||
```
|
||||
|
||||
## SWE-bench
|
||||
|
||||
CodeWhale has built-in SWE-bench support via `codewhale swebench run` and
|
||||
`codewhale swebench export`. See [docs/SWEBENCH.md](SWEBENCH.md) for the
|
||||
single-instance workflow.
|
||||
|
||||
### Batch run
|
||||
|
||||
```bash
|
||||
# Run all instances from a dataset split
|
||||
./scripts/benchmarks/run-swebench.sh \
|
||||
--dataset princeton-nlp/SWE-bench_Lite \
|
||||
--split test \
|
||||
--predictions-path ./results/swebench_preds.jsonl
|
||||
|
||||
# Run a single instance
|
||||
./scripts/benchmarks/run-swebench.sh \
|
||||
--instance-id django__django-12345 \
|
||||
--issue-file ./issue.md \
|
||||
--predictions-path ./results/swebench_preds.jsonl
|
||||
```
|
||||
|
||||
### Evaluate
|
||||
|
||||
```bash
|
||||
python -m swebench.harness.run_evaluation \
|
||||
--dataset_name princeton-nlp/SWE-bench_Lite \
|
||||
--predictions_path ./results/swebench_preds.jsonl \
|
||||
--max_workers 1 \
|
||||
--run_id codewhale-v0.8.53
|
||||
```
|
||||
|
||||
## Terminal-Bench (via Harbor)
|
||||
|
||||
Terminal-Bench tests agents on real terminal tasks — compiling, deploying,
|
||||
configuring servers, training models. The [Harbor framework](https://github.com/harbor-framework/harbor)
|
||||
is the official harness.
|
||||
|
||||
CodeWhale plugs in via a Harbor adapter (`scripts/benchmarks/harbor/codewhale_agent.py`).
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
pip install harbor
|
||||
```
|
||||
|
||||
### Run
|
||||
|
||||
```bash
|
||||
# Via the convenience script
|
||||
./scripts/benchmarks/run-terminal-bench.sh \
|
||||
--dataset terminal-bench@2.0 \
|
||||
--model deepseek/deepseek-chat \
|
||||
--n-concurrent 4
|
||||
|
||||
# Or directly with harbor
|
||||
harbor run \
|
||||
--dataset terminal-bench@2.0 \
|
||||
--agent codewhale \
|
||||
--model deepseek/deepseek-chat \
|
||||
--n-concurrent 4
|
||||
```
|
||||
|
||||
### Custom agent path
|
||||
|
||||
If the adapter is not installed system-wide, point Harbor at it:
|
||||
|
||||
```bash
|
||||
harbor run \
|
||||
--dataset terminal-bench@2.0 \
|
||||
--agent scripts.benchmarks.harbor.codewhale_agent:CodeWhaleAgent \
|
||||
--model deepseek/deepseek-chat
|
||||
```
|
||||
|
||||
## PinchBench
|
||||
|
||||
PinchBench measures agent performance on real-world tasks — scheduling, email
|
||||
triage, code generation, research, file management. It uses OpenClaw as the
|
||||
agent runtime.
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
git clone https://github.com/pinchbench/skill.git /tmp/pinchbench
|
||||
cd /tmp/pinchbench
|
||||
uv venv && source .venv/bin/activate
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
### Run
|
||||
|
||||
```bash
|
||||
# Via the convenience script
|
||||
./scripts/benchmarks/run-pinchbench.sh \
|
||||
--model deepseek/deepseek-chat \
|
||||
--suite all
|
||||
|
||||
# Or directly
|
||||
cd /tmp/pinchbench && ./scripts/run.sh \
|
||||
--model deepseek/deepseek-chat \
|
||||
--suite all
|
||||
```
|
||||
|
||||
## Reproducibility checklist
|
||||
|
||||
When publishing benchmark results, record:
|
||||
|
||||
- [ ] CodeWhale version: `codewhale --version`
|
||||
- [ ] Git commit: `git rev-parse HEAD`
|
||||
- [ ] Model and provider (e.g. `deepseek/deepseek-chat`)
|
||||
- [ ] Benchmark dataset and version
|
||||
- [ ] Docker platform (`linux/amd64` vs `linux/arm64`)
|
||||
- [ ] Worker concurrency
|
||||
- [ ] Timestamp (UTC)
|
||||
- [ ] Full result file (`all_preds.jsonl`, Harbor result dir, or PinchBench results JSON)
|
||||
|
||||
## References
|
||||
|
||||
- SWE-bench: https://github.com/SWE-bench/SWE-bench
|
||||
- Terminal-Bench: https://github.com/laude-institute/terminal-bench / https://www.tbench.ai
|
||||
- Harbor: https://github.com/harbor-framework/harbor / https://harborframework.com
|
||||
- PinchBench: https://github.com/pinchbench/skill / https://pinchbench.com
|
||||
@@ -0,0 +1,37 @@
|
||||
# Benchmark Scripts
|
||||
|
||||
Convenience runners for evaluating CodeWhale against external benchmarks.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Set your API key
|
||||
export DEEPSEEK_API_KEY="sk-..."
|
||||
|
||||
# SWE-bench (single instance)
|
||||
./scripts/benchmarks/run-swebench.sh \
|
||||
--instance-id django__django-12345 \
|
||||
--issue-file ./issue.md
|
||||
|
||||
# Terminal-Bench (via Harbor)
|
||||
./scripts/benchmarks/run-terminal-bench.sh \
|
||||
--model deepseek/deepseek-chat
|
||||
|
||||
# PinchBench (auto-install + run)
|
||||
./scripts/benchmarks/run-pinchbench.sh \
|
||||
--install \
|
||||
--model deepseek/deepseek-chat
|
||||
```
|
||||
|
||||
## Files
|
||||
|
||||
- `run-swebench.sh` — SWE-bench batch driver and evaluator
|
||||
- `run-terminal-bench.sh` — Terminal-Bench runner via Harbor
|
||||
- `run-pinchbench.sh` — PinchBench runner with auto-install
|
||||
- `harbor/__init__.py` — Harbor adapter for CodeWhale (Python)
|
||||
- `harbor/codewhale_agent.py` — Adapter entry point
|
||||
|
||||
## Documentation
|
||||
|
||||
See [docs/BENCHMARKS.md](../../docs/BENCHMARKS.md) for full setup instructions,
|
||||
reproducibility checklists, and references.
|
||||
@@ -0,0 +1,175 @@
|
||||
"""
|
||||
Harbor adapter for CodeWhale.
|
||||
|
||||
Lets Harbor evaluate CodeWhale as an agent on Terminal-Bench and other
|
||||
Harbor-compatible datasets.
|
||||
|
||||
Usage (after pip install harbor):
|
||||
|
||||
harbor run \\
|
||||
--dataset terminal-bench@2.0 \\
|
||||
--agent scripts.benchmarks.harbor.codewhale_agent:CodeWhaleAgent \\
|
||||
--model deepseek/deepseek-chat
|
||||
|
||||
Or register the agent name in Harbor's AgentName enum for shorter invocations.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import shlex
|
||||
from pathlib import Path, PurePosixPath
|
||||
from typing import Any
|
||||
|
||||
from harbor.agents.installed.base import (
|
||||
BaseInstalledAgent,
|
||||
CliFlag,
|
||||
with_prompt_template,
|
||||
)
|
||||
from harbor.environments.base import BaseEnvironment
|
||||
from harbor.models.agent.context import AgentContext
|
||||
|
||||
|
||||
class CodeWhaleAgent(BaseInstalledAgent):
|
||||
"""
|
||||
CodeWhale agent adapter for Harbor.
|
||||
|
||||
Installs the ``codewhale`` CLI via npm into the task container and runs
|
||||
tasks in non-interactive exec mode with full tool access.
|
||||
"""
|
||||
|
||||
_OUTPUT_FILENAME = "codewhale.txt"
|
||||
|
||||
CLI_FLAGS = [
|
||||
CliFlag(
|
||||
"max_subagents",
|
||||
cli="--max-subagents",
|
||||
type="int",
|
||||
default=4,
|
||||
),
|
||||
CliFlag(
|
||||
"thinking",
|
||||
cli="--thinking",
|
||||
type="str",
|
||||
default="high",
|
||||
),
|
||||
]
|
||||
|
||||
@staticmethod
|
||||
def name() -> str:
|
||||
return "codewhale"
|
||||
|
||||
def version(self) -> str | None:
|
||||
return getattr(self, "_version", None)
|
||||
|
||||
def get_version_command(self) -> str | None:
|
||||
return "codewhale --version 2>/dev/null || codewhale-tui --version 2>/dev/null"
|
||||
|
||||
def parse_version(self, stdout: str) -> str:
|
||||
text = stdout.strip()
|
||||
for line in text.splitlines():
|
||||
line = line.strip()
|
||||
if line:
|
||||
# Strip any prefix like "codewhale " or "codewhale-cli "
|
||||
for prefix in ("codewhale-tui ", "codewhale-cli ", "codewhale "):
|
||||
if line.lower().startswith(prefix):
|
||||
return line[len(prefix):]
|
||||
return line
|
||||
return text
|
||||
|
||||
async def install(self, environment: BaseEnvironment) -> None:
|
||||
"""Install CodeWhale via npm in the container."""
|
||||
# Install system dependencies
|
||||
await self.exec_as_root(
|
||||
environment,
|
||||
command=(
|
||||
"if ldd --version 2>&1 | grep -qi musl || [ -f /etc/alpine-release ]; then"
|
||||
" apk add --no-cache curl bash nodejs npm git ripgrep;"
|
||||
" elif command -v apt-get &>/dev/null; then"
|
||||
" apt-get update && apt-get install -y curl git ripgrep;"
|
||||
" elif command -v yum &>/dev/null; then"
|
||||
" yum install -y curl git ripgrep;"
|
||||
" fi"
|
||||
),
|
||||
env={"DEBIAN_FRONTEND": "noninteractive"},
|
||||
)
|
||||
|
||||
# Install Node.js if not present (some images lack it)
|
||||
await self.exec_as_root(
|
||||
environment,
|
||||
command=(
|
||||
"if ! command -v node &>/dev/null; then"
|
||||
" curl -fsSL https://deb.nodesource.com/setup_20.x | bash - &&"
|
||||
" apt-get install -y nodejs;"
|
||||
" fi"
|
||||
),
|
||||
env={"DEBIAN_FRONTEND": "noninteractive"},
|
||||
)
|
||||
|
||||
# Install CodeWhale CLI via npm
|
||||
await self.exec_as_agent(
|
||||
environment,
|
||||
command="npm install -g codewhale",
|
||||
)
|
||||
|
||||
@with_prompt_template
|
||||
async def run(
|
||||
self,
|
||||
instruction: str,
|
||||
environment: BaseEnvironment,
|
||||
context: AgentContext,
|
||||
) -> None:
|
||||
"""Run CodeWhale in non-interactive exec mode on the task."""
|
||||
escaped_instruction = shlex.quote(instruction)
|
||||
|
||||
# Build CLI flags from agent config
|
||||
cli_flags = self.build_cli_flags()
|
||||
extra_flags = (cli_flags + " ") if cli_flags else ""
|
||||
|
||||
# Determine API key environment variables to forward
|
||||
env: dict[str, str] = {}
|
||||
|
||||
# DeepSeek
|
||||
deepseek_key = os.environ.get("DEEPSEEK_API_KEY", "")
|
||||
if deepseek_key:
|
||||
env["DEEPSEEK_API_KEY"] = deepseek_key
|
||||
|
||||
# OpenRouter (fallback)
|
||||
openrouter_key = os.environ.get("OPENROUTER_API_KEY", "")
|
||||
if openrouter_key:
|
||||
env["OPENROUTER_API_KEY"] = openrouter_key
|
||||
|
||||
# Generic OpenAI-compatible
|
||||
openai_key = os.environ.get("OPENAI_API_KEY", "")
|
||||
if openai_key:
|
||||
env["OPENAI_API_KEY"] = openai_key
|
||||
|
||||
# Build model flag if model_name is provided
|
||||
model_flag = ""
|
||||
if self.model_name:
|
||||
# Harbor passes model as "provider/model"; CodeWhale uses --model
|
||||
model_flag = f"--model {shlex.quote(self.model_name)} "
|
||||
|
||||
output_path = f"/logs/agent/{self._OUTPUT_FILENAME}"
|
||||
|
||||
# Run CodeWhale in non-interactive YOLO exec mode
|
||||
# --yolo enables full tool access (auto-approved)
|
||||
# --auto runs non-interactively and exits when done
|
||||
# --stream-json gives us structured output for trajectory parsing
|
||||
await self.exec_as_agent(
|
||||
environment,
|
||||
command=(
|
||||
f"codewhale exec --yolo --auto --stream-json "
|
||||
f"{model_flag}{extra_flags}"
|
||||
f"--workspace /workspace "
|
||||
f"{escaped_instruction} "
|
||||
f"2>&1 | tee {shlex.quote(output_path)}"
|
||||
),
|
||||
env=env if env else None,
|
||||
)
|
||||
|
||||
def populate_context_post_run(self, context: AgentContext) -> None:
|
||||
"""Parse CodeWhale's output for any post-run metadata."""
|
||||
# CodeWhale writes its results to the working tree as git diffs.
|
||||
# Harbor's eval harness inspects the workspace directly, so no
|
||||
# special trajectory parsing is needed for basic eval.
|
||||
pass
|
||||
@@ -0,0 +1,4 @@
|
||||
"""Harbor adapter entry point for CodeWhale."""
|
||||
from scripts.benchmarks.harbor import CodeWhaleAgent # noqa: F401
|
||||
|
||||
__all__ = ["CodeWhaleAgent"]
|
||||
Executable
+149
@@ -0,0 +1,149 @@
|
||||
#!/usr/bin/env bash
|
||||
# run-pinchbench.sh — Run CodeWhale through PinchBench.
|
||||
#
|
||||
# PinchBench evaluates agent performance on real-world tasks. It normally
|
||||
# targets OpenClaw, but this script adapts the workflow for CodeWhale by
|
||||
# leveraging the OpenRouter-compatible model routing.
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/benchmarks/run-pinchbench.sh --help
|
||||
# ./scripts/benchmarks/run-pinchbench.sh --model deepseek/deepseek-chat
|
||||
#
|
||||
# Prerequisites:
|
||||
# - PinchBench cloned (or install via this script)
|
||||
# - Python 3.10+ with uv
|
||||
# - OPENROUTER_API_KEY or DEEPSEEK_API_KEY set
|
||||
# - A running OpenClaw instance (PinchBench's default runtime)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
|
||||
# Defaults
|
||||
MODEL="deepseek/deepseek-chat"
|
||||
SUITE="all"
|
||||
PINCHBENCH_DIR="${PINCHBENCH_DIR:-/tmp/pinchbench}"
|
||||
RESULTS_DIR="./results/pinchbench"
|
||||
INSTALL_PINCHBENCH=false
|
||||
RUNS=1
|
||||
JUDGE_MODEL=""
|
||||
NO_UPLOAD=true
|
||||
EXTRA_ARGS=()
|
||||
|
||||
usage() {
|
||||
cat <<EOF
|
||||
Usage: $(basename "$0") [OPTIONS]
|
||||
|
||||
Run PinchBench benchmarks with CodeWhale-compatible model routing.
|
||||
|
||||
Options:
|
||||
--model MODEL Model in provider/name format (default: deepseek/deepseek-chat)
|
||||
--suite SUITE Task suite: all, automated-only, or comma-separated IDs (default: all)
|
||||
--runs N Runs per task for averaging (default: 1)
|
||||
--judge MODEL Judge model for LLM grading
|
||||
--pinchbench-dir DIR PinchBench install directory (default: /tmp/pinchbench)
|
||||
--results-dir DIR Local results directory (default: ./results/pinchbench)
|
||||
--install Install/clone PinchBench before running
|
||||
--upload Upload results to pinchbench.com leaderboard
|
||||
-- [EXTRA_ARGS...] Additional arguments passed to PinchBench
|
||||
-h, --help Show this help
|
||||
|
||||
Examples:
|
||||
# Basic run with DeepSeek
|
||||
$(basename "$0") --model deepseek/deepseek-chat
|
||||
|
||||
# Install and run
|
||||
$(basename "$0") --install --model deepseek/deepseek-chat
|
||||
|
||||
# Specific tasks only
|
||||
$(basename "$0") --suite task_calendar,task_stock --model deepseek/deepseek-chat
|
||||
EOF
|
||||
}
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--model) MODEL="$2"; shift 2 ;;
|
||||
--suite) SUITE="$2"; shift 2 ;;
|
||||
--runs) RUNS="$2"; shift 2 ;;
|
||||
--judge) JUDGE_MODEL="$2"; shift 2 ;;
|
||||
--pinchbench-dir) PINCHBENCH_DIR="$2"; shift 2 ;;
|
||||
--results-dir) RESULTS_DIR="$2"; shift 2 ;;
|
||||
--install) INSTALL_PINCHBENCH=true; shift ;;
|
||||
--upload) NO_UPLOAD=false; shift ;;
|
||||
--) shift; EXTRA_ARGS=("$@"); break ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) echo "Unknown option: $1" >&2; usage >&2; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Install PinchBench if requested
|
||||
if [[ "$INSTALL_PINCHBENCH" == true || ! -d "$PINCHBENCH_DIR" ]]; then
|
||||
echo "Installing PinchBench to $PINCHBENCH_DIR ..."
|
||||
if [[ -d "$PINCHBENCH_DIR" ]]; then
|
||||
cd "$PINCHBENCH_DIR" && git pull
|
||||
else
|
||||
git clone https://github.com/pinchbench/skill.git "$PINCHBENCH_DIR"
|
||||
fi
|
||||
cd "$PINCHBENCH_DIR"
|
||||
uv venv .venv 2>/dev/null || true
|
||||
source .venv/bin/activate
|
||||
uv pip install -e .
|
||||
fi
|
||||
|
||||
# Verify PinchBench is available
|
||||
if [[ ! -d "$PINCHBENCH_DIR" ]]; then
|
||||
echo "Error: PinchBench not found at $PINCHBENCH_DIR" >&2
|
||||
echo "Run with --install to clone it automatically." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
cd "$PINCHBENCH_DIR"
|
||||
|
||||
# Activate venv if it exists
|
||||
if [[ -f ".venv/bin/activate" ]]; then
|
||||
source .venv/bin/activate
|
||||
fi
|
||||
|
||||
mkdir -p "$RESULTS_DIR"
|
||||
|
||||
# Record metadata
|
||||
METADATA_FILE="$RESULTS_DIR/run_metadata.json"
|
||||
cat > "$METADATA_FILE" <<META
|
||||
{
|
||||
"codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
|
||||
"git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
|
||||
"pinchbench_commit": "$(git rev-parse HEAD 2>/dev/null || echo unknown)",
|
||||
"model": "$MODEL",
|
||||
"suite": "$SUITE",
|
||||
"runs": $RUNS,
|
||||
"timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
"platform": "$(uname -s)/$(uname -m)"
|
||||
}
|
||||
META
|
||||
echo "Run metadata: $METADATA_FILE"
|
||||
|
||||
# Build PinchBench command
|
||||
PB_ARGS=("--model" "$MODEL" "--suite" "$SUITE" "--runs" "$RUNS" "--output-dir" "$RESULTS_DIR")
|
||||
|
||||
if [[ -n "$JUDGE_MODEL" ]]; then
|
||||
PB_ARGS+=("--judge" "$JUDGE_MODEL")
|
||||
fi
|
||||
|
||||
if [[ "$NO_UPLOAD" == true ]]; then
|
||||
PB_ARGS+=("--no-upload")
|
||||
fi
|
||||
|
||||
PB_ARGS+=("${EXTRA_ARGS[@]}")
|
||||
|
||||
echo "Running PinchBench..."
|
||||
echo " Model: $MODEL"
|
||||
echo " Suite: $SUITE"
|
||||
echo " Runs: $RUNS"
|
||||
echo " Output: $RESULTS_DIR"
|
||||
echo ""
|
||||
|
||||
./scripts/run.sh "${PB_ARGS[@]}"
|
||||
|
||||
echo ""
|
||||
echo "Results written to $RESULTS_DIR"
|
||||
Executable
+161
@@ -0,0 +1,161 @@
|
||||
#!/usr/bin/env bash
|
||||
# run-swebench.sh — Batch driver for CodeWhale SWE-bench runs.
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/benchmarks/run-swebench.sh --help
|
||||
# ./scripts/benchmarks/run-swebench.sh --dataset princeton-nlp/SWE-bench_Lite --split test
|
||||
# ./scripts/benchmarks/run-swebench.sh --instance-id django__django-12345 --issue-file issue.md
|
||||
#
|
||||
# Prerequisites:
|
||||
# - codewhale installed and on PATH
|
||||
# - DEEPSEEK_API_KEY set (or appropriate provider key)
|
||||
# - swebench pip package installed (for evaluation step)
|
||||
# - Docker running (for evaluation step)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
|
||||
# Defaults
|
||||
DATASET=""
|
||||
SPLIT="test"
|
||||
INSTANCE_ID=""
|
||||
ISSUE_FILE=""
|
||||
PREDICTIONS_PATH="./results/swebench_preds.jsonl"
|
||||
MODEL=""
|
||||
WORKSPACE_BASE="/tmp/swebench-workspaces"
|
||||
EVAL_ONLY=false
|
||||
MAX_WORKERS=1
|
||||
|
||||
usage() {
|
||||
cat <<EOF
|
||||
Usage: $(basename "$0") [OPTIONS]
|
||||
|
||||
Run CodeWhale on SWE-bench instances and produce prediction JSONL.
|
||||
|
||||
Options:
|
||||
--dataset DATASET HuggingFace dataset name (e.g. princeton-nlp/SWE-bench_Lite)
|
||||
--split SPLIT Dataset split (default: test)
|
||||
--instance-id ID Run a single instance by ID
|
||||
--issue-file PATH Issue text file for single-instance mode
|
||||
--predictions-path PATH Output JSONL file (default: ./results/swebench_preds.jsonl)
|
||||
--model MODEL Model override for CodeWhale
|
||||
--workspace-base DIR Base dir for instance workspaces (default: /tmp/swebench-workspaces)
|
||||
--eval-only Skip runs; just evaluate existing predictions file
|
||||
--max-workers N Parallel workers for evaluation (default: 1)
|
||||
-h, --help Show this help
|
||||
|
||||
Examples:
|
||||
# Run all instances from SWE-bench Lite
|
||||
$(basename "$0") --dataset princeton-nlp/SWE-bench_Lite --split test
|
||||
|
||||
# Run a single instance
|
||||
$(basename "$0") --instance-id django__django-12345 --issue-file ./issue.md
|
||||
|
||||
# Evaluate existing predictions
|
||||
$(basename "$0") --eval-only --predictions-path ./results/swebench_preds.jsonl
|
||||
EOF
|
||||
}
|
||||
|
||||
# Parse args
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--dataset) DATASET="$2"; shift 2 ;;
|
||||
--split) SPLIT="$2"; shift 2 ;;
|
||||
--instance-id) INSTANCE_ID="$2"; shift 2 ;;
|
||||
--issue-file) ISSUE_FILE="$2"; shift 2 ;;
|
||||
--predictions-path) PREDICTIONS_PATH="$2"; shift 2 ;;
|
||||
--model) MODEL="$2"; shift 2 ;;
|
||||
--workspace-base) WORKSPACE_BASE="$2"; shift 2 ;;
|
||||
--eval-only) EVAL_ONLY=true; shift ;;
|
||||
--max-workers) MAX_WORKERS="$2"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) echo "Unknown option: $1" >&2; usage >&2; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
mkdir -p "$(dirname "$PREDICTIONS_PATH")" "$WORKSPACE_BASE"
|
||||
|
||||
# Record run metadata
|
||||
METADATA_FILE="$(dirname "$PREDICTIONS_PATH")/run_metadata.json"
|
||||
cat > "$METADATA_FILE" <<META
|
||||
{
|
||||
"codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
|
||||
"git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
|
||||
"model": "${MODEL:-default}",
|
||||
"dataset": "${DATASET:-single-instance}",
|
||||
"split": "${SPLIT}",
|
||||
"timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
"platform": "$(uname -s)/$(uname -m)"
|
||||
}
|
||||
META
|
||||
echo "Run metadata written to $METADATA_FILE"
|
||||
|
||||
run_single_instance() {
|
||||
local id="$1"
|
||||
local workspace="$WORKSPACE_BASE/$id"
|
||||
|
||||
echo "=== Running instance: $id ==="
|
||||
|
||||
# Clone or checkout the instance workspace
|
||||
if [[ ! -d "$workspace" ]]; then
|
||||
echo " Workspace not found at $workspace"
|
||||
echo " For batch mode, pre-clone instance repos into $WORKSPACE_BASE/"
|
||||
echo " For single instance, use --issue-file with an existing workspace"
|
||||
return 1
|
||||
fi
|
||||
|
||||
cd "$workspace"
|
||||
|
||||
# Write issue file if provided
|
||||
if [[ -n "$ISSUE_FILE" && -f "$ISSUE_FILE" ]]; then
|
||||
cp "$ISSUE_FILE" "$workspace/issue.md"
|
||||
fi
|
||||
|
||||
# Build the codewhale command
|
||||
local cw_args=("swebench" "run"
|
||||
"--instance-id" "$id"
|
||||
"--predictions-path" "$PREDICTIONS_PATH"
|
||||
)
|
||||
|
||||
if [[ -n "$MODEL" ]]; then
|
||||
cw_args+=("--model" "$MODEL")
|
||||
fi
|
||||
|
||||
codewhale "${cw_args[@]}"
|
||||
echo " Prediction written for $id"
|
||||
}
|
||||
|
||||
if [[ "$EVAL_ONLY" == true ]]; then
|
||||
echo "Evaluating existing predictions at $PREDICTIONS_PATH ..."
|
||||
python -m swebench.harness.run_evaluation \
|
||||
--dataset_name "${DATASET:-princeton-nlp/SWE-bench_Lite}" \
|
||||
--predictions_path "$PREDICTIONS_PATH" \
|
||||
--max_workers "$MAX_WORKERS" \
|
||||
--run_id "codewhale-$(date -u +%Y%m%d-%H%M%S)"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ -n "$INSTANCE_ID" ]]; then
|
||||
# Single-instance mode
|
||||
run_single_instance "$INSTANCE_ID"
|
||||
elif [[ -n "$DATASET" ]]; then
|
||||
# Batch mode: requires a pre-prepared workspace directory structure
|
||||
echo "Batch mode for dataset: $DATASET (split: $SPLIT)"
|
||||
echo ""
|
||||
echo "To run batch SWE-bench:"
|
||||
echo " 1. Install swebench: pip install swebench"
|
||||
echo " 2. Prepare instance workspaces in $WORKSPACE_BASE/"
|
||||
echo " 3. For each instance, run:"
|
||||
echo " $0 --instance-id <ID> --predictions-path $PREDICTIONS_PATH"
|
||||
echo " 4. Then evaluate:"
|
||||
echo " $0 --eval-only --dataset $DATASET --predictions-path $PREDICTIONS_PATH"
|
||||
echo ""
|
||||
echo "Automated batch orchestration is planned for v0.9.0."
|
||||
echo "For now, use the SWE-bench docker harness to prepare workspaces."
|
||||
else
|
||||
echo "Error: specify --dataset or --instance-id" >&2
|
||||
usage >&2
|
||||
exit 1
|
||||
fi
|
||||
Executable
+113
@@ -0,0 +1,113 @@
|
||||
#!/usr/bin/env bash
|
||||
# run-terminal-bench.sh — Run CodeWhale on Terminal-Bench via Harbor.
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/benchmarks/run-terminal-bench.sh --help
|
||||
# ./scripts/benchmarks/run-terminal-bench.sh --dataset terminal-bench@2.0 --model deepseek/deepseek-chat
|
||||
#
|
||||
# Prerequisites:
|
||||
# - pip install harbor
|
||||
# - Docker running
|
||||
# - DEEPSEEK_API_KEY or OPENROUTER_API_KEY set
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
|
||||
# Defaults
|
||||
DATASET="terminal-bench@2.0"
|
||||
MODEL="deepseek/deepseek-chat"
|
||||
N_CONCURRENT=4
|
||||
AGENT_PATH="$SCRIPT_DIR/harbor/__init__.py:CodeWhaleAgent"
|
||||
RESULTS_DIR="./results/terminal-bench"
|
||||
EXTRA_ARGS=()
|
||||
|
||||
usage() {
|
||||
cat <<EOF
|
||||
Usage: $(basename "$0") [OPTIONS]
|
||||
|
||||
Run CodeWhale on Terminal-Bench tasks via the Harbor framework.
|
||||
|
||||
Options:
|
||||
--dataset DATASET Harbor dataset (default: terminal-bench@2.0)
|
||||
--model MODEL Model in provider/name format (default: deepseek/deepseek-chat)
|
||||
--agent PATH Harbor agent import path (default: local CodeWhale adapter)
|
||||
--n-concurrent N Parallel task workers (default: 4)
|
||||
--results-dir DIR Results output directory (default: ./results/terminal-bench)
|
||||
-- [EXTRA_ARGS...] Additional arguments passed to 'harbor run'
|
||||
-h, --help Show this help
|
||||
|
||||
Examples:
|
||||
# Default run
|
||||
$(basename "$0")
|
||||
|
||||
# Custom model and concurrency
|
||||
$(basename "$0") --model deepseek/deepseek-reasoner --n-concurrent 8
|
||||
|
||||
# Pass extra flags to harbor
|
||||
$(basename "$0") -- --env daytona
|
||||
EOF
|
||||
}
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--dataset) DATASET="$2"; shift 2 ;;
|
||||
--model) MODEL="$2"; shift 2 ;;
|
||||
--agent) AGENT_PATH="$2"; shift 2 ;;
|
||||
--n-concurrent) N_CONCURRENT="$2"; shift 2 ;;
|
||||
--results-dir) RESULTS_DIR="$2"; shift 2 ;;
|
||||
--) shift; EXTRA_ARGS=("$@"); break ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) echo "Unknown option: $1" >&2; usage >&2; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Check prerequisites
|
||||
if ! command -v harbor &>/dev/null; then
|
||||
echo "Error: 'harbor' not found. Install with: pip install harbor" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v docker &>/dev/null; then
|
||||
echo "Error: Docker not found. Harbor requires Docker." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$RESULTS_DIR"
|
||||
|
||||
# Record metadata
|
||||
METADATA_FILE="$RESULTS_DIR/run_metadata.json"
|
||||
cat > "$METADATA_FILE" <<META
|
||||
{
|
||||
"codewhale_version": "$(codewhale --version 2>/dev/null || echo unknown)",
|
||||
"git_commit": "$(cd "$REPO_ROOT" && git rev-parse HEAD 2>/dev/null || echo unknown)",
|
||||
"harbor_version": "$(harbor --version 2>/dev/null || echo unknown)",
|
||||
"model": "$MODEL",
|
||||
"dataset": "$DATASET",
|
||||
"agent": "codewhale",
|
||||
"n_concurrent": $N_CONCURRENT,
|
||||
"timestamp_utc": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
"platform": "$(uname -s)/$(uname -m)"
|
||||
}
|
||||
META
|
||||
echo "Run metadata: $METADATA_FILE"
|
||||
|
||||
# Run Harbor
|
||||
echo "Running Terminal-Bench via Harbor..."
|
||||
echo " Dataset: $DATASET"
|
||||
echo " Model: $MODEL"
|
||||
echo " Agent: $AGENT_PATH"
|
||||
echo " Workers: $N_CONCURRENT"
|
||||
echo ""
|
||||
|
||||
harbor run \
|
||||
--dataset "$DATASET" \
|
||||
--agent "$AGENT_PATH" \
|
||||
--model "$MODEL" \
|
||||
--n-concurrent "$N_CONCURRENT" \
|
||||
--results-dir "$RESULTS_DIR" \
|
||||
"${EXTRA_ARGS[@]}"
|
||||
|
||||
echo ""
|
||||
echo "Results written to $RESULTS_DIR"
|
||||
Reference in New Issue
Block a user