Files
codewhale/scripts/benchmarks
Hunter B ce46e29e38 fix(benchmarks): fix workspace file copying and add LLM judge grading
Two bugs from the initial run:
1. workspace_files format is [{source, dest}] not {path, content} —
   files live in PinchBench's assets/ directory, not tasks/. Now checks
   both tasks/ and assets/ directories.
2. LLM judge tasks (writing, research) scored 0% because the judge
   wasn't implemented. Now uses codewhale exec as the judge — sends
   the rubric + workspace contents and parses a JSON score response.

Also strips ANSI escape codes and control characters from judge output
to prevent JSON parse failures.
2026-06-05 15:57:06 -07:00
..

Benchmark Scripts

Convenience runners for evaluating CodeWhale against external benchmarks.

Quick Start

# Set your API key
export DEEPSEEK_API_KEY="sk-..."

# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
  --instance-id django__django-12345 \
  --issue-file ./issue.md

# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
  --model deepseek/deepseek-chat

# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
  --install \
  --model deepseek/deepseek-chat

Files

  • run-swebench.sh — SWE-bench batch driver and evaluator
  • run-terminal-bench.sh — Terminal-Bench runner via Harbor
  • run-pinchbench.sh — PinchBench runner with auto-install
  • harbor/__init__.py — Harbor adapter for CodeWhale (Python)
  • harbor/codewhale_agent.py — Adapter entry point

Documentation

See docs/BENCHMARKS.md for full setup instructions, reproducibility checklists, and references.