ce46e29e38
Two bugs from the initial run:
1. workspace_files format is [{source, dest}] not {path, content} —
files live in PinchBench's assets/ directory, not tasks/. Now checks
both tasks/ and assets/ directories.
2. LLM judge tasks (writing, research) scored 0% because the judge
wasn't implemented. Now uses codewhale exec as the judge — sends
the rubric + workspace contents and parses a JSON score response.
Also strips ANSI escape codes and control characters from judge output
to prevent JSON parse failures.
Benchmark Scripts
Convenience runners for evaluating CodeWhale against external benchmarks.
Quick Start
# Set your API key
export DEEPSEEK_API_KEY="sk-..."
# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
--instance-id django__django-12345 \
--issue-file ./issue.md
# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
--model deepseek/deepseek-chat
# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
--install \
--model deepseek/deepseek-chat
Files
run-swebench.sh— SWE-bench batch driver and evaluatorrun-terminal-bench.sh— Terminal-Bench runner via Harborrun-pinchbench.sh— PinchBench runner with auto-installharbor/__init__.py— Harbor adapter for CodeWhale (Python)harbor/codewhale_agent.py— Adapter entry point
Documentation
See docs/BENCHMARKS.md for full setup instructions, reproducibility checklists, and references.