dgf1988/codewhale

Files

T

History

idling11 0cd8bcde1b feat(bench): add CLI comparison harness

Harvest #3009 for the v0.8.59 release lane. Adds a paired Terminal-Bench harness for CodeWhale and Codex, a Codex Harbor adapter, generated-result ignore protection, and benchmark docs.

Maintainer amendments keep explicit zero-valued metrics, regenerate parent task names, write refreshed summaries in regenerate mode, and allow transcript paths outside the repo.

Fixes #2952.

2026-06-12 01:15:00 -07:00

harbor

feat(bench): add CLI comparison harness

2026-06-12 01:15:00 -07:00

cli-compare.py

feat(bench): add CLI comparison harness

2026-06-12 01:15:00 -07:00

pinchbench_codewhale.py

fix(benchmarks): fix workspace file copying and add LLM judge grading

2026-06-05 15:57:06 -07:00

README.md

feat(bench): add CLI comparison harness

2026-06-12 01:15:00 -07:00

run-pinchbench.sh

feat(benchmarks): default PinchBench to direct MiMo routing, auto-read config

2026-06-04 19:38:46 -07:00

run-swebench.sh

feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration

2026-06-04 19:22:06 -07:00

run-terminal-bench.sh

feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration

2026-06-04 19:22:06 -07:00

README.md

Benchmark Scripts

Convenience runners for evaluating CodeWhale against external benchmarks.

Quick Start

# Set your API key
export DEEPSEEK_API_KEY="sk-..."

# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
  --instance-id django__django-12345 \
  --issue-file ./issue.md

# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
  --model deepseek/deepseek-chat

# CodeWhale vs Codex comparison rows
python scripts/benchmarks/cli-compare.py \
  --task prove-plus-comm \
  --model deepseek/deepseek-chat

# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
  --install \
  --model deepseek/deepseek-chat

Files

run-swebench.sh — SWE-bench batch driver and evaluator
run-terminal-bench.sh — Terminal-Bench runner via Harbor
run-pinchbench.sh — PinchBench runner with auto-install
cli-compare.py — CodeWhale/Codex Terminal-Bench comparison harness
harbor/__init__.py — Harbor adapter for CodeWhale (Python)
harbor/codewhale_agent.py — Adapter entry point
harbor/codex_agent.py — Codex adapter for paired CLI comparisons

Documentation

See docs/BENCHMARKS.md for full setup instructions, reproducibility checklists, and references.