dgf1988/codewhale

Files

T

Hanmiao Li 5d9b5f67cb feat(bench): improve cli-compare harness with real Harbor integration (#3009 )

- Match actual Harbor CLI interface (no invented flags)
- Proper BaseInstalledAgent subclass for Codex
- Robust token extraction from stream JSONL + transcript parsing
- Heuristic answer_len extraction (## Final Answer markers)
- Metadata capture: versions, git commit, platform, timestamp
- --regenerate walks existing run directories
- All missing fields explicit null, never zero
- Support multiple runs per task with run_idx tracking

The harness is designed to run:
    harbor run --dataset terminal-bench@2.0:<task> --agent ... --model ...

for both codex and codewhale agents, then normalize the results.

2026-06-12 10:53:36 -07:00

harbor

feat(bench): improve cli-compare harness with real Harbor integration (#3009 )

2026-06-12 10:53:36 -07:00

cli-compare.py

feat(bench): improve cli-compare harness with real Harbor integration (#3009 )

2026-06-12 10:53:36 -07:00

pinchbench_codewhale.py

fix(benchmarks): fix workspace file copying and add LLM judge grading

2026-06-05 15:57:06 -07:00

README.md

feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration

2026-06-04 19:22:06 -07:00

run-pinchbench.sh

feat(benchmarks): default PinchBench to direct MiMo routing, auto-read config

2026-06-04 19:38:46 -07:00

run-swebench.sh

feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration

2026-06-04 19:22:06 -07:00

run-terminal-bench.sh

feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration

2026-06-04 19:22:06 -07:00

README.md

Benchmark Scripts

Convenience runners for evaluating CodeWhale against external benchmarks.

Quick Start

# Set your API key
export DEEPSEEK_API_KEY="sk-..."

# SWE-bench (single instance)
./scripts/benchmarks/run-swebench.sh \
  --instance-id django__django-12345 \
  --issue-file ./issue.md

# Terminal-Bench (via Harbor)
./scripts/benchmarks/run-terminal-bench.sh \
  --model deepseek/deepseek-chat

# PinchBench (auto-install + run)
./scripts/benchmarks/run-pinchbench.sh \
  --install \
  --model deepseek/deepseek-chat

Files

run-swebench.sh — SWE-bench batch driver and evaluator
run-terminal-bench.sh — Terminal-Bench runner via Harbor
run-pinchbench.sh — PinchBench runner with auto-install
harbor/__init__.py — Harbor adapter for CodeWhale (Python)
harbor/codewhale_agent.py — Adapter entry point

Documentation

See docs/BENCHMARKS.md for full setup instructions, reproducibility checklists, and references.