codewhale

dgf1988/codewhale

Author	SHA1	Message	Date
Hanmiao Li	5d9b5f67cb	feat(bench): improve cli-compare harness with real Harbor integration (#3009 ) - Match actual Harbor CLI interface (no invented flags) - Proper BaseInstalledAgent subclass for Codex - Robust token extraction from stream JSONL + transcript parsing - Heuristic answer_len extraction (## Final Answer markers) - Metadata capture: versions, git commit, platform, timestamp - --regenerate walks existing run directories - All missing fields explicit null, never zero - Support multiple runs per task with run_idx tracking The harness is designed to run: harbor run --dataset terminal-bench@2.0:<task> --agent ... --model ... for both codex and codewhale agents, then normalize the results.	2026-06-12 10:53:36 -07:00
Hunter B	ce46e29e38	fix(benchmarks): fix workspace file copying and add LLM judge grading Two bugs from the initial run: 1. workspace_files format is [{source, dest}] not {path, content} — files live in PinchBench's assets/ directory, not tasks/. Now checks both tasks/ and assets/ directories. 2. LLM judge tasks (writing, research) scored 0% because the judge wasn't implemented. Now uses codewhale exec as the judge — sends the rubric + workspace contents and parses a JSON score response. Also strips ANSI escape codes and control characters from judge output to prevent JSON parse failures.	2026-06-05 15:57:06 -07:00
Hunter B	c8fcef7f1e	feat(benchmarks): add CodeWhale-native PinchBench runner Runs PinchBench tasks directly through codewhale exec --auto instead of going through OpenClaw. Loads task markdown, creates workspace, runs the prompt, and grades using PinchBench's embedded automated checks. No external agent framework dependency — just codewhale + pyyaml.	2026-06-04 20:26:05 -07:00
Hunter B	b7798ba0f6	feat(benchmarks): default PinchBench to direct MiMo routing, auto-read config PinchBench runner now defaults to direct Xiaomi API (no OpenRouter). Reads API key from ~/.codewhale/config.toml [providers.xiaomi_mimo] when XIAOMI_MIMO_API_KEY env var is not set. --openrouter flag for the old OpenRouter path.	2026-06-04 19:38:46 -07:00
Hunter B	a5f27aae3a	feat(benchmarks): default PinchBench to MiMo v2.5 Pro, add direct-mimo routing PinchBench runner now defaults to openrouter/xiaomi/mimo-v2.5-pro instead of deepseek/deepseek-chat. Adds --direct-mimo flag for routing through Xiaomi's API directly (bypasses OpenRouter), with tp-/sk- key type detection and endpoint mismatch warnings. Harbor adapter gains --provider CLI flag for MiMo provider routing. Known issues documented in docs/MIMO_BENCHMARK_ISSUES.md: - PinchBench model validation requires OpenRouter prefix - OPENROUTER_API_KEY needed even for some direct-provider paths - Token Plan vs pay-as-you-go key/endpoint mismatch - PinchBench runs through OpenClaw, not CodeWhale	2026-06-04 19:33:43 -07:00
Hunter B	b329a532f5	feat(benchmarks): add SWE-bench, Terminal-Bench, and PinchBench integration Benchmark harness for evaluating CodeWhale against three external benchmarks: - SWE-bench: batch driver wrapping existing codewhale swebench commands - Terminal-Bench: Harbor adapter (BaseInstalledAgent) for container eval - PinchBench: runner with auto-install for real-world agent tasks Includes docs/BENCHMARKS.md umbrella doc with setup, usage, and reproducibility checklist. Scripts record version/commit/timestamp metadata for each run. Branch: codex/v0.8.53-benchmarks (based on v0.8.53)	2026-06-04 19:22:06 -07:00

6 Commits