codewhale

dgf1988/codewhale

Fork 0

Commit Graph

Author	SHA1	Message	Date
Hunter B	ce46e29e38	fix(benchmarks): fix workspace file copying and add LLM judge grading Two bugs from the initial run: 1. workspace_files format is [{source, dest}] not {path, content} — files live in PinchBench's assets/ directory, not tasks/. Now checks both tasks/ and assets/ directories. 2. LLM judge tasks (writing, research) scored 0% because the judge wasn't implemented. Now uses codewhale exec as the judge — sends the rubric + workspace contents and parses a JSON score response. Also strips ANSI escape codes and control characters from judge output to prevent JSON parse failures.	2026-06-05 15:57:06 -07:00
Hunter B	c8fcef7f1e	feat(benchmarks): add CodeWhale-native PinchBench runner Runs PinchBench tasks directly through codewhale exec --auto instead of going through OpenClaw. Loads task markdown, creates workspace, runs the prompt, and grades using PinchBench's embedded automated checks. No external agent framework dependency — just codewhale + pyyaml.	2026-06-04 20:26:05 -07:00

Author

SHA1

Message

Date

Hunter B

ce46e29e38

fix(benchmarks): fix workspace file copying and add LLM judge grading

Two bugs from the initial run:
1. workspace_files format is [{source, dest}] not {path, content} —
   files live in PinchBench's assets/ directory, not tasks/. Now checks
   both tasks/ and assets/ directories.
2. LLM judge tasks (writing, research) scored 0% because the judge
   wasn't implemented. Now uses codewhale exec as the judge — sends
   the rubric + workspace contents and parses a JSON score response.

Also strips ANSI escape codes and control characters from judge output
to prevent JSON parse failures.

2026-06-05 15:57:06 -07:00

Hunter B

c8fcef7f1e

feat(benchmarks): add CodeWhale-native PinchBench runner

Runs PinchBench tasks directly through codewhale exec --auto instead
of going through OpenClaw. Loads task markdown, creates workspace,
runs the prompt, and grades using PinchBench's embedded automated
checks.

No external agent framework dependency — just codewhale + pyyaml.

2026-06-04 20:26:05 -07:00

2 Commits