b329a532f5
Benchmark harness for evaluating CodeWhale against three external benchmarks: - SWE-bench: batch driver wrapping existing codewhale swebench commands - Terminal-Bench: Harbor adapter (BaseInstalledAgent) for container eval - PinchBench: runner with auto-install for real-world agent tasks Includes docs/BENCHMARKS.md umbrella doc with setup, usage, and reproducibility checklist. Scripts record version/commit/timestamp metadata for each run. Branch: codex/v0.8.53-benchmarks (based on v0.8.53)
5 lines
145 B
Python
5 lines
145 B
Python
"""Harbor adapter entry point for CodeWhale."""
|
|
from scripts.benchmarks.harbor import CodeWhaleAgent # noqa: F401
|
|
|
|
__all__ = ["CodeWhaleAgent"]
|