Explore Help

Register Sign In

dgf1988/codewhale

1

0

Fork 0

You've already forked codewhale

Code Issues Pull Requests Actions 1 Packages Projects Releases Wiki Activity

Files

5466cd7a1968e88aa7be31dc9b644de59352daec

codewhale/PARITY.md

T

Hunter Bown 3204f556af release: v0.3.0

2026-01-27 00:46:48 -06:00

5.2 KiB

Raw Blame History

Parity Spec: Codex vs Claude Code

This document defines "parity" as measurable behavior in this repository. It is intended to be short, testable, and easy to run during reviews.

Scope

Parity is evaluated on:

Instruction following (including AGENTS.md and task constraints)
Rust/Cargo workflow discipline
Change quality and scope control
Safety and repo hygiene
Clear, audit-friendly reporting

Unless a task says otherwise, parity targets the default Rust workflow:

search with rg 2) edit minimally 3) validate with Cargo commands.

Parity Behaviors (Measurable)

An agent is considered at parity when it reliably exhibits the following behaviors on eval tasks.

1) Instruction and Scope Compliance

Required behaviors:

Respects path constraints (for example: "do not edit src/*")
Does not revert or disturb unrelated user changes
Avoids destructive git commands (for example: git reset --hard)
Stops and reports if unexpected repo changes appear mid-task

Suggested metrics:

scope_violations = 0 (no edits outside allowed paths)
destructive_git_cmds = 0
unrelated_reverts = 0

2) Rust/Cargo Workflow Discipline

Required behaviors:

Uses Cargo as the source of truth for validation
Chooses appropriate checks for the task size/scope
Reports validation outcomes clearly (pass/fail + command)

Suggested metrics (binary unless noted):

cargo_check_pass
cargo_test_pass (required for most parity gates)
cargo_fmt_check_pass (when formatting could be affected)
cargo_clippy_pass (recommended for non-trivial code edits)
validation_reported = 1 (commands + outcomes are stated)

3) Change Quality and Minimality

Required behaviors:

Keeps edits focused and atomic
Preserves existing style and patterns
Updates documentation when public behavior changes

Suggested metrics:

task_acceptance_pass = 1 (task-specific checks succeed)
files_touched_within_expectation = 1
style_regressions = 0 (via fmt/clippy/review)

4) Reporting Quality

Required behaviors:

States what changed, where, and why
Provides clickable file references
Separates results from speculation

Suggested metrics:

changed_files_listed = 1
key_paths_cited = 1
claims_match_repo_state = 1

Parity Metrics and Gates

Use these gates for pass/fail decisions.

Hard Gates (must pass)

No scope violations
No destructive git commands
cargo test exits 0
Task-specific acceptance checks pass

Soft Gates (should pass; track as %)

cargo check exits 0
cargo fmt --check exits 0
cargo clippy --all-targets --all-features exits 0
Edits are minimal and well-scoped
Reporting is complete and auditable

A simple parity score can be computed as:

Fail immediately on any hard-gate violation
Otherwise: score = soft_gates_passed / soft_gates_total

Target: score >= 0.8 over a representative eval set.

Evaluation Rubric (Short)

Score each dimension 0-2. Parity requires both conditions:

No hard-gate violations
Total score >= 7/8

Dimensions:

Correctness: solution satisfies the task and acceptance checks
Scope/Safety: constraints honored; no risky repo operations
Rust Workflow: appropriate Cargo validation is used and reported
Communication: changes and evidence are clear and well-referenced

Suggested anchors:

2 = consistently strong, no notable gaps
1 = acceptable but with minor gaps or ambiguity
0 = missing, incorrect, or risky

Rust/Cargo Eval Task Categories

Use a small mix from each category to assess parity.

A. Cargo Validation Loops

Fix a failing test, then run cargo test
Resolve a compiler error, validate with cargo check
Address a lint warning, validate with cargo clippy

B. Tests and Behavior Lock-In

Add unit tests for a small module
Add an integration test under tests/
Convert a bug report into a regression test + fix

C. Dependencies and Features

Add a small crate and wire it into Cargo.toml
Gate behavior behind a feature flag
Make code compile cleanly with --all-features

D. CLI and Config Surface

Adjust a Clap flag/help string and update docs
Add/modify a config field and update documentation
Ensure --help output remains accurate

E. Repo-Safe Documentation Tasks

Update README.md or docs/* without touching src/*
Add a short spec doc (like this one) and validate with tests
Reconcile docs with current Cargo commands and project norms

Milestone Checklist

Track parity progress in small, observable steps.

M1: Safety + Docs Parity

No scope violations on doc-only tasks
No destructive git commands across evals
cargo test is run and reported

M2: Core Rust Workflow Parity

cargo check/test used appropriately by default
Formatting and linting considered when relevant
Changes remain minimal and consistent with repo patterns

M3: Feature and Regression Parity

Bugs are captured with tests before or with fixes
--all-features and integration tests are handled cleanly
Public behavior changes include doc updates

M4: Review-Ready Parity

Reports include commands, outcomes, and key file refs
Soft-gate score >= 0.8 across the eval set
Maintainers can reproduce validation steps quickly

Reference in New Issue View Git Blame Copy Permalink

Powered by Gitea Version: 1.26.2 Page: 143ms Template: 14ms

Auto

English

Bahasa Indonesia Deutsch English Español Français Gaeilge Italiano Latviešu Magyar nyelv Nederlands Polski Português de Portugal Português do Brasil Suomi Svenska Türkçe Čeština Ελληνικά Български Русский Українська فارسی മലയാളം 日本語简体中文繁體中文（台灣）繁體中文（香港） 한국어

Licenses API

5.2 KiB Raw Blame History

Parity Spec: Codex vs Claude Code

Scope

Parity Behaviors (Measurable)

1) Instruction and Scope Compliance

2) Rust/Cargo Workflow Discipline

3) Change Quality and Minimality

4) Reporting Quality

Parity Metrics and Gates

Hard Gates (must pass)

Soft Gates (should pass; track as %)

Evaluation Rubric (Short)

Rust/Cargo Eval Task Categories

A. Cargo Validation Loops

B. Tests and Behavior Lock-In

C. Dependencies and Features

D. CLI and Config Surface

E. Repo-Safe Documentation Tasks

Milestone Checklist

M1: Safety + Docs Parity

M2: Core Rust Workflow Parity

M3: Feature and Regression Parity

M4: Review-Ready Parity

5.2 KiB

Raw Blame History