feat(fleet): add manager runbook skill

2026-06-12 22:26:35 -07:00
parent 33f112143d
commit fb1a27b4f2
3 changed files with 175 additions and 2 deletions
@@ -0,0 +1,106 @@
 ---
 name: fleet-manager
 description: Use when managing, triaging, restarting, escalating, or summarizing CodeWhale Agent Fleet runs and workers.
 metadata:
  short-description: Triage CodeWhale Agent Fleet runs
 ---
 # Fleet Manager
 Use this skill when acting as a manager agent for CodeWhale Agent Fleet runs.
 Your job is to classify worker state, choose the narrowest safe typed action,
 and leave a ledgered receipt or a safe escalation draft.
 ## Authority Boundary
 - Prefer typed fleet surfaces over shell spelunking: `codewhale fleet status`,
  `inspect`, `logs`, `artifacts`, `interrupt`, `restart`, `stop`, and the
  Runtime API fleet endpoints.
 - Do not read `.codewhale/fleet.jsonl`, host logs, or remote files directly
  unless the typed command or API is missing required evidence.
 - Do not send Slack, webhook, PagerDuty, email, or chat messages unless the
  user or run config explicitly authorizes sending. Draft the message instead.
 - Never include secrets, tokens, webhook URLs, routing keys, full prompts, or
  oversized logs in a summary or escalation.
 ## Triage Loop
 1. Identify the run and worker from the user request, run receipt, or fleet
   status output. If no worker is named, start with `codewhale fleet status`.
 2. Inspect the worker with `codewhale fleet inspect <worker-id>` or the matching
   Runtime API worker endpoint.
 3. Review bounded evidence with `codewhale fleet logs <worker-id>` and
   `codewhale fleet artifacts <worker-id>`. Summarize artifact refs, not full
   payloads.
 4. Classify the state before acting:
   - `transient failure`: transport error, timeout, stale heartbeat, host
     unavailable, or retryable provider/network failure.
   - `task failure`: worker completed the task but the result is wrong,
     missing required artifacts, or reports a domain error.
   - `verifier failure`: scorer/verifier failed or disagrees with the worker
     result.
   - `needs-human`: missing authority, unsafe secret boundary, destructive
     action, repeated restart exhaustion, ambiguous product decision, or
     conflict between artifacts and verifier.
 5. Choose one typed action:
   - transient and retry budget remains: `codewhale fleet restart <worker-id>`.
   - transient but unsafe to retry: draft escalation and mark needs-human.
   - task failure: preserve artifacts, summarize the failure, and avoid restart
     unless the task spec says retrying can produce new evidence.
   - verifier failure: inspect scorer inputs and artifacts, then escalate if the
     verifier cannot be corrected through a typed action.
   - needs-human: do not restart automatically; draft a concise escalation.
 6. Record the result in the response: classification, action taken or drafted,
   evidence commands, artifact refs, and next owner.
 ## Restart vs Escalate
 Restart only when all of these are true:
 - the failure is likely transient,
 - the task is idempotent or the run policy allows retry,
 - retry budget remains,
 - no secret, permission, or destructive action boundary is involved, and
 - the previous attempt produced enough receipt data to explain the restart.
 Escalate when any of these are true:
 - restart budget is exhausted,
 - the worker requests secrets or new authority,
 - artifacts indicate data loss, corruption, or destructive side effects,
 - the verifier and task result conflict in a way you cannot resolve from typed
  evidence,
 - the same failure repeats after a restart, or
 - a human product or release decision is required.
 ## Safe Escalation Draft
 Use this shape for Slack/PagerDuty drafts. Keep logs to three short lines or an
 artifact ref.
 ```text
 CodeWhale fleet needs attention
 Run: <run-id>
 Worker: <worker-id>
 Task: <task-id or unknown>
 Classification: <transient failure | task failure | verifier failure | needs-human>
 Reason: <one sentence, no secrets>
 Latest typed evidence: codewhale fleet inspect <worker-id>; codewhale fleet artifacts <worker-id>
 Safe log excerpt: <3 lines max or "see artifact <ref>">
 Requested decision: <restart approval | verifier review | task owner review | permission decision>
 ```
 ## Post-Run Receipt
 End every fleet-manager response with a compact receipt:
 ```text
 Fleet receipt
 Run: <run-id>
 Workers checked: <count/list>
 Classification: <state>
 Action: <restart/interrupt/stop/escalation draft/no-op>
 Ledger expectation: <typed action should be recorded | draft only, no send>
 Artifacts reviewed: <refs>
 Follow-up owner: <manager | task owner | human>
 ```
@@ -4,13 +4,14 @@
 use std::fs;
 use std::path::Path;
-const BUNDLED_SKILL_VERSION: &str = "3";
+const BUNDLED_SKILL_VERSION: &str = "4";
 const SKILL_CREATOR_BODY: &str = include_str!("../../assets/skills/skill-creator/SKILL.md");
 const DELEGATE_BODY: &str = include_str!("../../assets/skills/delegate/SKILL.md");
 const V4_BEST_PRACTICES_BODY: &str = include_str!("../../assets/skills/v4-best-practices/SKILL.md");
 const PLUGIN_CREATOR_BODY: &str = include_str!("../../assets/skills/plugin-creator/SKILL.md");
 const SKILL_INSTALLER_BODY: &str = include_str!("../../assets/skills/skill-installer/SKILL.md");
 const MCP_BUILDER_BODY: &str = include_str!("../../assets/skills/mcp-builder/SKILL.md");
 const FLEET_MANAGER_BODY: &str = include_str!("../../assets/skills/fleet-manager/SKILL.md");
 const DOCUMENTS_BODY: &str = include_str!("../../assets/skills/documents/SKILL.md");
 const PRESENTATIONS_BODY: &str = include_str!("../../assets/skills/presentations/SKILL.md");
 const SPREADSHEETS_BODY: &str = include_str!("../../assets/skills/spreadsheets/SKILL.md");
@@ -54,6 +55,11 @@ const BUNDLED_SKILLS: &[BundledSkill] = &[
        body: MCP_BUILDER_BODY,
        introduced_in: 3,
    },
    BundledSkill {
        name: "fleet-manager",
        body: FLEET_MANAGER_BODY,
        introduced_in: 4,
    },
    BundledSkill {
        name: "documents",
        body: DOCUMENTS_BODY,
@@ -370,7 +376,7 @@ mod tests {
        let tmp = TempDir::new().unwrap();
        // Simulate v2 where older bundled skills had been deliberately removed
-        // before v3 introduced more system skills.
+        // before later versions introduced more system skills.
        fs::write(marker_file(&tmp), "2").unwrap();
        install_system_skills(tmp.path()).unwrap();
@@ -269,6 +269,67 @@ POST /v1/fleet/runs/{run_id}/stop
 Action endpoints call the same manager controls as the CLI and record their
 decisions in the fleet ledger.
 ## Manager-Agent Runbook
 Manager agents should treat Fleet operations as typed, ledgered control-plane
 work. Start with `codewhale fleet status`, then inspect one run or worker with
 `codewhale fleet inspect <worker-id>`, `logs`, and `artifacts`. Use direct
 reads of `.codewhale/fleet.jsonl`, host logs, or remote files only when the
 typed CLI/API surface cannot provide the required evidence.
 Classify the worker before taking action:
 - `transient failure`: stale heartbeat, host timeout, interrupted transport,
  retryable provider/network error, or an adapter status that can plausibly
  recover without changing the task.
 - `task failure`: the worker completed but produced an incorrect result,
  domain failure, missing required artifact, or explicit task-level error.
 - `verifier failure`: the worker result exists, but the scorer/verifier failed,
  timed out, or disagrees with the receipt.
 - `needs-human`: missing authority, secret request, destructive operation,
  repeated restart exhaustion, ambiguous product decision, or conflicting
  evidence that the manager cannot resolve from typed artifacts.
 Choose one typed action:
 - Restart a worker only when the failure is transient, retry budget remains,
  the task is idempotent or retry-safe, and no permission or secret boundary is
  involved: `codewhale fleet restart <worker-id>`.
 - Interrupt or stop only when the current task is unsafe to continue or the
  operator explicitly asks for cancellation: `codewhale fleet interrupt
  <worker-id>` or `codewhale fleet stop --all`.
 - Do not restart pure task failures by default; preserve artifacts and hand the
  receipt to the task owner unless the task spec says retrying can produce new
  evidence.
 - For verifier failures, inspect scorer inputs and artifact refs first. If the
  verifier cannot be corrected through typed fleet actions, escalate for human
  review.
 - For `needs-human`, draft an escalation instead of sending it unless alert
  config explicitly authorizes sending.
 Safe Slack or PagerDuty draft:
 ```text
 CodeWhale fleet needs attention
 Run: <run-id>
 Worker: <worker-id>
 Task: <task-id or unknown>
 Classification: <transient failure | task failure | verifier failure | needs-human>
 Reason: <one sentence, no secrets>
 Latest typed evidence: codewhale fleet inspect <worker-id>; codewhale fleet artifacts <worker-id>
 Safe log excerpt: <3 lines max or "see artifact <ref>">
 Requested decision: <restart approval | verifier review | task owner review | permission decision>
 ```
 Post-run summaries should include the run id, workers checked, classification,
 typed action taken or drafted, expected ledger effect, artifact refs reviewed,
 and next owner. Keep summaries bounded; link artifact refs instead of copying
 full logs or transcripts.
 The bundled `fleet-manager` skill mirrors this runbook for manager agents. It
 is a first-party system skill and should be discoverable through the normal
 skill registry after system skills are installed or refreshed.
 ## Host Adapters
 The host adapter boundary supports local child processes and explicit SSH