dgf1988/codewhale

Files

T

Hunter Bown 8cb4f94f30 docs: v0.8.53 tool-surface-diet design + north-star direction

Design-only deliverables for the v0.8.53 "tool surface diet / canonical
surfaces" cutover (no catalog code in this cycle). Grounded in a verified
inventory of the actual tool registry.

- docs/TOOL_LIFECYCLE.md (#2681): the umbrella policy. Five lifecycle states
  (active / deferred / hidden-compatibility / deprecated / removed) modeled as
  const name-sets + an alias table in tool_catalog.rs (not a per-ToolSpec
  field), so registration stays untouched and old transcripts always replay.
  Includes the deprecation manifest (exec_wait/exec_interact/tts →
  hidden-compat; todo_* → checklist_* deprecated; 11 legacy subagent names are
  already non-visible dead code → cleanup + guardrail), per-mode/per-provider
  active-catalog budget (incl. Arcee's 8-tool first-turn set), prefix-cache
  safety rules, and the tool_agent decision: canonical but DeepSeek-V4-gated.
- docs/CODEBASE_SEARCH_DESIGN.md (#2680, v0.9.0): local-first FTS5/BM25 +
  symbol/path ranking + RRF hybrid; rusqlite storage; mtime/branch/vendor
  invalidation; an explainable tool contract returning reasons[]; and a real
  CodeWhale query eval set. Complements grep_files/file_search, never replaces.
- docs/SKILL_INVOCATION_DESIGN.md (0.9.0): the $<skill-name> inline invocation
  syntax (the token IS the skill name), namespaced resolution, ambiguity-
  suggests-not-guesses, visible activation line, and a smallest-viable slice.
- docs/VISION_NORTH_STAR.md (0.9.0+): intent router, hybrid codebase
  intelligence, WhaleFlow typed workflow IR, skills/rules runtime, the layered
  context-memory stack, tool repair/autoload, the evaluation loop, and the
  command-surface taxonomy (/memory small · /context dashboard · /rules ·
  /workflow · /overlay · $<skill> · codebase_search). Marked DIRECTION, not
  committed 0.8.53 work; also records the deferred-not-done diet items.

Targets codex/v0.8.53.

2026-06-03 11:47:29 -07:00

19 KiB

Raw Blame History

Tool-Surface Lifecycle Policy (v0.8.53)

Status: Design doc / policy. No catalog code lands in this cycle — the code work is deferred. This document is the umbrella policy for GitHub #2681, with #2682 and #2683 as concrete instances of the planned diet. It describes what will be done and the invariants any future diet PR must hold.

Scope of related open work (do not contradict):

PR #2684 — subagent role vocabulary, lifecycle signals, eval ergonomics. Legacy subagent-name cleanup + guardrail tests in this policy rebase on #2684.
PR #2685 — git-history active + RLM/field errors.

All file:line citations are against the verified tree at /Users/huntermbown/Desktop/whalebro/codewhale as of v0.8.52/0.8.53.

1. Purpose and the weaker-model problem

CodeWhale ships a large native tool surface. The first-turn active partition of that surface is what every model sees before it has run a single tool_search_* call. Today that active set contains several near-duplicate tools that map to the same implementation under different names:

exec_wait and exec_shell_wait are both ShellWaitTool (crates/tui/src/tools/registry.rs:526,529).
exec_interact and exec_shell_interact are both ShellInteractTool (registry.rs:527,530).
tts and speech are both SpeechTool (registry.rs:787-792, both deferred).
todo_write and checklist_write are the same TodoWriteTool constructed two ways (crates/tui/src/tools/todo.rs:184-196).

For a strong model, redundant names are harmless noise. For weaker / smaller models (the Arcee Trinity lane, deepseek-v4-flash child executors, and any non-thinking executor), every additional near-duplicate in the visible set is a real cost:

It widens the choice space with options that do nothing distinct, increasing wrong-tool selection and oscillation between synonyms.
It spends scarce first-turn catalog budget (Section 5) on zero-information entries.
It dilutes the "one name = one thing" contract that lets a small model reason about the surface at all.

The lifecycle policy exists to shrink and discipline the model-visible surface without ever breaking the ability to replay an old transcript that referenced a now-retired name.

2. The five lifecycle states

Every native tool name occupies exactly one lifecycle state.

State	Meaning	Visible on first turn?	In `tool_search_*`?	Executes if called?	When used
active	Canonical, in the first-turn catalog head	Yes	n/a (already active)	Yes	The tool a model should reach for by default
deferred	Registered + discoverable, hydrated on demand	No	Yes	Yes	Real, useful tools that don't earn a first-turn slot
hidden-compatibility	Registered + dispatchable, but removed from active and from search	No	No	Yes — identical behavior, silent	Old synonym kept only so old transcripts replay; no model should newly discover it
deprecated	Like hidden-compat, but execution appends a replacement notice to result metadata	No	No	Yes — works, plus a "use X instead" notice	A retired name we actively steer callers off of, still safe to replay
removed	Not registered at all	No	No	No — hard error	Only after `planned_removal_version`, once replay support is formally dropped

hidden-compatibility vs deprecated — be precise

Both states are invisible (not active, not in tool search) and both remain dispatchable (calling them still works). The only difference is the caller-facing signal:

hidden-compatibility: completely silent. The tool behaves byte-for-byte like its canonical twin. We use this when there is no behavioral or naming lesson to teach — the name was a pure alias and we simply don't want models re-learning it. (Example: exec_wait is literally exec_shell_wait.)
deprecated: behaves identically and succeeds, but the tool result's metadata carries an appended notice like "deprecated: use checklist_write instead". The notice goes only in the result metadata returned for that call — never in the cached tool catalog prefix (see Section 8). We use this when there is a canonical replacement we want the caller (and any human reading the transcript) nudged toward.

Neither state ever changes the behavior of the call. Replay always works.

3. Representation in code

The lifecycle is represented as const name-sets plus an alias/manifest table in crates/tui/src/core/engine/tool_catalog.rs, alongside the existing DEFAULT_ACTIVE_NATIVE_TOOLS (tool_catalog.rs:37-64) and ARCEE_FIRST_TURN_NATIVE_TOOLS (tool_catalog.rs:106-115).

3a. Name-sets and the manifest (sketch)

// crates/tui/src/core/engine/tool_catalog.rs  (planned)

/// Tools removed from the active set AND from tool-search, but still
/// registered and dispatchable with byte-identical behavior. Silent.
pub(super) const HIDDEN_COMPATIBILITY_TOOLS: &[&str] = &[
    "exec_wait",          // == exec_shell_wait  (ShellWaitTool)
    "exec_interact",      // == exec_shell_interact (ShellInteractTool)
    "tts",                // == speech (SpeechTool)
];

/// Deprecated aliases: invisible + dispatchable, with a replacement notice
/// appended to RESULT METADATA only (never the cached prefix).
pub(super) struct DeprecatedAlias {
    pub name: &'static str,
    pub replacement: &'static str,
    pub note: &'static str,
}

pub(super) const DEPRECATED_ALIASES: &[DeprecatedAlias] = &[
    DeprecatedAlias { name: "todo_write",  replacement: "checklist_write",
                      note: "use checklist_write instead" },
    DeprecatedAlias { name: "todo_add",    replacement: "checklist_add",
                      note: "use checklist_add instead" },
    DeprecatedAlias { name: "todo_update", replacement: "checklist_update",
                      note: "use checklist_update instead" },
    DeprecatedAlias { name: "todo_list",   replacement: "checklist_list",
                      note: "use checklist_list instead" },
];

#[inline]
pub(super) fn is_hidden_or_deprecated(name: &str) -> bool {
    HIDDEN_COMPATIBILITY_TOOLS.contains(&name)
        || DEPRECATED_ALIASES.iter().any(|d| d.name == name)
}

3b. The two filter points

Catalog / tool-search exclusion (tool_catalog.rs). Deferral is decided by should_default_defer_tool (tool_catalog.rs:66-82), and the active set is the head built by build_model_tool_catalog (tool_catalog.rs:178-196). Hidden-compat and deprecated tools must be forced out of the active head and out of the tool-search-discoverable pool. Concretely, the deferral predicate gains a short-circuit so these names are never active, and the tool-search index builder skips any name for which is_hidden_or_deprecated(name) is true. Arcee's narrowed first-turn path (apply_provider_tool_policy, tool_catalog.rs:134-149) already excludes them by construction since they aren't in ARCEE_FIRST_TURN_NATIVE_TOOLS.
Result-notice append (tool_routing.rs). Dispatch already routes by tool name in crates/tui/src/tui/tool_routing.rs (e.g. the wait/interact unification at tool_routing.rs:1139-1140). After a successful dispatch, if the called name is in DEPRECATED_ALIASES, the router appends the matching note to the result metadata only. Hidden-compat names append nothing.

3c. Why name-sets, not a per-`ToolSpec` enum field

A per-ToolSpec lifecycle: Lifecycle field was rejected for three reasons:

Prefix-cache safety. The tool catalog array is part of DeepSeek's immutable KV prefix (tool_catalog.rs:169-177). A per-spec field invites serializing lifecycle state into each tool's schema, which is exactly the kind of head mutation that forces a full re-prefill. Name-sets live entirely in the catalog-build logic and never touch the emitted tool JSON.
Single source of truth + diffability. The diet for a release is one small, reviewable edit to two or three const arrays in one file, instead of scattered field flips across many tool modules.
Registration stays orthogonal. Tools remain registered exactly as today (e.g. with_shell_tools, registry.rs:523-531). Lifecycle is a catalog policy layered on top of registration, not a property baked into the tool.

4. Deprecation manifest (the #2681 acceptance-criteria table)

This is the authoritative manifest. Columns are the #2681 AC columns. No entry is "removed" in 0.8.53; replay is supported for everything listed.

Alias	Replacement (canonical)	Lifecycle state	first_deprecated_version	planned_removal_version	replay_supported
`exec_wait`	`exec_shell_wait`	hidden-compatibility	0.8.53	TBD (≥ 0.9.x)	Yes
`exec_interact`	`exec_shell_interact`	hidden-compatibility	0.8.53	TBD (≥ 0.9.x)	Yes
`tts`	`speech`	hidden-compatibility	0.8.53	TBD (≥ 0.9.x)	Yes
`todo_write`	`checklist_write`	deprecated	0.8.53	TBD (≥ 0.9.x)	Yes
`todo_add`	`checklist_add`	deprecated	0.8.53	TBD (≥ 0.9.x)	Yes
`todo_update`	`checklist_update`	deprecated	0.8.53	TBD (≥ 0.9.x)	Yes
`todo_list`	`checklist_list`	deprecated	0.8.53	TBD (≥ 0.9.x)	Yes

Legacy subagent names — already non-visible, no manifest entry needed. agent_spawn, spawn_agent, agent_result, agent_wait, agent_send_input, send_input, agent_assign, agent_list, agent_cancel, resume_agent, and delegate_to_agent exist only as #[allow(dead_code)] structs in crates/tui/src/tools/subagent/mod.rs and are never instantiated outside tests, so they are already not model-visible. Only agent_open, agent_eval, tool_agent, and agent_close are registered (registry.rs:1017-1029). The action for these legacy names is dead-code cleanup + a guardrail test (rebase on PR #2684), not a lifecycle transition.

Keep the live internal methods. send_input, cancel, and resume also exist as live SubAgentManager methods (subagent/mod.rs:1605,1495,1521) used internally by agent_eval / agent_close. These are not the dead-code tool structs and must be kept.

planned_removal_version is intentionally TBD: a name only moves to removed once we formally drop replay for transcripts old enough to contain it, which is a separate, deliberate decision per name.

5. Active-catalog budget (per mode, per provider)

The active set is the first-turn cost. Current default active set: DEFAULT_ACTIVE_NATIVE_TOOLS has 25 entries (tool_catalog.rs:37-64).

Per provider

Provider	First-turn active source	Current count	Target after diet
Default (DeepSeek et al.)	`DEFAULT_ACTIVE_NATIVE_TOOLS`	25	~22 (drop `exec_wait`, `exec_interact`; `todo_*` already not active)
Arcee (Trinity)	`ARCEE_FIRST_TURN_NATIVE_TOOLS`	8 (read-only WAF workaround)	8 (unchanged)

The default diet removes exec_wait and exec_interact from the active head (they become hidden-compat; their canonical twins exec_shell_wait / exec_shell_interact stay). tts and todo_* are already not in the active set, so the active count moves 25 → 23 from the wait/interact removal alone; the broader target is a stable budget of roughly ≤ 22 canonical tools.

Per mode (Plan / Agent / YOLO)

The native active head is the same set across modes by design — mode does not add or remove native tools from DEFAULT_ACTIVE_NATIVE_TOOLS (should_default_defer_tool ignores _mode for native tools, tool_catalog.rs:66-68). Mode affects MCP deferral instead: apply_mcp_tool_deferral keeps MCP tools deferred unless mode == Yolo (tool_catalog.rs:162-167).

Mode	Native active budget	MCP tools active?
Plan	same native head (target ≤ 22)	No (deferred)
Agent	same native head	No (deferred)
YOLO	same native head	Yes (a known, intentional widening)

Budget rule: the native active head must stay byte-identical across Plan ↔ Agent ↔ YOLO (Section 8). Any growth of the head requires retiring something else or an explicit budget bump in this doc.

6. The canonical-surface rule

Every model-visible (active or deferred-discoverable) tool must have one clear niche. If a tool is superseded, it gets a named replacement and moves to hidden-compatibility or deprecated — it does not stay visible.

Canonical vs compatibility summary for the confusing clusters

Cluster	Canonical (keep visible)	Compatibility / retired	Notes
Shell wait	`exec_shell_wait`	`exec_wait` → hidden-compat	Same `ShellWaitTool` (`registry.rs:526,529`); router already unifies (`tool_routing.rs:1139`)
Shell interact	`exec_shell_interact`	`exec_interact` → hidden-compat	Same `ShellInteractTool` (`registry.rs:527,530`)
Checklist / todo	`checklist_write`	`todo_write/add/update/list` → deprecated	Same `TodoWriteTool`, `::new` vs `::checklist` (`todo.rs:184-196`)
Speech / tts	`speech`	`tts` → hidden-compat	Same `SpeechTool` (`registry.rs:787-792`)
Subagent lifecycle	`agent_open`, `agent_eval`, `agent_close`, `tool_agent` (gated, §7)	all 11 legacy names → already non-visible dead code	Cleanup + guardrail test, rebase on #2684
Edit family	`apply_patch`, `edit_file`, `write_file`, `fim_edit`	none — all distinct niches	NOT touched (per #2681 non-goals); doc-only canonical guidance
Search family	`grep_files` (content), `file_search` (filename), `project_map` (structure)	none — distinct niches	NOT touched; no FTS5/BM25/semantic index exists today

Non-goals (explicitly NOT diet targets in this cycle, per #2681): apply_patch / edit_file / write_file / fim_edit; grep_files / file_search / project_map; fetch_url / web.run / web_search; task_shell_*; handle_read / retrieve_tool_result. These have distinct niches and receive canonical guidance only — no lifecycle change.

The RLM surface (rlm_open / rlm_eval / rlm_configure / rlm_close / rlm_session_objects, crates/tui/src/tools/rlm.rs) is likewise out of scope; handle_read retrieves var handles, and finalize / FINAL is an in-kernel Python function, not a tool — so there is nothing to retire there.

7. `tool_agent` decision: canonical but DeepSeek-V4-gated

tool_agent stays as a canonical subagent tool (registry.rs:1024, ToolAgentTool). It is the fast, non-thinking "Fin" executor lane, built on deepseek-v4-flash (cf. DEFAULT_CHILD_MODEL = "deepseek-v4-flash", rlm.rs:26).

Decision: gate tool_agent to DeepSeek-V4 models only.

It is purpose-built around the V4-flash non-thinking executor profile. Exposing it to other providers (e.g. Arcee Trinity, which is already WAF-narrowed to 8 read-only tools, tool_catalog.rs:106-115) offers no working executor lane and only adds a confusing, mis-targeted option to weaker surfaces.
Gating is a provider/model policy, consistent with the existing provider-aware first-turn policy (apply_provider_tool_policy, tool_catalog.rs:134-149): on non-DeepSeek-V4 models, tool_agent is excluded from the active set and from tool-search discovery. It remains registered and dispatchable so transcripts created under a V4 model replay everywhere.

This is not a lifecycle transition — tool_agent is canonical. It is a visibility gate layered on the same machinery as the Arcee narrowing.

8. Prefix-cache safety + replay guarantee

Prefix-cache rules every diet PR MUST follow

The tools array is part of DeepSeek's immutable KV prefix. The catalog-head byte-stability invariant (tool_catalog.rs:169-196) is binding:

Never mutate the active head non-deterministically. The first-turn active block must be byte-identical run-to-run and across Plan ↔ Agent ↔ YOLO.
A diet is a one-time deterministic edit. Removing a name from DEFAULT_ACTIVE_NATIVE_TOOLS shifts the head exactly once; after that it must be stable. Land such edits as their own focused change.
Notices live in result metadata, never the prefix. Deprecated replacement notes are appended at dispatch time in tool_routing.rs to the call result only. Nothing about hidden/deprecated state may be serialized into a tool schema, description, or the catalog array.
Preserve ordering and partitioning. build_model_tool_catalog sorts each partition by name and keeps built-ins as a contiguous prefix ahead of MCP tools (tool_catalog.rs:186-194). Diet edits must not break this.
Hidden/deprecated tools are excluded before the head is built, so their removal is the only head change — they do not appear in the prefix at all.

Old-transcript replay guarantee

For every name in the deprecation manifest with replay_supported = Yes, the tool stays registered and dispatchable with identical behavior. Replaying an old transcript that calls exec_wait, exec_interact, tts, or any todo_* produces the same result it always did. Deprecated names additionally attach a result-metadata notice; hidden-compat names are silent. A name is only ever made non-dispatchable (removed) after a deliberate, per-name decision to drop replay support at planned_removal_version.

9. Required tests

Any diet PR (and the umbrella #2681 work) must add/keep:

Duplicate-active-alias guard. A test asserting that no name in HIDDEN_COMPATIBILITY_TOOLS or DEPRECATED_ALIASES appears in DEFAULT_ACTIVE_NATIVE_TOOLS or ARCEE_FIRST_TURN_NATIVE_TOOLS, and that no two active entries resolve to the same underlying tool implementation.
Tool-search exclusion test. Assert that hidden-compat and deprecated names are absent from the tool-search-discoverable pool while remaining present in the registry (dispatchable).
Replay / dispatch tests. For each manifest name, calling it still executes and returns the same result as its canonical twin. Deprecated names additionally assert the replacement note is present in result metadata and absent from the catalog/prefix. Hidden-compat names assert no added notice.
Golden active-block byte test. A snapshot test pinning the byte serialization of the first-turn active tool block, asserting it is identical across Plan / Agent / YOLO (native head) and stable run-to-run — enforcing the tool_catalog.rs:169-196 invariant. The golden updates only as a reviewed, deliberate one-time edit when the diet lands.
Subagent guardrail test (rebase on #2684). Assert only agent_open, agent_eval, tool_agent, agent_close are registered as model-visible subagent tools and that no legacy name from subagent/mod.rs is instantiated outside tests.
tool_agent gating test. Assert tool_agent is active/discoverable only under DeepSeek-V4 models and excluded (but still registered) elsewhere.

19 KiB Raw Blame History