docs: v0.8.53 tool-surface-diet design + north-star direction

Design-only deliverables for the v0.8.53 "tool surface diet / canonical surfaces" cutover (no catalog code in this cycle). Grounded in a verified inventory of the actual tool registry. - docs/TOOL_LIFECYCLE.md (#2681): the umbrella policy. Five lifecycle states (active / deferred / hidden-compatibility / deprecated / removed) modeled as const name-sets + an alias table in tool_catalog.rs (not a per-ToolSpec field), so registration stays untouched and old transcripts always replay. Includes the deprecation manifest (exec_wait/exec_interact/tts → hidden-compat; todo_* → checklist_* deprecated; 11 legacy subagent names are already non-visible dead code → cleanup + guardrail), per-mode/per-provider active-catalog budget (incl. Arcee's 8-tool first-turn set), prefix-cache safety rules, and the tool_agent decision: canonical but DeepSeek-V4-gated. - docs/CODEBASE_SEARCH_DESIGN.md (#2680, v0.9.0): local-first FTS5/BM25 + symbol/path ranking + RRF hybrid; rusqlite storage; mtime/branch/vendor invalidation; an explainable tool contract returning reasons[]; and a real CodeWhale query eval set. Complements grep_files/file_search, never replaces. - docs/SKILL_INVOCATION_DESIGN.md (0.9.0): the $<skill-name> inline invocation syntax (the token IS the skill name), namespaced resolution, ambiguity- suggests-not-guesses, visible activation line, and a smallest-viable slice. - docs/VISION_NORTH_STAR.md (0.9.0+): intent router, hybrid codebase intelligence, WhaleFlow typed workflow IR, skills/rules runtime, the layered context-memory stack, tool repair/autoload, the evaluation loop, and the command-surface taxonomy (/memory small · /context dashboard · /rules · /workflow · /overlay · $<skill> · codebase_search). Marked DIRECTION, not committed 0.8.53 work; also records the deferred-not-done diet items. Targets codex/v0.8.53.
2026-06-03 11:47:29 -07:00
parent 03d1bba538
commit 8cb4f94f30
4 changed files with 1371 additions and 0 deletions
@@ -0,0 +1,300 @@
+# `codebase_search` — Local-First Semantic Code Retrieval
+
+> **Status:** Design note + eval scaffold. **Code is DEFERRED.**
+> GitHub #2680 · Milestone **v0.9.0** · This DOC ships in **v0.8.53** (doc-only; no catalog code in this cycle).
+> Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either.
+
+This document specifies a model-visible `codebase_search` tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding **tool lifecycle** decisions for v0.8.53 so the eventual catalog edit is a single deterministic change.
+
+---
+
+## 1. Problem
+
+Today CodeWhale ships two complementary code-locating tools and one structure map:
+
+- `file_search` — **filename** search (uses the `ignore` crate's `WalkBuilder` for vendor exclusion; default excludes at `crates/tui/src/tools/file_search.rs:210-219`).
+- `grep_files` — **content** search (literal/regex token match).
+- `project_map` — a deferred **structure** map.
+
+None of these answer **concept-level** questions where the user does not know the exact token:
+
+- "Where is provider auth resolved?"
+- "What enforces the shell approval policy?"
+- "Where do mode prompts get assembled?"
+- "How does the subagent lifecycle close out a child?"
+
+`grep_files` requires you to already know the literal string (`resolve_api_key`, `ApprovalRequirement`, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens.
+
+**Goal.** Add a retrieval tool keyed on *intent*, not on exact lexemes, that returns ranked, **explainable** code locations.
+
+**Non-goal / explicit complement.** `codebase_search` does **not** replace `grep_files` or `file_search`. Exact-token and filename lookups remain the right tool when you know what you're looking for. `codebase_search` is the "I don't know the token yet" entry point and always falls back to exact grep so it is never *worse* than grep for a literal query. (See §2 fallback, §6 non-goals.)
+
+There is currently **no** FTS5/BM25, sparse, or dense index in the tree. `rusqlite` is already a workspace dependency (`crates/tui/Cargo.toml`), so the lexical core can be built with no new heavy dependencies.
+
+---
+
+## 2. Approach Comparison
+
+| Approach | What it indexes | Local-first? | Recall on paraphrase | Cost / deps | Verdict for v0.9.0 |
+|---|---|---|---|---|---|
+| **Lexical FTS5 + `bm25()`** | tokenized code/comments/identifiers (camelCase/snake_case split) | Yes — SQLite built in via `rusqlite` | Medium (with tokenizer help) | Near-zero (existing dep) | **Phase 1 core** |
+| **Symbol / path ranking** | extracted symbols (fn/struct/impl/const), path components | Yes | Medium-high for "where is X defined" | Low (regex/tree-sitter optional) | **Phase 1 core** |
+| **Sparse encoders (SPLADE)** | learned term-expansion weights | Yes (model is local but heavy) | High | Model download + inference | Phase 3, feature-flagged |
+| **Dense embeddings** | vector of chunk semantics | Optional — embedding model needed | Highest on paraphrase | Model + vector store; HF download | Phase 3, feature-flagged |
+| **Cross-encoder reranker** | re-scores top-K candidates | Heavy | Boosts precision@k | Inference cost | Phase 4, feature-flagged |
+
+### Recommended architecture: Hybrid via Reciprocal-Rank Fusion (RRF)
+
+Each enabled signal produces an independent ranked list; results are merged with RRF
+(`score(d) = Σ_signals 1/(k + rank_signal(d))`, conventional `k≈60`). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales.
+
+**v0.9.0 Phase 1 signal set (all local, no model downloads):**
+
+1. **Lexical (FTS5 `bm25()`)** over chunk text with an identifier-aware tokenizer.
+2. **Symbol rank** — boost chunks whose extracted symbol name fuzzy-matches query terms.
+3. **Path rank** — boost chunks whose path components match (e.g. query "auth" → `…/auth/…`, `…/provider…`).
+4. **Session-relevance boost** — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at.
+5. **Exact grep fallback** — the query is *also* run as a literal `grep_files`-equivalent pass; any exact hit is fused in and tagged, guaranteeing `codebase_search` ⊇ grep for literal queries.
+
+**Optional later backends (feature-flagged, off by default):**
+
+- `--features sparse-splade` — adds a SPLADE signal list to the RRF.
+- `--features dense-embed` — adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy).
+- `--features rerank` — cross-encoder reranks the fused top-K.
+
+Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI.
+
+---
+
+## 3. Storage & Index
+
+### Location
+
+```
+~/.codewhale/index/<workspace-hash>.db
+```
+
+`<workspace-hash>` is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by `rusqlite` (existing dep).
+
+> Migration note (ties to the `/memory doctor` taxonomy in §7): older builds used `~/.deepseek`. The index path is `~/.codewhale` only; if a legacy `~/.deepseek/index` exists it is ignored (a future `doctor` may offer to migrate, never auto-read).
+
+### Schema sketch
+
+```sql
+CREATE TABLE files (
+  id            INTEGER PRIMARY KEY,
+  path          TEXT NOT NULL UNIQUE,   -- workspace-relative
+  mtime_ns      INTEGER NOT NULL,       -- invalidation
+  size_bytes    INTEGER NOT NULL,
+  content_hash  TEXT NOT NULL,          -- blake3; skip re-chunk if unchanged
+  lang          TEXT,                   -- detected language
+  branch        TEXT                    -- branch at last index (invalidation)
+);
+
+CREATE TABLE chunks (
+  id          INTEGER PRIMARY KEY,
+  file_id     INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
+  start_line  INTEGER NOT NULL,
+  end_line    INTEGER NOT NULL,
+  kind        TEXT,                     -- fn | struct | impl | const | doc | block
+  symbol      TEXT,                     -- primary symbol name if any
+  text        TEXT NOT NULL             -- chunk body (identifier-split copy for FTS)
+);
+
+-- Lexical index. external-content FTS so we don't duplicate bodies twice.
+CREATE VIRTUAL TABLE chunks_fts USING fts5(
+  text,
+  symbol,
+  content='chunks',
+  content_rowid='id',
+  tokenize = 'unicode61 remove_diacritics 2'   -- + identifier pre-split at index time
+);
+
+CREATE TABLE symbols (
+  id        INTEGER PRIMARY KEY,
+  file_id   INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
+  chunk_id  INTEGER REFERENCES chunks(id) ON DELETE CASCADE,
+  name      TEXT NOT NULL,
+  kind      TEXT NOT NULL,              -- fn | struct | enum | trait | impl | const | macro
+  line      INTEGER NOT NULL
+);
+CREATE INDEX symbols_name ON symbols(name);
+
+-- Session relevance: lightweight touch log, written by the session, decayed on read.
+CREATE TABLE session_touch (
+  path        TEXT PRIMARY KEY,
+  last_touch  INTEGER NOT NULL,         -- unix ns
+  touch_count INTEGER NOT NULL DEFAULT 1
+);
+```
+
+Identifier-aware tokenization (splitting `resolveApiKey` / `resolve_api_key` → `resolve api key`) is applied **at index time** into the FTS `text` column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (`chunks_sparse`, `chunks_vec`) behind their feature flags.
+
+### Chunking strategy (structure-aware)
+
+Chunk on **syntactic boundaries**, not fixed windows: one chunk per top-level item (`fn`, `struct`, `impl` block, `const`, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere.
+
+### Invalidation
+
+- **mtime + content_hash:** on index/refresh, skip files whose `mtime_ns` and `content_hash` are unchanged.
+- **Branch switch:** `files.branch` is recorded; on a branch change the affected files are re-checked (cheap because of content_hash).
+- **Generated / vendor exclusion:** reuse the **same** `ignore`-crate `WalkBuilder` exclusion behavior as `file_search` (mirror the defaults at `crates/tui/src/tools/file_search.rs:210-219`: `target/**`, `node_modules/**`, `.git/**`, `DerivedData/**`, `dist/**`, `build/**`, `*.lock`, `*.plist`, plus `.gitignore`). One exclusion source of truth shared with `file_search` avoids index drift.
+
+### Privacy / trust
+
+- **Workspace-scoped, local-only.** The index lives under `~/.codewhale/index/` and never leaves the machine.
+- **No cloud by default.** Phase 1 has zero network dependency.
+- **Embeddings / Hugging Face downloads are gated.** Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag *and* an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything.
+
+---
+
+## 4. Model-Visible Tool Contract
+
+```jsonc
+// codebase_search
+{
+  "name": "codebase_search",
+  "description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).",
+  "parameters": {
+    "query":      { "type": "string",  "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." },
+    "max_results":{ "type": "integer", "default": 10 },
+    "path_glob":  { "type": "string",  "description": "Optional path filter, e.g. 'crates/tui/**'." },
+    "lang":       { "type": "string",  "description": "Optional language filter." },
+    "kind":       { "type": "string",  "description": "Optional symbol-kind filter: fn|struct|impl|const|..." }
+  }
+}
+```
+
+**Result shape — ranked, explainable, auditable:**
+
+```jsonc
+{
+  "results": [
+    {
+      "path": "crates/tui/src/config/provider.rs",
+      "line": 142,
+      "snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result<Secret> { ... }",
+      "score": 0.91,
+      "reasons": [
+        "symbol: resolve_api_key matches 'auth/resolve'",
+        "lexical: matched tokens [provider, api, key, resolve]",
+        "path: component 'provider' matches query",
+        "session: file read 2 turns ago"
+      ]
+    }
+  ],
+  "backend": "lexical+symbol+path+session",   // which signals were fused (RRF)
+  "fallback_grep_hits": 1                       // exact-match hits folded in
+}
+```
+
+`reasons[]` is **mandatory** and is the auditability contract: every result explains *why* it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The `backend` field records which signals were active so results are reproducible given the feature set.
+
+---
+
+## 5. Benchmark / Eval Set
+
+A fixed set of real CodeWhale concept queries, each with the **expected** file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on **file**, not line.
+
+| # | Query (concept, no exact token) | Expected file(s) | Anchor |
+|---|---|---|---|
+| 1 | Where is provider auth / API key resolved? | `crates/tui/src/config/` provider auth path | provider/config module |
+| 2 | What is the first-turn active tool set? | `crates/tui/src/core/engine/tool_catalog.rs` | `DEFAULT_ACTIVE_NATIVE_TOOLS` :37-64 |
+| 3 | How are deferred tools hydrated / searched? | `crates/tui/src/core/engine/tool_catalog.rs` | tool_search regex/bm25 :26-35 |
+| 4 | Why does Arcee get a reduced tool set? (WAF workaround) | `crates/tui/src/core/engine/tool_catalog.rs` | `ARCEE_FIRST_TURN_NATIVE_TOOLS` :106-115 |
+| 5 | What keeps the tool catalog byte-stable for the KV prefix cache? | `crates/tui/src/core/engine/tool_catalog.rs` | catalog-head invariant :169-196 |
+| 6 | Where is the shell approval / cancel policy? | `crates/tui/src/tools/shell.rs` + `tools/spec.rs` (`ApprovalRequirement`) | shell tools, `ShellWaitTool`/`ShellInteractTool` registry.rs:524-531 |
+| 7 | Where are mode prompts (Plan/Agent/YOLO) assembled? | mode prompt / `AppMode` assembly in `crates/tui/src/tui/` | `AppMode` usage |
+| 8 | How does the subagent lifecycle open/eval/close a child? | `crates/tui/src/tools/subagent/mod.rs`; registry registration | registry.rs:1017-1029; `send_input`/`cancel`/`resume` mod.rs:1495,1521,1605 |
+| 9 | What is the RLM session surface and its default child model? | `crates/tui/src/tools/rlm.rs` | `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"` :26 |
+| 10 | Where is RLM eval / var_handle retrieval (`handle_read`)? | `crates/tui/src/tools/rlm.rs`, `tools/handle.rs` | `VarHandle` import rlm.rs:21 |
+| 11 | Where are skills discovered and parsed in the workspace? | `crates/tui/src/tools/skills/mod.rs` | `discover_in_workspace` ~421; skill struct ~382-388 |
+| 12 | Where is skill enable-state stored / checked? | `crates/tui/src/tools/skills/skill_state.rs` | `SkillStateStore::is_enabled` ~73 |
+| 13 | How does vendor/generated exclusion work for file walking? | `crates/tui/src/tools/file_search.rs` | `ignore` WalkBuilder excludes :210-219 |
+| 14 | Where is the queued user message built on submit? | `crates/tui/src/tui/ui.rs` | `build_queued_message` ~4721 |
+| 15 | Where are speech / TTS tools registered? (duplicate names) | `crates/tui/src/tools/registry.rs` | `speech` ≡ `tts` :787-792 |
+
+Each entry is a `(query, expected_paths[])` row in a fixture (e.g. `crates/tui/tests/fixtures/codebase_search_eval.jsonl`). Phase 1 ships the harness that runs all queries against the live index and reports recall@k and MRR; a regression bar (e.g. recall@10 ≥ target) gates future ranking changes.
+
+---
+
+## 6. Phasing, Feature Flags, and Non-Goals
+
+### Phasing
+
+- **Phase 0 (this cycle, v0.8.53):** this design note + eval fixture only. No catalog code.
+- **Phase 1 (v0.9.0):** local lexical core — FTS5 `bm25()` + symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at `~/.codewhale/index/<workspace-hash>.db`. Eval harness wired into CI. **No network, no model downloads.** Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant.
+- **Phase 2:** incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language).
+- **Phase 3 (feature-flagged, off by default):** `sparse-splade` and `dense-embed` RRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy).
+- **Phase 4 (feature-flagged):** `rerank` cross-encoder over the fused top-K.
+
+### Feature flags
+
+```
+codebase-search-core    # Phase 1, default-on once it lands
+sparse-splade           # Phase 3, default-off
+dense-embed             # Phase 3, default-off (gated HF download)
+rerank                  # Phase 4, default-off
+```
+
+### Non-goals
+
+- **No cloud index is required** for the core experience. Ever, for Phase 1.
+- **Not a grep replacement.** Exact-token (`grep_files`) and filename (`file_search`) search stay first-class; `codebase_search` complements them and folds exact hits in as a fallback.
+- Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more.
+
+### Cross-link: WhaleFlow epic
+
+`codebase_search` is a building block for the long-running multi-agent **WhaleFlow** (`/workflow` / `/whaleflow`) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable `reasons[]` feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors).
+
+---
+
+## Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only)
+
+These are **design decisions for the eventual one-time catalog edit**; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (`tool_catalog.rs:169-196`) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn.
+
+### Lifecycle states (represented as const name-sets + an alias table in `tool_catalog.rs`, NOT a per-`ToolSpec` field)
+
+| State | Active first turn? | In tool-search? | Registered/dispatchable? | Result-metadata notice? |
+|---|---|---|---|---|
+| **active** | yes | yes | yes | no |
+| **deferred** | no | yes | yes | no |
+| **hidden-compatibility** | no | no | yes | no |
+| **deprecated** | no | no | yes | yes (replacement notice, **metadata only**) |
+| **removed** | no | no | no | — |
+
+Deprecated/hidden tools stay **registered and dispatchable** so old transcripts always replay. A deprecated tool appends a replacement notice to **RESULT METADATA only** — never to the cached prefix (which would break the invariant).
+
+### Planned diet (documented, not yet coded)
+
+- **`exec_wait`, `exec_interact`, `tts` → hidden-compatibility.** These are exact duplicates of canonical tools:
+  - `exec_wait` ≡ `exec_shell_wait` (same `ShellWaitTool`, `registry.rs:526,529`); router already unifies them at `crates/tui/src/tui/tool_routing.rs:1139-1140`.
+  - `exec_interact` ≡ `exec_shell_interact` (same `ShellInteractTool`, `registry.rs:527,530`).
+  - `tts` ≡ `speech` (same `SpeechTool`, `registry.rs:787-792`).
+  - Action: drop from active + search, keep registered, identical behavior, **no notice**.
+- **`todo_*` (`todo_write/add/update/list`) → deprecated → `checklist_*`.** They are deferred twins of `checklist_*` (same `TodoWriteTool::new` vs `::checklist`, `todo.rs:187,194`); `checklist_write` is active, and `todo_*` are **not** in the active set. Action: drop from tool-search, keep registered, **add replacement notice** (metadata only).
+- **Legacy subagent names** (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`, `send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, `delegate_to_agent`) are already `#[allow(dead_code)]` structs never instantiated outside tests (`crates/tui/src/tools/subagent/mod.rs`) → **already not model-visible.** Action: cleanup + guardrail tests, **rebased on PR #2684.** Note the live internal `SubAgentManager` methods `send_input`/`cancel`/`resume` (`mod.rs:1495,1521,1605`) are used by `agent_eval`/`agent_close` and **must be kept** — only the model-visible *tool* names are retired.
+
+### Model-visible subagent surface (unchanged)
+
+Only `agent_open`, `agent_eval`, `tool_agent`, `agent_close` are registered (`registry.rs:1017-1029`).
+
+- **`tool_agent` — KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY.** It is the fast non-thinking "Fin" executor lane built on `deepseek-v4-flash` (cf. RLM `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit.
+
+### Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance)
+
+`apply_patch` / `edit_file` / `write_file` / `fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` / `web.run` / `web_search`; `task_shell_*`; `handle_read` / `retrieve_tool_result`. These serve distinct purposes and stay as-is.
+
+---
+
+## Appendix B — Command-Surface Taxonomy (context)
+
+Each name maps to exactly one thing; `codebase_search` slots in as concept-level code retrieval alongside these surfaces:
+
+- `/memory` — small user prefs/facts only (subcommands `add`/`edit`/`search`/`clear`/`doctor`, plus later `promote`; `doctor` detects the legacy `~/.deepseek` path).
+- `/context` — dashboard of all active layers.
+- `/rules` — repo guidance.
+- `/workflow` (`/whaleflow`) — long-running multi-agent (the WhaleFlow epic).
+- `/overlay` — promoted cached-main lessons.
+- `$<skill-name>` — skill invocation prefix; the token *is* the skill name (e.g. `$systematic-debugging`, `$github:gh-fix-ci`).
+- `codebase_search` — concept-level code retrieval (this document).
@@ -0,0 +1,233 @@
+# Skill Invocation Design — the `$<skill-name>` inline syntax
+
+Status: **DESIGN ONLY** (v0.8.53 cycle). No catalog/parser code ships in this
+cycle; the implementation target is **0.9.0**. This document describes what
+*will* be built and the contracts it must honor against today's code.
+
+Related design docs: `TOOL_LIFECYCLE.md` (tool lifecycle states + per-skill tool
+restriction), command-surface taxonomy notes for `/memory`, `/context`,
+`/rules`, `/workflow` (`/whaleflow`), `/overlay`. Open PRs on `codex/v0.8.53`:
+#2684 (subagent role vocab / lifecycle signals / eval ergonomics) and #2685
+(git history active + RLM/field errors). Nothing here contradicts those.
+
+---
+
+## 1. Problem
+
+Skill activation has no single, model-legible entry point, and the candidate
+surfaces all compete with each other:
+
+- A `/skill` slash command, a `load_skill`-style tool, plugin/namespace naming
+  (`superpowers:systematic-debugging`, `github:gh-fix-ci`), and the long-running
+  workflow commands (`/workflow` / `/whaleflow`) all *could* be "the way you
+  start a skill." None of them is canonical.
+- Slash commands are already overloaded. `/memory`, `/context`, `/rules`,
+  `/config`, `/provider`, `/workflow`, `/overlay` each map to one subsystem;
+  jamming skill invocation into `/`-space forces a weaker model to disambiguate
+  "is this a command or a skill?" on every keystroke.
+- Weaker / smaller models (the cheaper providers CodeWhale targets) do not
+  reliably pick the right mechanism. They will free-text "let me use systematic
+  debugging" instead of actually loading the skill body, so the guidance never
+  enters the context window.
+- Today there is **no parser that activates an inline skill mention on submit.**
+  `slash_menu.rs:86` (`partial_inline_skill_mention_at_cursor`) recognizes an
+  inline `/<skill>` token *under the cursor for popup purposes only*; the submit
+  path in `ui.rs:4721` (`build_queued_message`) does not resolve or activate any
+  inline mention. There is also no activation-mode concept (always-on / glob /
+  model-decision / manual) and skills cannot restrict tools yet.
+
+We need one prefix that means exactly "invoke this skill," is visually distinct
+from commands, and is cheap for a small model to emit correctly.
+
+---
+
+## 2. Proposal
+
+Adopt **`$` as the skill-invocation prefix**, where **the token *is* the skill
+name** — not a literal command called `$skill`.
+
+```
+$systematic-debugging figure out why MiMo auth fails
+$test-driven-development add coverage before fixing
+$github:gh-fix-ci inspect the failing checks
+$aleph search the planning doc
+```
+
+The leading `$` is the marker; everything from `$` up to the next whitespace is
+the **skill id**. The rest of the line is the user's request, passed through to
+the model with the skill body already loaded as active guidance.
+
+This is deliberately a *reference / macro* sigil, like a shell variable
+expansion or an `@mention`: `$skill-id` resolves to "the contents and tool
+policy of that skill," then the surrounding prose is the task.
+
+`$` works in three places (see §4): the user composer, the command-palette
+input, and **model-facing planning text** — so the model itself can write
+`$systematic-debugging` in its plan and have it resolve.
+
+---
+
+## 3. Resolution rules
+
+Given a token `$<id>` (id captured up to the next whitespace):
+
+1. **Exact name first.** Look the id up directly:
+   `discover_in_workspace(workspace).get(id)` — `skills/mod.rs:553` builds the
+   registry; `SkillRegistry::get` (`skills/mod.rs:421`) matches on `s.name == id`
+   exactly. Skill names come from frontmatter `name:` (or the first `# Heading`
+   fallback) parsed at `skills/mod.rs:382-417`. An exact hit wins unconditionally.
+
+2. **Namespaced `$ns:skill`.** If the id contains a `:`, treat the part before
+   the colon as a source/plugin namespace and the part after as the skill name:
+   `$github:gh-fix-ci`, `$superpowers:systematic-debugging`. Namespaced ids are
+   the disambiguation handle a user is told to type when a bare id is ambiguous.
+   (Glob/wildcard namespacing — `$github:*` — is explicitly deferred, see §6.)
+
+3. **Fuzzy match *suggests*, never silently chooses.** If there is no exact (or
+   namespaced-exact) hit, run a case-insensitive substring / prefix match over
+   `SkillRegistry::list()` (`skills/mod.rs:426`). If exactly one skill matches,
+   surface it as a suggestion ("did you mean `$systematic-debugging`?") but do
+   **not** auto-activate it. If more than one matches, list them and require the
+   user/model to re-issue with a disambiguated id (§7). Ambiguity never resolves
+   to a silent pick.
+
+4. **Respect enable-state.** A resolved skill is only activated if
+   `SkillStateStore::is_enabled(id)` is true (`skill_state.rs:73`:
+   `!self.disabled.contains(skill_name)`). A disabled skill that resolves by
+   name produces a clear "skill is disabled; enable it with `/skill enable <id>`"
+   message rather than silently activating or silently doing nothing.
+
+Resolution order is therefore: **exact → namespaced-exact → enabled-check →
+fuzzy-suggest (never auto-pick).**
+
+---
+
+## 4. Behavior
+
+When a `$<id>` mention resolves and is enabled:
+
+- **Visible activation line.** The transcript shows `Using skill: <name>` so the
+  user can see which skill body entered context. (Mirrors the existing skill UX
+  vocabulary; one line per activated skill.)
+- **Body loaded as active guidance.** The skill's `body`
+  (`skills/mod.rs` `Skill.body`) is injected into the turn as authoritative
+  guidance, the same content a `/skill`-style activation would load. The user's
+  trailing prose is the task the guidance applies to.
+- **Tool-surface narrowing (when declared).** If the skill declares a set of
+  allowed tools, the active tool surface narrows to that set for the duration of
+  the skill's influence. **Per-skill tool restriction is net-new** — skills
+  cannot restrict tools today; the mechanism, and how narrowing interacts with
+  the catalog-head byte-stability invariant (`tool_catalog.rs:169-196`), is
+  specified in `TOOL_LIFECYCLE.md`. Until that lands, a declared tool list is
+  parsed and shown but not enforced.
+- **Multiple `$mentions` compose explicitly, or prompt.** Until formal
+  composition rules exist, two or more `$mentions` in one message either compose
+  only when the rule is unambiguous (e.g. one guidance skill + one tool-scoping
+  skill) or return a **"choose one"** prompt listing the mentioned skills. We
+  never silently activate multiple complex skills at once (see §7 and Non-goals).
+- **Three input surfaces.** Resolution runs for: (a) user prompts in the
+  composer, (b) command-palette input, and (c) model-facing planning text, so a
+  model that writes `$test-driven-development` in its plan triggers the same
+  activation path a human would.
+- **Slash commands remain supported.** `/skill ...` and the rest of the slash
+  surface keep working unchanged. `$` is the *preferred* path for models because
+  it is one token and unambiguous, but it is additive, not a replacement (§7
+  Non-goals).
+
+---
+
+## 5. Why `$`
+
+- **Visually distinct from `/commands`.** A glance separates "run a subsystem
+  command" (`/memory`, `/context`, `/workflow`) from "load a skill" (`$aleph`).
+  Weaker models stop confusing the two surfaces.
+- **Reads like a reference / macro.** `$name` already means "expand this named
+  thing" to anyone who has touched a shell or a templating language. Skill
+  invocation *is* an expansion: `$skill-id` → that skill's guidance + tool policy.
+- **Avoids overloading the slash namespace.** `/workflow`, `/memory`, `/config`,
+  `/provider`, `/rules`, `/overlay`, `/context` each already own one meaning in
+  the command-surface taxonomy. Skills get their own sigil instead of a crowded
+  `/skill <name>` subcommand competing with all of them.
+- **Easy to type and remember.** Single leading character, then the literal
+  skill name. Nothing to memorize beyond the skill ids the user already sees in
+  `/skill list`.
+
+---
+
+## 6. Implementation plan (smallest viable 0.8.53-ready slice → 0.9.0)
+
+The 0.8.53 cycle is **docs only**. The plan below is the build order once code
+is unblocked; the first slice is intentionally the minimum that proves the path.
+
+**Slice 1 — token scanner at submit (the minimum viable feature).**
+- Add a `$<skill-id>` token scanner invoked on submit, **before**
+  `build_queued_message` runs (`ui.rs:4721`). The scanner finds leading-`$`
+  tokens, captures the id up to the next whitespace, and hands each id to the
+  resolver. The scanner must skip `$` occurrences inside code fences and inline
+  command strings (see Non-goals) so shell `$VAR` references are never treated as
+  skill mentions.
+- Resolve via `discover_in_workspace(workspace).get(id)` (`skills/mod.rs:553` /
+  `:421`), gate on `SkillStateStore::is_enabled` (`skill_state.rs:73`), and emit
+  the `Using skill: <name>` line plus the loaded body.
+
+**Slice 2 — inline-mention popup.**
+- Extend the inline-mention popup machinery in `slash_menu.rs:86`
+  (`partial_inline_skill_mention_at_cursor`) to recognize a `$`-prefixed token
+  under the cursor and offer skill-name completions from `SkillRegistry::list()`,
+  the same way the slash popup offers commands. This is a UX accelerator on top
+  of Slice 1, not a precondition for it.
+
+**Slice 3 — ambiguity diagnostics.**
+- When resolution is ambiguous, emit actionable diagnostics, e.g.
+  `"$debugging matched 3 skills: systematic-debugging, root-cause-debugging,
+   superpowers:systematic-debugging — use $superpowers:systematic-debugging"`.
+  Diagnostics name the disambiguated id the user should type next.
+
+**Deferred to 0.9.0+ (explicitly out of the first slices):**
+- `$ns:skill` **globs / wildcards** (`$github:*`). Plain namespaced-exact
+  (`$github:gh-fix-ci`) ships in Slice 1; globbing does not.
+- **Per-skill tool restriction enforcement.** Parsing/display can land early;
+  enforcement and its catalog-head-stability handling are owned by
+  `TOOL_LIFECYCLE.md`.
+- **Multi-skill composition rules.** Until defined, fall back to the "choose one"
+  prompt (§4, §7).
+
+---
+
+## 7. Ambiguity / error UX, tests, and non-goals
+
+### Error / ambiguity UX examples
+
+| Input | Outcome |
+|---|---|
+| `$systematic-debugging fix the auth bug` | Exact hit. `Using skill: systematic-debugging`, body loaded, task = "fix the auth bug". |
+| `$github:gh-fix-ci inspect failing checks` | Namespaced-exact hit. `Using skill: github:gh-fix-ci`, body loaded. |
+| `$nope do a thing` | No match. `"No skill named 'nope'. Run /skill list to see available skills."` No activation; the line is sent as ordinary text. |
+| `$debugging ...` (3 candidates) | `"$debugging matched 3 skills: systematic-debugging, root-cause-debugging, superpowers:systematic-debugging — use $superpowers:systematic-debugging."` No auto-pick. |
+| `$systematic-debug ...` (1 fuzzy candidate) | Suggest only: `"No exact skill 'systematic-debug'. Did you mean $systematic-debugging?"` No silent activation. |
+| `$aleph ...` but aleph disabled | `"Skill 'aleph' is disabled. Enable it with /skill enable aleph."` No activation. |
+| `$tdd $systematic-debugging ...` (2 mentions) | `"Choose one skill to lead this turn: $test-driven-development or $systematic-debugging."` (until composition rules exist). |
+| `echo $PATH` inside a code fence / command string | Not a mention. Scanner skips `$` inside code/command contexts. |
+
+### Tests (planned)
+
+- **Exact:** `$systematic-debugging` resolves via `get(id)`, activates, loads body.
+- **Namespaced:** `$github:gh-fix-ci` resolves on the `ns:skill` form.
+- **Missing:** `$nope` → no-match message, no activation, line passed as text.
+- **Ambiguous:** `$debugging` (≥2 candidates) → "matched N skills … use $ns:skill",
+  asserts **no** auto-activation occurred.
+- **Disabled:** a skill with `is_enabled == false` → disabled message, no activation.
+- **Guardrail — `$` in code:** `$VAR` inside a fenced block or command string is
+  not treated as a mention.
+
+### Non-goals
+
+- **Do not remove slash commands.** `/skill` and the whole `/` surface stay; `$`
+  is preferred for models but additive.
+- **Do not auto-run arbitrary scripts.** A `$mention` loads guidance (and, later,
+  a declared tool policy) — it never executes shell or skill-attached scripts on
+  its own.
+- **Do not silently activate multiple complex skills.** Multi-mention falls back
+  to a "choose one" prompt until composition rules are specified.
+- **Do not let `$` collide with shell variables.** `$` inside code fences and
+  command strings is never parsed as a skill mention.
@@ -0,0 +1,366 @@
+# Tool-Surface Lifecycle Policy (v0.8.53)
+
+**Status:** Design doc / policy. No catalog code lands in this cycle — the code
+work is **deferred**. This document is the umbrella policy for GitHub **#2681**,
+with **#2682** and **#2683** as concrete instances of the planned diet. It
+describes *what will be done* and the invariants any future diet PR must hold.
+
+**Scope of related open work (do not contradict):**
+- PR **#2684** — subagent role vocabulary, lifecycle signals, eval ergonomics.
+  Legacy subagent-name cleanup + guardrail tests in this policy rebase on #2684.
+- PR **#2685** — git-history active + RLM/field errors.
+
+All file:line citations are against the verified tree at
+`/Users/huntermbown/Desktop/whalebro/codewhale` as of v0.8.52/0.8.53.
+
+---
+
+## 1. Purpose and the weaker-model problem
+
+CodeWhale ships a large native tool surface. The first-turn *active* partition
+of that surface is what every model sees before it has run a single
+`tool_search_*` call. Today that active set contains several **near-duplicate
+tools** that map to the *same* implementation under different names:
+
+- `exec_wait` and `exec_shell_wait` are both `ShellWaitTool`
+  (`crates/tui/src/tools/registry.rs:526,529`).
+- `exec_interact` and `exec_shell_interact` are both `ShellInteractTool`
+  (`registry.rs:527,530`).
+- `tts` and `speech` are both `SpeechTool`
+  (`registry.rs:787-792`, both deferred).
+- `todo_write` and `checklist_write` are the *same* `TodoWriteTool`
+  constructed two ways (`crates/tui/src/tools/todo.rs:184-196`).
+
+For a strong model, redundant names are harmless noise. For **weaker / smaller
+models** (the Arcee Trinity lane, `deepseek-v4-flash` child executors, and any
+non-thinking executor), every additional near-duplicate in the visible set is a
+real cost:
+
+- It widens the choice space with options that do *nothing distinct*, increasing
+  wrong-tool selection and oscillation between synonyms.
+- It spends scarce first-turn catalog budget (Section 5) on zero-information
+  entries.
+- It dilutes the "one name = one thing" contract that lets a small model reason
+  about the surface at all.
+
+The lifecycle policy exists to **shrink and discipline the model-visible
+surface** without ever breaking the ability to replay an old transcript that
+referenced a now-retired name.
+
+---
+
+## 2. The five lifecycle states
+
+Every native tool name occupies exactly one lifecycle state.
+
+| State | Meaning | Visible on first turn? | In `tool_search_*`? | Executes if called? | When used |
+|---|---|---|---|---|---|
+| **active** | Canonical, in the first-turn catalog head | **Yes** | n/a (already active) | Yes | The tool a model should reach for by default |
+| **deferred** | Registered + discoverable, hydrated on demand | No | **Yes** | Yes | Real, useful tools that don't earn a first-turn slot |
+| **hidden-compatibility** | Registered + dispatchable, but removed from active **and** from search | No | **No** | **Yes — identical behavior, silent** | Old synonym kept only so old transcripts replay; no model should newly discover it |
+| **deprecated** | Like hidden-compat, but execution **appends a replacement notice to result metadata** | No | **No** | **Yes — works, plus a "use X instead" notice** | A retired name we actively steer callers off of, still safe to replay |
+| **removed** | Not registered at all | No | No | **No — hard error** | Only after `planned_removal_version`, once replay support is formally dropped |
+
+### hidden-compatibility vs deprecated — be precise
+
+Both states are **invisible** (not active, not in tool search) and both remain
+**dispatchable** (calling them still works). The *only* difference is the
+caller-facing signal:
+
+- **hidden-compatibility:** completely silent. The tool behaves byte-for-byte
+  like its canonical twin. We use this when there is *no behavioral or naming
+  lesson to teach* — the name was a pure alias and we simply don't want models
+  re-learning it. (Example: `exec_wait` is literally `exec_shell_wait`.)
+- **deprecated:** behaves identically *and succeeds*, but the tool result's
+  **metadata** carries an appended notice like
+  `"deprecated: use checklist_write instead"`. The notice goes **only in the
+  result metadata returned for that call** — never in the cached tool catalog
+  prefix (see Section 8). We use this when there is a canonical replacement we
+  want the caller (and any human reading the transcript) nudged toward.
+
+Neither state ever changes the *behavior* of the call. Replay always works.
+
+---
+
+## 3. Representation in code
+
+The lifecycle is represented as **const name-sets plus an alias/manifest table**
+in `crates/tui/src/core/engine/tool_catalog.rs`, alongside the existing
+`DEFAULT_ACTIVE_NATIVE_TOOLS` (`tool_catalog.rs:37-64`) and
+`ARCEE_FIRST_TURN_NATIVE_TOOLS` (`tool_catalog.rs:106-115`).
+
+### 3a. Name-sets and the manifest (sketch)
+
+```rust
+// crates/tui/src/core/engine/tool_catalog.rs  (planned)
+
+/// Tools removed from the active set AND from tool-search, but still
+/// registered and dispatchable with byte-identical behavior. Silent.
+pub(super) const HIDDEN_COMPATIBILITY_TOOLS: &[&str] = &[
+    "exec_wait",          // == exec_shell_wait  (ShellWaitTool)
+    "exec_interact",      // == exec_shell_interact (ShellInteractTool)
+    "tts",                // == speech (SpeechTool)
+];
+
+/// Deprecated aliases: invisible + dispatchable, with a replacement notice
+/// appended to RESULT METADATA only (never the cached prefix).
+pub(super) struct DeprecatedAlias {
+    pub name: &'static str,
+    pub replacement: &'static str,
+    pub note: &'static str,
+}
+
+pub(super) const DEPRECATED_ALIASES: &[DeprecatedAlias] = &[
+    DeprecatedAlias { name: "todo_write",  replacement: "checklist_write",
+                      note: "use checklist_write instead" },
+    DeprecatedAlias { name: "todo_add",    replacement: "checklist_add",
+                      note: "use checklist_add instead" },
+    DeprecatedAlias { name: "todo_update", replacement: "checklist_update",
+                      note: "use checklist_update instead" },
+    DeprecatedAlias { name: "todo_list",   replacement: "checklist_list",
+                      note: "use checklist_list instead" },
+];
+
+#[inline]
+pub(super) fn is_hidden_or_deprecated(name: &str) -> bool {
+    HIDDEN_COMPATIBILITY_TOOLS.contains(&name)
+        || DEPRECATED_ALIASES.iter().any(|d| d.name == name)
+}
+```
+
+### 3b. The two filter points
+
+1. **Catalog / tool-search exclusion (tool_catalog.rs).**
+   Deferral is decided by `should_default_defer_tool` (`tool_catalog.rs:66-82`),
+   and the active set is the head built by `build_model_tool_catalog`
+   (`tool_catalog.rs:178-196`). Hidden-compat and deprecated tools must be
+   forced *out of the active head* and *out of the tool-search-discoverable
+   pool*. Concretely, the deferral predicate gains a short-circuit so these
+   names are never active, and the tool-search index builder skips any name for
+   which `is_hidden_or_deprecated(name)` is true. Arcee's narrowed first-turn
+   path (`apply_provider_tool_policy`, `tool_catalog.rs:134-149`) already
+   excludes them by construction since they aren't in
+   `ARCEE_FIRST_TURN_NATIVE_TOOLS`.
+
+2. **Result-notice append (tool_routing.rs).**
+   Dispatch already routes by tool name in
+   `crates/tui/src/tui/tool_routing.rs` (e.g. the wait/interact unification at
+   `tool_routing.rs:1139-1140`). After a successful dispatch, if the called name
+   is in `DEPRECATED_ALIASES`, the router appends the matching `note` to the
+   **result metadata only**. Hidden-compat names append nothing.
+
+### 3c. Why name-sets, not a per-`ToolSpec` enum field
+
+A per-`ToolSpec` `lifecycle: Lifecycle` field was rejected for three reasons:
+
+- **Prefix-cache safety.** The tool catalog array is part of DeepSeek's
+  immutable KV prefix (`tool_catalog.rs:169-177`). A per-spec field invites
+  serializing lifecycle state *into* each tool's schema, which is exactly the
+  kind of head mutation that forces a full re-prefill. Name-sets live entirely
+  in the catalog-build logic and never touch the emitted tool JSON.
+- **Single source of truth + diffability.** The diet for a release is one small,
+  reviewable edit to two or three const arrays in one file, instead of scattered
+  field flips across many tool modules.
+- **Registration stays orthogonal.** Tools remain registered exactly as today
+  (e.g. `with_shell_tools`, `registry.rs:523-531`). Lifecycle is a *catalog
+  policy* layered on top of registration, not a property baked into the tool.
+
+---
+
+## 4. Deprecation manifest (the #2681 acceptance-criteria table)
+
+This is the authoritative manifest. Columns are the #2681 AC columns. No entry
+is "removed" in 0.8.53; replay is supported for everything listed.
+
+| Alias | Replacement (canonical) | Lifecycle state | first_deprecated_version | planned_removal_version | replay_supported |
+|---|---|---|---|---|---|
+| `exec_wait` | `exec_shell_wait` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
+| `exec_interact` | `exec_shell_interact` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
+| `tts` | `speech` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
+| `todo_write` | `checklist_write` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
+| `todo_add` | `checklist_add` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
+| `todo_update` | `checklist_update` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
+| `todo_list` | `checklist_list` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
+
+**Legacy subagent names — already non-visible, no manifest entry needed.**
+`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`,
+`send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, and
+`delegate_to_agent` exist only as `#[allow(dead_code)]` structs in
+`crates/tui/src/tools/subagent/mod.rs` and are **never instantiated** outside
+tests, so they are already not model-visible. Only `agent_open`, `agent_eval`,
+`tool_agent`, and `agent_close` are registered
+(`registry.rs:1017-1029`). The action for these legacy names is **dead-code
+cleanup + a guardrail test** (rebase on PR #2684), not a lifecycle transition.
+
+> **Keep the live internal methods.** `send_input`, `cancel`, and `resume` also
+> exist as live `SubAgentManager` methods
+> (`subagent/mod.rs:1605,1495,1521`) used internally by `agent_eval` /
+> `agent_close`. These are *not* the dead-code tool structs and must be kept.
+
+`planned_removal_version` is intentionally `TBD`: a name only moves to **removed**
+once we formally drop replay for transcripts old enough to contain it, which is a
+separate, deliberate decision per name.
+
+---
+
+## 5. Active-catalog budget (per mode, per provider)
+
+The active set is the first-turn cost. Current default active set:
+`DEFAULT_ACTIVE_NATIVE_TOOLS` has **25** entries (`tool_catalog.rs:37-64`).
+
+### Per provider
+
+| Provider | First-turn active source | Current count | Target after diet |
+|---|---|---|---|
+| Default (DeepSeek et al.) | `DEFAULT_ACTIVE_NATIVE_TOOLS` | 25 | ~22 (drop `exec_wait`, `exec_interact`; `todo_*` already not active) |
+| Arcee (Trinity) | `ARCEE_FIRST_TURN_NATIVE_TOOLS` | 8 (read-only WAF workaround) | 8 (unchanged) |
+
+The default diet removes `exec_wait` and `exec_interact` from the active head
+(they become hidden-compat; their canonical twins `exec_shell_wait` /
+`exec_shell_interact` stay). `tts` and `todo_*` are *already not* in the active
+set, so the active count moves **25 → 23** from the wait/interact removal alone;
+the broader target is a stable budget of roughly **≤ 22** canonical tools.
+
+### Per mode (Plan / Agent / YOLO)
+
+The native active head is the **same set across modes** by design — mode does not
+add or remove native tools from `DEFAULT_ACTIVE_NATIVE_TOOLS`
+(`should_default_defer_tool` ignores `_mode` for native tools,
+`tool_catalog.rs:66-68`). Mode affects **MCP** deferral instead:
+`apply_mcp_tool_deferral` keeps MCP tools deferred unless `mode == Yolo`
+(`tool_catalog.rs:162-167`).
+
+| Mode | Native active budget | MCP tools active? |
+|---|---|---|
+| Plan | same native head (target ≤ 22) | No (deferred) |
+| Agent | same native head | No (deferred) |
+| YOLO | same native head | Yes (a known, intentional widening) |
+
+**Budget rule:** the native active head must stay byte-identical across Plan ↔
+Agent ↔ YOLO (Section 8). Any growth of the head requires retiring something
+else or an explicit budget bump in this doc.
+
+---
+
+## 6. The canonical-surface rule
+
+> **Every model-visible (active or deferred-discoverable) tool must have one
+> clear niche. If a tool is superseded, it gets a named replacement and moves to
+> hidden-compatibility or deprecated — it does not stay visible.**
+
+### Canonical vs compatibility summary for the confusing clusters
+
+| Cluster | Canonical (keep visible) | Compatibility / retired | Notes |
+|---|---|---|---|
+| **Shell wait** | `exec_shell_wait` | `exec_wait` → hidden-compat | Same `ShellWaitTool` (`registry.rs:526,529`); router already unifies (`tool_routing.rs:1139`) |
+| **Shell interact** | `exec_shell_interact` | `exec_interact` → hidden-compat | Same `ShellInteractTool` (`registry.rs:527,530`) |
+| **Checklist / todo** | `checklist_write` | `todo_write/add/update/list` → deprecated | Same `TodoWriteTool`, `::new` vs `::checklist` (`todo.rs:184-196`) |
+| **Speech / tts** | `speech` | `tts` → hidden-compat | Same `SpeechTool` (`registry.rs:787-792`) |
+| **Subagent lifecycle** | `agent_open`, `agent_eval`, `agent_close`, `tool_agent` (gated, §7) | all 11 legacy names → already non-visible dead code | Cleanup + guardrail test, rebase on #2684 |
+| **Edit family** | `apply_patch`, `edit_file`, `write_file`, `fim_edit` | none — **all distinct niches** | NOT touched (per #2681 non-goals); doc-only canonical guidance |
+| **Search family** | `grep_files` (content), `file_search` (filename), `project_map` (structure) | none — **distinct niches** | NOT touched; no FTS5/BM25/semantic index exists today |
+
+**Non-goals (explicitly NOT diet targets in this cycle, per #2681):**
+`apply_patch` / `edit_file` / `write_file` / `fim_edit`;
+`grep_files` / `file_search` / `project_map`;
+`fetch_url` / `web.run` / `web_search`;
+`task_shell_*`; `handle_read` / `retrieve_tool_result`. These have distinct
+niches and receive **canonical guidance only** — no lifecycle change.
+
+The RLM surface (`rlm_open` / `rlm_eval` / `rlm_configure` / `rlm_close` /
+`rlm_session_objects`, `crates/tui/src/tools/rlm.rs`) is likewise out of scope;
+`handle_read` retrieves var handles, and `finalize` / `FINAL` is an in-kernel
+Python function, **not a tool** — so there is nothing to retire there.
+
+---
+
+## 7. `tool_agent` decision: canonical but DeepSeek-V4-gated
+
+`tool_agent` **stays** as a canonical subagent tool
+(`registry.rs:1024`, `ToolAgentTool`). It is the fast, **non-thinking "Fin"
+executor lane**, built on `deepseek-v4-flash` (cf. `DEFAULT_CHILD_MODEL =
+"deepseek-v4-flash"`, `rlm.rs:26`).
+
+**Decision: gate `tool_agent` to DeepSeek-V4 models only.**
+
+- It is purpose-built around the V4-flash non-thinking executor profile. Exposing
+  it to other providers (e.g. Arcee Trinity, which is already WAF-narrowed to 8
+  read-only tools, `tool_catalog.rs:106-115`) offers no working executor lane and
+  only adds a confusing, mis-targeted option to weaker surfaces.
+- Gating is a **provider/model policy**, consistent with the existing
+  provider-aware first-turn policy (`apply_provider_tool_policy`,
+  `tool_catalog.rs:134-149`): on non-DeepSeek-V4 models, `tool_agent` is excluded
+  from the active set and from tool-search discovery. It remains **registered and
+  dispatchable** so transcripts created under a V4 model replay everywhere.
+
+This is not a lifecycle transition — `tool_agent` is canonical. It is a
+*visibility gate* layered on the same machinery as the Arcee narrowing.
+
+---
+
+## 8. Prefix-cache safety + replay guarantee
+
+### Prefix-cache rules every diet PR MUST follow
+
+The tools array is part of DeepSeek's immutable KV prefix. The catalog-head
+byte-stability invariant (`tool_catalog.rs:169-196`) is binding:
+
+1. **Never mutate the active head non-deterministically.** The first-turn active
+   block must be **byte-identical run-to-run** and across Plan ↔ Agent ↔ YOLO.
+2. **A diet is a one-time deterministic edit.** Removing a name from
+   `DEFAULT_ACTIVE_NATIVE_TOOLS` shifts the head exactly once; after that it must
+   be stable. Land such edits as their own focused change.
+3. **Notices live in result metadata, never the prefix.** Deprecated replacement
+   notes are appended at dispatch time in `tool_routing.rs` to the *call result*
+   only. **Nothing** about hidden/deprecated state may be serialized into a tool
+   schema, description, or the catalog array.
+4. **Preserve ordering and partitioning.** `build_model_tool_catalog` sorts each
+   partition by name and keeps built-ins as a contiguous prefix ahead of MCP
+   tools (`tool_catalog.rs:186-194`). Diet edits must not break this.
+5. **Hidden/deprecated tools are excluded *before* the head is built**, so their
+   removal is the only head change — they do not appear in the prefix at all.
+
+### Old-transcript replay guarantee
+
+> For every name in the deprecation manifest with `replay_supported = Yes`, the
+> tool stays **registered and dispatchable with identical behavior**. Replaying
+> an old transcript that calls `exec_wait`, `exec_interact`, `tts`, or any
+> `todo_*` produces the same result it always did. Deprecated names additionally
+> attach a result-metadata notice; hidden-compat names are silent. A name is only
+> ever made non-dispatchable (**removed**) after a deliberate, per-name decision
+> to drop replay support at `planned_removal_version`.
+
+---
+
+## 9. Required tests
+
+Any diet PR (and the umbrella #2681 work) must add/keep:
+
+1. **Duplicate-active-alias guard.** A test asserting that no name in
+   `HIDDEN_COMPATIBILITY_TOOLS` or `DEPRECATED_ALIASES` appears in
+   `DEFAULT_ACTIVE_NATIVE_TOOLS` or `ARCEE_FIRST_TURN_NATIVE_TOOLS`, and that no
+   two active entries resolve to the same underlying tool implementation.
+
+2. **Tool-search exclusion test.** Assert that hidden-compat and deprecated names
+   are absent from the tool-search-discoverable pool while remaining present in
+   the registry (dispatchable).
+
+3. **Replay / dispatch tests.** For each manifest name, calling it still
+   executes and returns the same result as its canonical twin. Deprecated names
+   additionally assert the replacement note is present **in result metadata** and
+   absent from the catalog/prefix. Hidden-compat names assert **no** added
+   notice.
+
+4. **Golden active-block byte test.** A snapshot test pinning the byte
+   serialization of the first-turn active tool block, asserting it is identical
+   across Plan / Agent / YOLO (native head) and stable run-to-run — enforcing the
+   `tool_catalog.rs:169-196` invariant. The golden updates **only** as a
+   reviewed, deliberate one-time edit when the diet lands.
+
+5. **Subagent guardrail test (rebase on #2684).** Assert only `agent_open`,
+   `agent_eval`, `tool_agent`, `agent_close` are registered as model-visible
+   subagent tools and that no legacy name from `subagent/mod.rs` is
+   instantiated outside tests.
+
+6. **`tool_agent` gating test.** Assert `tool_agent` is active/discoverable only
+   under DeepSeek-V4 models and excluded (but still registered) elsewhere.
@@ -0,0 +1,472 @@
+# CodeWhale North Star (0.9.0+)
+
+> **STATUS: DIRECTION, NOT COMMITTED WORK.**
+> Everything in this document is the maintainer's intended *direction* for
+> CodeWhale 0.9.0 and beyond. **None of it is committed 0.8.53 work.** The
+> 0.8.53 cycle ships **design docs only** for these areas — no tool-catalog code
+> lands this cycle except the small, already-scoped subagent/git/RLM fixes in
+> PR #2684 and PR #2685. Treat every "rough shape" below as a sketch to be
+> refined, not an API contract. Where this doc names tools that do not exist yet
+> (`codebase_search`, `read_file` as a canonical alias, `agent_run`, etc.) those
+> are **aspirational names** that will *map onto today's tools*; see each
+> section.
+
+## Why this document exists
+
+The vision is at risk of being lost between point releases. CodeWhale is
+accumulating capability (subagents, RLM, skills, workflows, an enormous tool
+catalog) faster than it is accumulating *shape*. This is the north star that the
+incremental 0.8.x stabilization work is steering toward, written down once so it
+survives the next dozen PRs.
+
+### The one principle
+
+**The harness handles memory, search, routing, state, and guardrails so a
+weaker model can just *think*.** Every design decision below is in service of
+moving cognitive load *out* of the model and *into* the harness. A
+`deepseek-v4-flash`-class model should not have to remember ~80 tool names, hold
+the codebase index in its head, track which layer of memory a fact lives in, or
+re-derive a recovery path after a malformed tool call. The harness does that.
+The model decides *what it wants*; the harness figures out *how*.
+
+---
+
+## Ground-truth anchor (today's reality)
+
+So the direction is honest about where it starts:
+
+- **Active first-turn tool set** is `DEFAULT_ACTIVE_NATIVE_TOOLS`
+  (`crates/tui/src/core/engine/tool_catalog.rs:37-64`) — 26 tools. Everything
+  else is **deferred** and hydrates via `tool_search_tool_regex` /
+  `tool_search_tool_bm25` (`tool_catalog.rs:26-35`).
+- **Catalog-head byte-stability is a hard invariant** for DeepSeek's KV
+  prefix cache (`tool_catalog.rs:169-196`). The active first-turn tool block
+  must stay byte-identical run-to-run; any change to it is a **one-time,
+  deterministic edit**, never a per-turn or per-mode mutation.
+- **Arcee** narrows the first turn to 8 read-only tools
+  (`ARCEE_FIRST_TURN_NATIVE_TOOLS`, `tool_catalog.rs:106-115`) as a Cloudflare
+  WAF workaround — proof the active partition is already provider-shaped.
+- **Subagent tools that are model-visible:** only `agent_open`, `agent_eval`,
+  `tool_agent`, `agent_close` (`crates/tui/src/tools/registry.rs:1017-1029`).
+  All legacy names (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`,
+  `agent_send_input`, `agent_assign`, `agent_list`, `agent_cancel`,
+  `resume_agent`, `delegate_to_agent`, …) are `#[allow(dead_code)]` structs in
+  `crates/tui/src/tools/subagent/mod.rs`, never instantiated outside tests →
+  **already not model-visible**. The live internal `send_input` / `cancel` /
+  `resume` methods on `SubAgentManager` (`mod.rs:1495,1521,1605`) back
+  `agent_eval` / `agent_close` and **stay**.
+- **`tool_agent` is "Fin"** — the experimental fast-lane executor: DeepSeek V4
+  Flash with thinking forced off (`mod.rs:5233`, `TOOL_AGENT_INTRO`;
+  `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`).
+- **Known duplicates today:** `exec_wait ≡ exec_shell_wait`,
+  `exec_interact ≡ exec_shell_interact` (same structs, all four in the active
+  set), `tts ≡ speech` (both deferred). `todo_*` are deferred twins of
+  `checklist_*` (same `TodoWriteTool`, `::new` vs `::checklist`,
+  `todo.rs:187,194`). The router already unifies `exec_wait`/`exec_shell_wait`
+  (`crates/tui/src/tui/tool_routing.rs:1139-1140`).
+
+This is the surface the north star refactors *toward simplicity*.
+
+---
+
+## 1. Intent Router
+
+**What it is.** A thin layer where the model declares an **intent** —
+*search / inspect / edit / test / delegate / ask-user / run-shell /
+run-workflow* — and the harness maps that intent to the correct low-level tool
+and arguments. The model picks from a tiny, stable verb vocabulary instead of
+recalling ~80 concrete tool names and their schemas.
+
+**Why it helps weaker models.** Tool-name recall is one of the largest sources
+of wasted turns for small models: choosing a deferred tool (double-invoke),
+choosing a deprecated alias, or hallucinating a name. A fixed intent vocabulary
+collapses that decision space to ~10 verbs. The model spends its budget on
+*reasoning about the task*, not on *remembering the API*.
+
+**Rough shape.** A small **canonical visible set** — aspirational names that
+route onto today's tools:
+
+| Intent verb (aspirational) | Routes onto today |
+|---|---|
+| `codebase_search` | concept-level retrieval over the hybrid index (§2); today: `grep_files` + `file_search` + `project_map` |
+| `read_file` | `read_file` (already canonical) |
+| `apply_patch` | `apply_patch` (canonical; `edit_file`/`write_file`/`fim_edit` remain as distinct lower-level tools) |
+| `run_tests` | `run_tests` / `run_verifiers` |
+| `git_status` | `git_status` |
+| `git_diff` | `git_diff` |
+| `work_update` | `update_plan` / `checklist_write` |
+| `ask_user` | `request_user_input` |
+| `shell_run` | `exec_shell` (canonical; `exec_wait`/`exec_interact` hidden — §10) |
+| `agent_run` | `agent_open` / `tool_agent` (gated, §3) / `agent_eval` / `agent_close` |
+| `workflow_run` | WhaleFlow runner (§4) |
+
+The router is the *only* place the catalog's full complexity is allowed to live.
+It is also where **tool repair** (§7) hooks in: a mis-stated intent or a
+deferred/deprecated name is rewritten to the canonical route.
+
+**Dependencies.** The small canonical surface (§3), the lifecycle alias table
+(§3 / `docs/TOOL_LIFECYCLE.md`), and the hybrid index for `codebase_search`
+(§2). Must respect the **catalog-head byte-stability invariant**: the visible
+verb set is itself a one-time deterministic edit, not a dynamic per-turn list.
+
+---
+
+## 2. Default Hybrid Codebase Intelligence
+
+**What it is.** An always-on, local-first codebase index that ships with the
+harness — not an opt-in tool the model has to remember to build. It fuses:
+
+- plain **text** search,
+- **symbol** index (definitions/references),
+- **import / call graph**,
+- **FTS5 + BM25** lexical ranking (rusqlite is already a dependency —
+  `Cargo.toml`),
+- **sparse** retrieval,
+- optional **dense** (embedding) retrieval,
+- **PR / commit / issue history** as a first-class retrieval source,
+- a **codemap** (structural overview, the successor to today's deferred
+  `project_map`).
+
+**Why it helps weaker models.** Today the model must orchestrate `grep_files`
+(content), `file_search` (filename), and `project_map` (structure) by hand,
+reconcile their outputs, and re-run them as it narrows. There is **no FTS5/BM25
+or semantic index today** — every search is a cold walk (`file_search` uses the
+`ignore` crate's `WalkBuilder` for vendor exclusion, `file_search.rs:~210`). A
+weaker model burns turns stitching partial results. A single `codebase_search`
+intent backed by a hybrid index returns ranked, concept-level hits in one call,
+so the model reasons about *answers*, not *query mechanics*.
+
+**Rough shape.** A background indexer maintains a SQLite store (FTS5 + symbol +
+graph tables), refreshed on file change and on git events. `codebase_search`
+(§1) queries it; the codemap is regenerated incrementally. Vendor exclusion
+reuses the existing `ignore`/`WalkBuilder` path.
+
+**Dependencies.** rusqlite/FTS5; the Intent Router (§1) for the
+`codebase_search` verb; the trace store (§6/§8) for history retrieval. **Full
+design lives in `docs/CODEBASE_SEARCH_DESIGN.md`** (to be written this cycle).
+
+---
+
+## 3. Small Canonical Tool Surface
+
+**What it is.** A deliberately tiny set of always-visible canonical tools;
+**everything else is hidden, deferred, or skill-scoped**. The catalog grows
+behind the scenes but the *visible* surface stays small and stable.
+
+**Why it helps weaker models.** Fewer choices, no aliases competing for the same
+job, no deferred double-invokes for common operations. The model sees the verbs
+it needs and nothing else.
+
+**Rough shape — tool lifecycle states.** Five states, represented as **const
+name-sets plus an alias table in `tool_catalog.rs`** (NOT a per-`ToolSpec`
+field, to preserve the byte-stable head):
+
+1. **active** — in the first-turn catalog head.
+2. **deferred** — registered, hydrated via tool-search.
+3. **hidden-compatibility** — registered + dispatchable, **dropped from both
+   active and search**, identical behavior, **no notice**. (For exact
+   duplicates that should simply disappear from discovery.)
+4. **deprecated** — registered + dispatchable, **dropped from search**, appends
+   a *replacement notice to RESULT METADATA only* — **never** to the cached
+   prefix.
+5. **removed** — final state; no longer registered.
+
+**Invariant:** deprecated and hidden-compatibility tools **stay registered and
+dispatchable forever** so old transcripts always replay deterministically.
+
+**Planned diet (documented this cycle, not yet coded):**
+
+- `exec_wait`, `exec_interact`, `tts` → **hidden-compatibility** (exact
+  duplicates of `exec_shell_wait`, `exec_shell_interact`, `speech`).
+- `todo_*` (`todo_write/add/update/list`) → **deprecated → checklist_*** (drop
+  from tool-search, keep registered, add result-metadata notice).
+- Legacy subagent names → already hidden; remaining work is **cleanup +
+  guardrail tests**, rebased on PR #2684.
+
+**Explicitly NOT touched** (distinct niches, per #2681 non-goals) — doc-only
+canonical guidance, no diet: `apply_patch` / `edit_file` / `write_file` /
+`fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` /
+`web.run` / `web_search`; `task_shell_*`; `handle_read` /
+`retrieve_tool_result`.
+
+**`tool_agent` gating decision.** `tool_agent` ("Fin") **stays** as a canonical
+subagent tool, but is **gated to DeepSeek-V4 models only**. It is the fast,
+non-thinking executor lane built on `deepseek-v4-flash`; offering it to other
+providers/models is meaningless (the lane *is* a specific model) and would just
+add a name to recall. The gate is provider/model-conditional in the same spirit
+as the Arcee first-turn narrowing.
+
+**Dependencies.** The alias table backs the Intent Router (§1) and Tool Repair
+(§7). **Full spec in `docs/TOOL_LIFECYCLE.md`** (to be written this cycle).
+
+---
+
+## 4. WhaleFlow / Workflow Mode
+
+**What it is.** A typed, multi-agent **workflow runner**. A workflow is a graph
+of typed nodes — **branches, leaves, reviewers, verifiers, test-runners,
+PR-creators**, with **trace-replay** and a **progress-monitor**. Authors write
+workflows in **Starlark or YAML**, which compile to a **typed Rust IR**; the
+**Rust executor** runs the IR. "Like Claude's workflow mode, but safer" — the
+safety comes from the typed IR and Rust execution boundary rather than free-form
+model-driven orchestration.
+
+**Why it helps weaker models.** Long-running, multi-step work (implement →
+review → verify → test → open PR) is exactly where weaker models drift, lose
+state, or skip verification. Encoding the *process* as a typed graph means the
+model only has to be competent at each *leaf*, while the harness guarantees the
+sequencing, the verification gates, and the evidence trail.
+
+**Rough shape.** Starlark/YAML → typed IR → Rust executor. Nodes map to
+subagent lanes (`agent_open` / `tool_agent` / `agent_eval` / `agent_close`,
+`registry.rs:1017-1029`). Reviewer/verifier/test-runner nodes are first-class
+node *types*, not ad-hoc prompts. Every run emits a trace (→ §8). Surfaced via
+`/workflow` (alias `/whaleflow`) and the `workflow_run` intent (§1).
+
+**Dependencies.** Subagent runtime; the evaluation loop (§8) for traces;
+Skills & Rules (§5) so a skill can *define* a workflow; the command taxonomy
+(§9).
+
+---
+
+## 5. Skills & Rules as First-Class Runtime
+
+**What it is.** Skills and rules become real runtime objects, not just prompt
+text. Skills gain **activation modes**:
+
+- **always-on** — injected every turn,
+- **glob** — activated when matching files are in scope,
+- **model-decision** — offered to the model to opt into,
+- **manual** — only via explicit `$<skill-name>` invocation (§9).
+
+Skills can **restrict the tool surface**, **define workflows** (§4), and
+**inject repo context**.
+
+**Why it helps weaker models.** A skill scoped to a task can shrink the tool
+surface to exactly what that task needs and pre-load the relevant rules and
+context — so the model operates inside a curated, smaller world instead of the
+full catalog.
+
+**Rough shape (vs. today).** Today: skills are discovered
+(`crates/tui/src/tools/skills/mod.rs`, `discover_in_workspace ~421`; struct
+parses name/description `~382-388`), enable-state is tracked
+(`skill_state.rs`, `SkillStateStore::is_enabled ~73`), and there's an
+inline-mention popup (`slash_menu.rs ~86`). **But:** no parser activates inline
+`$` mentions on submit (submit path: `ui.rs build_queued_message ~4721`), there
+is **no activation-mode concept**, and **skills cannot restrict tools**. The
+direction adds (a) a submit-time `$<skill-name>` activation parser, (b) the
+four activation modes in skill metadata, and (c) a tool-restriction field
+enforced by the registry/router.
+
+**Dependencies.** Tool lifecycle/alias table (§3) for restriction; Intent Router
+(§1); WhaleFlow (§4); command taxonomy (§9). **Full design in
+`docs/SKILL_INVOCATION_DESIGN.md`** (to be written this cycle).
+
+---
+
+## 6. Context Memory Stack
+
+**What it is.** Memory modeled as **explicit, layered, inspectable** stores
+rather than one undifferentiated blob. Each layer is **visible, inspectable,
+clearable, and scoped**:
+
+1. **User memory** — small user prefs/facts (surfaced via `/memory`, §9).
+2. **Repo rules** — checked-in guidance (`/rules`).
+3. **Codemap-wiki** — derived structural/semantic knowledge of the repo (§2).
+4. **Trace store** — recorded workflow/turn evidence (§8).
+5. **ARMH–RLM memo** — the RLM kernel's in-session working memory
+   (`rlm_open`/`rlm_eval`/`rlm_configure`/`rlm_close`/`rlm_session_objects`,
+   `crates/tui/src/tools/rlm.rs`; `handle_read` retrieves var handles;
+   `finalize`/`FINAL` is an *in-kernel Python function*, not a tool).
+6. **Cached-main overlay** — promoted lessons from the cached main branch
+   (`/overlay`, §9).
+7. **External memory (Aleph)** — large local data via the `aleph` skill.
+
+**Why it helps weaker models.** The model never has to *guess* where a fact
+should live or *re-derive* context it already established. Each layer has a
+clear scope and a clear command to inspect/clear it, so stale context is
+visible and removable rather than silently poisoning the prefix.
+
+**Rough shape.** A `/context` dashboard (§9) renders all active layers and their
+sizes; `/memory` manages the small user layer; `/overlay` manages promoted
+lessons. The RLM layer already exists and is plumbed through `rlm.rs`.
+
+**Dependencies.** Command taxonomy (§9); codebase intelligence (§2); evaluation
+loop (§8) for promotion into the overlay.
+
+---
+
+## 7. Tool Repair & Autoload
+
+**What it is.** When the model emits a wrong, deferred, deprecated, or
+environment-blocked tool call, the harness **repairs** it instead of returning a
+bare error — and **autoloads** what's needed.
+
+**Why it helps weaker models.** Recovery from a malformed call is precisely
+where weak models loop or give up. Turning every failure into an actionable,
+schema-bearing correction keeps the model on-task.
+
+**Rough shape — representative repairs:**
+
+- **Wrong/legacy name** → *"you meant `agent_eval`; here's the schema"* (autoload
+  the deferred tool's schema in the same turn).
+- **Mode mismatch** → *"shell is unavailable in Plan mode — ask the user or
+  switch modes"*.
+- **Missing dependency** → *"this tool needs Node; Node is missing"*
+  (dependency probe via `ExternalTool`, already imported in `tool_catalog.rs`).
+- **Deprecated alias** → silently **routed to the canonical** tool, with the
+  replacement notice in **result metadata only** (§3) — never the cached prefix.
+
+**Dependencies.** The alias table + lifecycle states (§3); the Intent Router
+(§1); dependency detection (`ExternalTool`). Builds on PR #2685's actionable
+RLM/field errors and PR #2684's lifecycle signals — **must not contradict
+either**.
+
+---
+
+## 8. Evaluation Loop
+
+**What it is.** Every workflow run **leaves evidence**: the tests it ran, the
+diffs it produced, the failures it hit, the searches it issued, the claims it
+verified, and the PR outcome. A **teacher/student replay** turns *good* traces
+into reusable **rules, skills, tests, and cached guidance**.
+
+**Why it helps weaker models.** The system gets better at *this repo* over time
+without the model getting smarter. Verified good traces become rules/skills the
+weaker model can lean on next time, and become the source of the cached-main
+overlay (§6).
+
+**Rough shape.** Workflow nodes (§4) emit structured evidence into the trace
+store (§6). A replay/distillation pass (teacher reviews student trace) promotes
+high-value traces into: repo rules (`/rules`), skills (§5), regression tests,
+and overlay guidance (`/overlay`). Verified-claim tracking ties into the
+adversarial-verification posture already used elsewhere.
+
+**Dependencies.** WhaleFlow (§4) for trace emission; trace store + overlay (§6);
+Skills & Rules (§5) as promotion targets.
+
+---
+
+## 9. Command-Surface Taxonomy
+
+**What it is.** One name = **one thing**. The command surface is split so each
+prefix has a single, memorable responsibility:
+
+| Surface | Responsibility |
+|---|---|
+| `/memory` | **Small** user prefs/facts only |
+| `/context` | **Dashboard** of all active memory layers (§6) |
+| `/rules` | Repo guidance |
+| `/workflow` (`/whaleflow`) | Long-running multi-agent runs (§4) |
+| `/overlay` | Promoted cached-main lessons (§6/§8) |
+| `$<skill-name>` | Skill invocation — **the token *is* the skill name** |
+| `codebase_search` | Concept-level code retrieval (§2) |
+
+**Why it helps weaker models (and users).** No overloaded command does five
+jobs; the model/user never has to disambiguate *which* `/memory` behavior they
+meant. `$systematic-debugging` self-documents what it invokes.
+
+**`/memory` subcommand sketch:**
+
+```
+/memory add "<fact>"        # store a small pref/fact
+/memory edit                # edit stored facts
+/memory search <query>      # find a stored fact
+/memory clear               # clear user memory
+/memory doctor              # health check; detects legacy ~/.deepseek path
+/memory promote <fact>      # (later) promote a fact to a higher layer
+```
+
+`doctor` specifically detects the **legacy `~/.deepseek`** path and guides
+migration.
+
+**`$<skill-name>` invocation examples:**
+
+```
+$systematic-debugging       # local skill
+$github:gh-fix-ci           # namespaced skill
+```
+
+The submit-time parser (to be added; submit path `ui.rs ~4721`) recognizes the
+`$` token and activates the named skill (§5).
+
+**`/context` layers dashboard (example render):**
+
+```
+/context
+  user-memory      ▸ 7 facts                 (12 KB)   [clear]
+  repo-rules       ▸ CLAUDE.md, AGENTS.md     (8 KB)   [view]
+  codemap-wiki     ▸ 412 symbols indexed     (auto)    [rebuild]
+  trace-store      ▸ 3 recent workflow runs  (—)       [open]
+  rlm-memo         ▸ 0 active sessions        (—)       [—]
+  cached-overlay   ▸ 5 promoted lessons       (3 KB)   [view]
+  aleph-external   ▸ not attached             (—)       [attach]
+```
+
+**Dependencies.** Memory stack (§6); skills (§5); codebase intelligence (§2);
+workflow runner (§4).
+
+---
+
+## 10. Deferred-Not-Done 0.8.53 Diet Items
+
+Recorded here so they are **not silently dropped** — these were considered for
+the 0.8.53 diet and deliberately **deferred** (design-only or out of scope this
+cycle):
+
+- **File-mutation overload** — `apply_patch` / `edit_file` / `write_file` /
+  `fim_edit` overlap in purpose. Per #2681 non-goals these stay distinct;
+  canonical *guidance* (prefer `apply_patch`) is doc-only, no consolidation
+  this cycle.
+- **`task_shell_*` ↔ `exec_*` redundancy** — `task_shell_start` /
+  `task_shell_wait` overlap conceptually with the `exec_*` family. Left intact
+  this cycle (distinct niche per #2681); revisit under §1/§3.
+- **`handle_read` / `retrieve_tool_result`** — result-handle plumbing kept as-is
+  (doc-only canonical guidance); folds naturally into the memory stack (§6) and
+  intent routing (§1) later.
+- **Search-cluster consolidation** — `grep_files` / `file_search` /
+  `project_map` remain three tools this cycle; consolidation is the *job of the
+  hybrid index* (§2) under `codebase_search`, not a catalog edit in 0.8.53.
+
+---
+
+## Phased Roadmap
+
+### 0.8.53 — design + small fixes only
+- **Code:** only the already-scoped, narrow fixes — PR #2684 (subagent role
+  vocab, lifecycle signals, eval ergonomics) and PR #2685 (read-only git history
+  active + actionable RLM/field errors). Subagent legacy-name cleanup +
+  guardrail tests rebased on #2684.
+- **Docs:** this north star, plus `docs/TOOL_LIFECYCLE.md`,
+  `docs/CODEBASE_SEARCH_DESIGN.md`, `docs/SKILL_INVOCATION_DESIGN.md`.
+- **No tool-catalog code:** the diet (§3), the Intent Router (§1), and the
+  hybrid index (§2) are **documented, not coded** this cycle.
+
+### 0.9.0 — first structural moves
+- Implement the **tool lifecycle** const name-sets + alias table in
+  `tool_catalog.rs` (§3) as a one-time deterministic head edit.
+- Land the **planned diet**: `exec_wait`/`exec_interact`/`tts` →
+  hidden-compatibility; `todo_*` → deprecated→`checklist_*` (result-metadata
+  notice only).
+- Gate **`tool_agent`** to DeepSeek-V4 models only (§3).
+- First version of the **default hybrid codebase index** (FTS5/BM25 + symbol +
+  codemap) behind `codebase_search` (§2).
+- First **Intent Router** verbs mapping onto today's tools (§1).
+- **Tool Repair** for deferred/deprecated/mode/dependency cases (§7).
+
+### Later (post-0.9.0)
+- **WhaleFlow** typed-IR workflow runner (§4) and the **evaluation loop** /
+  teacher-student replay (§8).
+- **Skills activation modes** + tool restriction + `$<skill-name>` submit-time
+  activation (§5).
+- Full **Context Memory Stack** with `/context` dashboard, `/overlay`
+  promotion, and Aleph external memory (§6).
+- Dense/semantic retrieval and PR/commit/issue history in the index (§2).
+- Search-cluster consolidation and the remaining §10 deferred items.
+
+---
+
+## North-star one-liner
+
+> **The harness handles memory, search, routing, state, and guardrails — so a
+> weaker model can just think.**