diff --git a/docs/CODEBASE_SEARCH_DESIGN.md b/docs/CODEBASE_SEARCH_DESIGN.md new file mode 100644 index 00000000..d7711d1a --- /dev/null +++ b/docs/CODEBASE_SEARCH_DESIGN.md @@ -0,0 +1,305 @@ +# `codebase_search` — Local-First Semantic Code Retrieval + +> **Status:** Design note + planned eval scaffold. **Code is DEFERRED.** +> GitHub #2680 · Milestone **v0.9.0** · This DOC ships in **v0.8.53** (doc-only; no catalog code in this cycle). +> Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either. + +This document specifies a model-visible `codebase_search` tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding **tool lifecycle** decisions for v0.8.53 so the eventual catalog edit is a single deterministic change. + +--- + +## 1. Problem + +Today CodeWhale ships two complementary code-locating tools and one structure map: + +- `file_search` — **filename** search (uses the `ignore` crate's `WalkBuilder` for vendor exclusion; default excludes at `crates/tui/src/tools/file_search.rs:210-219`). +- `grep_files` — **content** search (literal/regex token match). +- `project_map` — a deferred **structure** map. + +None of these answer **concept-level** questions where the user does not know the exact token: + +- "Where is provider auth resolved?" +- "What enforces the shell approval policy?" +- "Where do mode prompts get assembled?" +- "How does the subagent lifecycle close out a child?" + +`grep_files` requires you to already know the literal string (`resolve_api_key`, `ApprovalRequirement`, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens. + +**Goal.** Add a retrieval tool keyed on *intent*, not on exact lexemes, that returns ranked, **explainable** code locations. + +**Non-goal / explicit complement.** `codebase_search` does **not** replace `grep_files` or `file_search`. Exact-token and filename lookups remain the right tool when you know what you're looking for. `codebase_search` is the "I don't know the token yet" entry point and always falls back to exact grep so it is never *worse* than grep for a literal query. (See §2 fallback, §6 non-goals.) + +There is currently **no** FTS5/BM25, sparse, or dense index in the tree. `rusqlite` is already a workspace dependency (`crates/tui/Cargo.toml`), so the lexical core can be built with no new heavy dependencies. + +--- + +## 2. Approach Comparison + +| Approach | What it indexes | Local-first? | Recall on paraphrase | Cost / deps | Verdict for v0.9.0 | +|---|---|---|---|---|---| +| **Lexical FTS5 + `bm25()`** | tokenized code/comments/identifiers (camelCase/snake_case split) | Yes — SQLite built in via `rusqlite` | Medium (with tokenizer help) | Near-zero (existing dep) | **Phase 1 core** | +| **Symbol / path ranking** | extracted symbols (fn/struct/impl/const), path components | Yes | Medium-high for "where is X defined" | Low (regex/tree-sitter optional) | **Phase 1 core** | +| **Sparse encoders (SPLADE)** | learned term-expansion weights | Yes (model is local but heavy) | High | Model download + inference | Phase 3, feature-flagged | +| **Dense embeddings** | vector of chunk semantics | Optional — embedding model needed | Highest on paraphrase | Model + vector store; HF download | Phase 3, feature-flagged | +| **Cross-encoder reranker** | re-scores top-K candidates | Heavy | Boosts precision@k | Inference cost | Phase 4, feature-flagged | + +### Recommended architecture: Hybrid via Reciprocal-Rank Fusion (RRF) + +Each enabled signal produces an independent ranked list; results are merged with RRF +(`score(d) = Σ_signals 1/(k + rank_signal(d))`, conventional `k≈60`). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales. + +**v0.9.0 Phase 1 signal set (all local, no model downloads):** + +1. **Lexical (FTS5 `bm25()`)** over chunk text with an identifier-aware tokenizer. +2. **Symbol rank** — boost chunks whose extracted symbol name fuzzy-matches query terms. +3. **Path rank** — boost chunks whose path components match (e.g. query "auth" → `…/auth/…`, `…/provider…`). +4. **Session-relevance boost** — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at. +5. **Exact grep fallback** — the query is *also* run as a literal `grep_files`-equivalent pass; any exact hit is fused in and tagged, guaranteeing `codebase_search` ⊇ grep for literal queries. + +**Optional later backends (feature-flagged, off by default):** + +- `--features sparse-splade` — adds a SPLADE signal list to the RRF. +- `--features dense-embed` — adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy). +- `--features rerank` — cross-encoder reranks the fused top-K. + +Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI. + +--- + +## 3. Storage & Index + +### Location + +``` +~/.codewhale/index/.db +``` + +`` is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by `rusqlite` (existing dep). + +> Migration note (ties to the `/memory doctor` taxonomy in §7): older builds used `~/.deepseek`. The index path is `~/.codewhale` only; if a legacy `~/.deepseek/index` exists it is ignored (a future `doctor` may offer to migrate, never auto-read). + +### Schema sketch + +```sql +CREATE TABLE files ( + id INTEGER PRIMARY KEY, + path TEXT NOT NULL UNIQUE, -- workspace-relative + mtime_ns INTEGER NOT NULL, -- invalidation + size_bytes INTEGER NOT NULL, + content_hash TEXT NOT NULL, -- blake3; skip re-chunk if unchanged + lang TEXT, -- detected language + branch TEXT -- branch at last index (invalidation) +); + +CREATE TABLE chunks ( + id INTEGER PRIMARY KEY, + file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE, + start_line INTEGER NOT NULL, + end_line INTEGER NOT NULL, + kind TEXT, -- fn | struct | impl | const | doc | block + symbol TEXT, -- primary symbol name if any + text TEXT NOT NULL -- chunk body (identifier-split copy for FTS) +); + +-- Lexical index. external-content FTS so we don't duplicate bodies twice. +CREATE VIRTUAL TABLE chunks_fts USING fts5( + text, + symbol, + content='chunks', + content_rowid='id', + tokenize = 'unicode61 remove_diacritics 2' -- + identifier pre-split at index time +); + +CREATE TABLE symbols ( + id INTEGER PRIMARY KEY, + file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE, + chunk_id INTEGER REFERENCES chunks(id) ON DELETE CASCADE, + name TEXT NOT NULL, + kind TEXT NOT NULL, -- fn | struct | enum | trait | impl | const | macro + line INTEGER NOT NULL +); +CREATE INDEX symbols_name ON symbols(name); + +-- Session relevance: lightweight touch log, written by the session, decayed on read. +CREATE TABLE session_touch ( + path TEXT PRIMARY KEY, + last_touch INTEGER NOT NULL, -- unix ns + touch_count INTEGER NOT NULL DEFAULT 1 +); +``` + +Identifier-aware tokenization (splitting `resolveApiKey` / `resolve_api_key` → `resolve api key`) is applied **at index time** into the FTS `text` column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (`chunks_sparse`, `chunks_vec`) behind their feature flags. + +### Chunking strategy (structure-aware) + +Chunk on **syntactic boundaries**, not fixed windows: one chunk per top-level item (`fn`, `struct`, `impl` block, `const`, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere. + +### Invalidation + +- **mtime + content_hash:** on index/refresh, skip files whose `mtime_ns` and `content_hash` are unchanged. +- **Branch switch:** `files.branch` is recorded; on a branch change the affected files are re-checked (cheap because of content_hash). +- **Generated / vendor exclusion:** reuse the **same** `ignore`-crate `WalkBuilder` exclusion behavior as `file_search` (mirror the defaults at `crates/tui/src/tools/file_search.rs:210-219`: `target/**`, `node_modules/**`, `.git/**`, `DerivedData/**`, `dist/**`, `build/**`, `*.lock`, `*.plist`, plus `.gitignore`). One exclusion source of truth shared with `file_search` avoids index drift. + +### Privacy / trust + +- **Workspace-scoped, local-only.** The index lives under `~/.codewhale/index/` and never leaves the machine. +- **No cloud by default.** Phase 1 has zero network dependency. +- **Embeddings / Hugging Face downloads are gated.** Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag *and* an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything. + +--- + +## 4. Model-Visible Tool Contract + +```jsonc +// codebase_search +{ + "name": "codebase_search", + "description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).", + "parameters": { + "query": { "type": "string", "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." }, + "max_results":{ "type": "integer", "default": 10 }, + "path_glob": { "type": "string", "description": "Optional path filter, e.g. 'crates/tui/**'." }, + "lang": { "type": "string", "description": "Optional language filter." }, + "kind": { "type": "string", "description": "Optional symbol-kind filter: fn|struct|impl|const|..." } + } +} +``` + +**Result shape — ranked, explainable, auditable:** + +```jsonc +{ + "results": [ + { + "path": "crates/tui/src/config/provider.rs", + "line": 142, + "snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result { ... }", + "score": 0.91, + "reasons": [ + "symbol: resolve_api_key matches 'auth/resolve'", + "lexical: matched tokens [provider, api, key, resolve]", + "path: component 'provider' matches query", + "session: file read 2 turns ago" + ] + } + ], + "backend": "lexical+symbol+path+session", // which signals were fused (RRF) + "fallback_grep_hits": 1 // exact-match hits folded in +} +``` + +`reasons[]` is **mandatory** and is the auditability contract: every result explains *why* it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The `backend` field records which signals were active so results are reproducible given the feature set. + +--- + +## 5. Benchmark / Eval Set + +A fixed set of real CodeWhale concept queries, each with the **expected** file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on **file**, not line. + +| # | Query (concept, no exact token) | Expected file(s) | Anchor | +|---|---|---|---| +| 1 | Where is provider auth / API key resolved? | `crates/tui/src/config/` provider auth path | provider/config module | +| 2 | What is the first-turn active tool set? | `crates/tui/src/core/engine/tool_catalog.rs` | `DEFAULT_ACTIVE_NATIVE_TOOLS` :37-64 | +| 3 | How are deferred tools hydrated / searched? | `crates/tui/src/core/engine/tool_catalog.rs` | tool_search regex/bm25 :26-35 | +| 4 | Why does Arcee get a reduced tool set? (WAF workaround) | `crates/tui/src/core/engine/tool_catalog.rs` | `ARCEE_FIRST_TURN_NATIVE_TOOLS` :106-115 | +| 5 | What keeps the tool catalog byte-stable for the KV prefix cache? | `crates/tui/src/core/engine/tool_catalog.rs` | catalog-head invariant :169-196 | +| 6 | Where is the shell approval / cancel policy? | `crates/tui/src/tools/shell.rs` + `tools/spec.rs` (`ApprovalRequirement`) | shell tools, `ShellWaitTool`/`ShellInteractTool` registry.rs:524-531 | +| 7 | Where are mode prompts (Plan/Agent/YOLO) assembled? | mode prompt / `AppMode` assembly in `crates/tui/src/tui/` | `AppMode` usage | +| 8 | How does the subagent lifecycle open/eval/close a child? | `crates/tui/src/tools/subagent/mod.rs`; registry registration | registry.rs:1017-1029; `send_input`/`cancel`/`resume` mod.rs:1495,1521,1605 | +| 9 | What is the RLM session surface and its default child model? | `crates/tui/src/tools/rlm.rs` | `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"` :26 | +| 10 | Where is RLM eval / var_handle retrieval (`handle_read`)? | `crates/tui/src/tools/rlm.rs`, `tools/handle.rs` | `VarHandle` import rlm.rs:21 | +| 11 | Where are skills discovered and parsed in the workspace? | `crates/tui/src/tools/skills/mod.rs` | `discover_in_workspace` ~421; skill struct ~382-388 | +| 12 | Where is skill enable-state stored / checked? | `crates/tui/src/tools/skills/skill_state.rs` | `SkillStateStore::is_enabled` ~73 | +| 13 | How does vendor/generated exclusion work for file walking? | `crates/tui/src/tools/file_search.rs` | `ignore` WalkBuilder excludes :210-219 | +| 14 | Where is the queued user message built on submit? | `crates/tui/src/tui/ui.rs` | `build_queued_message` ~4721 | +| 15 | Where are speech / TTS tools registered? (duplicate names) | `crates/tui/src/tools/registry.rs` | `speech` ≡ `tts` :787-792 | + +Each entry is intended to become a `(query, expected_paths[])` row in a fixture +(e.g. `crates/tui/tests/fixtures/codebase_search_eval.jsonl`). This PR ships +the design table only; the fixture and harness are deferred to Phase 1. The +Phase 1 harness runs all queries against the live index and reports recall@k +and MRR; a regression bar (e.g. recall@10 >= target) gates future ranking +changes. + +--- + +## 6. Phasing, Feature Flags, and Non-Goals + +### Phasing + +- **Phase 0 (this cycle, v0.8.53):** this design note + benchmark table only. No fixture, harness, or catalog code. +- **Phase 1 (v0.9.0):** local lexical core — FTS5 `bm25()` + symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at `~/.codewhale/index/.db`. Eval harness wired into CI. **No network, no model downloads.** Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant. +- **Phase 2:** incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language). +- **Phase 3 (feature-flagged, off by default):** `sparse-splade` and `dense-embed` RRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy). +- **Phase 4 (feature-flagged):** `rerank` cross-encoder over the fused top-K. + +### Feature flags + +``` +codebase-search-core # Phase 1, default-on once it lands +sparse-splade # Phase 3, default-off +dense-embed # Phase 3, default-off (gated HF download) +rerank # Phase 4, default-off +``` + +### Non-goals + +- **No cloud index is required** for the core experience. Ever, for Phase 1. +- **Not a grep replacement.** Exact-token (`grep_files`) and filename (`file_search`) search stay first-class; `codebase_search` complements them and folds exact hits in as a fallback. +- Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more. + +### Cross-link: WhaleFlow epic + +`codebase_search` is a building block for the long-running multi-agent **WhaleFlow** (`/workflow` / `/whaleflow`) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable `reasons[]` feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors). + +--- + +## Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only) + +These are **design decisions for the eventual one-time catalog edit**; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (`tool_catalog.rs:169-196`) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn. + +### Lifecycle states (represented as const name-sets + an alias table in `tool_catalog.rs`, NOT a per-`ToolSpec` field) + +| State | Active first turn? | In tool-search? | Registered/dispatchable? | Result-metadata notice? | +|---|---|---|---|---| +| **active** | yes | yes | yes | no | +| **deferred** | no | yes | yes | no | +| **hidden-compatibility** | no | no | yes | no | +| **deprecated** | no | no | yes | yes (replacement notice, **metadata only**) | +| **removed** | no | no | no | — | + +Deprecated/hidden tools stay **registered and dispatchable** so old transcripts always replay. A deprecated tool appends a replacement notice to **RESULT METADATA only** — never to the cached prefix (which would break the invariant). + +### Planned diet (documented, not yet coded) + +- **`exec_wait`, `exec_interact`, `tts` → hidden-compatibility.** These are exact duplicates of canonical tools: + - `exec_wait` ≡ `exec_shell_wait` (same `ShellWaitTool`, `registry.rs:526,529`); router already unifies them at `crates/tui/src/tui/tool_routing.rs:1139-1140`. + - `exec_interact` ≡ `exec_shell_interact` (same `ShellInteractTool`, `registry.rs:527,530`). + - `tts` ≡ `speech` (same `SpeechTool`, `registry.rs:787-792`). + - Action: drop from active + search, keep registered, identical behavior, **no notice**. +- **`todo_*` (`todo_write/add/update/list`) → deprecated → `checklist_*`.** They are deferred twins of `checklist_*` (same `TodoWriteTool::new` vs `::checklist`, `todo.rs:187,194`); `checklist_write` is active, and `todo_*` are **not** in the active set. Action: drop from tool-search, keep registered, **add replacement notice** (metadata only). +- **Legacy subagent names** (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`, `send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, `delegate_to_agent`) are already `#[allow(dead_code)]` structs never instantiated outside tests (`crates/tui/src/tools/subagent/mod.rs`) → **already not model-visible.** Action: cleanup + guardrail tests, **rebased on PR #2684.** Note the live internal `SubAgentManager` methods `send_input`/`cancel`/`resume` (`mod.rs:1495,1521,1605`) are used by `agent_eval`/`agent_close` and **must be kept** — only the model-visible *tool* names are retired. + +### Model-visible subagent surface (unchanged) + +Only `agent_open`, `agent_eval`, `tool_agent`, `agent_close` are registered (`registry.rs:1017-1029`). + +- **`tool_agent` — KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY.** It is the fast non-thinking "Fin" executor lane built on `deepseek-v4-flash` (cf. RLM `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit. + +### Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance) + +`apply_patch` / `edit_file` / `write_file` / `fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` / `web.run` / `web_search`; `task_shell_*`; `handle_read` / `retrieve_tool_result`. These serve distinct purposes and stay as-is. + +--- + +## Appendix B — Command-Surface Taxonomy (context) + +Each name maps to exactly one thing; `codebase_search` slots in as concept-level code retrieval alongside these surfaces: + +- `/memory` — small user prefs/facts only (subcommands `add`/`edit`/`search`/`clear`/`doctor`, plus later `promote`; `doctor` detects the legacy `~/.deepseek` path). +- `/context` — dashboard of all active layers. +- `/rules` — repo guidance. +- `/workflow` (`/whaleflow`) — long-running multi-agent (the WhaleFlow epic). +- `/overlay` — promoted cached-main lessons. +- `$` — skill invocation prefix; the token *is* the skill name (e.g. `$systematic-debugging`, `$github:gh-fix-ci`). +- `codebase_search` — concept-level code retrieval (this document). diff --git a/docs/SKILL_INVOCATION_DESIGN.md b/docs/SKILL_INVOCATION_DESIGN.md new file mode 100644 index 00000000..127f17fa --- /dev/null +++ b/docs/SKILL_INVOCATION_DESIGN.md @@ -0,0 +1,233 @@ +# Skill Invocation Design — the `$` inline syntax + +Status: **DESIGN ONLY** (v0.8.53 cycle). No catalog/parser code ships in this +cycle; the implementation target is **0.9.0**. This document describes what +*will* be built and the contracts it must honor against today's code. + +Related design docs: `TOOL_LIFECYCLE.md` (tool lifecycle states + per-skill tool +restriction), command-surface taxonomy notes for `/memory`, `/context`, +`/rules`, `/workflow` (`/whaleflow`), `/overlay`. Open PRs on `codex/v0.8.53`: +#2684 (subagent role vocab / lifecycle signals / eval ergonomics) and #2685 +(git history active + RLM/field errors). Nothing here contradicts those. + +--- + +## 1. Problem + +Skill activation has no single, model-legible entry point, and the candidate +surfaces all compete with each other: + +- A `/skill` slash command, a `load_skill`-style tool, plugin/namespace naming + (`superpowers:systematic-debugging`, `github:gh-fix-ci`), and the long-running + workflow commands (`/workflow` / `/whaleflow`) all *could* be "the way you + start a skill." None of them is canonical. +- Slash commands are already overloaded. `/memory`, `/context`, `/rules`, + `/config`, `/provider`, `/workflow`, `/overlay` each map to one subsystem; + jamming skill invocation into `/`-space forces a weaker model to disambiguate + "is this a command or a skill?" on every keystroke. +- Weaker / smaller models (the cheaper providers CodeWhale targets) do not + reliably pick the right mechanism. They will free-text "let me use systematic + debugging" instead of actually loading the skill body, so the guidance never + enters the context window. +- Today there is **no parser that activates an inline skill mention on submit.** + `slash_menu.rs:86` (`partial_inline_skill_mention_at_cursor`) recognizes an + inline `/` token *under the cursor for popup purposes only*; the submit + path in `ui.rs:4721` (`build_queued_message`) does not resolve or activate any + inline mention. There is also no activation-mode concept (always-on / glob / + model-decision / manual) and skills cannot restrict tools yet. + +We need one prefix that means exactly "invoke this skill," is visually distinct +from commands, and is cheap for a small model to emit correctly. + +--- + +## 2. Proposal + +Adopt **`$` as the skill-invocation prefix**, where **the token *is* the skill +name** — not a literal command called `$skill`. + +``` +$systematic-debugging figure out why MiMo auth fails +$test-driven-development add coverage before fixing +$github:gh-fix-ci inspect the failing checks +$aleph search the planning doc +``` + +The leading `$` is the marker; everything from `$` up to the next whitespace is +the **skill id**. The rest of the line is the user's request, passed through to +the model with the skill body already loaded as active guidance. + +This is deliberately a *reference / macro* sigil, like a shell variable +expansion or an `@mention`: `$skill-id` resolves to "the contents and tool +policy of that skill," then the surrounding prose is the task. + +`$` works in three places (see §4): the user composer, the command-palette +input, and **model-facing planning text** — so the model itself can write +`$systematic-debugging` in its plan and have it resolve. + +--- + +## 3. Resolution rules + +Given a token `$` (id captured up to the next whitespace): + +1. **Exact name first.** Look the id up directly: + `discover_in_workspace(workspace).get(id)` — `skills/mod.rs:553` builds the + registry; `SkillRegistry::get` (`skills/mod.rs:421`) matches on `s.name == id` + exactly. Skill names come from frontmatter `name:` (or the first `# Heading` + fallback) parsed at `skills/mod.rs:382-417`. An exact hit wins unconditionally. + +2. **Namespaced `$ns:skill`.** If the id contains a `:`, treat the part before + the colon as a source/plugin namespace and the part after as the skill name: + `$github:gh-fix-ci`, `$superpowers:systematic-debugging`. Namespaced ids are + the disambiguation handle a user is told to type when a bare id is ambiguous. + (Glob/wildcard namespacing — `$github:*` — is explicitly deferred, see §6.) + +3. **Fuzzy match *suggests*, never silently chooses.** If there is no exact (or + namespaced-exact) hit, run a case-insensitive substring / prefix match over + `SkillRegistry::list()` (`skills/mod.rs:426`). If exactly one skill matches, + surface it as a suggestion ("did you mean `$systematic-debugging`?") but do + **not** auto-activate it. If more than one matches, list them and require the + user/model to re-issue with a disambiguated id (§7). Ambiguity never resolves + to a silent pick. + +4. **Respect enable-state.** A resolved skill is only activated if + `SkillStateStore::is_enabled(id)` is true (`skill_state.rs:73`: + `!self.disabled.contains(skill_name)`). A disabled skill that resolves by + name produces a clear "skill is disabled; enable it with `/skill enable `" + message rather than silently activating or silently doing nothing. + +Resolution order is therefore: **exact → namespaced-exact → enabled-check → +fuzzy-suggest (never auto-pick).** + +--- + +## 4. Behavior + +When a `$` mention resolves and is enabled: + +- **Visible activation line.** The transcript shows `Using skill: ` so the + user can see which skill body entered context. (Mirrors the existing skill UX + vocabulary; one line per activated skill.) +- **Body loaded as active guidance.** The skill's `body` + (`skills/mod.rs` `Skill.body`) is injected into the turn as authoritative + guidance, the same content a `/skill`-style activation would load. The user's + trailing prose is the task the guidance applies to. +- **Tool-surface narrowing (when declared).** If the skill declares a set of + allowed tools, the active tool surface narrows to that set for the duration of + the skill's influence. **Per-skill tool restriction is net-new** — skills + cannot restrict tools today; the mechanism, and how narrowing interacts with + the catalog-head byte-stability invariant (`tool_catalog.rs:169-196`), is + specified in `TOOL_LIFECYCLE.md`. Until that lands, a declared tool list is + parsed and shown but not enforced. +- **Multiple `$mentions` compose explicitly, or prompt.** Until formal + composition rules exist, two or more `$mentions` in one message either compose + only when the rule is unambiguous (e.g. one guidance skill + one tool-scoping + skill) or return a **"choose one"** prompt listing the mentioned skills. We + never silently activate multiple complex skills at once (see §7 and Non-goals). +- **Three input surfaces.** Resolution runs for: (a) user prompts in the + composer, (b) command-palette input, and (c) model-facing planning text, so a + model that writes `$test-driven-development` in its plan triggers the same + activation path a human would. +- **Slash commands remain supported.** `/skill ...` and the rest of the slash + surface keep working unchanged. `$` is the *preferred* path for models because + it is one token and unambiguous, but it is additive, not a replacement (§7 + Non-goals). + +--- + +## 5. Why `$` + +- **Visually distinct from `/commands`.** A glance separates "run a subsystem + command" (`/memory`, `/context`, `/workflow`) from "load a skill" (`$aleph`). + Weaker models stop confusing the two surfaces. +- **Reads like a reference / macro.** `$name` already means "expand this named + thing" to anyone who has touched a shell or a templating language. Skill + invocation *is* an expansion: `$skill-id` → that skill's guidance + tool policy. +- **Avoids overloading the slash namespace.** `/workflow`, `/memory`, `/config`, + `/provider`, `/rules`, `/overlay`, `/context` each already own one meaning in + the command-surface taxonomy. Skills get their own sigil instead of a crowded + `/skill ` subcommand competing with all of them. +- **Easy to type and remember.** Single leading character, then the literal + skill name. Nothing to memorize beyond the skill ids the user already sees in + `/skill list`. + +--- + +## 6. Implementation plan (smallest viable 0.8.53-ready slice → 0.9.0) + +The 0.8.53 cycle is **docs only**. The plan below is the build order once code +is unblocked; the first slice is intentionally the minimum that proves the path. + +**Slice 1 — token scanner at submit (the minimum viable feature).** +- Add a `$` token scanner invoked on submit, **before** + `build_queued_message` runs (`ui.rs:4721`). The scanner finds leading-`$` + tokens, captures the id up to the next whitespace, and hands each id to the + resolver. The scanner must skip `$` occurrences inside code fences and inline + command strings (see Non-goals) so shell `$VAR` references are never treated as + skill mentions. +- Resolve via `discover_in_workspace(workspace).get(id)` (`skills/mod.rs:553` / + `:421`), gate on `SkillStateStore::is_enabled` (`skill_state.rs:73`), and emit + the `Using skill: ` line plus the loaded body. + +**Slice 2 — inline-mention popup.** +- Extend the inline-mention popup machinery in `slash_menu.rs:86` + (`partial_inline_skill_mention_at_cursor`) to recognize a `$`-prefixed token + under the cursor and offer skill-name completions from `SkillRegistry::list()`, + the same way the slash popup offers commands. This is a UX accelerator on top + of Slice 1, not a precondition for it. + +**Slice 3 — ambiguity diagnostics.** +- When resolution is ambiguous, emit actionable diagnostics, e.g. + `"$debugging matched 3 skills: systematic-debugging, root-cause-debugging, + superpowers:systematic-debugging — use $superpowers:systematic-debugging"`. + Diagnostics name the disambiguated id the user should type next. + +**Deferred to 0.9.0+ (explicitly out of the first slices):** +- `$ns:skill` **globs / wildcards** (`$github:*`). Plain namespaced-exact + (`$github:gh-fix-ci`) ships in Slice 1; globbing does not. +- **Per-skill tool restriction enforcement.** Parsing/display can land early; + enforcement and its catalog-head-stability handling are owned by + `TOOL_LIFECYCLE.md`. +- **Multi-skill composition rules.** Until defined, fall back to the "choose one" + prompt (§4, §7). + +--- + +## 7. Ambiguity / error UX, tests, and non-goals + +### Error / ambiguity UX examples + +| Input | Outcome | +|---|---| +| `$systematic-debugging fix the auth bug` | Exact hit. `Using skill: systematic-debugging`, body loaded, task = "fix the auth bug". | +| `$github:gh-fix-ci inspect failing checks` | Namespaced-exact hit. `Using skill: github:gh-fix-ci`, body loaded. | +| `$nope do a thing` | No match. `"No skill named 'nope'. Run /skill list to see available skills."` No activation; the line is sent as ordinary text. | +| `$debugging ...` (3 candidates) | `"$debugging matched 3 skills: systematic-debugging, root-cause-debugging, superpowers:systematic-debugging — use $superpowers:systematic-debugging."` No auto-pick. | +| `$systematic-debug ...` (1 fuzzy candidate) | Suggest only: `"No exact skill 'systematic-debug'. Did you mean $systematic-debugging?"` No silent activation. | +| `$aleph ...` but aleph disabled | `"Skill 'aleph' is disabled. Enable it with /skill enable aleph."` No activation. | +| `$tdd $systematic-debugging ...` (2 mentions) | `"Choose one skill to lead this turn: $test-driven-development or $systematic-debugging."` (until composition rules exist). | +| `echo $PATH` inside a code fence / command string | Not a mention. Scanner skips `$` inside code/command contexts. | + +### Tests (planned) + +- **Exact:** `$systematic-debugging` resolves via `get(id)`, activates, loads body. +- **Namespaced:** `$github:gh-fix-ci` resolves on the `ns:skill` form. +- **Missing:** `$nope` → no-match message, no activation, line passed as text. +- **Ambiguous:** `$debugging` (≥2 candidates) → "matched N skills … use $ns:skill", + asserts **no** auto-activation occurred. +- **Disabled:** a skill with `is_enabled == false` → disabled message, no activation. +- **Guardrail — `$` in code:** `$VAR` inside a fenced block or command string is + not treated as a mention. + +### Non-goals + +- **Do not remove slash commands.** `/skill` and the whole `/` surface stay; `$` + is preferred for models but additive. +- **Do not auto-run arbitrary scripts.** A `$mention` loads guidance (and, later, + a declared tool policy) — it never executes shell or skill-attached scripts on + its own. +- **Do not silently activate multiple complex skills.** Multi-mention falls back + to a "choose one" prompt until composition rules are specified. +- **Do not let `$` collide with shell variables.** `$` inside code fences and + command strings is never parsed as a skill mention. diff --git a/docs/TOOL_LIFECYCLE.md b/docs/TOOL_LIFECYCLE.md new file mode 100644 index 00000000..07dc3407 --- /dev/null +++ b/docs/TOOL_LIFECYCLE.md @@ -0,0 +1,370 @@ +# Tool-Surface Lifecycle Policy (v0.8.53) + +**Status:** Design doc / policy. No catalog code lands in this cycle — the code +work is **deferred**. This document is the umbrella policy for GitHub **#2681**, +with **#2682** and **#2683** as concrete instances of the planned diet. It +describes *what will be done* and the invariants any future diet PR must hold. + +**Scope of related open work (do not contradict):** +- PR **#2684** — subagent role vocabulary, lifecycle signals, eval ergonomics. + Legacy subagent-name cleanup + guardrail tests in this policy rebase on #2684. +- PR **#2685** — git-history active + RLM/field errors. + +All file:line citations are against the verified tree at +`/Users/huntermbown/Desktop/whalebro/codewhale` as of v0.8.52/0.8.53. + +--- + +## 1. Purpose and the weaker-model problem + +CodeWhale ships a large native tool surface. The first-turn *active* partition +of that surface is what every model sees before it has run a single +`tool_search_*` call. Today that active set contains several **near-duplicate +tools** that map to the *same* implementation under different names: + +- `exec_wait` and `exec_shell_wait` are both `ShellWaitTool` + (`crates/tui/src/tools/registry.rs:526,529`). +- `exec_interact` and `exec_shell_interact` are both `ShellInteractTool` + (`registry.rs:527,530`). +- `tts` and `speech` are both `SpeechTool` + (`registry.rs:787-792`, both deferred). +- `todo_write` and `checklist_write` are the *same* `TodoWriteTool` + constructed two ways (`crates/tui/src/tools/todo.rs:184-196`). + +For a strong model, redundant names are harmless noise. For **weaker / smaller +models** (the Arcee Trinity lane, `deepseek-v4-flash` child executors, and any +non-thinking executor), every additional near-duplicate in the visible set is a +real cost: + +- It widens the choice space with options that do *nothing distinct*, increasing + wrong-tool selection and oscillation between synonyms. +- It spends scarce first-turn catalog budget (Section 5) on zero-information + entries. +- It dilutes the "one name = one thing" contract that lets a small model reason + about the surface at all. + +The lifecycle policy exists to **shrink and discipline the model-visible +surface** without ever breaking the ability to replay an old transcript that +referenced a now-retired name. + +--- + +## 2. The five lifecycle states + +Every native tool name occupies exactly one lifecycle state. + +| State | Meaning | Visible on first turn? | In `tool_search_*`? | Executes if called? | When used | +|---|---|---|---|---|---| +| **active** | Canonical, in the first-turn catalog head | **Yes** | n/a (already active) | Yes | The tool a model should reach for by default | +| **deferred** | Registered + discoverable, hydrated on demand | No | **Yes** | Yes | Real, useful tools that don't earn a first-turn slot | +| **hidden-compatibility** | Registered + dispatchable, but removed from active **and** from search | No | **No** | **Yes — identical behavior, silent** | Old synonym kept only so old transcripts replay; no model should newly discover it | +| **deprecated** | Like hidden-compat, but execution **appends a replacement notice to result metadata** | No | **No** | **Yes — works, plus a "use X instead" notice** | A retired name we actively steer callers off of, still safe to replay | +| **removed** | Not registered at all | No | No | **No — hard error** | Only after `planned_removal_version`, once replay support is formally dropped | + +### hidden-compatibility vs deprecated — be precise + +Both states are **invisible** (not active, not in tool search) and both remain +**dispatchable** (calling them still works). The *only* difference is the +caller-facing signal: + +- **hidden-compatibility:** completely silent. The tool behaves byte-for-byte + like its canonical twin. We use this when there is *no behavioral or naming + lesson to teach* — the name was a pure alias and we simply don't want models + re-learning it. (Example: `exec_wait` is literally `exec_shell_wait`.) +- **deprecated:** behaves identically *and succeeds*, but the tool result's + **metadata** carries an appended notice like + `"deprecated: use checklist_write instead"`. The notice goes **only in the + result metadata returned for that call** — never in the cached tool catalog + prefix (see Section 8). We use this when there is a canonical replacement we + want the caller (and any human reading the transcript) nudged toward. + +Neither state ever changes the *behavior* of the call. Replay always works. + +--- + +## 3. Representation in code + +The lifecycle is represented as **const name-sets plus an alias/manifest table** +in `crates/tui/src/core/engine/tool_catalog.rs`, alongside the existing +`DEFAULT_ACTIVE_NATIVE_TOOLS` (`tool_catalog.rs:37-64`) and +`ARCEE_FIRST_TURN_NATIVE_TOOLS` (`tool_catalog.rs:106-115`). + +### 3a. Name-sets and the manifest (sketch) + +```rust +// crates/tui/src/core/engine/tool_catalog.rs (planned) + +/// Tools removed from the active set AND from tool-search, but still +/// registered and dispatchable with byte-identical behavior. Silent. +pub(super) const HIDDEN_COMPATIBILITY_TOOLS: &[&str] = &[ + "exec_wait", // == exec_shell_wait (ShellWaitTool) + "exec_interact", // == exec_shell_interact (ShellInteractTool) + "tts", // == speech (SpeechTool) +]; + +/// Deprecated aliases: invisible + dispatchable, with a replacement notice +/// appended to RESULT METADATA only (never the cached prefix). +pub(super) struct DeprecatedAlias { + pub name: &'static str, + pub replacement: &'static str, + pub note: &'static str, +} + +pub(super) const DEPRECATED_ALIASES: &[DeprecatedAlias] = &[ + DeprecatedAlias { name: "todo_write", replacement: "checklist_write", + note: "use checklist_write instead" }, + DeprecatedAlias { name: "todo_add", replacement: "checklist_add", + note: "use checklist_add instead" }, + DeprecatedAlias { name: "todo_update", replacement: "checklist_update", + note: "use checklist_update instead" }, + DeprecatedAlias { name: "todo_list", replacement: "checklist_list", + note: "use checklist_list instead" }, +]; + +#[inline] +pub(super) fn is_hidden_or_deprecated(name: &str) -> bool { + HIDDEN_COMPATIBILITY_TOOLS.contains(&name) + || DEPRECATED_ALIASES.iter().any(|d| d.name == name) +} +``` + +### 3b. The two filter points + +1. **Catalog / tool-search exclusion (tool_catalog.rs).** + Deferral is decided by `should_default_defer_tool` (`tool_catalog.rs:66-82`), + and the active set is the head built by `build_model_tool_catalog` + (`tool_catalog.rs:178-196`). Hidden-compat and deprecated tools must be + forced *out of the active head* and *out of the tool-search-discoverable + pool*. Concretely, the deferral predicate gains a short-circuit so these + names are never active, and the tool-search index builder skips any name for + which `is_hidden_or_deprecated(name)` is true. Arcee's narrowed first-turn + path (`apply_provider_tool_policy`, `tool_catalog.rs:134-149`) already + excludes them by construction since they aren't in + `ARCEE_FIRST_TURN_NATIVE_TOOLS`. + +2. **Result-notice append (tool_routing.rs).** + Dispatch already routes by tool name in + `crates/tui/src/tui/tool_routing.rs` (e.g. the wait/interact unification at + `tool_routing.rs:1139-1140`). After a successful dispatch, if the called name + is in `DEPRECATED_ALIASES`, the router appends the matching `note` to the + **result metadata only**. Hidden-compat names append nothing. + +### 3c. Why name-sets, not a per-`ToolSpec` enum field + +A per-`ToolSpec` `lifecycle: Lifecycle` field was rejected for three reasons: + +- **Prefix-cache safety.** The tool catalog array is part of DeepSeek's + immutable KV prefix (`tool_catalog.rs:169-177`). A per-spec field invites + serializing lifecycle state *into* each tool's schema, which is exactly the + kind of head mutation that forces a full re-prefill. Name-sets live entirely + in the catalog-build logic and never touch the emitted tool JSON. +- **Single source of truth + diffability.** The diet for a release is one small, + reviewable edit to two or three const arrays in one file, instead of scattered + field flips across many tool modules. +- **Registration stays orthogonal.** Tools remain registered exactly as today + (e.g. `with_shell_tools`, `registry.rs:523-531`). Lifecycle is a *catalog + policy* layered on top of registration, not a property baked into the tool. + +--- + +## 4. Deprecation manifest (the #2681 acceptance-criteria table) + +This is the authoritative manifest. Columns are the #2681 AC columns. No entry +is "removed" in 0.8.53; replay is supported for everything listed. + +| Alias | Replacement (canonical) | Lifecycle state | first_deprecated_version | planned_removal_version | replay_supported | +|---|---|---|---|---|---| +| `exec_wait` | `exec_shell_wait` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes | +| `exec_interact` | `exec_shell_interact` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes | +| `tts` | `speech` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes | +| `todo_write` | `checklist_write` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes | +| `todo_add` | `checklist_add` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes | +| `todo_update` | `checklist_update` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes | +| `todo_list` | `checklist_list` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes | + +**Legacy subagent names — already non-visible, no manifest entry needed.** +`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`, +`send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, and +`delegate_to_agent` exist only as `#[allow(dead_code)]` structs in +`crates/tui/src/tools/subagent/mod.rs` and are **never instantiated** outside +tests, so they are already not model-visible. Only `agent_open`, `agent_eval`, +`tool_agent`, and `agent_close` are registered +(`registry.rs:1017-1029`). The action for these legacy names is **dead-code +cleanup + a guardrail test** (rebase on PR #2684), not a lifecycle transition. + +> **Keep the live internal methods.** `send_input`, `cancel`, and `resume` also +> exist as live `SubAgentManager` methods +> (`subagent/mod.rs:1605,1495,1521`) used internally by `agent_eval` / +> `agent_close`. These are *not* the dead-code tool structs and must be kept. + +`planned_removal_version` is intentionally `TBD`: a name only moves to **removed** +once we formally drop replay for transcripts old enough to contain it, which is a +separate, deliberate decision per name. + +--- + +## 5. Active-catalog budget (per mode, per provider) + +The active set is the first-turn cost. Do not duplicate the exact +`DEFAULT_ACTIVE_NATIVE_TOOLS` count here: adjacent PRs in the v0.8.53 batch may +add or remove active tools, and the source of truth is always +`tool_catalog.rs`. This document defines the diet policy and invariants, not a +second catalog snapshot. + +### Per provider + +| Provider | First-turn active source | Budget policy | +|---|---|---| +| Default (DeepSeek et al.) | `DEFAULT_ACTIVE_NATIVE_TOOLS` | Remove duplicate aliases from the active head when their canonical twins stay active; any net growth needs an explicit budget decision. | +| Arcee (Trinity) | `ARCEE_FIRST_TURN_NATIVE_TOOLS` | Provider-specific read-only WAF workaround; unchanged by the default diet unless explicitly reviewed. | + +The default diet removes `exec_wait` and `exec_interact` from the active head +(they become hidden-compat; their canonical twins `exec_shell_wait` / +`exec_shell_interact` stay). `tts` and `todo_*` are *already not* in the active +set, so they do not change the active budget in this diet. The net effect of +this specific diet is to remove two duplicate active aliases from whatever +default active head is current after the surrounding v0.8.53 PR batch. + +### Per mode (Plan / Agent / YOLO) + +The native active head is the **same set across modes** by design — mode does not +add or remove native tools from `DEFAULT_ACTIVE_NATIVE_TOOLS` +(`should_default_defer_tool` ignores `_mode` for native tools, +`tool_catalog.rs:66-68`). Mode affects **MCP** deferral instead: +`apply_mcp_tool_deferral` keeps MCP tools deferred unless `mode == Yolo` +(`tool_catalog.rs:162-167`). + +| Mode | Native active budget | MCP tools active? | +|---|---|---| +| Plan | same native head | No (deferred) | +| Agent | same native head | No (deferred) | +| YOLO | same native head | Yes (a known, intentional widening) | + +**Budget rule:** the native active head must stay byte-identical across Plan ↔ +Agent ↔ YOLO (Section 8). Any growth of the head requires retiring something +else or an explicit budget bump in this doc. + +--- + +## 6. The canonical-surface rule + +> **Every model-visible (active or deferred-discoverable) tool must have one +> clear niche. If a tool is superseded, it gets a named replacement and moves to +> hidden-compatibility or deprecated — it does not stay visible.** + +### Canonical vs compatibility summary for the confusing clusters + +| Cluster | Canonical (keep visible) | Compatibility / retired | Notes | +|---|---|---|---| +| **Shell wait** | `exec_shell_wait` | `exec_wait` → hidden-compat | Same `ShellWaitTool` (`registry.rs:526,529`); router already unifies (`tool_routing.rs:1139`) | +| **Shell interact** | `exec_shell_interact` | `exec_interact` → hidden-compat | Same `ShellInteractTool` (`registry.rs:527,530`) | +| **Checklist / todo** | `checklist_write` | `todo_write/add/update/list` → deprecated | Same `TodoWriteTool`, `::new` vs `::checklist` (`todo.rs:184-196`) | +| **Speech / tts** | `speech` | `tts` → hidden-compat | Same `SpeechTool` (`registry.rs:787-792`) | +| **Subagent lifecycle** | `agent_open`, `agent_eval`, `agent_close`, `tool_agent` (gated, §7) | all 11 legacy names → already non-visible dead code | Cleanup + guardrail test, rebase on #2684 | +| **Edit family** | `apply_patch`, `edit_file`, `write_file`, `fim_edit` | none — **all distinct niches** | NOT touched (per #2681 non-goals); doc-only canonical guidance | +| **Search family** | `grep_files` (content), `file_search` (filename), `project_map` (structure) | none — **distinct niches** | NOT touched; no FTS5/BM25/semantic index exists today | + +**Non-goals (explicitly NOT diet targets in this cycle, per #2681):** +`apply_patch` / `edit_file` / `write_file` / `fim_edit`; +`grep_files` / `file_search` / `project_map`; +`fetch_url` / `web.run` / `web_search`; +`task_shell_*`; `handle_read` / `retrieve_tool_result`. These have distinct +niches and receive **canonical guidance only** — no lifecycle change. + +The RLM surface (`rlm_open` / `rlm_eval` / `rlm_configure` / `rlm_close` / +`rlm_session_objects`, `crates/tui/src/tools/rlm.rs`) is likewise out of scope; +`handle_read` retrieves var handles, and `finalize` / `FINAL` is an in-kernel +Python function, **not a tool** — so there is nothing to retire there. + +--- + +## 7. `tool_agent` decision: canonical but DeepSeek-V4-gated + +`tool_agent` **stays** as a canonical subagent tool +(`registry.rs:1024`, `ToolAgentTool`). It is the fast, **non-thinking "Fin" +executor lane**, built on `deepseek-v4-flash` (cf. `DEFAULT_CHILD_MODEL = +"deepseek-v4-flash"`, `rlm.rs:26`). + +**Decision: gate `tool_agent` to DeepSeek-V4 models only.** + +- It is purpose-built around the V4-flash non-thinking executor profile. Exposing + it to other providers (e.g. Arcee Trinity, which is already WAF-narrowed to 8 + read-only tools, `tool_catalog.rs:106-115`) offers no working executor lane and + only adds a confusing, mis-targeted option to weaker surfaces. +- Gating is a **provider/model policy**, consistent with the existing + provider-aware first-turn policy (`apply_provider_tool_policy`, + `tool_catalog.rs:134-149`): on non-DeepSeek-V4 models, `tool_agent` is excluded + from the active set and from tool-search discovery. It remains **registered and + dispatchable** so transcripts created under a V4 model replay everywhere. + +This is not a lifecycle transition — `tool_agent` is canonical. It is a +*visibility gate* layered on the same machinery as the Arcee narrowing. + +--- + +## 8. Prefix-cache safety + replay guarantee + +### Prefix-cache rules every diet PR MUST follow + +The tools array is part of DeepSeek's immutable KV prefix. The catalog-head +byte-stability invariant (`tool_catalog.rs:169-196`) is binding: + +1. **Never mutate the active head non-deterministically.** The first-turn active + block must be **byte-identical run-to-run** and across Plan ↔ Agent ↔ YOLO. +2. **A diet is a one-time deterministic edit.** Removing a name from + `DEFAULT_ACTIVE_NATIVE_TOOLS` shifts the head exactly once; after that it must + be stable. Land such edits as their own focused change. +3. **Notices live in result metadata, never the prefix.** Deprecated replacement + notes are appended at dispatch time in `tool_routing.rs` to the *call result* + only. **Nothing** about hidden/deprecated state may be serialized into a tool + schema, description, or the catalog array. +4. **Preserve ordering and partitioning.** `build_model_tool_catalog` sorts each + partition by name and keeps built-ins as a contiguous prefix ahead of MCP + tools (`tool_catalog.rs:186-194`). Diet edits must not break this. +5. **Hidden/deprecated tools are excluded *before* the head is built**, so their + removal is the only head change — they do not appear in the prefix at all. + +### Old-transcript replay guarantee + +> For every name in the deprecation manifest with `replay_supported = Yes`, the +> tool stays **registered and dispatchable with identical behavior**. Replaying +> an old transcript that calls `exec_wait`, `exec_interact`, `tts`, or any +> `todo_*` produces the same result it always did. Deprecated names additionally +> attach a result-metadata notice; hidden-compat names are silent. A name is only +> ever made non-dispatchable (**removed**) after a deliberate, per-name decision +> to drop replay support at `planned_removal_version`. + +--- + +## 9. Required tests + +Any diet PR (and the umbrella #2681 work) must add/keep: + +1. **Duplicate-active-alias guard.** A test asserting that no name in + `HIDDEN_COMPATIBILITY_TOOLS` or `DEPRECATED_ALIASES` appears in + `DEFAULT_ACTIVE_NATIVE_TOOLS` or `ARCEE_FIRST_TURN_NATIVE_TOOLS`, and that no + two active entries resolve to the same underlying tool implementation. + +2. **Tool-search exclusion test.** Assert that hidden-compat and deprecated names + are absent from the tool-search-discoverable pool while remaining present in + the registry (dispatchable). + +3. **Replay / dispatch tests.** For each manifest name, calling it still + executes and returns the same result as its canonical twin. Deprecated names + additionally assert the replacement note is present **in result metadata** and + absent from the catalog/prefix. Hidden-compat names assert **no** added + notice. + +4. **Golden active-block byte test.** A snapshot test pinning the byte + serialization of the first-turn active tool block, asserting it is identical + across Plan / Agent / YOLO (native head) and stable run-to-run — enforcing the + `tool_catalog.rs:169-196` invariant. The golden updates **only** as a + reviewed, deliberate one-time edit when the diet lands. + +5. **Subagent guardrail test (rebase on #2684).** Assert only `agent_open`, + `agent_eval`, `tool_agent`, `agent_close` are registered as model-visible + subagent tools and that no legacy name from `subagent/mod.rs` is + instantiated outside tests. + +6. **`tool_agent` gating test.** Assert `tool_agent` is active/discoverable only + under DeepSeek-V4 models and excluded (but still registered) elsewhere. diff --git a/docs/VISION_NORTH_STAR.md b/docs/VISION_NORTH_STAR.md new file mode 100644 index 00000000..064d8885 --- /dev/null +++ b/docs/VISION_NORTH_STAR.md @@ -0,0 +1,490 @@ +# CodeWhale North Star (0.9.0+) + +> **STATUS: DIRECTION, NOT COMMITTED WORK.** +> Everything in this document is the maintainer's intended *direction* for +> CodeWhale 0.9.0 and beyond. **None of it is committed 0.8.53 work.** The +> 0.8.53 cycle ships **design docs only** for these areas — no tool-catalog code +> lands this cycle except the small, already-scoped subagent/git/RLM fixes in +> PR #2684 and PR #2685. Treat every "rough shape" below as a sketch to be +> refined, not an API contract. Where this doc names tools that do not exist yet +> (`codebase_search`, `read_file` as a canonical alias, `agent_run`, etc.) those +> are **aspirational names** that will *map onto today's tools*; see each +> section. + +## Why this document exists + +The vision is at risk of being lost between point releases. CodeWhale is +accumulating capability (subagents, RLM, skills, workflows, an enormous tool +catalog) faster than it is accumulating *shape*. This is the north star that the +incremental 0.8.x stabilization work is steering toward, written down once so it +survives the next dozen PRs. + +### The one principle + +**The harness handles memory, search, routing, state, and guardrails so a +weaker model can just *think*.** Every design decision below is in service of +moving cognitive load *out* of the model and *into* the harness. A +`deepseek-v4-flash`-class model should not have to remember ~80 tool names, hold +the codebase index in its head, track which layer of memory a fact lives in, or +re-derive a recovery path after a malformed tool call. The harness does that. +The model decides *what it wants*; the harness figures out *how*. + +--- + +## Ground-truth anchor (today's reality) + +So the direction is honest about where it starts: + +- **Active first-turn tool set** is `DEFAULT_ACTIVE_NATIVE_TOOLS` + (`crates/tui/src/core/engine/tool_catalog.rs:37-64`) — 26 tools. Everything + else is **deferred** and hydrates via `tool_search_tool_regex` / + `tool_search_tool_bm25` (`tool_catalog.rs:26-35`). +- **Catalog-head byte-stability is a hard invariant** for DeepSeek's KV + prefix cache (`tool_catalog.rs:169-196`). The active first-turn tool block + must stay byte-identical run-to-run; any change to it is a **one-time, + deterministic edit**, never a per-turn or per-mode mutation. +- **Arcee** narrows the first turn to 8 read-only tools + (`ARCEE_FIRST_TURN_NATIVE_TOOLS`, `tool_catalog.rs:106-115`) as a Cloudflare + WAF workaround — proof the active partition is already provider-shaped. +- **Subagent tools that are model-visible:** only `agent_open`, `agent_eval`, + `tool_agent`, `agent_close` (`crates/tui/src/tools/registry.rs:1017-1029`). + All legacy names (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, + `agent_send_input`, `agent_assign`, `agent_list`, `agent_cancel`, + `resume_agent`, `delegate_to_agent`, …) are `#[allow(dead_code)]` structs in + `crates/tui/src/tools/subagent/mod.rs`, never instantiated outside tests → + **already not model-visible**. The live internal `send_input` / `cancel` / + `resume` methods on `SubAgentManager` (`mod.rs:1495,1521,1605`) back + `agent_eval` / `agent_close` and **stay**. +- **`tool_agent` is "Fin"** — the experimental fast-lane executor: DeepSeek V4 + Flash with thinking forced off (`mod.rs:5233`, `TOOL_AGENT_INTRO`; + `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`). +- **Known duplicates today:** `exec_wait ≡ exec_shell_wait`, + `exec_interact ≡ exec_shell_interact` (same structs, all four in the active + set), `tts ≡ speech` (both deferred). `todo_*` are deferred twins of + `checklist_*` (same `TodoWriteTool`, `::new` vs `::checklist`, + `todo.rs:187,194`). The router already unifies `exec_wait`/`exec_shell_wait` + (`crates/tui/src/tui/tool_routing.rs:1139-1140`). + +This is the surface the north star refactors *toward simplicity*. + +--- + +## 1. Intent Router + +**What it is.** A thin layer where the model declares an **intent** — +*search / inspect / edit / test / delegate / ask-user / run-shell / +run-workflow* — and the harness maps that intent to the correct low-level tool +and arguments. The model picks from a tiny, stable verb vocabulary instead of +recalling ~80 concrete tool names and their schemas. + +**Why it helps weaker models.** Tool-name recall is one of the largest sources +of wasted turns for small models: choosing a deferred tool (double-invoke), +choosing a deprecated alias, or hallucinating a name. A fixed intent vocabulary +collapses that decision space to ~10 verbs. The model spends its budget on +*reasoning about the task*, not on *remembering the API*. + +**Rough shape.** A small **canonical visible set** — aspirational names that +route onto today's tools: + +| Intent verb (aspirational) | Routes onto today | +|---|---| +| `codebase_search` | concept-level retrieval over the hybrid index (§2); today: `grep_files` + `file_search` + `project_map` | +| `read_file` | `read_file` (already canonical) | +| `apply_patch` | `apply_patch` (canonical; `edit_file`/`write_file`/`fim_edit` remain as distinct lower-level tools) | +| `run_tests` | `run_tests` / `run_verifiers` | +| `git_status` | `git_status` | +| `git_diff` | `git_diff` | +| `work_update` | `update_plan` / `checklist_write` | +| `ask_user` | `request_user_input` | +| `shell_run` | `exec_shell` (canonical; `exec_wait`/`exec_interact` hidden — §10) | +| `agent_run` | `agent_open` / `tool_agent` (gated, §3) / `agent_eval` / `agent_close` | +| `workflow_run` | WhaleFlow runner (§4) | + +The router is the *only* place the catalog's full complexity is allowed to live. +It is also where **tool repair** (§7) hooks in: a mis-stated intent or a +deferred/deprecated name is rewritten to the canonical route. + +**Dependencies.** The small canonical surface (§3), the lifecycle alias table +(§3 / `docs/TOOL_LIFECYCLE.md`), and the hybrid index for `codebase_search` +(§2). Must respect the **catalog-head byte-stability invariant**: the visible +verb set is itself a one-time deterministic edit, not a dynamic per-turn list. + +--- + +## 2. Default Hybrid Codebase Intelligence + +**What it is.** An always-on, local-first codebase index that ships with the +harness — not an opt-in tool the model has to remember to build. It fuses: + +- plain **text** search, +- **symbol** index (definitions/references), +- **import / call graph**, +- **FTS5 + BM25** lexical ranking (rusqlite is already a dependency — + `Cargo.toml`), +- **sparse** retrieval, +- optional **dense** (embedding) retrieval, +- **PR / commit / issue history** as a first-class retrieval source, +- a **codemap** (structural overview, the successor to today's deferred + `project_map`). + +**Why it helps weaker models.** Today the model must orchestrate `grep_files` +(content), `file_search` (filename), and `project_map` (structure) by hand, +reconcile their outputs, and re-run them as it narrows. There is **no FTS5/BM25 +or semantic index today** — every search is a cold walk (`file_search` uses the +`ignore` crate's `WalkBuilder` for vendor exclusion, `file_search.rs:~210`). A +weaker model burns turns stitching partial results. A single `codebase_search` +intent backed by a hybrid index returns ranked, concept-level hits in one call, +so the model reasons about *answers*, not *query mechanics*. + +**Rough shape.** A background indexer maintains a SQLite store (FTS5 + symbol + +graph tables), refreshed on file change and on git events. `codebase_search` +(§1) queries it; the codemap is regenerated incrementally. Vendor exclusion +reuses the existing `ignore`/`WalkBuilder` path. + +**Dependencies.** rusqlite/FTS5; the Intent Router (§1) for the +`codebase_search` verb; the trace store (§6/§8) for history retrieval. **Full +design lives in `docs/CODEBASE_SEARCH_DESIGN.md`** (to be written this cycle). + +--- + +## 3. Small Canonical Tool Surface + +**What it is.** A deliberately tiny set of always-visible canonical tools; +**everything else is hidden, deferred, or skill-scoped**. The catalog grows +behind the scenes but the *visible* surface stays small and stable. + +**Why it helps weaker models.** Fewer choices, no aliases competing for the same +job, no deferred double-invokes for common operations. The model sees the verbs +it needs and nothing else. + +**Rough shape — tool lifecycle states.** Five states, represented as **const +name-sets plus an alias table in `tool_catalog.rs`** (NOT a per-`ToolSpec` +field, to preserve the byte-stable head): + +1. **active** — in the first-turn catalog head. +2. **deferred** — registered, hydrated via tool-search. +3. **hidden-compatibility** — registered + dispatchable, **dropped from both + active and search**, identical behavior, **no notice**. (For exact + duplicates that should simply disappear from discovery.) +4. **deprecated** — registered + dispatchable, **dropped from search**, appends + a *replacement notice to RESULT METADATA only* — **never** to the cached + prefix. +5. **removed** — final state; no longer registered. + +**Invariant:** deprecated and hidden-compatibility tools **stay registered and +dispatchable forever** so old transcripts always replay deterministically. + +**Planned diet (documented this cycle, not yet coded):** + +- `exec_wait`, `exec_interact`, `tts` → **hidden-compatibility** (exact + duplicates of `exec_shell_wait`, `exec_shell_interact`, `speech`). +- `todo_*` (`todo_write/add/update/list`) → **deprecated → checklist_*** (drop + from tool-search, keep registered, add result-metadata notice). +- Legacy subagent names → already hidden; remaining work is **cleanup + + guardrail tests**, rebased on PR #2684. + +**Explicitly NOT touched** (distinct niches, per #2681 non-goals) — doc-only +canonical guidance, no diet: `apply_patch` / `edit_file` / `write_file` / +`fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` / +`web.run` / `web_search`; `task_shell_*`; `handle_read` / +`retrieve_tool_result`. + +**`tool_agent` gating decision.** `tool_agent` ("Fin") **stays** as a canonical +subagent tool, but is **gated to DeepSeek-V4 models only**. It is the fast, +non-thinking executor lane built on `deepseek-v4-flash`; offering it to other +providers/models is meaningless (the lane *is* a specific model) and would just +add a name to recall. The gate is provider/model-conditional in the same spirit +as the Arcee first-turn narrowing. + +**Dependencies.** The alias table backs the Intent Router (§1) and Tool Repair +(§7). **Full spec in `docs/TOOL_LIFECYCLE.md`** (to be written this cycle). + +--- + +## 4. WhaleFlow / Workflow Mode + +**What it is.** A typed, multi-agent **workflow runner**. A workflow is a graph +of typed nodes — **branches, leaves, reviewers, verifiers, test-runners, +PR-creators**, with **trace-replay** and a **progress-monitor**. Authors write +workflows in **Starlark or YAML**, which compile to a **typed Rust IR**; the +**Rust executor** runs the IR. "Like Claude's workflow mode, but safer" — the +safety comes from the typed IR and Rust execution boundary rather than free-form +model-driven orchestration. + +**Why it helps weaker models.** Long-running, multi-step work (implement → +review → verify → test → open PR) is exactly where weaker models drift, lose +state, or skip verification. Encoding the *process* as a typed graph means the +model only has to be competent at each *leaf*, while the harness guarantees the +sequencing, the verification gates, and the evidence trail. + +**Rough shape.** Starlark/YAML → typed IR → Rust executor. Nodes map to +subagent lanes (`agent_open` / `tool_agent` / `agent_eval` / `agent_close`, +`registry.rs:1017-1029`). Reviewer/verifier/test-runner nodes are first-class +node *types*, not ad-hoc prompts. Every run emits a trace (→ §8). Surfaced via +`/workflow` (alias `/whaleflow`) and the `workflow_run` intent (§1). + +**Dependencies.** Subagent runtime; the evaluation loop (§8) for traces; +Skills & Rules (§5) so a skill can *define* a workflow; the command taxonomy +(§9). + +--- + +## 5. Skills & Rules as First-Class Runtime + +**What it is.** Skills and rules become real runtime objects, not just prompt +text. Skills gain **activation modes**: + +- **always-on** — injected every turn, +- **glob** — activated when matching files are in scope, +- **model-decision** — offered to the model to opt into, +- **manual** — only via explicit `$` invocation (§9). + +Skills can **restrict the tool surface**, **define workflows** (§4), and +**inject repo context**. + +**Why it helps weaker models.** A skill scoped to a task can shrink the tool +surface to exactly what that task needs and pre-load the relevant rules and +context — so the model operates inside a curated, smaller world instead of the +full catalog. + +**Rough shape (vs. today).** Today: skills are discovered +(`crates/tui/src/tools/skills/mod.rs`, `discover_in_workspace ~421`; struct +parses name/description `~382-388`), enable-state is tracked +(`skill_state.rs`, `SkillStateStore::is_enabled ~73`), and there's an +inline-mention popup (`slash_menu.rs ~86`). **But:** no parser activates inline +`$` mentions on submit (submit path: `ui.rs build_queued_message ~4721`), there +is **no activation-mode concept**, and **skills cannot restrict tools**. The +direction adds (a) a submit-time `$` activation parser, (b) the +four activation modes in skill metadata, and (c) a tool-restriction field +enforced by the registry/router. + +**Dependencies.** Tool lifecycle/alias table (§3) for restriction; Intent Router +(§1); WhaleFlow (§4); command taxonomy (§9). **Full design in +`docs/SKILL_INVOCATION_DESIGN.md`** (to be written this cycle). + +--- + +## 6. Context Memory Stack + +**What it is.** Memory modeled as **explicit, layered, inspectable** stores +rather than one undifferentiated blob. Each layer is **visible, inspectable, +clearable, and scoped**: + +1. **User memory** — small user prefs/facts (surfaced via `/memory`, §9). +2. **Repo rules** — checked-in guidance (`/rules`). +3. **Codemap-wiki** — derived structural/semantic knowledge of the repo (§2). +4. **Trace store** — recorded workflow/turn evidence (§8). +5. **ARMH–RLM memo** — the RLM kernel's in-session working memory + (`rlm_open`/`rlm_eval`/`rlm_configure`/`rlm_close`/`rlm_session_objects`, + `crates/tui/src/tools/rlm.rs`; `handle_read` retrieves var handles; + `finalize`/`FINAL` is an *in-kernel Python function*, not a tool). +6. **Cached-main overlay** — promoted lessons from the cached main branch + (`/overlay`, §9). +7. **External memory (Aleph)** — large local data via the `aleph` skill. + +**Why it helps weaker models.** The model never has to *guess* where a fact +should live or *re-derive* context it already established. Each layer has a +clear scope and a clear command to inspect/clear it, so stale context is +visible and removable rather than silently poisoning the prefix. + +**Rough shape.** A `/context` dashboard (§9) renders all active layers and their +sizes; `/memory` manages the small user layer; `/overlay` manages promoted +lessons. The RLM layer already exists and is plumbed through `rlm.rs`. + +**Dependencies.** Command taxonomy (§9); codebase intelligence (§2); evaluation +loop (§8) for promotion into the overlay. + +--- + +## 7. Tool Repair & Autoload + +**What it is.** When the model emits a wrong, deferred, deprecated, or +environment-blocked tool call, the harness **repairs** it instead of returning a +bare error — and **autoloads** what's needed. + +**Why it helps weaker models.** Recovery from a malformed call is precisely +where weak models loop or give up. Turning every failure into an actionable, +schema-bearing correction keeps the model on-task. + +**Rough shape — representative repairs:** + +- **Wrong/legacy name** → *"you meant `agent_eval`; here's the schema"* (autoload + the deferred tool's schema in the same turn). +- **Mode mismatch** → *"shell is unavailable in Plan mode — ask the user or + switch modes"*. +- **Missing dependency** → *"this tool needs Node; Node is missing"* + (dependency probe via `ExternalTool`, already imported in `tool_catalog.rs`). +- **Deprecated alias** → silently **routed to the canonical** tool, with the + replacement notice in **result metadata only** (§3) — never the cached prefix. + +**Dependencies.** The alias table + lifecycle states (§3); the Intent Router +(§1); dependency detection (`ExternalTool`). Builds on PR #2685's actionable +RLM/field errors and PR #2684's lifecycle signals — **must not contradict +either**. + +--- + +## 8. Evaluation Loop + +**What it is.** Every workflow run **leaves evidence**: the tests it ran, the +diffs it produced, the failures it hit, the searches it issued, the claims it +verified, and the PR outcome. A **teacher/student replay** turns *good* traces +into reusable **rules, skills, tests, and cached guidance**. + +**Why it helps weaker models.** The system gets better at *this repo* over time +without the model getting smarter. Verified good traces become rules/skills the +weaker model can lean on next time, and become the source of the cached-main +overlay (§6). + +**Rough shape.** Workflow nodes (§4) emit structured evidence into the trace +store (§6). A replay/distillation pass (teacher reviews student trace) promotes +high-value traces into: repo rules (`/rules`), skills (§5), regression tests, +and overlay guidance (`/overlay`). Verified-claim tracking ties into the +adversarial-verification posture already used elsewhere. + +**Dependencies.** WhaleFlow (§4) for trace emission; trace store + overlay (§6); +Skills & Rules (§5) as promotion targets. + +--- + +## 9. Command-Surface Taxonomy + +**What it is.** One name = **one thing**. The command surface is split so each +prefix has a single, memorable responsibility: + +| Surface | Responsibility | +|---|---| +| `/memory` | **Small** user prefs/facts only | +| `/context` | **Dashboard** of all active memory layers (§6) | +| `/rules` | Repo guidance | +| `.codewhale/constitution.json` | Repo constitution: checked-in **local law** | +| `/workflow` (`/whaleflow`) | Long-running multi-agent runs (§4) | +| `/overlay` | Promoted cached-main lessons (§6/§8) | +| `$` | Skill invocation — **the token *is* the skill name** | +| `codebase_search` | Concept-level code retrieval (§2) | + +The repo constitution is not another memory bucket. It is the local-law layer in +a layered authority model: + +``` +base myth / global Constitution + -> repo constitution (.codewhale/constitution.json) + -> task packet + -> runtime policy +``` + +At conflict time, the **current user request for the task remains above the repo +constitution**; the repo constitution supplies durable defaults and local law +only when the active task packet and runtime policy leave room. Runtime policy is +the compiled enforcement surface for the run, not a separate place for the model +to invent new rules. + +**Why it helps weaker models (and users).** No overloaded command does five +jobs; the model/user never has to disambiguate *which* `/memory` behavior they +meant. `$systematic-debugging` self-documents what it invokes. + +**`/memory` subcommand sketch:** + +``` +/memory add "" # store a small pref/fact +/memory edit # edit stored facts +/memory search # find a stored fact +/memory clear # clear user memory +/memory doctor # health check; detects legacy ~/.deepseek path +/memory promote # (later) promote a fact to a higher layer +``` + +`doctor` specifically detects the **legacy `~/.deepseek`** path and guides +migration. + +**`$` invocation examples:** + +``` +$systematic-debugging # local skill +$github:gh-fix-ci # namespaced skill +``` + +The submit-time parser (to be added; submit path `ui.rs ~4721`) recognizes the +`$` token and activates the named skill (§5). + +**`/context` layers dashboard (example render):** + +``` +/context + user-memory ▸ 7 facts (12 KB) [clear] + repo-constitution ▸ .codewhale/constitution.json (4 KB) [view] + repo-rules ▸ CLAUDE.md, AGENTS.md (8 KB) [view] + codemap-wiki ▸ 412 symbols indexed (auto) [rebuild] + trace-store ▸ 3 recent workflow runs (—) [open] + rlm-memo ▸ 0 active sessions (—) [—] + cached-overlay ▸ 5 promoted lessons (3 KB) [view] + aleph-external ▸ not attached (—) [attach] +``` + +**Dependencies.** Memory stack (§6); skills (§5); codebase intelligence (§2); +workflow runner (§4). + +--- + +## 10. Deferred-Not-Done 0.8.53 Diet Items + +Recorded here so they are **not silently dropped** — these were considered for +the 0.8.53 diet and deliberately **deferred** (design-only or out of scope this +cycle): + +- **File-mutation overload** — `apply_patch` / `edit_file` / `write_file` / + `fim_edit` overlap in purpose. Per #2681 non-goals these stay distinct; + canonical *guidance* (prefer `apply_patch`) is doc-only, no consolidation + this cycle. +- **`task_shell_*` ↔ `exec_*` redundancy** — `task_shell_start` / + `task_shell_wait` overlap conceptually with the `exec_*` family. Left intact + this cycle (distinct niche per #2681); revisit under §1/§3. +- **`handle_read` / `retrieve_tool_result`** — result-handle plumbing kept as-is + (doc-only canonical guidance); folds naturally into the memory stack (§6) and + intent routing (§1) later. +- **Search-cluster consolidation** — `grep_files` / `file_search` / + `project_map` remain three tools this cycle; consolidation is the *job of the + hybrid index* (§2) under `codebase_search`, not a catalog edit in 0.8.53. + +--- + +## Phased Roadmap + +### 0.8.53 — design + small fixes only +- **Code:** only the already-scoped, narrow fixes — PR #2684 (subagent role + vocab, lifecycle signals, eval ergonomics) and PR #2685 (read-only git history + active + actionable RLM/field errors). Subagent legacy-name cleanup + + guardrail tests rebased on #2684. +- **Docs:** this north star, plus `docs/TOOL_LIFECYCLE.md`, + `docs/CODEBASE_SEARCH_DESIGN.md`, `docs/SKILL_INVOCATION_DESIGN.md`. +- **No tool-catalog code:** the diet (§3), the Intent Router (§1), and the + hybrid index (§2) are **documented, not coded** this cycle. + +### 0.9.0 — first structural moves +- Implement the **tool lifecycle** const name-sets + alias table in + `tool_catalog.rs` (§3) as a one-time deterministic head edit. +- Land the **planned diet**: `exec_wait`/`exec_interact`/`tts` → + hidden-compatibility; `todo_*` → deprecated→`checklist_*` (result-metadata + notice only). +- Gate **`tool_agent`** to DeepSeek-V4 models only (§3). +- First version of the **default hybrid codebase index** (FTS5/BM25 + symbol + + codemap) behind `codebase_search` (§2). +- First **Intent Router** verbs mapping onto today's tools (§1). +- **Tool Repair** for deferred/deprecated/mode/dependency cases (§7). + +### Later (post-0.9.0) +- **WhaleFlow** typed-IR workflow runner (§4) and the **evaluation loop** / + teacher-student replay (§8). +- **Skills activation modes** + tool restriction + `$` submit-time + activation (§5). +- Full **Context Memory Stack** with `/context` dashboard, `/overlay` + promotion, and Aleph external memory (§6). +- Dense/semantic retrieval and PR/commit/issue history in the index (§2). +- Search-cluster consolidation and the remaining §10 deferred items. + +--- + +## North-star one-liner + +> **The harness handles memory, search, routing, state, and guardrails — so a +> weaker model can just think.**