Merge branch 'codex/v0.8.53-toolsurface-design-docs' into codex/v0.8.53

This commit is contained in:
Hunter Bown
2026-06-03 12:37:57 -07:00
4 changed files with 1398 additions and 0 deletions
+305
View File
@@ -0,0 +1,305 @@
# `codebase_search` — Local-First Semantic Code Retrieval
> **Status:** Design note + planned eval scaffold. **Code is DEFERRED.**
> GitHub #2680 · Milestone **v0.9.0** · This DOC ships in **v0.8.53** (doc-only; no catalog code in this cycle).
> Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either.
This document specifies a model-visible `codebase_search` tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding **tool lifecycle** decisions for v0.8.53 so the eventual catalog edit is a single deterministic change.
---
## 1. Problem
Today CodeWhale ships two complementary code-locating tools and one structure map:
- `file_search`**filename** search (uses the `ignore` crate's `WalkBuilder` for vendor exclusion; default excludes at `crates/tui/src/tools/file_search.rs:210-219`).
- `grep_files`**content** search (literal/regex token match).
- `project_map` — a deferred **structure** map.
None of these answer **concept-level** questions where the user does not know the exact token:
- "Where is provider auth resolved?"
- "What enforces the shell approval policy?"
- "Where do mode prompts get assembled?"
- "How does the subagent lifecycle close out a child?"
`grep_files` requires you to already know the literal string (`resolve_api_key`, `ApprovalRequirement`, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens.
**Goal.** Add a retrieval tool keyed on *intent*, not on exact lexemes, that returns ranked, **explainable** code locations.
**Non-goal / explicit complement.** `codebase_search` does **not** replace `grep_files` or `file_search`. Exact-token and filename lookups remain the right tool when you know what you're looking for. `codebase_search` is the "I don't know the token yet" entry point and always falls back to exact grep so it is never *worse* than grep for a literal query. (See §2 fallback, §6 non-goals.)
There is currently **no** FTS5/BM25, sparse, or dense index in the tree. `rusqlite` is already a workspace dependency (`crates/tui/Cargo.toml`), so the lexical core can be built with no new heavy dependencies.
---
## 2. Approach Comparison
| Approach | What it indexes | Local-first? | Recall on paraphrase | Cost / deps | Verdict for v0.9.0 |
|---|---|---|---|---|---|
| **Lexical FTS5 + `bm25()`** | tokenized code/comments/identifiers (camelCase/snake_case split) | Yes — SQLite built in via `rusqlite` | Medium (with tokenizer help) | Near-zero (existing dep) | **Phase 1 core** |
| **Symbol / path ranking** | extracted symbols (fn/struct/impl/const), path components | Yes | Medium-high for "where is X defined" | Low (regex/tree-sitter optional) | **Phase 1 core** |
| **Sparse encoders (SPLADE)** | learned term-expansion weights | Yes (model is local but heavy) | High | Model download + inference | Phase 3, feature-flagged |
| **Dense embeddings** | vector of chunk semantics | Optional — embedding model needed | Highest on paraphrase | Model + vector store; HF download | Phase 3, feature-flagged |
| **Cross-encoder reranker** | re-scores top-K candidates | Heavy | Boosts precision@k | Inference cost | Phase 4, feature-flagged |
### Recommended architecture: Hybrid via Reciprocal-Rank Fusion (RRF)
Each enabled signal produces an independent ranked list; results are merged with RRF
(`score(d) = Σ_signals 1/(k + rank_signal(d))`, conventional `k≈60`). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales.
**v0.9.0 Phase 1 signal set (all local, no model downloads):**
1. **Lexical (FTS5 `bm25()`)** over chunk text with an identifier-aware tokenizer.
2. **Symbol rank** — boost chunks whose extracted symbol name fuzzy-matches query terms.
3. **Path rank** — boost chunks whose path components match (e.g. query "auth" → `…/auth/…`, `…/provider…`).
4. **Session-relevance boost** — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at.
5. **Exact grep fallback** — the query is *also* run as a literal `grep_files`-equivalent pass; any exact hit is fused in and tagged, guaranteeing `codebase_search` ⊇ grep for literal queries.
**Optional later backends (feature-flagged, off by default):**
- `--features sparse-splade` — adds a SPLADE signal list to the RRF.
- `--features dense-embed` — adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy).
- `--features rerank` — cross-encoder reranks the fused top-K.
Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI.
---
## 3. Storage & Index
### Location
```
~/.codewhale/index/<workspace-hash>.db
```
`<workspace-hash>` is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by `rusqlite` (existing dep).
> Migration note (ties to the `/memory doctor` taxonomy in §7): older builds used `~/.deepseek`. The index path is `~/.codewhale` only; if a legacy `~/.deepseek/index` exists it is ignored (a future `doctor` may offer to migrate, never auto-read).
### Schema sketch
```sql
CREATE TABLE files (
id INTEGER PRIMARY KEY,
path TEXT NOT NULL UNIQUE, -- workspace-relative
mtime_ns INTEGER NOT NULL, -- invalidation
size_bytes INTEGER NOT NULL,
content_hash TEXT NOT NULL, -- blake3; skip re-chunk if unchanged
lang TEXT, -- detected language
branch TEXT -- branch at last index (invalidation)
);
CREATE TABLE chunks (
id INTEGER PRIMARY KEY,
file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
start_line INTEGER NOT NULL,
end_line INTEGER NOT NULL,
kind TEXT, -- fn | struct | impl | const | doc | block
symbol TEXT, -- primary symbol name if any
text TEXT NOT NULL -- chunk body (identifier-split copy for FTS)
);
-- Lexical index. external-content FTS so we don't duplicate bodies twice.
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
symbol,
content='chunks',
content_rowid='id',
tokenize = 'unicode61 remove_diacritics 2' -- + identifier pre-split at index time
);
CREATE TABLE symbols (
id INTEGER PRIMARY KEY,
file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
chunk_id INTEGER REFERENCES chunks(id) ON DELETE CASCADE,
name TEXT NOT NULL,
kind TEXT NOT NULL, -- fn | struct | enum | trait | impl | const | macro
line INTEGER NOT NULL
);
CREATE INDEX symbols_name ON symbols(name);
-- Session relevance: lightweight touch log, written by the session, decayed on read.
CREATE TABLE session_touch (
path TEXT PRIMARY KEY,
last_touch INTEGER NOT NULL, -- unix ns
touch_count INTEGER NOT NULL DEFAULT 1
);
```
Identifier-aware tokenization (splitting `resolveApiKey` / `resolve_api_key``resolve api key`) is applied **at index time** into the FTS `text` column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (`chunks_sparse`, `chunks_vec`) behind their feature flags.
### Chunking strategy (structure-aware)
Chunk on **syntactic boundaries**, not fixed windows: one chunk per top-level item (`fn`, `struct`, `impl` block, `const`, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere.
### Invalidation
- **mtime + content_hash:** on index/refresh, skip files whose `mtime_ns` and `content_hash` are unchanged.
- **Branch switch:** `files.branch` is recorded; on a branch change the affected files are re-checked (cheap because of content_hash).
- **Generated / vendor exclusion:** reuse the **same** `ignore`-crate `WalkBuilder` exclusion behavior as `file_search` (mirror the defaults at `crates/tui/src/tools/file_search.rs:210-219`: `target/**`, `node_modules/**`, `.git/**`, `DerivedData/**`, `dist/**`, `build/**`, `*.lock`, `*.plist`, plus `.gitignore`). One exclusion source of truth shared with `file_search` avoids index drift.
### Privacy / trust
- **Workspace-scoped, local-only.** The index lives under `~/.codewhale/index/` and never leaves the machine.
- **No cloud by default.** Phase 1 has zero network dependency.
- **Embeddings / Hugging Face downloads are gated.** Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag *and* an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything.
---
## 4. Model-Visible Tool Contract
```jsonc
// codebase_search
{
"name": "codebase_search",
"description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).",
"parameters": {
"query": { "type": "string", "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." },
"max_results":{ "type": "integer", "default": 10 },
"path_glob": { "type": "string", "description": "Optional path filter, e.g. 'crates/tui/**'." },
"lang": { "type": "string", "description": "Optional language filter." },
"kind": { "type": "string", "description": "Optional symbol-kind filter: fn|struct|impl|const|..." }
}
}
```
**Result shape — ranked, explainable, auditable:**
```jsonc
{
"results": [
{
"path": "crates/tui/src/config/provider.rs",
"line": 142,
"snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result<Secret> { ... }",
"score": 0.91,
"reasons": [
"symbol: resolve_api_key matches 'auth/resolve'",
"lexical: matched tokens [provider, api, key, resolve]",
"path: component 'provider' matches query",
"session: file read 2 turns ago"
]
}
],
"backend": "lexical+symbol+path+session", // which signals were fused (RRF)
"fallback_grep_hits": 1 // exact-match hits folded in
}
```
`reasons[]` is **mandatory** and is the auditability contract: every result explains *why* it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The `backend` field records which signals were active so results are reproducible given the feature set.
---
## 5. Benchmark / Eval Set
A fixed set of real CodeWhale concept queries, each with the **expected** file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on **file**, not line.
| # | Query (concept, no exact token) | Expected file(s) | Anchor |
|---|---|---|---|
| 1 | Where is provider auth / API key resolved? | `crates/tui/src/config/` provider auth path | provider/config module |
| 2 | What is the first-turn active tool set? | `crates/tui/src/core/engine/tool_catalog.rs` | `DEFAULT_ACTIVE_NATIVE_TOOLS` :37-64 |
| 3 | How are deferred tools hydrated / searched? | `crates/tui/src/core/engine/tool_catalog.rs` | tool_search regex/bm25 :26-35 |
| 4 | Why does Arcee get a reduced tool set? (WAF workaround) | `crates/tui/src/core/engine/tool_catalog.rs` | `ARCEE_FIRST_TURN_NATIVE_TOOLS` :106-115 |
| 5 | What keeps the tool catalog byte-stable for the KV prefix cache? | `crates/tui/src/core/engine/tool_catalog.rs` | catalog-head invariant :169-196 |
| 6 | Where is the shell approval / cancel policy? | `crates/tui/src/tools/shell.rs` + `tools/spec.rs` (`ApprovalRequirement`) | shell tools, `ShellWaitTool`/`ShellInteractTool` registry.rs:524-531 |
| 7 | Where are mode prompts (Plan/Agent/YOLO) assembled? | mode prompt / `AppMode` assembly in `crates/tui/src/tui/` | `AppMode` usage |
| 8 | How does the subagent lifecycle open/eval/close a child? | `crates/tui/src/tools/subagent/mod.rs`; registry registration | registry.rs:1017-1029; `send_input`/`cancel`/`resume` mod.rs:1495,1521,1605 |
| 9 | What is the RLM session surface and its default child model? | `crates/tui/src/tools/rlm.rs` | `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"` :26 |
| 10 | Where is RLM eval / var_handle retrieval (`handle_read`)? | `crates/tui/src/tools/rlm.rs`, `tools/handle.rs` | `VarHandle` import rlm.rs:21 |
| 11 | Where are skills discovered and parsed in the workspace? | `crates/tui/src/tools/skills/mod.rs` | `discover_in_workspace` ~421; skill struct ~382-388 |
| 12 | Where is skill enable-state stored / checked? | `crates/tui/src/tools/skills/skill_state.rs` | `SkillStateStore::is_enabled` ~73 |
| 13 | How does vendor/generated exclusion work for file walking? | `crates/tui/src/tools/file_search.rs` | `ignore` WalkBuilder excludes :210-219 |
| 14 | Where is the queued user message built on submit? | `crates/tui/src/tui/ui.rs` | `build_queued_message` ~4721 |
| 15 | Where are speech / TTS tools registered? (duplicate names) | `crates/tui/src/tools/registry.rs` | `speech``tts` :787-792 |
Each entry is intended to become a `(query, expected_paths[])` row in a fixture
(e.g. `crates/tui/tests/fixtures/codebase_search_eval.jsonl`). This PR ships
the design table only; the fixture and harness are deferred to Phase 1. The
Phase 1 harness runs all queries against the live index and reports recall@k
and MRR; a regression bar (e.g. recall@10 >= target) gates future ranking
changes.
---
## 6. Phasing, Feature Flags, and Non-Goals
### Phasing
- **Phase 0 (this cycle, v0.8.53):** this design note + benchmark table only. No fixture, harness, or catalog code.
- **Phase 1 (v0.9.0):** local lexical core — FTS5 `bm25()` + symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at `~/.codewhale/index/<workspace-hash>.db`. Eval harness wired into CI. **No network, no model downloads.** Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant.
- **Phase 2:** incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language).
- **Phase 3 (feature-flagged, off by default):** `sparse-splade` and `dense-embed` RRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy).
- **Phase 4 (feature-flagged):** `rerank` cross-encoder over the fused top-K.
### Feature flags
```
codebase-search-core # Phase 1, default-on once it lands
sparse-splade # Phase 3, default-off
dense-embed # Phase 3, default-off (gated HF download)
rerank # Phase 4, default-off
```
### Non-goals
- **No cloud index is required** for the core experience. Ever, for Phase 1.
- **Not a grep replacement.** Exact-token (`grep_files`) and filename (`file_search`) search stay first-class; `codebase_search` complements them and folds exact hits in as a fallback.
- Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more.
### Cross-link: WhaleFlow epic
`codebase_search` is a building block for the long-running multi-agent **WhaleFlow** (`/workflow` / `/whaleflow`) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable `reasons[]` feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors).
---
## Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only)
These are **design decisions for the eventual one-time catalog edit**; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (`tool_catalog.rs:169-196`) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn.
### Lifecycle states (represented as const name-sets + an alias table in `tool_catalog.rs`, NOT a per-`ToolSpec` field)
| State | Active first turn? | In tool-search? | Registered/dispatchable? | Result-metadata notice? |
|---|---|---|---|---|
| **active** | yes | yes | yes | no |
| **deferred** | no | yes | yes | no |
| **hidden-compatibility** | no | no | yes | no |
| **deprecated** | no | no | yes | yes (replacement notice, **metadata only**) |
| **removed** | no | no | no | — |
Deprecated/hidden tools stay **registered and dispatchable** so old transcripts always replay. A deprecated tool appends a replacement notice to **RESULT METADATA only** — never to the cached prefix (which would break the invariant).
### Planned diet (documented, not yet coded)
- **`exec_wait`, `exec_interact`, `tts` → hidden-compatibility.** These are exact duplicates of canonical tools:
- `exec_wait``exec_shell_wait` (same `ShellWaitTool`, `registry.rs:526,529`); router already unifies them at `crates/tui/src/tui/tool_routing.rs:1139-1140`.
- `exec_interact``exec_shell_interact` (same `ShellInteractTool`, `registry.rs:527,530`).
- `tts``speech` (same `SpeechTool`, `registry.rs:787-792`).
- Action: drop from active + search, keep registered, identical behavior, **no notice**.
- **`todo_*` (`todo_write/add/update/list`) → deprecated → `checklist_*`.** They are deferred twins of `checklist_*` (same `TodoWriteTool::new` vs `::checklist`, `todo.rs:187,194`); `checklist_write` is active, and `todo_*` are **not** in the active set. Action: drop from tool-search, keep registered, **add replacement notice** (metadata only).
- **Legacy subagent names** (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`, `send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, `delegate_to_agent`) are already `#[allow(dead_code)]` structs never instantiated outside tests (`crates/tui/src/tools/subagent/mod.rs`) → **already not model-visible.** Action: cleanup + guardrail tests, **rebased on PR #2684.** Note the live internal `SubAgentManager` methods `send_input`/`cancel`/`resume` (`mod.rs:1495,1521,1605`) are used by `agent_eval`/`agent_close` and **must be kept** — only the model-visible *tool* names are retired.
### Model-visible subagent surface (unchanged)
Only `agent_open`, `agent_eval`, `tool_agent`, `agent_close` are registered (`registry.rs:1017-1029`).
- **`tool_agent` — KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY.** It is the fast non-thinking "Fin" executor lane built on `deepseek-v4-flash` (cf. RLM `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit.
### Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance)
`apply_patch` / `edit_file` / `write_file` / `fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` / `web.run` / `web_search`; `task_shell_*`; `handle_read` / `retrieve_tool_result`. These serve distinct purposes and stay as-is.
---
## Appendix B — Command-Surface Taxonomy (context)
Each name maps to exactly one thing; `codebase_search` slots in as concept-level code retrieval alongside these surfaces:
- `/memory` — small user prefs/facts only (subcommands `add`/`edit`/`search`/`clear`/`doctor`, plus later `promote`; `doctor` detects the legacy `~/.deepseek` path).
- `/context` — dashboard of all active layers.
- `/rules` — repo guidance.
- `/workflow` (`/whaleflow`) — long-running multi-agent (the WhaleFlow epic).
- `/overlay` — promoted cached-main lessons.
- `$<skill-name>` — skill invocation prefix; the token *is* the skill name (e.g. `$systematic-debugging`, `$github:gh-fix-ci`).
- `codebase_search` — concept-level code retrieval (this document).
+233
View File
@@ -0,0 +1,233 @@
# Skill Invocation Design — the `$<skill-name>` inline syntax
Status: **DESIGN ONLY** (v0.8.53 cycle). No catalog/parser code ships in this
cycle; the implementation target is **0.9.0**. This document describes what
*will* be built and the contracts it must honor against today's code.
Related design docs: `TOOL_LIFECYCLE.md` (tool lifecycle states + per-skill tool
restriction), command-surface taxonomy notes for `/memory`, `/context`,
`/rules`, `/workflow` (`/whaleflow`), `/overlay`. Open PRs on `codex/v0.8.53`:
#2684 (subagent role vocab / lifecycle signals / eval ergonomics) and #2685
(git history active + RLM/field errors). Nothing here contradicts those.
---
## 1. Problem
Skill activation has no single, model-legible entry point, and the candidate
surfaces all compete with each other:
- A `/skill` slash command, a `load_skill`-style tool, plugin/namespace naming
(`superpowers:systematic-debugging`, `github:gh-fix-ci`), and the long-running
workflow commands (`/workflow` / `/whaleflow`) all *could* be "the way you
start a skill." None of them is canonical.
- Slash commands are already overloaded. `/memory`, `/context`, `/rules`,
`/config`, `/provider`, `/workflow`, `/overlay` each map to one subsystem;
jamming skill invocation into `/`-space forces a weaker model to disambiguate
"is this a command or a skill?" on every keystroke.
- Weaker / smaller models (the cheaper providers CodeWhale targets) do not
reliably pick the right mechanism. They will free-text "let me use systematic
debugging" instead of actually loading the skill body, so the guidance never
enters the context window.
- Today there is **no parser that activates an inline skill mention on submit.**
`slash_menu.rs:86` (`partial_inline_skill_mention_at_cursor`) recognizes an
inline `/<skill>` token *under the cursor for popup purposes only*; the submit
path in `ui.rs:4721` (`build_queued_message`) does not resolve or activate any
inline mention. There is also no activation-mode concept (always-on / glob /
model-decision / manual) and skills cannot restrict tools yet.
We need one prefix that means exactly "invoke this skill," is visually distinct
from commands, and is cheap for a small model to emit correctly.
---
## 2. Proposal
Adopt **`$` as the skill-invocation prefix**, where **the token *is* the skill
name** — not a literal command called `$skill`.
```
$systematic-debugging figure out why MiMo auth fails
$test-driven-development add coverage before fixing
$github:gh-fix-ci inspect the failing checks
$aleph search the planning doc
```
The leading `$` is the marker; everything from `$` up to the next whitespace is
the **skill id**. The rest of the line is the user's request, passed through to
the model with the skill body already loaded as active guidance.
This is deliberately a *reference / macro* sigil, like a shell variable
expansion or an `@mention`: `$skill-id` resolves to "the contents and tool
policy of that skill," then the surrounding prose is the task.
`$` works in three places (see §4): the user composer, the command-palette
input, and **model-facing planning text** — so the model itself can write
`$systematic-debugging` in its plan and have it resolve.
---
## 3. Resolution rules
Given a token `$<id>` (id captured up to the next whitespace):
1. **Exact name first.** Look the id up directly:
`discover_in_workspace(workspace).get(id)``skills/mod.rs:553` builds the
registry; `SkillRegistry::get` (`skills/mod.rs:421`) matches on `s.name == id`
exactly. Skill names come from frontmatter `name:` (or the first `# Heading`
fallback) parsed at `skills/mod.rs:382-417`. An exact hit wins unconditionally.
2. **Namespaced `$ns:skill`.** If the id contains a `:`, treat the part before
the colon as a source/plugin namespace and the part after as the skill name:
`$github:gh-fix-ci`, `$superpowers:systematic-debugging`. Namespaced ids are
the disambiguation handle a user is told to type when a bare id is ambiguous.
(Glob/wildcard namespacing — `$github:*` — is explicitly deferred, see §6.)
3. **Fuzzy match *suggests*, never silently chooses.** If there is no exact (or
namespaced-exact) hit, run a case-insensitive substring / prefix match over
`SkillRegistry::list()` (`skills/mod.rs:426`). If exactly one skill matches,
surface it as a suggestion ("did you mean `$systematic-debugging`?") but do
**not** auto-activate it. If more than one matches, list them and require the
user/model to re-issue with a disambiguated id (§7). Ambiguity never resolves
to a silent pick.
4. **Respect enable-state.** A resolved skill is only activated if
`SkillStateStore::is_enabled(id)` is true (`skill_state.rs:73`:
`!self.disabled.contains(skill_name)`). A disabled skill that resolves by
name produces a clear "skill is disabled; enable it with `/skill enable <id>`"
message rather than silently activating or silently doing nothing.
Resolution order is therefore: **exact → namespaced-exact → enabled-check →
fuzzy-suggest (never auto-pick).**
---
## 4. Behavior
When a `$<id>` mention resolves and is enabled:
- **Visible activation line.** The transcript shows `Using skill: <name>` so the
user can see which skill body entered context. (Mirrors the existing skill UX
vocabulary; one line per activated skill.)
- **Body loaded as active guidance.** The skill's `body`
(`skills/mod.rs` `Skill.body`) is injected into the turn as authoritative
guidance, the same content a `/skill`-style activation would load. The user's
trailing prose is the task the guidance applies to.
- **Tool-surface narrowing (when declared).** If the skill declares a set of
allowed tools, the active tool surface narrows to that set for the duration of
the skill's influence. **Per-skill tool restriction is net-new** — skills
cannot restrict tools today; the mechanism, and how narrowing interacts with
the catalog-head byte-stability invariant (`tool_catalog.rs:169-196`), is
specified in `TOOL_LIFECYCLE.md`. Until that lands, a declared tool list is
parsed and shown but not enforced.
- **Multiple `$mentions` compose explicitly, or prompt.** Until formal
composition rules exist, two or more `$mentions` in one message either compose
only when the rule is unambiguous (e.g. one guidance skill + one tool-scoping
skill) or return a **"choose one"** prompt listing the mentioned skills. We
never silently activate multiple complex skills at once (see §7 and Non-goals).
- **Three input surfaces.** Resolution runs for: (a) user prompts in the
composer, (b) command-palette input, and (c) model-facing planning text, so a
model that writes `$test-driven-development` in its plan triggers the same
activation path a human would.
- **Slash commands remain supported.** `/skill ...` and the rest of the slash
surface keep working unchanged. `$` is the *preferred* path for models because
it is one token and unambiguous, but it is additive, not a replacement (§7
Non-goals).
---
## 5. Why `$`
- **Visually distinct from `/commands`.** A glance separates "run a subsystem
command" (`/memory`, `/context`, `/workflow`) from "load a skill" (`$aleph`).
Weaker models stop confusing the two surfaces.
- **Reads like a reference / macro.** `$name` already means "expand this named
thing" to anyone who has touched a shell or a templating language. Skill
invocation *is* an expansion: `$skill-id` → that skill's guidance + tool policy.
- **Avoids overloading the slash namespace.** `/workflow`, `/memory`, `/config`,
`/provider`, `/rules`, `/overlay`, `/context` each already own one meaning in
the command-surface taxonomy. Skills get their own sigil instead of a crowded
`/skill <name>` subcommand competing with all of them.
- **Easy to type and remember.** Single leading character, then the literal
skill name. Nothing to memorize beyond the skill ids the user already sees in
`/skill list`.
---
## 6. Implementation plan (smallest viable 0.8.53-ready slice → 0.9.0)
The 0.8.53 cycle is **docs only**. The plan below is the build order once code
is unblocked; the first slice is intentionally the minimum that proves the path.
**Slice 1 — token scanner at submit (the minimum viable feature).**
- Add a `$<skill-id>` token scanner invoked on submit, **before**
`build_queued_message` runs (`ui.rs:4721`). The scanner finds leading-`$`
tokens, captures the id up to the next whitespace, and hands each id to the
resolver. The scanner must skip `$` occurrences inside code fences and inline
command strings (see Non-goals) so shell `$VAR` references are never treated as
skill mentions.
- Resolve via `discover_in_workspace(workspace).get(id)` (`skills/mod.rs:553` /
`:421`), gate on `SkillStateStore::is_enabled` (`skill_state.rs:73`), and emit
the `Using skill: <name>` line plus the loaded body.
**Slice 2 — inline-mention popup.**
- Extend the inline-mention popup machinery in `slash_menu.rs:86`
(`partial_inline_skill_mention_at_cursor`) to recognize a `$`-prefixed token
under the cursor and offer skill-name completions from `SkillRegistry::list()`,
the same way the slash popup offers commands. This is a UX accelerator on top
of Slice 1, not a precondition for it.
**Slice 3 — ambiguity diagnostics.**
- When resolution is ambiguous, emit actionable diagnostics, e.g.
`"$debugging matched 3 skills: systematic-debugging, root-cause-debugging,
superpowers:systematic-debugging — use $superpowers:systematic-debugging"`.
Diagnostics name the disambiguated id the user should type next.
**Deferred to 0.9.0+ (explicitly out of the first slices):**
- `$ns:skill` **globs / wildcards** (`$github:*`). Plain namespaced-exact
(`$github:gh-fix-ci`) ships in Slice 1; globbing does not.
- **Per-skill tool restriction enforcement.** Parsing/display can land early;
enforcement and its catalog-head-stability handling are owned by
`TOOL_LIFECYCLE.md`.
- **Multi-skill composition rules.** Until defined, fall back to the "choose one"
prompt (§4, §7).
---
## 7. Ambiguity / error UX, tests, and non-goals
### Error / ambiguity UX examples
| Input | Outcome |
|---|---|
| `$systematic-debugging fix the auth bug` | Exact hit. `Using skill: systematic-debugging`, body loaded, task = "fix the auth bug". |
| `$github:gh-fix-ci inspect failing checks` | Namespaced-exact hit. `Using skill: github:gh-fix-ci`, body loaded. |
| `$nope do a thing` | No match. `"No skill named 'nope'. Run /skill list to see available skills."` No activation; the line is sent as ordinary text. |
| `$debugging ...` (3 candidates) | `"$debugging matched 3 skills: systematic-debugging, root-cause-debugging, superpowers:systematic-debugging — use $superpowers:systematic-debugging."` No auto-pick. |
| `$systematic-debug ...` (1 fuzzy candidate) | Suggest only: `"No exact skill 'systematic-debug'. Did you mean $systematic-debugging?"` No silent activation. |
| `$aleph ...` but aleph disabled | `"Skill 'aleph' is disabled. Enable it with /skill enable aleph."` No activation. |
| `$tdd $systematic-debugging ...` (2 mentions) | `"Choose one skill to lead this turn: $test-driven-development or $systematic-debugging."` (until composition rules exist). |
| `echo $PATH` inside a code fence / command string | Not a mention. Scanner skips `$` inside code/command contexts. |
### Tests (planned)
- **Exact:** `$systematic-debugging` resolves via `get(id)`, activates, loads body.
- **Namespaced:** `$github:gh-fix-ci` resolves on the `ns:skill` form.
- **Missing:** `$nope` → no-match message, no activation, line passed as text.
- **Ambiguous:** `$debugging` (≥2 candidates) → "matched N skills … use $ns:skill",
asserts **no** auto-activation occurred.
- **Disabled:** a skill with `is_enabled == false` → disabled message, no activation.
- **Guardrail — `$` in code:** `$VAR` inside a fenced block or command string is
not treated as a mention.
### Non-goals
- **Do not remove slash commands.** `/skill` and the whole `/` surface stay; `$`
is preferred for models but additive.
- **Do not auto-run arbitrary scripts.** A `$mention` loads guidance (and, later,
a declared tool policy) — it never executes shell or skill-attached scripts on
its own.
- **Do not silently activate multiple complex skills.** Multi-mention falls back
to a "choose one" prompt until composition rules are specified.
- **Do not let `$` collide with shell variables.** `$` inside code fences and
command strings is never parsed as a skill mention.
+370
View File
@@ -0,0 +1,370 @@
# Tool-Surface Lifecycle Policy (v0.8.53)
**Status:** Design doc / policy. No catalog code lands in this cycle — the code
work is **deferred**. This document is the umbrella policy for GitHub **#2681**,
with **#2682** and **#2683** as concrete instances of the planned diet. It
describes *what will be done* and the invariants any future diet PR must hold.
**Scope of related open work (do not contradict):**
- PR **#2684** — subagent role vocabulary, lifecycle signals, eval ergonomics.
Legacy subagent-name cleanup + guardrail tests in this policy rebase on #2684.
- PR **#2685** — git-history active + RLM/field errors.
All file:line citations are against the verified tree at
`/Users/huntermbown/Desktop/whalebro/codewhale` as of v0.8.52/0.8.53.
---
## 1. Purpose and the weaker-model problem
CodeWhale ships a large native tool surface. The first-turn *active* partition
of that surface is what every model sees before it has run a single
`tool_search_*` call. Today that active set contains several **near-duplicate
tools** that map to the *same* implementation under different names:
- `exec_wait` and `exec_shell_wait` are both `ShellWaitTool`
(`crates/tui/src/tools/registry.rs:526,529`).
- `exec_interact` and `exec_shell_interact` are both `ShellInteractTool`
(`registry.rs:527,530`).
- `tts` and `speech` are both `SpeechTool`
(`registry.rs:787-792`, both deferred).
- `todo_write` and `checklist_write` are the *same* `TodoWriteTool`
constructed two ways (`crates/tui/src/tools/todo.rs:184-196`).
For a strong model, redundant names are harmless noise. For **weaker / smaller
models** (the Arcee Trinity lane, `deepseek-v4-flash` child executors, and any
non-thinking executor), every additional near-duplicate in the visible set is a
real cost:
- It widens the choice space with options that do *nothing distinct*, increasing
wrong-tool selection and oscillation between synonyms.
- It spends scarce first-turn catalog budget (Section 5) on zero-information
entries.
- It dilutes the "one name = one thing" contract that lets a small model reason
about the surface at all.
The lifecycle policy exists to **shrink and discipline the model-visible
surface** without ever breaking the ability to replay an old transcript that
referenced a now-retired name.
---
## 2. The five lifecycle states
Every native tool name occupies exactly one lifecycle state.
| State | Meaning | Visible on first turn? | In `tool_search_*`? | Executes if called? | When used |
|---|---|---|---|---|---|
| **active** | Canonical, in the first-turn catalog head | **Yes** | n/a (already active) | Yes | The tool a model should reach for by default |
| **deferred** | Registered + discoverable, hydrated on demand | No | **Yes** | Yes | Real, useful tools that don't earn a first-turn slot |
| **hidden-compatibility** | Registered + dispatchable, but removed from active **and** from search | No | **No** | **Yes — identical behavior, silent** | Old synonym kept only so old transcripts replay; no model should newly discover it |
| **deprecated** | Like hidden-compat, but execution **appends a replacement notice to result metadata** | No | **No** | **Yes — works, plus a "use X instead" notice** | A retired name we actively steer callers off of, still safe to replay |
| **removed** | Not registered at all | No | No | **No — hard error** | Only after `planned_removal_version`, once replay support is formally dropped |
### hidden-compatibility vs deprecated — be precise
Both states are **invisible** (not active, not in tool search) and both remain
**dispatchable** (calling them still works). The *only* difference is the
caller-facing signal:
- **hidden-compatibility:** completely silent. The tool behaves byte-for-byte
like its canonical twin. We use this when there is *no behavioral or naming
lesson to teach* — the name was a pure alias and we simply don't want models
re-learning it. (Example: `exec_wait` is literally `exec_shell_wait`.)
- **deprecated:** behaves identically *and succeeds*, but the tool result's
**metadata** carries an appended notice like
`"deprecated: use checklist_write instead"`. The notice goes **only in the
result metadata returned for that call** — never in the cached tool catalog
prefix (see Section 8). We use this when there is a canonical replacement we
want the caller (and any human reading the transcript) nudged toward.
Neither state ever changes the *behavior* of the call. Replay always works.
---
## 3. Representation in code
The lifecycle is represented as **const name-sets plus an alias/manifest table**
in `crates/tui/src/core/engine/tool_catalog.rs`, alongside the existing
`DEFAULT_ACTIVE_NATIVE_TOOLS` (`tool_catalog.rs:37-64`) and
`ARCEE_FIRST_TURN_NATIVE_TOOLS` (`tool_catalog.rs:106-115`).
### 3a. Name-sets and the manifest (sketch)
```rust
// crates/tui/src/core/engine/tool_catalog.rs (planned)
/// Tools removed from the active set AND from tool-search, but still
/// registered and dispatchable with byte-identical behavior. Silent.
pub(super) const HIDDEN_COMPATIBILITY_TOOLS: &[&str] = &[
"exec_wait", // == exec_shell_wait (ShellWaitTool)
"exec_interact", // == exec_shell_interact (ShellInteractTool)
"tts", // == speech (SpeechTool)
];
/// Deprecated aliases: invisible + dispatchable, with a replacement notice
/// appended to RESULT METADATA only (never the cached prefix).
pub(super) struct DeprecatedAlias {
pub name: &'static str,
pub replacement: &'static str,
pub note: &'static str,
}
pub(super) const DEPRECATED_ALIASES: &[DeprecatedAlias] = &[
DeprecatedAlias { name: "todo_write", replacement: "checklist_write",
note: "use checklist_write instead" },
DeprecatedAlias { name: "todo_add", replacement: "checklist_add",
note: "use checklist_add instead" },
DeprecatedAlias { name: "todo_update", replacement: "checklist_update",
note: "use checklist_update instead" },
DeprecatedAlias { name: "todo_list", replacement: "checklist_list",
note: "use checklist_list instead" },
];
#[inline]
pub(super) fn is_hidden_or_deprecated(name: &str) -> bool {
HIDDEN_COMPATIBILITY_TOOLS.contains(&name)
|| DEPRECATED_ALIASES.iter().any(|d| d.name == name)
}
```
### 3b. The two filter points
1. **Catalog / tool-search exclusion (tool_catalog.rs).**
Deferral is decided by `should_default_defer_tool` (`tool_catalog.rs:66-82`),
and the active set is the head built by `build_model_tool_catalog`
(`tool_catalog.rs:178-196`). Hidden-compat and deprecated tools must be
forced *out of the active head* and *out of the tool-search-discoverable
pool*. Concretely, the deferral predicate gains a short-circuit so these
names are never active, and the tool-search index builder skips any name for
which `is_hidden_or_deprecated(name)` is true. Arcee's narrowed first-turn
path (`apply_provider_tool_policy`, `tool_catalog.rs:134-149`) already
excludes them by construction since they aren't in
`ARCEE_FIRST_TURN_NATIVE_TOOLS`.
2. **Result-notice append (tool_routing.rs).**
Dispatch already routes by tool name in
`crates/tui/src/tui/tool_routing.rs` (e.g. the wait/interact unification at
`tool_routing.rs:1139-1140`). After a successful dispatch, if the called name
is in `DEPRECATED_ALIASES`, the router appends the matching `note` to the
**result metadata only**. Hidden-compat names append nothing.
### 3c. Why name-sets, not a per-`ToolSpec` enum field
A per-`ToolSpec` `lifecycle: Lifecycle` field was rejected for three reasons:
- **Prefix-cache safety.** The tool catalog array is part of DeepSeek's
immutable KV prefix (`tool_catalog.rs:169-177`). A per-spec field invites
serializing lifecycle state *into* each tool's schema, which is exactly the
kind of head mutation that forces a full re-prefill. Name-sets live entirely
in the catalog-build logic and never touch the emitted tool JSON.
- **Single source of truth + diffability.** The diet for a release is one small,
reviewable edit to two or three const arrays in one file, instead of scattered
field flips across many tool modules.
- **Registration stays orthogonal.** Tools remain registered exactly as today
(e.g. `with_shell_tools`, `registry.rs:523-531`). Lifecycle is a *catalog
policy* layered on top of registration, not a property baked into the tool.
---
## 4. Deprecation manifest (the #2681 acceptance-criteria table)
This is the authoritative manifest. Columns are the #2681 AC columns. No entry
is "removed" in 0.8.53; replay is supported for everything listed.
| Alias | Replacement (canonical) | Lifecycle state | first_deprecated_version | planned_removal_version | replay_supported |
|---|---|---|---|---|---|
| `exec_wait` | `exec_shell_wait` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
| `exec_interact` | `exec_shell_interact` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
| `tts` | `speech` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
| `todo_write` | `checklist_write` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
| `todo_add` | `checklist_add` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
| `todo_update` | `checklist_update` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
| `todo_list` | `checklist_list` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
**Legacy subagent names — already non-visible, no manifest entry needed.**
`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`,
`send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, and
`delegate_to_agent` exist only as `#[allow(dead_code)]` structs in
`crates/tui/src/tools/subagent/mod.rs` and are **never instantiated** outside
tests, so they are already not model-visible. Only `agent_open`, `agent_eval`,
`tool_agent`, and `agent_close` are registered
(`registry.rs:1017-1029`). The action for these legacy names is **dead-code
cleanup + a guardrail test** (rebase on PR #2684), not a lifecycle transition.
> **Keep the live internal methods.** `send_input`, `cancel`, and `resume` also
> exist as live `SubAgentManager` methods
> (`subagent/mod.rs:1605,1495,1521`) used internally by `agent_eval` /
> `agent_close`. These are *not* the dead-code tool structs and must be kept.
`planned_removal_version` is intentionally `TBD`: a name only moves to **removed**
once we formally drop replay for transcripts old enough to contain it, which is a
separate, deliberate decision per name.
---
## 5. Active-catalog budget (per mode, per provider)
The active set is the first-turn cost. Do not duplicate the exact
`DEFAULT_ACTIVE_NATIVE_TOOLS` count here: adjacent PRs in the v0.8.53 batch may
add or remove active tools, and the source of truth is always
`tool_catalog.rs`. This document defines the diet policy and invariants, not a
second catalog snapshot.
### Per provider
| Provider | First-turn active source | Budget policy |
|---|---|---|
| Default (DeepSeek et al.) | `DEFAULT_ACTIVE_NATIVE_TOOLS` | Remove duplicate aliases from the active head when their canonical twins stay active; any net growth needs an explicit budget decision. |
| Arcee (Trinity) | `ARCEE_FIRST_TURN_NATIVE_TOOLS` | Provider-specific read-only WAF workaround; unchanged by the default diet unless explicitly reviewed. |
The default diet removes `exec_wait` and `exec_interact` from the active head
(they become hidden-compat; their canonical twins `exec_shell_wait` /
`exec_shell_interact` stay). `tts` and `todo_*` are *already not* in the active
set, so they do not change the active budget in this diet. The net effect of
this specific diet is to remove two duplicate active aliases from whatever
default active head is current after the surrounding v0.8.53 PR batch.
### Per mode (Plan / Agent / YOLO)
The native active head is the **same set across modes** by design — mode does not
add or remove native tools from `DEFAULT_ACTIVE_NATIVE_TOOLS`
(`should_default_defer_tool` ignores `_mode` for native tools,
`tool_catalog.rs:66-68`). Mode affects **MCP** deferral instead:
`apply_mcp_tool_deferral` keeps MCP tools deferred unless `mode == Yolo`
(`tool_catalog.rs:162-167`).
| Mode | Native active budget | MCP tools active? |
|---|---|---|
| Plan | same native head | No (deferred) |
| Agent | same native head | No (deferred) |
| YOLO | same native head | Yes (a known, intentional widening) |
**Budget rule:** the native active head must stay byte-identical across Plan ↔
Agent ↔ YOLO (Section 8). Any growth of the head requires retiring something
else or an explicit budget bump in this doc.
---
## 6. The canonical-surface rule
> **Every model-visible (active or deferred-discoverable) tool must have one
> clear niche. If a tool is superseded, it gets a named replacement and moves to
> hidden-compatibility or deprecated — it does not stay visible.**
### Canonical vs compatibility summary for the confusing clusters
| Cluster | Canonical (keep visible) | Compatibility / retired | Notes |
|---|---|---|---|
| **Shell wait** | `exec_shell_wait` | `exec_wait` → hidden-compat | Same `ShellWaitTool` (`registry.rs:526,529`); router already unifies (`tool_routing.rs:1139`) |
| **Shell interact** | `exec_shell_interact` | `exec_interact` → hidden-compat | Same `ShellInteractTool` (`registry.rs:527,530`) |
| **Checklist / todo** | `checklist_write` | `todo_write/add/update/list` → deprecated | Same `TodoWriteTool`, `::new` vs `::checklist` (`todo.rs:184-196`) |
| **Speech / tts** | `speech` | `tts` → hidden-compat | Same `SpeechTool` (`registry.rs:787-792`) |
| **Subagent lifecycle** | `agent_open`, `agent_eval`, `agent_close`, `tool_agent` (gated, §7) | all 11 legacy names → already non-visible dead code | Cleanup + guardrail test, rebase on #2684 |
| **Edit family** | `apply_patch`, `edit_file`, `write_file`, `fim_edit` | none — **all distinct niches** | NOT touched (per #2681 non-goals); doc-only canonical guidance |
| **Search family** | `grep_files` (content), `file_search` (filename), `project_map` (structure) | none — **distinct niches** | NOT touched; no FTS5/BM25/semantic index exists today |
**Non-goals (explicitly NOT diet targets in this cycle, per #2681):**
`apply_patch` / `edit_file` / `write_file` / `fim_edit`;
`grep_files` / `file_search` / `project_map`;
`fetch_url` / `web.run` / `web_search`;
`task_shell_*`; `handle_read` / `retrieve_tool_result`. These have distinct
niches and receive **canonical guidance only** — no lifecycle change.
The RLM surface (`rlm_open` / `rlm_eval` / `rlm_configure` / `rlm_close` /
`rlm_session_objects`, `crates/tui/src/tools/rlm.rs`) is likewise out of scope;
`handle_read` retrieves var handles, and `finalize` / `FINAL` is an in-kernel
Python function, **not a tool** — so there is nothing to retire there.
---
## 7. `tool_agent` decision: canonical but DeepSeek-V4-gated
`tool_agent` **stays** as a canonical subagent tool
(`registry.rs:1024`, `ToolAgentTool`). It is the fast, **non-thinking "Fin"
executor lane**, built on `deepseek-v4-flash` (cf. `DEFAULT_CHILD_MODEL =
"deepseek-v4-flash"`, `rlm.rs:26`).
**Decision: gate `tool_agent` to DeepSeek-V4 models only.**
- It is purpose-built around the V4-flash non-thinking executor profile. Exposing
it to other providers (e.g. Arcee Trinity, which is already WAF-narrowed to 8
read-only tools, `tool_catalog.rs:106-115`) offers no working executor lane and
only adds a confusing, mis-targeted option to weaker surfaces.
- Gating is a **provider/model policy**, consistent with the existing
provider-aware first-turn policy (`apply_provider_tool_policy`,
`tool_catalog.rs:134-149`): on non-DeepSeek-V4 models, `tool_agent` is excluded
from the active set and from tool-search discovery. It remains **registered and
dispatchable** so transcripts created under a V4 model replay everywhere.
This is not a lifecycle transition — `tool_agent` is canonical. It is a
*visibility gate* layered on the same machinery as the Arcee narrowing.
---
## 8. Prefix-cache safety + replay guarantee
### Prefix-cache rules every diet PR MUST follow
The tools array is part of DeepSeek's immutable KV prefix. The catalog-head
byte-stability invariant (`tool_catalog.rs:169-196`) is binding:
1. **Never mutate the active head non-deterministically.** The first-turn active
block must be **byte-identical run-to-run** and across Plan ↔ Agent ↔ YOLO.
2. **A diet is a one-time deterministic edit.** Removing a name from
`DEFAULT_ACTIVE_NATIVE_TOOLS` shifts the head exactly once; after that it must
be stable. Land such edits as their own focused change.
3. **Notices live in result metadata, never the prefix.** Deprecated replacement
notes are appended at dispatch time in `tool_routing.rs` to the *call result*
only. **Nothing** about hidden/deprecated state may be serialized into a tool
schema, description, or the catalog array.
4. **Preserve ordering and partitioning.** `build_model_tool_catalog` sorts each
partition by name and keeps built-ins as a contiguous prefix ahead of MCP
tools (`tool_catalog.rs:186-194`). Diet edits must not break this.
5. **Hidden/deprecated tools are excluded *before* the head is built**, so their
removal is the only head change — they do not appear in the prefix at all.
### Old-transcript replay guarantee
> For every name in the deprecation manifest with `replay_supported = Yes`, the
> tool stays **registered and dispatchable with identical behavior**. Replaying
> an old transcript that calls `exec_wait`, `exec_interact`, `tts`, or any
> `todo_*` produces the same result it always did. Deprecated names additionally
> attach a result-metadata notice; hidden-compat names are silent. A name is only
> ever made non-dispatchable (**removed**) after a deliberate, per-name decision
> to drop replay support at `planned_removal_version`.
---
## 9. Required tests
Any diet PR (and the umbrella #2681 work) must add/keep:
1. **Duplicate-active-alias guard.** A test asserting that no name in
`HIDDEN_COMPATIBILITY_TOOLS` or `DEPRECATED_ALIASES` appears in
`DEFAULT_ACTIVE_NATIVE_TOOLS` or `ARCEE_FIRST_TURN_NATIVE_TOOLS`, and that no
two active entries resolve to the same underlying tool implementation.
2. **Tool-search exclusion test.** Assert that hidden-compat and deprecated names
are absent from the tool-search-discoverable pool while remaining present in
the registry (dispatchable).
3. **Replay / dispatch tests.** For each manifest name, calling it still
executes and returns the same result as its canonical twin. Deprecated names
additionally assert the replacement note is present **in result metadata** and
absent from the catalog/prefix. Hidden-compat names assert **no** added
notice.
4. **Golden active-block byte test.** A snapshot test pinning the byte
serialization of the first-turn active tool block, asserting it is identical
across Plan / Agent / YOLO (native head) and stable run-to-run — enforcing the
`tool_catalog.rs:169-196` invariant. The golden updates **only** as a
reviewed, deliberate one-time edit when the diet lands.
5. **Subagent guardrail test (rebase on #2684).** Assert only `agent_open`,
`agent_eval`, `tool_agent`, `agent_close` are registered as model-visible
subagent tools and that no legacy name from `subagent/mod.rs` is
instantiated outside tests.
6. **`tool_agent` gating test.** Assert `tool_agent` is active/discoverable only
under DeepSeek-V4 models and excluded (but still registered) elsewhere.
+490
View File
@@ -0,0 +1,490 @@
# CodeWhale North Star (0.9.0+)
> **STATUS: DIRECTION, NOT COMMITTED WORK.**
> Everything in this document is the maintainer's intended *direction* for
> CodeWhale 0.9.0 and beyond. **None of it is committed 0.8.53 work.** The
> 0.8.53 cycle ships **design docs only** for these areas — no tool-catalog code
> lands this cycle except the small, already-scoped subagent/git/RLM fixes in
> PR #2684 and PR #2685. Treat every "rough shape" below as a sketch to be
> refined, not an API contract. Where this doc names tools that do not exist yet
> (`codebase_search`, `read_file` as a canonical alias, `agent_run`, etc.) those
> are **aspirational names** that will *map onto today's tools*; see each
> section.
## Why this document exists
The vision is at risk of being lost between point releases. CodeWhale is
accumulating capability (subagents, RLM, skills, workflows, an enormous tool
catalog) faster than it is accumulating *shape*. This is the north star that the
incremental 0.8.x stabilization work is steering toward, written down once so it
survives the next dozen PRs.
### The one principle
**The harness handles memory, search, routing, state, and guardrails so a
weaker model can just *think*.** Every design decision below is in service of
moving cognitive load *out* of the model and *into* the harness. A
`deepseek-v4-flash`-class model should not have to remember ~80 tool names, hold
the codebase index in its head, track which layer of memory a fact lives in, or
re-derive a recovery path after a malformed tool call. The harness does that.
The model decides *what it wants*; the harness figures out *how*.
---
## Ground-truth anchor (today's reality)
So the direction is honest about where it starts:
- **Active first-turn tool set** is `DEFAULT_ACTIVE_NATIVE_TOOLS`
(`crates/tui/src/core/engine/tool_catalog.rs:37-64`) — 26 tools. Everything
else is **deferred** and hydrates via `tool_search_tool_regex` /
`tool_search_tool_bm25` (`tool_catalog.rs:26-35`).
- **Catalog-head byte-stability is a hard invariant** for DeepSeek's KV
prefix cache (`tool_catalog.rs:169-196`). The active first-turn tool block
must stay byte-identical run-to-run; any change to it is a **one-time,
deterministic edit**, never a per-turn or per-mode mutation.
- **Arcee** narrows the first turn to 8 read-only tools
(`ARCEE_FIRST_TURN_NATIVE_TOOLS`, `tool_catalog.rs:106-115`) as a Cloudflare
WAF workaround — proof the active partition is already provider-shaped.
- **Subagent tools that are model-visible:** only `agent_open`, `agent_eval`,
`tool_agent`, `agent_close` (`crates/tui/src/tools/registry.rs:1017-1029`).
All legacy names (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`,
`agent_send_input`, `agent_assign`, `agent_list`, `agent_cancel`,
`resume_agent`, `delegate_to_agent`, …) are `#[allow(dead_code)]` structs in
`crates/tui/src/tools/subagent/mod.rs`, never instantiated outside tests →
**already not model-visible**. The live internal `send_input` / `cancel` /
`resume` methods on `SubAgentManager` (`mod.rs:1495,1521,1605`) back
`agent_eval` / `agent_close` and **stay**.
- **`tool_agent` is "Fin"** — the experimental fast-lane executor: DeepSeek V4
Flash with thinking forced off (`mod.rs:5233`, `TOOL_AGENT_INTRO`;
`DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`).
- **Known duplicates today:** `exec_wait ≡ exec_shell_wait`,
`exec_interact ≡ exec_shell_interact` (same structs, all four in the active
set), `tts ≡ speech` (both deferred). `todo_*` are deferred twins of
`checklist_*` (same `TodoWriteTool`, `::new` vs `::checklist`,
`todo.rs:187,194`). The router already unifies `exec_wait`/`exec_shell_wait`
(`crates/tui/src/tui/tool_routing.rs:1139-1140`).
This is the surface the north star refactors *toward simplicity*.
---
## 1. Intent Router
**What it is.** A thin layer where the model declares an **intent**
*search / inspect / edit / test / delegate / ask-user / run-shell /
run-workflow* — and the harness maps that intent to the correct low-level tool
and arguments. The model picks from a tiny, stable verb vocabulary instead of
recalling ~80 concrete tool names and their schemas.
**Why it helps weaker models.** Tool-name recall is one of the largest sources
of wasted turns for small models: choosing a deferred tool (double-invoke),
choosing a deprecated alias, or hallucinating a name. A fixed intent vocabulary
collapses that decision space to ~10 verbs. The model spends its budget on
*reasoning about the task*, not on *remembering the API*.
**Rough shape.** A small **canonical visible set** — aspirational names that
route onto today's tools:
| Intent verb (aspirational) | Routes onto today |
|---|---|
| `codebase_search` | concept-level retrieval over the hybrid index (§2); today: `grep_files` + `file_search` + `project_map` |
| `read_file` | `read_file` (already canonical) |
| `apply_patch` | `apply_patch` (canonical; `edit_file`/`write_file`/`fim_edit` remain as distinct lower-level tools) |
| `run_tests` | `run_tests` / `run_verifiers` |
| `git_status` | `git_status` |
| `git_diff` | `git_diff` |
| `work_update` | `update_plan` / `checklist_write` |
| `ask_user` | `request_user_input` |
| `shell_run` | `exec_shell` (canonical; `exec_wait`/`exec_interact` hidden — §10) |
| `agent_run` | `agent_open` / `tool_agent` (gated, §3) / `agent_eval` / `agent_close` |
| `workflow_run` | WhaleFlow runner (§4) |
The router is the *only* place the catalog's full complexity is allowed to live.
It is also where **tool repair** (§7) hooks in: a mis-stated intent or a
deferred/deprecated name is rewritten to the canonical route.
**Dependencies.** The small canonical surface (§3), the lifecycle alias table
(§3 / `docs/TOOL_LIFECYCLE.md`), and the hybrid index for `codebase_search`
(§2). Must respect the **catalog-head byte-stability invariant**: the visible
verb set is itself a one-time deterministic edit, not a dynamic per-turn list.
---
## 2. Default Hybrid Codebase Intelligence
**What it is.** An always-on, local-first codebase index that ships with the
harness — not an opt-in tool the model has to remember to build. It fuses:
- plain **text** search,
- **symbol** index (definitions/references),
- **import / call graph**,
- **FTS5 + BM25** lexical ranking (rusqlite is already a dependency —
`Cargo.toml`),
- **sparse** retrieval,
- optional **dense** (embedding) retrieval,
- **PR / commit / issue history** as a first-class retrieval source,
- a **codemap** (structural overview, the successor to today's deferred
`project_map`).
**Why it helps weaker models.** Today the model must orchestrate `grep_files`
(content), `file_search` (filename), and `project_map` (structure) by hand,
reconcile their outputs, and re-run them as it narrows. There is **no FTS5/BM25
or semantic index today** — every search is a cold walk (`file_search` uses the
`ignore` crate's `WalkBuilder` for vendor exclusion, `file_search.rs:~210`). A
weaker model burns turns stitching partial results. A single `codebase_search`
intent backed by a hybrid index returns ranked, concept-level hits in one call,
so the model reasons about *answers*, not *query mechanics*.
**Rough shape.** A background indexer maintains a SQLite store (FTS5 + symbol +
graph tables), refreshed on file change and on git events. `codebase_search`
(§1) queries it; the codemap is regenerated incrementally. Vendor exclusion
reuses the existing `ignore`/`WalkBuilder` path.
**Dependencies.** rusqlite/FTS5; the Intent Router (§1) for the
`codebase_search` verb; the trace store (§6/§8) for history retrieval. **Full
design lives in `docs/CODEBASE_SEARCH_DESIGN.md`** (to be written this cycle).
---
## 3. Small Canonical Tool Surface
**What it is.** A deliberately tiny set of always-visible canonical tools;
**everything else is hidden, deferred, or skill-scoped**. The catalog grows
behind the scenes but the *visible* surface stays small and stable.
**Why it helps weaker models.** Fewer choices, no aliases competing for the same
job, no deferred double-invokes for common operations. The model sees the verbs
it needs and nothing else.
**Rough shape — tool lifecycle states.** Five states, represented as **const
name-sets plus an alias table in `tool_catalog.rs`** (NOT a per-`ToolSpec`
field, to preserve the byte-stable head):
1. **active** — in the first-turn catalog head.
2. **deferred** — registered, hydrated via tool-search.
3. **hidden-compatibility** — registered + dispatchable, **dropped from both
active and search**, identical behavior, **no notice**. (For exact
duplicates that should simply disappear from discovery.)
4. **deprecated** — registered + dispatchable, **dropped from search**, appends
a *replacement notice to RESULT METADATA only***never** to the cached
prefix.
5. **removed** — final state; no longer registered.
**Invariant:** deprecated and hidden-compatibility tools **stay registered and
dispatchable forever** so old transcripts always replay deterministically.
**Planned diet (documented this cycle, not yet coded):**
- `exec_wait`, `exec_interact`, `tts`**hidden-compatibility** (exact
duplicates of `exec_shell_wait`, `exec_shell_interact`, `speech`).
- `todo_*` (`todo_write/add/update/list`) → **deprecated → checklist_*** (drop
from tool-search, keep registered, add result-metadata notice).
- Legacy subagent names → already hidden; remaining work is **cleanup +
guardrail tests**, rebased on PR #2684.
**Explicitly NOT touched** (distinct niches, per #2681 non-goals) — doc-only
canonical guidance, no diet: `apply_patch` / `edit_file` / `write_file` /
`fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` /
`web.run` / `web_search`; `task_shell_*`; `handle_read` /
`retrieve_tool_result`.
**`tool_agent` gating decision.** `tool_agent` ("Fin") **stays** as a canonical
subagent tool, but is **gated to DeepSeek-V4 models only**. It is the fast,
non-thinking executor lane built on `deepseek-v4-flash`; offering it to other
providers/models is meaningless (the lane *is* a specific model) and would just
add a name to recall. The gate is provider/model-conditional in the same spirit
as the Arcee first-turn narrowing.
**Dependencies.** The alias table backs the Intent Router (§1) and Tool Repair
(§7). **Full spec in `docs/TOOL_LIFECYCLE.md`** (to be written this cycle).
---
## 4. WhaleFlow / Workflow Mode
**What it is.** A typed, multi-agent **workflow runner**. A workflow is a graph
of typed nodes — **branches, leaves, reviewers, verifiers, test-runners,
PR-creators**, with **trace-replay** and a **progress-monitor**. Authors write
workflows in **Starlark or YAML**, which compile to a **typed Rust IR**; the
**Rust executor** runs the IR. "Like Claude's workflow mode, but safer" — the
safety comes from the typed IR and Rust execution boundary rather than free-form
model-driven orchestration.
**Why it helps weaker models.** Long-running, multi-step work (implement →
review → verify → test → open PR) is exactly where weaker models drift, lose
state, or skip verification. Encoding the *process* as a typed graph means the
model only has to be competent at each *leaf*, while the harness guarantees the
sequencing, the verification gates, and the evidence trail.
**Rough shape.** Starlark/YAML → typed IR → Rust executor. Nodes map to
subagent lanes (`agent_open` / `tool_agent` / `agent_eval` / `agent_close`,
`registry.rs:1017-1029`). Reviewer/verifier/test-runner nodes are first-class
node *types*, not ad-hoc prompts. Every run emits a trace (→ §8). Surfaced via
`/workflow` (alias `/whaleflow`) and the `workflow_run` intent (§1).
**Dependencies.** Subagent runtime; the evaluation loop (§8) for traces;
Skills & Rules (§5) so a skill can *define* a workflow; the command taxonomy
(§9).
---
## 5. Skills & Rules as First-Class Runtime
**What it is.** Skills and rules become real runtime objects, not just prompt
text. Skills gain **activation modes**:
- **always-on** — injected every turn,
- **glob** — activated when matching files are in scope,
- **model-decision** — offered to the model to opt into,
- **manual** — only via explicit `$<skill-name>` invocation (§9).
Skills can **restrict the tool surface**, **define workflows** (§4), and
**inject repo context**.
**Why it helps weaker models.** A skill scoped to a task can shrink the tool
surface to exactly what that task needs and pre-load the relevant rules and
context — so the model operates inside a curated, smaller world instead of the
full catalog.
**Rough shape (vs. today).** Today: skills are discovered
(`crates/tui/src/tools/skills/mod.rs`, `discover_in_workspace ~421`; struct
parses name/description `~382-388`), enable-state is tracked
(`skill_state.rs`, `SkillStateStore::is_enabled ~73`), and there's an
inline-mention popup (`slash_menu.rs ~86`). **But:** no parser activates inline
`$` mentions on submit (submit path: `ui.rs build_queued_message ~4721`), there
is **no activation-mode concept**, and **skills cannot restrict tools**. The
direction adds (a) a submit-time `$<skill-name>` activation parser, (b) the
four activation modes in skill metadata, and (c) a tool-restriction field
enforced by the registry/router.
**Dependencies.** Tool lifecycle/alias table (§3) for restriction; Intent Router
(§1); WhaleFlow (§4); command taxonomy (§9). **Full design in
`docs/SKILL_INVOCATION_DESIGN.md`** (to be written this cycle).
---
## 6. Context Memory Stack
**What it is.** Memory modeled as **explicit, layered, inspectable** stores
rather than one undifferentiated blob. Each layer is **visible, inspectable,
clearable, and scoped**:
1. **User memory** — small user prefs/facts (surfaced via `/memory`, §9).
2. **Repo rules** — checked-in guidance (`/rules`).
3. **Codemap-wiki** — derived structural/semantic knowledge of the repo (§2).
4. **Trace store** — recorded workflow/turn evidence (§8).
5. **ARMHRLM memo** — the RLM kernel's in-session working memory
(`rlm_open`/`rlm_eval`/`rlm_configure`/`rlm_close`/`rlm_session_objects`,
`crates/tui/src/tools/rlm.rs`; `handle_read` retrieves var handles;
`finalize`/`FINAL` is an *in-kernel Python function*, not a tool).
6. **Cached-main overlay** — promoted lessons from the cached main branch
(`/overlay`, §9).
7. **External memory (Aleph)** — large local data via the `aleph` skill.
**Why it helps weaker models.** The model never has to *guess* where a fact
should live or *re-derive* context it already established. Each layer has a
clear scope and a clear command to inspect/clear it, so stale context is
visible and removable rather than silently poisoning the prefix.
**Rough shape.** A `/context` dashboard (§9) renders all active layers and their
sizes; `/memory` manages the small user layer; `/overlay` manages promoted
lessons. The RLM layer already exists and is plumbed through `rlm.rs`.
**Dependencies.** Command taxonomy (§9); codebase intelligence (§2); evaluation
loop (§8) for promotion into the overlay.
---
## 7. Tool Repair & Autoload
**What it is.** When the model emits a wrong, deferred, deprecated, or
environment-blocked tool call, the harness **repairs** it instead of returning a
bare error — and **autoloads** what's needed.
**Why it helps weaker models.** Recovery from a malformed call is precisely
where weak models loop or give up. Turning every failure into an actionable,
schema-bearing correction keeps the model on-task.
**Rough shape — representative repairs:**
- **Wrong/legacy name** → *"you meant `agent_eval`; here's the schema"* (autoload
the deferred tool's schema in the same turn).
- **Mode mismatch** → *"shell is unavailable in Plan mode — ask the user or
switch modes"*.
- **Missing dependency** → *"this tool needs Node; Node is missing"*
(dependency probe via `ExternalTool`, already imported in `tool_catalog.rs`).
- **Deprecated alias** → silently **routed to the canonical** tool, with the
replacement notice in **result metadata only** (§3) — never the cached prefix.
**Dependencies.** The alias table + lifecycle states (§3); the Intent Router
(§1); dependency detection (`ExternalTool`). Builds on PR #2685's actionable
RLM/field errors and PR #2684's lifecycle signals — **must not contradict
either**.
---
## 8. Evaluation Loop
**What it is.** Every workflow run **leaves evidence**: the tests it ran, the
diffs it produced, the failures it hit, the searches it issued, the claims it
verified, and the PR outcome. A **teacher/student replay** turns *good* traces
into reusable **rules, skills, tests, and cached guidance**.
**Why it helps weaker models.** The system gets better at *this repo* over time
without the model getting smarter. Verified good traces become rules/skills the
weaker model can lean on next time, and become the source of the cached-main
overlay (§6).
**Rough shape.** Workflow nodes (§4) emit structured evidence into the trace
store (§6). A replay/distillation pass (teacher reviews student trace) promotes
high-value traces into: repo rules (`/rules`), skills (§5), regression tests,
and overlay guidance (`/overlay`). Verified-claim tracking ties into the
adversarial-verification posture already used elsewhere.
**Dependencies.** WhaleFlow (§4) for trace emission; trace store + overlay (§6);
Skills & Rules (§5) as promotion targets.
---
## 9. Command-Surface Taxonomy
**What it is.** One name = **one thing**. The command surface is split so each
prefix has a single, memorable responsibility:
| Surface | Responsibility |
|---|---|
| `/memory` | **Small** user prefs/facts only |
| `/context` | **Dashboard** of all active memory layers (§6) |
| `/rules` | Repo guidance |
| `.codewhale/constitution.json` | Repo constitution: checked-in **local law** |
| `/workflow` (`/whaleflow`) | Long-running multi-agent runs (§4) |
| `/overlay` | Promoted cached-main lessons (§6/§8) |
| `$<skill-name>` | Skill invocation — **the token *is* the skill name** |
| `codebase_search` | Concept-level code retrieval (§2) |
The repo constitution is not another memory bucket. It is the local-law layer in
a layered authority model:
```
base myth / global Constitution
-> repo constitution (.codewhale/constitution.json)
-> task packet
-> runtime policy
```
At conflict time, the **current user request for the task remains above the repo
constitution**; the repo constitution supplies durable defaults and local law
only when the active task packet and runtime policy leave room. Runtime policy is
the compiled enforcement surface for the run, not a separate place for the model
to invent new rules.
**Why it helps weaker models (and users).** No overloaded command does five
jobs; the model/user never has to disambiguate *which* `/memory` behavior they
meant. `$systematic-debugging` self-documents what it invokes.
**`/memory` subcommand sketch:**
```
/memory add "<fact>" # store a small pref/fact
/memory edit # edit stored facts
/memory search <query> # find a stored fact
/memory clear # clear user memory
/memory doctor # health check; detects legacy ~/.deepseek path
/memory promote <fact> # (later) promote a fact to a higher layer
```
`doctor` specifically detects the **legacy `~/.deepseek`** path and guides
migration.
**`$<skill-name>` invocation examples:**
```
$systematic-debugging # local skill
$github:gh-fix-ci # namespaced skill
```
The submit-time parser (to be added; submit path `ui.rs ~4721`) recognizes the
`$` token and activates the named skill (§5).
**`/context` layers dashboard (example render):**
```
/context
user-memory ▸ 7 facts (12 KB) [clear]
repo-constitution ▸ .codewhale/constitution.json (4 KB) [view]
repo-rules ▸ CLAUDE.md, AGENTS.md (8 KB) [view]
codemap-wiki ▸ 412 symbols indexed (auto) [rebuild]
trace-store ▸ 3 recent workflow runs (—) [open]
rlm-memo ▸ 0 active sessions (—) [—]
cached-overlay ▸ 5 promoted lessons (3 KB) [view]
aleph-external ▸ not attached (—) [attach]
```
**Dependencies.** Memory stack (§6); skills (§5); codebase intelligence (§2);
workflow runner (§4).
---
## 10. Deferred-Not-Done 0.8.53 Diet Items
Recorded here so they are **not silently dropped** — these were considered for
the 0.8.53 diet and deliberately **deferred** (design-only or out of scope this
cycle):
- **File-mutation overload** — `apply_patch` / `edit_file` / `write_file` /
`fim_edit` overlap in purpose. Per #2681 non-goals these stay distinct;
canonical *guidance* (prefer `apply_patch`) is doc-only, no consolidation
this cycle.
- **`task_shell_*``exec_*` redundancy** — `task_shell_start` /
`task_shell_wait` overlap conceptually with the `exec_*` family. Left intact
this cycle (distinct niche per #2681); revisit under §1/§3.
- **`handle_read` / `retrieve_tool_result`** — result-handle plumbing kept as-is
(doc-only canonical guidance); folds naturally into the memory stack (§6) and
intent routing (§1) later.
- **Search-cluster consolidation** — `grep_files` / `file_search` /
`project_map` remain three tools this cycle; consolidation is the *job of the
hybrid index* (§2) under `codebase_search`, not a catalog edit in 0.8.53.
---
## Phased Roadmap
### 0.8.53 — design + small fixes only
- **Code:** only the already-scoped, narrow fixes — PR #2684 (subagent role
vocab, lifecycle signals, eval ergonomics) and PR #2685 (read-only git history
active + actionable RLM/field errors). Subagent legacy-name cleanup +
guardrail tests rebased on #2684.
- **Docs:** this north star, plus `docs/TOOL_LIFECYCLE.md`,
`docs/CODEBASE_SEARCH_DESIGN.md`, `docs/SKILL_INVOCATION_DESIGN.md`.
- **No tool-catalog code:** the diet (§3), the Intent Router (§1), and the
hybrid index (§2) are **documented, not coded** this cycle.
### 0.9.0 — first structural moves
- Implement the **tool lifecycle** const name-sets + alias table in
`tool_catalog.rs` (§3) as a one-time deterministic head edit.
- Land the **planned diet**: `exec_wait`/`exec_interact`/`tts`
hidden-compatibility; `todo_*` → deprecated→`checklist_*` (result-metadata
notice only).
- Gate **`tool_agent`** to DeepSeek-V4 models only (§3).
- First version of the **default hybrid codebase index** (FTS5/BM25 + symbol +
codemap) behind `codebase_search` (§2).
- First **Intent Router** verbs mapping onto today's tools (§1).
- **Tool Repair** for deferred/deprecated/mode/dependency cases (§7).
### Later (post-0.9.0)
- **WhaleFlow** typed-IR workflow runner (§4) and the **evaluation loop** /
teacher-student replay (§8).
- **Skills activation modes** + tool restriction + `$<skill-name>` submit-time
activation (§5).
- Full **Context Memory Stack** with `/context` dashboard, `/overlay`
promotion, and Aleph external memory (§6).
- Dense/semantic retrieval and PR/commit/issue history in the index (§2).
- Search-cluster consolidation and the remaining §10 deferred items.
---
## North-star one-liner
> **The harness handles memory, search, routing, state, and guardrails — so a
> weaker model can just think.**