docs: v0.8.53 tool-surface-diet design + north-star direction
Design-only deliverables for the v0.8.53 "tool surface diet / canonical surfaces" cutover (no catalog code in this cycle). Grounded in a verified inventory of the actual tool registry. - docs/TOOL_LIFECYCLE.md (#2681): the umbrella policy. Five lifecycle states (active / deferred / hidden-compatibility / deprecated / removed) modeled as const name-sets + an alias table in tool_catalog.rs (not a per-ToolSpec field), so registration stays untouched and old transcripts always replay. Includes the deprecation manifest (exec_wait/exec_interact/tts → hidden-compat; todo_* → checklist_* deprecated; 11 legacy subagent names are already non-visible dead code → cleanup + guardrail), per-mode/per-provider active-catalog budget (incl. Arcee's 8-tool first-turn set), prefix-cache safety rules, and the tool_agent decision: canonical but DeepSeek-V4-gated. - docs/CODEBASE_SEARCH_DESIGN.md (#2680, v0.9.0): local-first FTS5/BM25 + symbol/path ranking + RRF hybrid; rusqlite storage; mtime/branch/vendor invalidation; an explainable tool contract returning reasons[]; and a real CodeWhale query eval set. Complements grep_files/file_search, never replaces. - docs/SKILL_INVOCATION_DESIGN.md (0.9.0): the $<skill-name> inline invocation syntax (the token IS the skill name), namespaced resolution, ambiguity- suggests-not-guesses, visible activation line, and a smallest-viable slice. - docs/VISION_NORTH_STAR.md (0.9.0+): intent router, hybrid codebase intelligence, WhaleFlow typed workflow IR, skills/rules runtime, the layered context-memory stack, tool repair/autoload, the evaluation loop, and the command-surface taxonomy (/memory small · /context dashboard · /rules · /workflow · /overlay · $<skill> · codebase_search). Marked DIRECTION, not committed 0.8.53 work; also records the deferred-not-done diet items. Targets codex/v0.8.53.
This commit is contained in:
@@ -0,0 +1,300 @@
|
||||
# `codebase_search` — Local-First Semantic Code Retrieval
|
||||
|
||||
> **Status:** Design note + eval scaffold. **Code is DEFERRED.**
|
||||
> GitHub #2680 · Milestone **v0.9.0** · This DOC ships in **v0.8.53** (doc-only; no catalog code in this cycle).
|
||||
> Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either.
|
||||
|
||||
This document specifies a model-visible `codebase_search` tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding **tool lifecycle** decisions for v0.8.53 so the eventual catalog edit is a single deterministic change.
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem
|
||||
|
||||
Today CodeWhale ships two complementary code-locating tools and one structure map:
|
||||
|
||||
- `file_search` — **filename** search (uses the `ignore` crate's `WalkBuilder` for vendor exclusion; default excludes at `crates/tui/src/tools/file_search.rs:210-219`).
|
||||
- `grep_files` — **content** search (literal/regex token match).
|
||||
- `project_map` — a deferred **structure** map.
|
||||
|
||||
None of these answer **concept-level** questions where the user does not know the exact token:
|
||||
|
||||
- "Where is provider auth resolved?"
|
||||
- "What enforces the shell approval policy?"
|
||||
- "Where do mode prompts get assembled?"
|
||||
- "How does the subagent lifecycle close out a child?"
|
||||
|
||||
`grep_files` requires you to already know the literal string (`resolve_api_key`, `ApprovalRequirement`, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens.
|
||||
|
||||
**Goal.** Add a retrieval tool keyed on *intent*, not on exact lexemes, that returns ranked, **explainable** code locations.
|
||||
|
||||
**Non-goal / explicit complement.** `codebase_search` does **not** replace `grep_files` or `file_search`. Exact-token and filename lookups remain the right tool when you know what you're looking for. `codebase_search` is the "I don't know the token yet" entry point and always falls back to exact grep so it is never *worse* than grep for a literal query. (See §2 fallback, §6 non-goals.)
|
||||
|
||||
There is currently **no** FTS5/BM25, sparse, or dense index in the tree. `rusqlite` is already a workspace dependency (`crates/tui/Cargo.toml`), so the lexical core can be built with no new heavy dependencies.
|
||||
|
||||
---
|
||||
|
||||
## 2. Approach Comparison
|
||||
|
||||
| Approach | What it indexes | Local-first? | Recall on paraphrase | Cost / deps | Verdict for v0.9.0 |
|
||||
|---|---|---|---|---|---|
|
||||
| **Lexical FTS5 + `bm25()`** | tokenized code/comments/identifiers (camelCase/snake_case split) | Yes — SQLite built in via `rusqlite` | Medium (with tokenizer help) | Near-zero (existing dep) | **Phase 1 core** |
|
||||
| **Symbol / path ranking** | extracted symbols (fn/struct/impl/const), path components | Yes | Medium-high for "where is X defined" | Low (regex/tree-sitter optional) | **Phase 1 core** |
|
||||
| **Sparse encoders (SPLADE)** | learned term-expansion weights | Yes (model is local but heavy) | High | Model download + inference | Phase 3, feature-flagged |
|
||||
| **Dense embeddings** | vector of chunk semantics | Optional — embedding model needed | Highest on paraphrase | Model + vector store; HF download | Phase 3, feature-flagged |
|
||||
| **Cross-encoder reranker** | re-scores top-K candidates | Heavy | Boosts precision@k | Inference cost | Phase 4, feature-flagged |
|
||||
|
||||
### Recommended architecture: Hybrid via Reciprocal-Rank Fusion (RRF)
|
||||
|
||||
Each enabled signal produces an independent ranked list; results are merged with RRF
|
||||
(`score(d) = Σ_signals 1/(k + rank_signal(d))`, conventional `k≈60`). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales.
|
||||
|
||||
**v0.9.0 Phase 1 signal set (all local, no model downloads):**
|
||||
|
||||
1. **Lexical (FTS5 `bm25()`)** over chunk text with an identifier-aware tokenizer.
|
||||
2. **Symbol rank** — boost chunks whose extracted symbol name fuzzy-matches query terms.
|
||||
3. **Path rank** — boost chunks whose path components match (e.g. query "auth" → `…/auth/…`, `…/provider…`).
|
||||
4. **Session-relevance boost** — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at.
|
||||
5. **Exact grep fallback** — the query is *also* run as a literal `grep_files`-equivalent pass; any exact hit is fused in and tagged, guaranteeing `codebase_search` ⊇ grep for literal queries.
|
||||
|
||||
**Optional later backends (feature-flagged, off by default):**
|
||||
|
||||
- `--features sparse-splade` — adds a SPLADE signal list to the RRF.
|
||||
- `--features dense-embed` — adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy).
|
||||
- `--features rerank` — cross-encoder reranks the fused top-K.
|
||||
|
||||
Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI.
|
||||
|
||||
---
|
||||
|
||||
## 3. Storage & Index
|
||||
|
||||
### Location
|
||||
|
||||
```
|
||||
~/.codewhale/index/<workspace-hash>.db
|
||||
```
|
||||
|
||||
`<workspace-hash>` is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by `rusqlite` (existing dep).
|
||||
|
||||
> Migration note (ties to the `/memory doctor` taxonomy in §7): older builds used `~/.deepseek`. The index path is `~/.codewhale` only; if a legacy `~/.deepseek/index` exists it is ignored (a future `doctor` may offer to migrate, never auto-read).
|
||||
|
||||
### Schema sketch
|
||||
|
||||
```sql
|
||||
CREATE TABLE files (
|
||||
id INTEGER PRIMARY KEY,
|
||||
path TEXT NOT NULL UNIQUE, -- workspace-relative
|
||||
mtime_ns INTEGER NOT NULL, -- invalidation
|
||||
size_bytes INTEGER NOT NULL,
|
||||
content_hash TEXT NOT NULL, -- blake3; skip re-chunk if unchanged
|
||||
lang TEXT, -- detected language
|
||||
branch TEXT -- branch at last index (invalidation)
|
||||
);
|
||||
|
||||
CREATE TABLE chunks (
|
||||
id INTEGER PRIMARY KEY,
|
||||
file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
|
||||
start_line INTEGER NOT NULL,
|
||||
end_line INTEGER NOT NULL,
|
||||
kind TEXT, -- fn | struct | impl | const | doc | block
|
||||
symbol TEXT, -- primary symbol name if any
|
||||
text TEXT NOT NULL -- chunk body (identifier-split copy for FTS)
|
||||
);
|
||||
|
||||
-- Lexical index. external-content FTS so we don't duplicate bodies twice.
|
||||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||
text,
|
||||
symbol,
|
||||
content='chunks',
|
||||
content_rowid='id',
|
||||
tokenize = 'unicode61 remove_diacritics 2' -- + identifier pre-split at index time
|
||||
);
|
||||
|
||||
CREATE TABLE symbols (
|
||||
id INTEGER PRIMARY KEY,
|
||||
file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
|
||||
chunk_id INTEGER REFERENCES chunks(id) ON DELETE CASCADE,
|
||||
name TEXT NOT NULL,
|
||||
kind TEXT NOT NULL, -- fn | struct | enum | trait | impl | const | macro
|
||||
line INTEGER NOT NULL
|
||||
);
|
||||
CREATE INDEX symbols_name ON symbols(name);
|
||||
|
||||
-- Session relevance: lightweight touch log, written by the session, decayed on read.
|
||||
CREATE TABLE session_touch (
|
||||
path TEXT PRIMARY KEY,
|
||||
last_touch INTEGER NOT NULL, -- unix ns
|
||||
touch_count INTEGER NOT NULL DEFAULT 1
|
||||
);
|
||||
```
|
||||
|
||||
Identifier-aware tokenization (splitting `resolveApiKey` / `resolve_api_key` → `resolve api key`) is applied **at index time** into the FTS `text` column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (`chunks_sparse`, `chunks_vec`) behind their feature flags.
|
||||
|
||||
### Chunking strategy (structure-aware)
|
||||
|
||||
Chunk on **syntactic boundaries**, not fixed windows: one chunk per top-level item (`fn`, `struct`, `impl` block, `const`, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere.
|
||||
|
||||
### Invalidation
|
||||
|
||||
- **mtime + content_hash:** on index/refresh, skip files whose `mtime_ns` and `content_hash` are unchanged.
|
||||
- **Branch switch:** `files.branch` is recorded; on a branch change the affected files are re-checked (cheap because of content_hash).
|
||||
- **Generated / vendor exclusion:** reuse the **same** `ignore`-crate `WalkBuilder` exclusion behavior as `file_search` (mirror the defaults at `crates/tui/src/tools/file_search.rs:210-219`: `target/**`, `node_modules/**`, `.git/**`, `DerivedData/**`, `dist/**`, `build/**`, `*.lock`, `*.plist`, plus `.gitignore`). One exclusion source of truth shared with `file_search` avoids index drift.
|
||||
|
||||
### Privacy / trust
|
||||
|
||||
- **Workspace-scoped, local-only.** The index lives under `~/.codewhale/index/` and never leaves the machine.
|
||||
- **No cloud by default.** Phase 1 has zero network dependency.
|
||||
- **Embeddings / Hugging Face downloads are gated.** Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag *and* an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything.
|
||||
|
||||
---
|
||||
|
||||
## 4. Model-Visible Tool Contract
|
||||
|
||||
```jsonc
|
||||
// codebase_search
|
||||
{
|
||||
"name": "codebase_search",
|
||||
"description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).",
|
||||
"parameters": {
|
||||
"query": { "type": "string", "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." },
|
||||
"max_results":{ "type": "integer", "default": 10 },
|
||||
"path_glob": { "type": "string", "description": "Optional path filter, e.g. 'crates/tui/**'." },
|
||||
"lang": { "type": "string", "description": "Optional language filter." },
|
||||
"kind": { "type": "string", "description": "Optional symbol-kind filter: fn|struct|impl|const|..." }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Result shape — ranked, explainable, auditable:**
|
||||
|
||||
```jsonc
|
||||
{
|
||||
"results": [
|
||||
{
|
||||
"path": "crates/tui/src/config/provider.rs",
|
||||
"line": 142,
|
||||
"snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result<Secret> { ... }",
|
||||
"score": 0.91,
|
||||
"reasons": [
|
||||
"symbol: resolve_api_key matches 'auth/resolve'",
|
||||
"lexical: matched tokens [provider, api, key, resolve]",
|
||||
"path: component 'provider' matches query",
|
||||
"session: file read 2 turns ago"
|
||||
]
|
||||
}
|
||||
],
|
||||
"backend": "lexical+symbol+path+session", // which signals were fused (RRF)
|
||||
"fallback_grep_hits": 1 // exact-match hits folded in
|
||||
}
|
||||
```
|
||||
|
||||
`reasons[]` is **mandatory** and is the auditability contract: every result explains *why* it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The `backend` field records which signals were active so results are reproducible given the feature set.
|
||||
|
||||
---
|
||||
|
||||
## 5. Benchmark / Eval Set
|
||||
|
||||
A fixed set of real CodeWhale concept queries, each with the **expected** file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on **file**, not line.
|
||||
|
||||
| # | Query (concept, no exact token) | Expected file(s) | Anchor |
|
||||
|---|---|---|---|
|
||||
| 1 | Where is provider auth / API key resolved? | `crates/tui/src/config/` provider auth path | provider/config module |
|
||||
| 2 | What is the first-turn active tool set? | `crates/tui/src/core/engine/tool_catalog.rs` | `DEFAULT_ACTIVE_NATIVE_TOOLS` :37-64 |
|
||||
| 3 | How are deferred tools hydrated / searched? | `crates/tui/src/core/engine/tool_catalog.rs` | tool_search regex/bm25 :26-35 |
|
||||
| 4 | Why does Arcee get a reduced tool set? (WAF workaround) | `crates/tui/src/core/engine/tool_catalog.rs` | `ARCEE_FIRST_TURN_NATIVE_TOOLS` :106-115 |
|
||||
| 5 | What keeps the tool catalog byte-stable for the KV prefix cache? | `crates/tui/src/core/engine/tool_catalog.rs` | catalog-head invariant :169-196 |
|
||||
| 6 | Where is the shell approval / cancel policy? | `crates/tui/src/tools/shell.rs` + `tools/spec.rs` (`ApprovalRequirement`) | shell tools, `ShellWaitTool`/`ShellInteractTool` registry.rs:524-531 |
|
||||
| 7 | Where are mode prompts (Plan/Agent/YOLO) assembled? | mode prompt / `AppMode` assembly in `crates/tui/src/tui/` | `AppMode` usage |
|
||||
| 8 | How does the subagent lifecycle open/eval/close a child? | `crates/tui/src/tools/subagent/mod.rs`; registry registration | registry.rs:1017-1029; `send_input`/`cancel`/`resume` mod.rs:1495,1521,1605 |
|
||||
| 9 | What is the RLM session surface and its default child model? | `crates/tui/src/tools/rlm.rs` | `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"` :26 |
|
||||
| 10 | Where is RLM eval / var_handle retrieval (`handle_read`)? | `crates/tui/src/tools/rlm.rs`, `tools/handle.rs` | `VarHandle` import rlm.rs:21 |
|
||||
| 11 | Where are skills discovered and parsed in the workspace? | `crates/tui/src/tools/skills/mod.rs` | `discover_in_workspace` ~421; skill struct ~382-388 |
|
||||
| 12 | Where is skill enable-state stored / checked? | `crates/tui/src/tools/skills/skill_state.rs` | `SkillStateStore::is_enabled` ~73 |
|
||||
| 13 | How does vendor/generated exclusion work for file walking? | `crates/tui/src/tools/file_search.rs` | `ignore` WalkBuilder excludes :210-219 |
|
||||
| 14 | Where is the queued user message built on submit? | `crates/tui/src/tui/ui.rs` | `build_queued_message` ~4721 |
|
||||
| 15 | Where are speech / TTS tools registered? (duplicate names) | `crates/tui/src/tools/registry.rs` | `speech` ≡ `tts` :787-792 |
|
||||
|
||||
Each entry is a `(query, expected_paths[])` row in a fixture (e.g. `crates/tui/tests/fixtures/codebase_search_eval.jsonl`). Phase 1 ships the harness that runs all queries against the live index and reports recall@k and MRR; a regression bar (e.g. recall@10 ≥ target) gates future ranking changes.
|
||||
|
||||
---
|
||||
|
||||
## 6. Phasing, Feature Flags, and Non-Goals
|
||||
|
||||
### Phasing
|
||||
|
||||
- **Phase 0 (this cycle, v0.8.53):** this design note + eval fixture only. No catalog code.
|
||||
- **Phase 1 (v0.9.0):** local lexical core — FTS5 `bm25()` + symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at `~/.codewhale/index/<workspace-hash>.db`. Eval harness wired into CI. **No network, no model downloads.** Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant.
|
||||
- **Phase 2:** incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language).
|
||||
- **Phase 3 (feature-flagged, off by default):** `sparse-splade` and `dense-embed` RRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy).
|
||||
- **Phase 4 (feature-flagged):** `rerank` cross-encoder over the fused top-K.
|
||||
|
||||
### Feature flags
|
||||
|
||||
```
|
||||
codebase-search-core # Phase 1, default-on once it lands
|
||||
sparse-splade # Phase 3, default-off
|
||||
dense-embed # Phase 3, default-off (gated HF download)
|
||||
rerank # Phase 4, default-off
|
||||
```
|
||||
|
||||
### Non-goals
|
||||
|
||||
- **No cloud index is required** for the core experience. Ever, for Phase 1.
|
||||
- **Not a grep replacement.** Exact-token (`grep_files`) and filename (`file_search`) search stay first-class; `codebase_search` complements them and folds exact hits in as a fallback.
|
||||
- Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more.
|
||||
|
||||
### Cross-link: WhaleFlow epic
|
||||
|
||||
`codebase_search` is a building block for the long-running multi-agent **WhaleFlow** (`/workflow` / `/whaleflow`) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable `reasons[]` feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors).
|
||||
|
||||
---
|
||||
|
||||
## Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only)
|
||||
|
||||
These are **design decisions for the eventual one-time catalog edit**; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (`tool_catalog.rs:169-196`) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn.
|
||||
|
||||
### Lifecycle states (represented as const name-sets + an alias table in `tool_catalog.rs`, NOT a per-`ToolSpec` field)
|
||||
|
||||
| State | Active first turn? | In tool-search? | Registered/dispatchable? | Result-metadata notice? |
|
||||
|---|---|---|---|---|
|
||||
| **active** | yes | yes | yes | no |
|
||||
| **deferred** | no | yes | yes | no |
|
||||
| **hidden-compatibility** | no | no | yes | no |
|
||||
| **deprecated** | no | no | yes | yes (replacement notice, **metadata only**) |
|
||||
| **removed** | no | no | no | — |
|
||||
|
||||
Deprecated/hidden tools stay **registered and dispatchable** so old transcripts always replay. A deprecated tool appends a replacement notice to **RESULT METADATA only** — never to the cached prefix (which would break the invariant).
|
||||
|
||||
### Planned diet (documented, not yet coded)
|
||||
|
||||
- **`exec_wait`, `exec_interact`, `tts` → hidden-compatibility.** These are exact duplicates of canonical tools:
|
||||
- `exec_wait` ≡ `exec_shell_wait` (same `ShellWaitTool`, `registry.rs:526,529`); router already unifies them at `crates/tui/src/tui/tool_routing.rs:1139-1140`.
|
||||
- `exec_interact` ≡ `exec_shell_interact` (same `ShellInteractTool`, `registry.rs:527,530`).
|
||||
- `tts` ≡ `speech` (same `SpeechTool`, `registry.rs:787-792`).
|
||||
- Action: drop from active + search, keep registered, identical behavior, **no notice**.
|
||||
- **`todo_*` (`todo_write/add/update/list`) → deprecated → `checklist_*`.** They are deferred twins of `checklist_*` (same `TodoWriteTool::new` vs `::checklist`, `todo.rs:187,194`); `checklist_write` is active, and `todo_*` are **not** in the active set. Action: drop from tool-search, keep registered, **add replacement notice** (metadata only).
|
||||
- **Legacy subagent names** (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`, `send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, `delegate_to_agent`) are already `#[allow(dead_code)]` structs never instantiated outside tests (`crates/tui/src/tools/subagent/mod.rs`) → **already not model-visible.** Action: cleanup + guardrail tests, **rebased on PR #2684.** Note the live internal `SubAgentManager` methods `send_input`/`cancel`/`resume` (`mod.rs:1495,1521,1605`) are used by `agent_eval`/`agent_close` and **must be kept** — only the model-visible *tool* names are retired.
|
||||
|
||||
### Model-visible subagent surface (unchanged)
|
||||
|
||||
Only `agent_open`, `agent_eval`, `tool_agent`, `agent_close` are registered (`registry.rs:1017-1029`).
|
||||
|
||||
- **`tool_agent` — KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY.** It is the fast non-thinking "Fin" executor lane built on `deepseek-v4-flash` (cf. RLM `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit.
|
||||
|
||||
### Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance)
|
||||
|
||||
`apply_patch` / `edit_file` / `write_file` / `fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` / `web.run` / `web_search`; `task_shell_*`; `handle_read` / `retrieve_tool_result`. These serve distinct purposes and stay as-is.
|
||||
|
||||
---
|
||||
|
||||
## Appendix B — Command-Surface Taxonomy (context)
|
||||
|
||||
Each name maps to exactly one thing; `codebase_search` slots in as concept-level code retrieval alongside these surfaces:
|
||||
|
||||
- `/memory` — small user prefs/facts only (subcommands `add`/`edit`/`search`/`clear`/`doctor`, plus later `promote`; `doctor` detects the legacy `~/.deepseek` path).
|
||||
- `/context` — dashboard of all active layers.
|
||||
- `/rules` — repo guidance.
|
||||
- `/workflow` (`/whaleflow`) — long-running multi-agent (the WhaleFlow epic).
|
||||
- `/overlay` — promoted cached-main lessons.
|
||||
- `$<skill-name>` — skill invocation prefix; the token *is* the skill name (e.g. `$systematic-debugging`, `$github:gh-fix-ci`).
|
||||
- `codebase_search` — concept-level code retrieval (this document).
|
||||
@@ -0,0 +1,233 @@
|
||||
# Skill Invocation Design — the `$<skill-name>` inline syntax
|
||||
|
||||
Status: **DESIGN ONLY** (v0.8.53 cycle). No catalog/parser code ships in this
|
||||
cycle; the implementation target is **0.9.0**. This document describes what
|
||||
*will* be built and the contracts it must honor against today's code.
|
||||
|
||||
Related design docs: `TOOL_LIFECYCLE.md` (tool lifecycle states + per-skill tool
|
||||
restriction), command-surface taxonomy notes for `/memory`, `/context`,
|
||||
`/rules`, `/workflow` (`/whaleflow`), `/overlay`. Open PRs on `codex/v0.8.53`:
|
||||
#2684 (subagent role vocab / lifecycle signals / eval ergonomics) and #2685
|
||||
(git history active + RLM/field errors). Nothing here contradicts those.
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem
|
||||
|
||||
Skill activation has no single, model-legible entry point, and the candidate
|
||||
surfaces all compete with each other:
|
||||
|
||||
- A `/skill` slash command, a `load_skill`-style tool, plugin/namespace naming
|
||||
(`superpowers:systematic-debugging`, `github:gh-fix-ci`), and the long-running
|
||||
workflow commands (`/workflow` / `/whaleflow`) all *could* be "the way you
|
||||
start a skill." None of them is canonical.
|
||||
- Slash commands are already overloaded. `/memory`, `/context`, `/rules`,
|
||||
`/config`, `/provider`, `/workflow`, `/overlay` each map to one subsystem;
|
||||
jamming skill invocation into `/`-space forces a weaker model to disambiguate
|
||||
"is this a command or a skill?" on every keystroke.
|
||||
- Weaker / smaller models (the cheaper providers CodeWhale targets) do not
|
||||
reliably pick the right mechanism. They will free-text "let me use systematic
|
||||
debugging" instead of actually loading the skill body, so the guidance never
|
||||
enters the context window.
|
||||
- Today there is **no parser that activates an inline skill mention on submit.**
|
||||
`slash_menu.rs:86` (`partial_inline_skill_mention_at_cursor`) recognizes an
|
||||
inline `/<skill>` token *under the cursor for popup purposes only*; the submit
|
||||
path in `ui.rs:4721` (`build_queued_message`) does not resolve or activate any
|
||||
inline mention. There is also no activation-mode concept (always-on / glob /
|
||||
model-decision / manual) and skills cannot restrict tools yet.
|
||||
|
||||
We need one prefix that means exactly "invoke this skill," is visually distinct
|
||||
from commands, and is cheap for a small model to emit correctly.
|
||||
|
||||
---
|
||||
|
||||
## 2. Proposal
|
||||
|
||||
Adopt **`$` as the skill-invocation prefix**, where **the token *is* the skill
|
||||
name** — not a literal command called `$skill`.
|
||||
|
||||
```
|
||||
$systematic-debugging figure out why MiMo auth fails
|
||||
$test-driven-development add coverage before fixing
|
||||
$github:gh-fix-ci inspect the failing checks
|
||||
$aleph search the planning doc
|
||||
```
|
||||
|
||||
The leading `$` is the marker; everything from `$` up to the next whitespace is
|
||||
the **skill id**. The rest of the line is the user's request, passed through to
|
||||
the model with the skill body already loaded as active guidance.
|
||||
|
||||
This is deliberately a *reference / macro* sigil, like a shell variable
|
||||
expansion or an `@mention`: `$skill-id` resolves to "the contents and tool
|
||||
policy of that skill," then the surrounding prose is the task.
|
||||
|
||||
`$` works in three places (see §4): the user composer, the command-palette
|
||||
input, and **model-facing planning text** — so the model itself can write
|
||||
`$systematic-debugging` in its plan and have it resolve.
|
||||
|
||||
---
|
||||
|
||||
## 3. Resolution rules
|
||||
|
||||
Given a token `$<id>` (id captured up to the next whitespace):
|
||||
|
||||
1. **Exact name first.** Look the id up directly:
|
||||
`discover_in_workspace(workspace).get(id)` — `skills/mod.rs:553` builds the
|
||||
registry; `SkillRegistry::get` (`skills/mod.rs:421`) matches on `s.name == id`
|
||||
exactly. Skill names come from frontmatter `name:` (or the first `# Heading`
|
||||
fallback) parsed at `skills/mod.rs:382-417`. An exact hit wins unconditionally.
|
||||
|
||||
2. **Namespaced `$ns:skill`.** If the id contains a `:`, treat the part before
|
||||
the colon as a source/plugin namespace and the part after as the skill name:
|
||||
`$github:gh-fix-ci`, `$superpowers:systematic-debugging`. Namespaced ids are
|
||||
the disambiguation handle a user is told to type when a bare id is ambiguous.
|
||||
(Glob/wildcard namespacing — `$github:*` — is explicitly deferred, see §6.)
|
||||
|
||||
3. **Fuzzy match *suggests*, never silently chooses.** If there is no exact (or
|
||||
namespaced-exact) hit, run a case-insensitive substring / prefix match over
|
||||
`SkillRegistry::list()` (`skills/mod.rs:426`). If exactly one skill matches,
|
||||
surface it as a suggestion ("did you mean `$systematic-debugging`?") but do
|
||||
**not** auto-activate it. If more than one matches, list them and require the
|
||||
user/model to re-issue with a disambiguated id (§7). Ambiguity never resolves
|
||||
to a silent pick.
|
||||
|
||||
4. **Respect enable-state.** A resolved skill is only activated if
|
||||
`SkillStateStore::is_enabled(id)` is true (`skill_state.rs:73`:
|
||||
`!self.disabled.contains(skill_name)`). A disabled skill that resolves by
|
||||
name produces a clear "skill is disabled; enable it with `/skill enable <id>`"
|
||||
message rather than silently activating or silently doing nothing.
|
||||
|
||||
Resolution order is therefore: **exact → namespaced-exact → enabled-check →
|
||||
fuzzy-suggest (never auto-pick).**
|
||||
|
||||
---
|
||||
|
||||
## 4. Behavior
|
||||
|
||||
When a `$<id>` mention resolves and is enabled:
|
||||
|
||||
- **Visible activation line.** The transcript shows `Using skill: <name>` so the
|
||||
user can see which skill body entered context. (Mirrors the existing skill UX
|
||||
vocabulary; one line per activated skill.)
|
||||
- **Body loaded as active guidance.** The skill's `body`
|
||||
(`skills/mod.rs` `Skill.body`) is injected into the turn as authoritative
|
||||
guidance, the same content a `/skill`-style activation would load. The user's
|
||||
trailing prose is the task the guidance applies to.
|
||||
- **Tool-surface narrowing (when declared).** If the skill declares a set of
|
||||
allowed tools, the active tool surface narrows to that set for the duration of
|
||||
the skill's influence. **Per-skill tool restriction is net-new** — skills
|
||||
cannot restrict tools today; the mechanism, and how narrowing interacts with
|
||||
the catalog-head byte-stability invariant (`tool_catalog.rs:169-196`), is
|
||||
specified in `TOOL_LIFECYCLE.md`. Until that lands, a declared tool list is
|
||||
parsed and shown but not enforced.
|
||||
- **Multiple `$mentions` compose explicitly, or prompt.** Until formal
|
||||
composition rules exist, two or more `$mentions` in one message either compose
|
||||
only when the rule is unambiguous (e.g. one guidance skill + one tool-scoping
|
||||
skill) or return a **"choose one"** prompt listing the mentioned skills. We
|
||||
never silently activate multiple complex skills at once (see §7 and Non-goals).
|
||||
- **Three input surfaces.** Resolution runs for: (a) user prompts in the
|
||||
composer, (b) command-palette input, and (c) model-facing planning text, so a
|
||||
model that writes `$test-driven-development` in its plan triggers the same
|
||||
activation path a human would.
|
||||
- **Slash commands remain supported.** `/skill ...` and the rest of the slash
|
||||
surface keep working unchanged. `$` is the *preferred* path for models because
|
||||
it is one token and unambiguous, but it is additive, not a replacement (§7
|
||||
Non-goals).
|
||||
|
||||
---
|
||||
|
||||
## 5. Why `$`
|
||||
|
||||
- **Visually distinct from `/commands`.** A glance separates "run a subsystem
|
||||
command" (`/memory`, `/context`, `/workflow`) from "load a skill" (`$aleph`).
|
||||
Weaker models stop confusing the two surfaces.
|
||||
- **Reads like a reference / macro.** `$name` already means "expand this named
|
||||
thing" to anyone who has touched a shell or a templating language. Skill
|
||||
invocation *is* an expansion: `$skill-id` → that skill's guidance + tool policy.
|
||||
- **Avoids overloading the slash namespace.** `/workflow`, `/memory`, `/config`,
|
||||
`/provider`, `/rules`, `/overlay`, `/context` each already own one meaning in
|
||||
the command-surface taxonomy. Skills get their own sigil instead of a crowded
|
||||
`/skill <name>` subcommand competing with all of them.
|
||||
- **Easy to type and remember.** Single leading character, then the literal
|
||||
skill name. Nothing to memorize beyond the skill ids the user already sees in
|
||||
`/skill list`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Implementation plan (smallest viable 0.8.53-ready slice → 0.9.0)
|
||||
|
||||
The 0.8.53 cycle is **docs only**. The plan below is the build order once code
|
||||
is unblocked; the first slice is intentionally the minimum that proves the path.
|
||||
|
||||
**Slice 1 — token scanner at submit (the minimum viable feature).**
|
||||
- Add a `$<skill-id>` token scanner invoked on submit, **before**
|
||||
`build_queued_message` runs (`ui.rs:4721`). The scanner finds leading-`$`
|
||||
tokens, captures the id up to the next whitespace, and hands each id to the
|
||||
resolver. The scanner must skip `$` occurrences inside code fences and inline
|
||||
command strings (see Non-goals) so shell `$VAR` references are never treated as
|
||||
skill mentions.
|
||||
- Resolve via `discover_in_workspace(workspace).get(id)` (`skills/mod.rs:553` /
|
||||
`:421`), gate on `SkillStateStore::is_enabled` (`skill_state.rs:73`), and emit
|
||||
the `Using skill: <name>` line plus the loaded body.
|
||||
|
||||
**Slice 2 — inline-mention popup.**
|
||||
- Extend the inline-mention popup machinery in `slash_menu.rs:86`
|
||||
(`partial_inline_skill_mention_at_cursor`) to recognize a `$`-prefixed token
|
||||
under the cursor and offer skill-name completions from `SkillRegistry::list()`,
|
||||
the same way the slash popup offers commands. This is a UX accelerator on top
|
||||
of Slice 1, not a precondition for it.
|
||||
|
||||
**Slice 3 — ambiguity diagnostics.**
|
||||
- When resolution is ambiguous, emit actionable diagnostics, e.g.
|
||||
`"$debugging matched 3 skills: systematic-debugging, root-cause-debugging,
|
||||
superpowers:systematic-debugging — use $superpowers:systematic-debugging"`.
|
||||
Diagnostics name the disambiguated id the user should type next.
|
||||
|
||||
**Deferred to 0.9.0+ (explicitly out of the first slices):**
|
||||
- `$ns:skill` **globs / wildcards** (`$github:*`). Plain namespaced-exact
|
||||
(`$github:gh-fix-ci`) ships in Slice 1; globbing does not.
|
||||
- **Per-skill tool restriction enforcement.** Parsing/display can land early;
|
||||
enforcement and its catalog-head-stability handling are owned by
|
||||
`TOOL_LIFECYCLE.md`.
|
||||
- **Multi-skill composition rules.** Until defined, fall back to the "choose one"
|
||||
prompt (§4, §7).
|
||||
|
||||
---
|
||||
|
||||
## 7. Ambiguity / error UX, tests, and non-goals
|
||||
|
||||
### Error / ambiguity UX examples
|
||||
|
||||
| Input | Outcome |
|
||||
|---|---|
|
||||
| `$systematic-debugging fix the auth bug` | Exact hit. `Using skill: systematic-debugging`, body loaded, task = "fix the auth bug". |
|
||||
| `$github:gh-fix-ci inspect failing checks` | Namespaced-exact hit. `Using skill: github:gh-fix-ci`, body loaded. |
|
||||
| `$nope do a thing` | No match. `"No skill named 'nope'. Run /skill list to see available skills."` No activation; the line is sent as ordinary text. |
|
||||
| `$debugging ...` (3 candidates) | `"$debugging matched 3 skills: systematic-debugging, root-cause-debugging, superpowers:systematic-debugging — use $superpowers:systematic-debugging."` No auto-pick. |
|
||||
| `$systematic-debug ...` (1 fuzzy candidate) | Suggest only: `"No exact skill 'systematic-debug'. Did you mean $systematic-debugging?"` No silent activation. |
|
||||
| `$aleph ...` but aleph disabled | `"Skill 'aleph' is disabled. Enable it with /skill enable aleph."` No activation. |
|
||||
| `$tdd $systematic-debugging ...` (2 mentions) | `"Choose one skill to lead this turn: $test-driven-development or $systematic-debugging."` (until composition rules exist). |
|
||||
| `echo $PATH` inside a code fence / command string | Not a mention. Scanner skips `$` inside code/command contexts. |
|
||||
|
||||
### Tests (planned)
|
||||
|
||||
- **Exact:** `$systematic-debugging` resolves via `get(id)`, activates, loads body.
|
||||
- **Namespaced:** `$github:gh-fix-ci` resolves on the `ns:skill` form.
|
||||
- **Missing:** `$nope` → no-match message, no activation, line passed as text.
|
||||
- **Ambiguous:** `$debugging` (≥2 candidates) → "matched N skills … use $ns:skill",
|
||||
asserts **no** auto-activation occurred.
|
||||
- **Disabled:** a skill with `is_enabled == false` → disabled message, no activation.
|
||||
- **Guardrail — `$` in code:** `$VAR` inside a fenced block or command string is
|
||||
not treated as a mention.
|
||||
|
||||
### Non-goals
|
||||
|
||||
- **Do not remove slash commands.** `/skill` and the whole `/` surface stay; `$`
|
||||
is preferred for models but additive.
|
||||
- **Do not auto-run arbitrary scripts.** A `$mention` loads guidance (and, later,
|
||||
a declared tool policy) — it never executes shell or skill-attached scripts on
|
||||
its own.
|
||||
- **Do not silently activate multiple complex skills.** Multi-mention falls back
|
||||
to a "choose one" prompt until composition rules are specified.
|
||||
- **Do not let `$` collide with shell variables.** `$` inside code fences and
|
||||
command strings is never parsed as a skill mention.
|
||||
@@ -0,0 +1,366 @@
|
||||
# Tool-Surface Lifecycle Policy (v0.8.53)
|
||||
|
||||
**Status:** Design doc / policy. No catalog code lands in this cycle — the code
|
||||
work is **deferred**. This document is the umbrella policy for GitHub **#2681**,
|
||||
with **#2682** and **#2683** as concrete instances of the planned diet. It
|
||||
describes *what will be done* and the invariants any future diet PR must hold.
|
||||
|
||||
**Scope of related open work (do not contradict):**
|
||||
- PR **#2684** — subagent role vocabulary, lifecycle signals, eval ergonomics.
|
||||
Legacy subagent-name cleanup + guardrail tests in this policy rebase on #2684.
|
||||
- PR **#2685** — git-history active + RLM/field errors.
|
||||
|
||||
All file:line citations are against the verified tree at
|
||||
`/Users/huntermbown/Desktop/whalebro/codewhale` as of v0.8.52/0.8.53.
|
||||
|
||||
---
|
||||
|
||||
## 1. Purpose and the weaker-model problem
|
||||
|
||||
CodeWhale ships a large native tool surface. The first-turn *active* partition
|
||||
of that surface is what every model sees before it has run a single
|
||||
`tool_search_*` call. Today that active set contains several **near-duplicate
|
||||
tools** that map to the *same* implementation under different names:
|
||||
|
||||
- `exec_wait` and `exec_shell_wait` are both `ShellWaitTool`
|
||||
(`crates/tui/src/tools/registry.rs:526,529`).
|
||||
- `exec_interact` and `exec_shell_interact` are both `ShellInteractTool`
|
||||
(`registry.rs:527,530`).
|
||||
- `tts` and `speech` are both `SpeechTool`
|
||||
(`registry.rs:787-792`, both deferred).
|
||||
- `todo_write` and `checklist_write` are the *same* `TodoWriteTool`
|
||||
constructed two ways (`crates/tui/src/tools/todo.rs:184-196`).
|
||||
|
||||
For a strong model, redundant names are harmless noise. For **weaker / smaller
|
||||
models** (the Arcee Trinity lane, `deepseek-v4-flash` child executors, and any
|
||||
non-thinking executor), every additional near-duplicate in the visible set is a
|
||||
real cost:
|
||||
|
||||
- It widens the choice space with options that do *nothing distinct*, increasing
|
||||
wrong-tool selection and oscillation between synonyms.
|
||||
- It spends scarce first-turn catalog budget (Section 5) on zero-information
|
||||
entries.
|
||||
- It dilutes the "one name = one thing" contract that lets a small model reason
|
||||
about the surface at all.
|
||||
|
||||
The lifecycle policy exists to **shrink and discipline the model-visible
|
||||
surface** without ever breaking the ability to replay an old transcript that
|
||||
referenced a now-retired name.
|
||||
|
||||
---
|
||||
|
||||
## 2. The five lifecycle states
|
||||
|
||||
Every native tool name occupies exactly one lifecycle state.
|
||||
|
||||
| State | Meaning | Visible on first turn? | In `tool_search_*`? | Executes if called? | When used |
|
||||
|---|---|---|---|---|---|
|
||||
| **active** | Canonical, in the first-turn catalog head | **Yes** | n/a (already active) | Yes | The tool a model should reach for by default |
|
||||
| **deferred** | Registered + discoverable, hydrated on demand | No | **Yes** | Yes | Real, useful tools that don't earn a first-turn slot |
|
||||
| **hidden-compatibility** | Registered + dispatchable, but removed from active **and** from search | No | **No** | **Yes — identical behavior, silent** | Old synonym kept only so old transcripts replay; no model should newly discover it |
|
||||
| **deprecated** | Like hidden-compat, but execution **appends a replacement notice to result metadata** | No | **No** | **Yes — works, plus a "use X instead" notice** | A retired name we actively steer callers off of, still safe to replay |
|
||||
| **removed** | Not registered at all | No | No | **No — hard error** | Only after `planned_removal_version`, once replay support is formally dropped |
|
||||
|
||||
### hidden-compatibility vs deprecated — be precise
|
||||
|
||||
Both states are **invisible** (not active, not in tool search) and both remain
|
||||
**dispatchable** (calling them still works). The *only* difference is the
|
||||
caller-facing signal:
|
||||
|
||||
- **hidden-compatibility:** completely silent. The tool behaves byte-for-byte
|
||||
like its canonical twin. We use this when there is *no behavioral or naming
|
||||
lesson to teach* — the name was a pure alias and we simply don't want models
|
||||
re-learning it. (Example: `exec_wait` is literally `exec_shell_wait`.)
|
||||
- **deprecated:** behaves identically *and succeeds*, but the tool result's
|
||||
**metadata** carries an appended notice like
|
||||
`"deprecated: use checklist_write instead"`. The notice goes **only in the
|
||||
result metadata returned for that call** — never in the cached tool catalog
|
||||
prefix (see Section 8). We use this when there is a canonical replacement we
|
||||
want the caller (and any human reading the transcript) nudged toward.
|
||||
|
||||
Neither state ever changes the *behavior* of the call. Replay always works.
|
||||
|
||||
---
|
||||
|
||||
## 3. Representation in code
|
||||
|
||||
The lifecycle is represented as **const name-sets plus an alias/manifest table**
|
||||
in `crates/tui/src/core/engine/tool_catalog.rs`, alongside the existing
|
||||
`DEFAULT_ACTIVE_NATIVE_TOOLS` (`tool_catalog.rs:37-64`) and
|
||||
`ARCEE_FIRST_TURN_NATIVE_TOOLS` (`tool_catalog.rs:106-115`).
|
||||
|
||||
### 3a. Name-sets and the manifest (sketch)
|
||||
|
||||
```rust
|
||||
// crates/tui/src/core/engine/tool_catalog.rs (planned)
|
||||
|
||||
/// Tools removed from the active set AND from tool-search, but still
|
||||
/// registered and dispatchable with byte-identical behavior. Silent.
|
||||
pub(super) const HIDDEN_COMPATIBILITY_TOOLS: &[&str] = &[
|
||||
"exec_wait", // == exec_shell_wait (ShellWaitTool)
|
||||
"exec_interact", // == exec_shell_interact (ShellInteractTool)
|
||||
"tts", // == speech (SpeechTool)
|
||||
];
|
||||
|
||||
/// Deprecated aliases: invisible + dispatchable, with a replacement notice
|
||||
/// appended to RESULT METADATA only (never the cached prefix).
|
||||
pub(super) struct DeprecatedAlias {
|
||||
pub name: &'static str,
|
||||
pub replacement: &'static str,
|
||||
pub note: &'static str,
|
||||
}
|
||||
|
||||
pub(super) const DEPRECATED_ALIASES: &[DeprecatedAlias] = &[
|
||||
DeprecatedAlias { name: "todo_write", replacement: "checklist_write",
|
||||
note: "use checklist_write instead" },
|
||||
DeprecatedAlias { name: "todo_add", replacement: "checklist_add",
|
||||
note: "use checklist_add instead" },
|
||||
DeprecatedAlias { name: "todo_update", replacement: "checklist_update",
|
||||
note: "use checklist_update instead" },
|
||||
DeprecatedAlias { name: "todo_list", replacement: "checklist_list",
|
||||
note: "use checklist_list instead" },
|
||||
];
|
||||
|
||||
#[inline]
|
||||
pub(super) fn is_hidden_or_deprecated(name: &str) -> bool {
|
||||
HIDDEN_COMPATIBILITY_TOOLS.contains(&name)
|
||||
|| DEPRECATED_ALIASES.iter().any(|d| d.name == name)
|
||||
}
|
||||
```
|
||||
|
||||
### 3b. The two filter points
|
||||
|
||||
1. **Catalog / tool-search exclusion (tool_catalog.rs).**
|
||||
Deferral is decided by `should_default_defer_tool` (`tool_catalog.rs:66-82`),
|
||||
and the active set is the head built by `build_model_tool_catalog`
|
||||
(`tool_catalog.rs:178-196`). Hidden-compat and deprecated tools must be
|
||||
forced *out of the active head* and *out of the tool-search-discoverable
|
||||
pool*. Concretely, the deferral predicate gains a short-circuit so these
|
||||
names are never active, and the tool-search index builder skips any name for
|
||||
which `is_hidden_or_deprecated(name)` is true. Arcee's narrowed first-turn
|
||||
path (`apply_provider_tool_policy`, `tool_catalog.rs:134-149`) already
|
||||
excludes them by construction since they aren't in
|
||||
`ARCEE_FIRST_TURN_NATIVE_TOOLS`.
|
||||
|
||||
2. **Result-notice append (tool_routing.rs).**
|
||||
Dispatch already routes by tool name in
|
||||
`crates/tui/src/tui/tool_routing.rs` (e.g. the wait/interact unification at
|
||||
`tool_routing.rs:1139-1140`). After a successful dispatch, if the called name
|
||||
is in `DEPRECATED_ALIASES`, the router appends the matching `note` to the
|
||||
**result metadata only**. Hidden-compat names append nothing.
|
||||
|
||||
### 3c. Why name-sets, not a per-`ToolSpec` enum field
|
||||
|
||||
A per-`ToolSpec` `lifecycle: Lifecycle` field was rejected for three reasons:
|
||||
|
||||
- **Prefix-cache safety.** The tool catalog array is part of DeepSeek's
|
||||
immutable KV prefix (`tool_catalog.rs:169-177`). A per-spec field invites
|
||||
serializing lifecycle state *into* each tool's schema, which is exactly the
|
||||
kind of head mutation that forces a full re-prefill. Name-sets live entirely
|
||||
in the catalog-build logic and never touch the emitted tool JSON.
|
||||
- **Single source of truth + diffability.** The diet for a release is one small,
|
||||
reviewable edit to two or three const arrays in one file, instead of scattered
|
||||
field flips across many tool modules.
|
||||
- **Registration stays orthogonal.** Tools remain registered exactly as today
|
||||
(e.g. `with_shell_tools`, `registry.rs:523-531`). Lifecycle is a *catalog
|
||||
policy* layered on top of registration, not a property baked into the tool.
|
||||
|
||||
---
|
||||
|
||||
## 4. Deprecation manifest (the #2681 acceptance-criteria table)
|
||||
|
||||
This is the authoritative manifest. Columns are the #2681 AC columns. No entry
|
||||
is "removed" in 0.8.53; replay is supported for everything listed.
|
||||
|
||||
| Alias | Replacement (canonical) | Lifecycle state | first_deprecated_version | planned_removal_version | replay_supported |
|
||||
|---|---|---|---|---|---|
|
||||
| `exec_wait` | `exec_shell_wait` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
|
||||
| `exec_interact` | `exec_shell_interact` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
|
||||
| `tts` | `speech` | hidden-compatibility | 0.8.53 | TBD (≥ 0.9.x) | Yes |
|
||||
| `todo_write` | `checklist_write` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
|
||||
| `todo_add` | `checklist_add` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
|
||||
| `todo_update` | `checklist_update` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
|
||||
| `todo_list` | `checklist_list` | deprecated | 0.8.53 | TBD (≥ 0.9.x) | Yes |
|
||||
|
||||
**Legacy subagent names — already non-visible, no manifest entry needed.**
|
||||
`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`,
|
||||
`send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, and
|
||||
`delegate_to_agent` exist only as `#[allow(dead_code)]` structs in
|
||||
`crates/tui/src/tools/subagent/mod.rs` and are **never instantiated** outside
|
||||
tests, so they are already not model-visible. Only `agent_open`, `agent_eval`,
|
||||
`tool_agent`, and `agent_close` are registered
|
||||
(`registry.rs:1017-1029`). The action for these legacy names is **dead-code
|
||||
cleanup + a guardrail test** (rebase on PR #2684), not a lifecycle transition.
|
||||
|
||||
> **Keep the live internal methods.** `send_input`, `cancel`, and `resume` also
|
||||
> exist as live `SubAgentManager` methods
|
||||
> (`subagent/mod.rs:1605,1495,1521`) used internally by `agent_eval` /
|
||||
> `agent_close`. These are *not* the dead-code tool structs and must be kept.
|
||||
|
||||
`planned_removal_version` is intentionally `TBD`: a name only moves to **removed**
|
||||
once we formally drop replay for transcripts old enough to contain it, which is a
|
||||
separate, deliberate decision per name.
|
||||
|
||||
---
|
||||
|
||||
## 5. Active-catalog budget (per mode, per provider)
|
||||
|
||||
The active set is the first-turn cost. Current default active set:
|
||||
`DEFAULT_ACTIVE_NATIVE_TOOLS` has **25** entries (`tool_catalog.rs:37-64`).
|
||||
|
||||
### Per provider
|
||||
|
||||
| Provider | First-turn active source | Current count | Target after diet |
|
||||
|---|---|---|---|
|
||||
| Default (DeepSeek et al.) | `DEFAULT_ACTIVE_NATIVE_TOOLS` | 25 | ~22 (drop `exec_wait`, `exec_interact`; `todo_*` already not active) |
|
||||
| Arcee (Trinity) | `ARCEE_FIRST_TURN_NATIVE_TOOLS` | 8 (read-only WAF workaround) | 8 (unchanged) |
|
||||
|
||||
The default diet removes `exec_wait` and `exec_interact` from the active head
|
||||
(they become hidden-compat; their canonical twins `exec_shell_wait` /
|
||||
`exec_shell_interact` stay). `tts` and `todo_*` are *already not* in the active
|
||||
set, so the active count moves **25 → 23** from the wait/interact removal alone;
|
||||
the broader target is a stable budget of roughly **≤ 22** canonical tools.
|
||||
|
||||
### Per mode (Plan / Agent / YOLO)
|
||||
|
||||
The native active head is the **same set across modes** by design — mode does not
|
||||
add or remove native tools from `DEFAULT_ACTIVE_NATIVE_TOOLS`
|
||||
(`should_default_defer_tool` ignores `_mode` for native tools,
|
||||
`tool_catalog.rs:66-68`). Mode affects **MCP** deferral instead:
|
||||
`apply_mcp_tool_deferral` keeps MCP tools deferred unless `mode == Yolo`
|
||||
(`tool_catalog.rs:162-167`).
|
||||
|
||||
| Mode | Native active budget | MCP tools active? |
|
||||
|---|---|---|
|
||||
| Plan | same native head (target ≤ 22) | No (deferred) |
|
||||
| Agent | same native head | No (deferred) |
|
||||
| YOLO | same native head | Yes (a known, intentional widening) |
|
||||
|
||||
**Budget rule:** the native active head must stay byte-identical across Plan ↔
|
||||
Agent ↔ YOLO (Section 8). Any growth of the head requires retiring something
|
||||
else or an explicit budget bump in this doc.
|
||||
|
||||
---
|
||||
|
||||
## 6. The canonical-surface rule
|
||||
|
||||
> **Every model-visible (active or deferred-discoverable) tool must have one
|
||||
> clear niche. If a tool is superseded, it gets a named replacement and moves to
|
||||
> hidden-compatibility or deprecated — it does not stay visible.**
|
||||
|
||||
### Canonical vs compatibility summary for the confusing clusters
|
||||
|
||||
| Cluster | Canonical (keep visible) | Compatibility / retired | Notes |
|
||||
|---|---|---|---|
|
||||
| **Shell wait** | `exec_shell_wait` | `exec_wait` → hidden-compat | Same `ShellWaitTool` (`registry.rs:526,529`); router already unifies (`tool_routing.rs:1139`) |
|
||||
| **Shell interact** | `exec_shell_interact` | `exec_interact` → hidden-compat | Same `ShellInteractTool` (`registry.rs:527,530`) |
|
||||
| **Checklist / todo** | `checklist_write` | `todo_write/add/update/list` → deprecated | Same `TodoWriteTool`, `::new` vs `::checklist` (`todo.rs:184-196`) |
|
||||
| **Speech / tts** | `speech` | `tts` → hidden-compat | Same `SpeechTool` (`registry.rs:787-792`) |
|
||||
| **Subagent lifecycle** | `agent_open`, `agent_eval`, `agent_close`, `tool_agent` (gated, §7) | all 11 legacy names → already non-visible dead code | Cleanup + guardrail test, rebase on #2684 |
|
||||
| **Edit family** | `apply_patch`, `edit_file`, `write_file`, `fim_edit` | none — **all distinct niches** | NOT touched (per #2681 non-goals); doc-only canonical guidance |
|
||||
| **Search family** | `grep_files` (content), `file_search` (filename), `project_map` (structure) | none — **distinct niches** | NOT touched; no FTS5/BM25/semantic index exists today |
|
||||
|
||||
**Non-goals (explicitly NOT diet targets in this cycle, per #2681):**
|
||||
`apply_patch` / `edit_file` / `write_file` / `fim_edit`;
|
||||
`grep_files` / `file_search` / `project_map`;
|
||||
`fetch_url` / `web.run` / `web_search`;
|
||||
`task_shell_*`; `handle_read` / `retrieve_tool_result`. These have distinct
|
||||
niches and receive **canonical guidance only** — no lifecycle change.
|
||||
|
||||
The RLM surface (`rlm_open` / `rlm_eval` / `rlm_configure` / `rlm_close` /
|
||||
`rlm_session_objects`, `crates/tui/src/tools/rlm.rs`) is likewise out of scope;
|
||||
`handle_read` retrieves var handles, and `finalize` / `FINAL` is an in-kernel
|
||||
Python function, **not a tool** — so there is nothing to retire there.
|
||||
|
||||
---
|
||||
|
||||
## 7. `tool_agent` decision: canonical but DeepSeek-V4-gated
|
||||
|
||||
`tool_agent` **stays** as a canonical subagent tool
|
||||
(`registry.rs:1024`, `ToolAgentTool`). It is the fast, **non-thinking "Fin"
|
||||
executor lane**, built on `deepseek-v4-flash` (cf. `DEFAULT_CHILD_MODEL =
|
||||
"deepseek-v4-flash"`, `rlm.rs:26`).
|
||||
|
||||
**Decision: gate `tool_agent` to DeepSeek-V4 models only.**
|
||||
|
||||
- It is purpose-built around the V4-flash non-thinking executor profile. Exposing
|
||||
it to other providers (e.g. Arcee Trinity, which is already WAF-narrowed to 8
|
||||
read-only tools, `tool_catalog.rs:106-115`) offers no working executor lane and
|
||||
only adds a confusing, mis-targeted option to weaker surfaces.
|
||||
- Gating is a **provider/model policy**, consistent with the existing
|
||||
provider-aware first-turn policy (`apply_provider_tool_policy`,
|
||||
`tool_catalog.rs:134-149`): on non-DeepSeek-V4 models, `tool_agent` is excluded
|
||||
from the active set and from tool-search discovery. It remains **registered and
|
||||
dispatchable** so transcripts created under a V4 model replay everywhere.
|
||||
|
||||
This is not a lifecycle transition — `tool_agent` is canonical. It is a
|
||||
*visibility gate* layered on the same machinery as the Arcee narrowing.
|
||||
|
||||
---
|
||||
|
||||
## 8. Prefix-cache safety + replay guarantee
|
||||
|
||||
### Prefix-cache rules every diet PR MUST follow
|
||||
|
||||
The tools array is part of DeepSeek's immutable KV prefix. The catalog-head
|
||||
byte-stability invariant (`tool_catalog.rs:169-196`) is binding:
|
||||
|
||||
1. **Never mutate the active head non-deterministically.** The first-turn active
|
||||
block must be **byte-identical run-to-run** and across Plan ↔ Agent ↔ YOLO.
|
||||
2. **A diet is a one-time deterministic edit.** Removing a name from
|
||||
`DEFAULT_ACTIVE_NATIVE_TOOLS` shifts the head exactly once; after that it must
|
||||
be stable. Land such edits as their own focused change.
|
||||
3. **Notices live in result metadata, never the prefix.** Deprecated replacement
|
||||
notes are appended at dispatch time in `tool_routing.rs` to the *call result*
|
||||
only. **Nothing** about hidden/deprecated state may be serialized into a tool
|
||||
schema, description, or the catalog array.
|
||||
4. **Preserve ordering and partitioning.** `build_model_tool_catalog` sorts each
|
||||
partition by name and keeps built-ins as a contiguous prefix ahead of MCP
|
||||
tools (`tool_catalog.rs:186-194`). Diet edits must not break this.
|
||||
5. **Hidden/deprecated tools are excluded *before* the head is built**, so their
|
||||
removal is the only head change — they do not appear in the prefix at all.
|
||||
|
||||
### Old-transcript replay guarantee
|
||||
|
||||
> For every name in the deprecation manifest with `replay_supported = Yes`, the
|
||||
> tool stays **registered and dispatchable with identical behavior**. Replaying
|
||||
> an old transcript that calls `exec_wait`, `exec_interact`, `tts`, or any
|
||||
> `todo_*` produces the same result it always did. Deprecated names additionally
|
||||
> attach a result-metadata notice; hidden-compat names are silent. A name is only
|
||||
> ever made non-dispatchable (**removed**) after a deliberate, per-name decision
|
||||
> to drop replay support at `planned_removal_version`.
|
||||
|
||||
---
|
||||
|
||||
## 9. Required tests
|
||||
|
||||
Any diet PR (and the umbrella #2681 work) must add/keep:
|
||||
|
||||
1. **Duplicate-active-alias guard.** A test asserting that no name in
|
||||
`HIDDEN_COMPATIBILITY_TOOLS` or `DEPRECATED_ALIASES` appears in
|
||||
`DEFAULT_ACTIVE_NATIVE_TOOLS` or `ARCEE_FIRST_TURN_NATIVE_TOOLS`, and that no
|
||||
two active entries resolve to the same underlying tool implementation.
|
||||
|
||||
2. **Tool-search exclusion test.** Assert that hidden-compat and deprecated names
|
||||
are absent from the tool-search-discoverable pool while remaining present in
|
||||
the registry (dispatchable).
|
||||
|
||||
3. **Replay / dispatch tests.** For each manifest name, calling it still
|
||||
executes and returns the same result as its canonical twin. Deprecated names
|
||||
additionally assert the replacement note is present **in result metadata** and
|
||||
absent from the catalog/prefix. Hidden-compat names assert **no** added
|
||||
notice.
|
||||
|
||||
4. **Golden active-block byte test.** A snapshot test pinning the byte
|
||||
serialization of the first-turn active tool block, asserting it is identical
|
||||
across Plan / Agent / YOLO (native head) and stable run-to-run — enforcing the
|
||||
`tool_catalog.rs:169-196` invariant. The golden updates **only** as a
|
||||
reviewed, deliberate one-time edit when the diet lands.
|
||||
|
||||
5. **Subagent guardrail test (rebase on #2684).** Assert only `agent_open`,
|
||||
`agent_eval`, `tool_agent`, `agent_close` are registered as model-visible
|
||||
subagent tools and that no legacy name from `subagent/mod.rs` is
|
||||
instantiated outside tests.
|
||||
|
||||
6. **`tool_agent` gating test.** Assert `tool_agent` is active/discoverable only
|
||||
under DeepSeek-V4 models and excluded (but still registered) elsewhere.
|
||||
@@ -0,0 +1,472 @@
|
||||
# CodeWhale North Star (0.9.0+)
|
||||
|
||||
> **STATUS: DIRECTION, NOT COMMITTED WORK.**
|
||||
> Everything in this document is the maintainer's intended *direction* for
|
||||
> CodeWhale 0.9.0 and beyond. **None of it is committed 0.8.53 work.** The
|
||||
> 0.8.53 cycle ships **design docs only** for these areas — no tool-catalog code
|
||||
> lands this cycle except the small, already-scoped subagent/git/RLM fixes in
|
||||
> PR #2684 and PR #2685. Treat every "rough shape" below as a sketch to be
|
||||
> refined, not an API contract. Where this doc names tools that do not exist yet
|
||||
> (`codebase_search`, `read_file` as a canonical alias, `agent_run`, etc.) those
|
||||
> are **aspirational names** that will *map onto today's tools*; see each
|
||||
> section.
|
||||
|
||||
## Why this document exists
|
||||
|
||||
The vision is at risk of being lost between point releases. CodeWhale is
|
||||
accumulating capability (subagents, RLM, skills, workflows, an enormous tool
|
||||
catalog) faster than it is accumulating *shape*. This is the north star that the
|
||||
incremental 0.8.x stabilization work is steering toward, written down once so it
|
||||
survives the next dozen PRs.
|
||||
|
||||
### The one principle
|
||||
|
||||
**The harness handles memory, search, routing, state, and guardrails so a
|
||||
weaker model can just *think*.** Every design decision below is in service of
|
||||
moving cognitive load *out* of the model and *into* the harness. A
|
||||
`deepseek-v4-flash`-class model should not have to remember ~80 tool names, hold
|
||||
the codebase index in its head, track which layer of memory a fact lives in, or
|
||||
re-derive a recovery path after a malformed tool call. The harness does that.
|
||||
The model decides *what it wants*; the harness figures out *how*.
|
||||
|
||||
---
|
||||
|
||||
## Ground-truth anchor (today's reality)
|
||||
|
||||
So the direction is honest about where it starts:
|
||||
|
||||
- **Active first-turn tool set** is `DEFAULT_ACTIVE_NATIVE_TOOLS`
|
||||
(`crates/tui/src/core/engine/tool_catalog.rs:37-64`) — 26 tools. Everything
|
||||
else is **deferred** and hydrates via `tool_search_tool_regex` /
|
||||
`tool_search_tool_bm25` (`tool_catalog.rs:26-35`).
|
||||
- **Catalog-head byte-stability is a hard invariant** for DeepSeek's KV
|
||||
prefix cache (`tool_catalog.rs:169-196`). The active first-turn tool block
|
||||
must stay byte-identical run-to-run; any change to it is a **one-time,
|
||||
deterministic edit**, never a per-turn or per-mode mutation.
|
||||
- **Arcee** narrows the first turn to 8 read-only tools
|
||||
(`ARCEE_FIRST_TURN_NATIVE_TOOLS`, `tool_catalog.rs:106-115`) as a Cloudflare
|
||||
WAF workaround — proof the active partition is already provider-shaped.
|
||||
- **Subagent tools that are model-visible:** only `agent_open`, `agent_eval`,
|
||||
`tool_agent`, `agent_close` (`crates/tui/src/tools/registry.rs:1017-1029`).
|
||||
All legacy names (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`,
|
||||
`agent_send_input`, `agent_assign`, `agent_list`, `agent_cancel`,
|
||||
`resume_agent`, `delegate_to_agent`, …) are `#[allow(dead_code)]` structs in
|
||||
`crates/tui/src/tools/subagent/mod.rs`, never instantiated outside tests →
|
||||
**already not model-visible**. The live internal `send_input` / `cancel` /
|
||||
`resume` methods on `SubAgentManager` (`mod.rs:1495,1521,1605`) back
|
||||
`agent_eval` / `agent_close` and **stay**.
|
||||
- **`tool_agent` is "Fin"** — the experimental fast-lane executor: DeepSeek V4
|
||||
Flash with thinking forced off (`mod.rs:5233`, `TOOL_AGENT_INTRO`;
|
||||
`DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`).
|
||||
- **Known duplicates today:** `exec_wait ≡ exec_shell_wait`,
|
||||
`exec_interact ≡ exec_shell_interact` (same structs, all four in the active
|
||||
set), `tts ≡ speech` (both deferred). `todo_*` are deferred twins of
|
||||
`checklist_*` (same `TodoWriteTool`, `::new` vs `::checklist`,
|
||||
`todo.rs:187,194`). The router already unifies `exec_wait`/`exec_shell_wait`
|
||||
(`crates/tui/src/tui/tool_routing.rs:1139-1140`).
|
||||
|
||||
This is the surface the north star refactors *toward simplicity*.
|
||||
|
||||
---
|
||||
|
||||
## 1. Intent Router
|
||||
|
||||
**What it is.** A thin layer where the model declares an **intent** —
|
||||
*search / inspect / edit / test / delegate / ask-user / run-shell /
|
||||
run-workflow* — and the harness maps that intent to the correct low-level tool
|
||||
and arguments. The model picks from a tiny, stable verb vocabulary instead of
|
||||
recalling ~80 concrete tool names and their schemas.
|
||||
|
||||
**Why it helps weaker models.** Tool-name recall is one of the largest sources
|
||||
of wasted turns for small models: choosing a deferred tool (double-invoke),
|
||||
choosing a deprecated alias, or hallucinating a name. A fixed intent vocabulary
|
||||
collapses that decision space to ~10 verbs. The model spends its budget on
|
||||
*reasoning about the task*, not on *remembering the API*.
|
||||
|
||||
**Rough shape.** A small **canonical visible set** — aspirational names that
|
||||
route onto today's tools:
|
||||
|
||||
| Intent verb (aspirational) | Routes onto today |
|
||||
|---|---|
|
||||
| `codebase_search` | concept-level retrieval over the hybrid index (§2); today: `grep_files` + `file_search` + `project_map` |
|
||||
| `read_file` | `read_file` (already canonical) |
|
||||
| `apply_patch` | `apply_patch` (canonical; `edit_file`/`write_file`/`fim_edit` remain as distinct lower-level tools) |
|
||||
| `run_tests` | `run_tests` / `run_verifiers` |
|
||||
| `git_status` | `git_status` |
|
||||
| `git_diff` | `git_diff` |
|
||||
| `work_update` | `update_plan` / `checklist_write` |
|
||||
| `ask_user` | `request_user_input` |
|
||||
| `shell_run` | `exec_shell` (canonical; `exec_wait`/`exec_interact` hidden — §10) |
|
||||
| `agent_run` | `agent_open` / `tool_agent` (gated, §3) / `agent_eval` / `agent_close` |
|
||||
| `workflow_run` | WhaleFlow runner (§4) |
|
||||
|
||||
The router is the *only* place the catalog's full complexity is allowed to live.
|
||||
It is also where **tool repair** (§7) hooks in: a mis-stated intent or a
|
||||
deferred/deprecated name is rewritten to the canonical route.
|
||||
|
||||
**Dependencies.** The small canonical surface (§3), the lifecycle alias table
|
||||
(§3 / `docs/TOOL_LIFECYCLE.md`), and the hybrid index for `codebase_search`
|
||||
(§2). Must respect the **catalog-head byte-stability invariant**: the visible
|
||||
verb set is itself a one-time deterministic edit, not a dynamic per-turn list.
|
||||
|
||||
---
|
||||
|
||||
## 2. Default Hybrid Codebase Intelligence
|
||||
|
||||
**What it is.** An always-on, local-first codebase index that ships with the
|
||||
harness — not an opt-in tool the model has to remember to build. It fuses:
|
||||
|
||||
- plain **text** search,
|
||||
- **symbol** index (definitions/references),
|
||||
- **import / call graph**,
|
||||
- **FTS5 + BM25** lexical ranking (rusqlite is already a dependency —
|
||||
`Cargo.toml`),
|
||||
- **sparse** retrieval,
|
||||
- optional **dense** (embedding) retrieval,
|
||||
- **PR / commit / issue history** as a first-class retrieval source,
|
||||
- a **codemap** (structural overview, the successor to today's deferred
|
||||
`project_map`).
|
||||
|
||||
**Why it helps weaker models.** Today the model must orchestrate `grep_files`
|
||||
(content), `file_search` (filename), and `project_map` (structure) by hand,
|
||||
reconcile their outputs, and re-run them as it narrows. There is **no FTS5/BM25
|
||||
or semantic index today** — every search is a cold walk (`file_search` uses the
|
||||
`ignore` crate's `WalkBuilder` for vendor exclusion, `file_search.rs:~210`). A
|
||||
weaker model burns turns stitching partial results. A single `codebase_search`
|
||||
intent backed by a hybrid index returns ranked, concept-level hits in one call,
|
||||
so the model reasons about *answers*, not *query mechanics*.
|
||||
|
||||
**Rough shape.** A background indexer maintains a SQLite store (FTS5 + symbol +
|
||||
graph tables), refreshed on file change and on git events. `codebase_search`
|
||||
(§1) queries it; the codemap is regenerated incrementally. Vendor exclusion
|
||||
reuses the existing `ignore`/`WalkBuilder` path.
|
||||
|
||||
**Dependencies.** rusqlite/FTS5; the Intent Router (§1) for the
|
||||
`codebase_search` verb; the trace store (§6/§8) for history retrieval. **Full
|
||||
design lives in `docs/CODEBASE_SEARCH_DESIGN.md`** (to be written this cycle).
|
||||
|
||||
---
|
||||
|
||||
## 3. Small Canonical Tool Surface
|
||||
|
||||
**What it is.** A deliberately tiny set of always-visible canonical tools;
|
||||
**everything else is hidden, deferred, or skill-scoped**. The catalog grows
|
||||
behind the scenes but the *visible* surface stays small and stable.
|
||||
|
||||
**Why it helps weaker models.** Fewer choices, no aliases competing for the same
|
||||
job, no deferred double-invokes for common operations. The model sees the verbs
|
||||
it needs and nothing else.
|
||||
|
||||
**Rough shape — tool lifecycle states.** Five states, represented as **const
|
||||
name-sets plus an alias table in `tool_catalog.rs`** (NOT a per-`ToolSpec`
|
||||
field, to preserve the byte-stable head):
|
||||
|
||||
1. **active** — in the first-turn catalog head.
|
||||
2. **deferred** — registered, hydrated via tool-search.
|
||||
3. **hidden-compatibility** — registered + dispatchable, **dropped from both
|
||||
active and search**, identical behavior, **no notice**. (For exact
|
||||
duplicates that should simply disappear from discovery.)
|
||||
4. **deprecated** — registered + dispatchable, **dropped from search**, appends
|
||||
a *replacement notice to RESULT METADATA only* — **never** to the cached
|
||||
prefix.
|
||||
5. **removed** — final state; no longer registered.
|
||||
|
||||
**Invariant:** deprecated and hidden-compatibility tools **stay registered and
|
||||
dispatchable forever** so old transcripts always replay deterministically.
|
||||
|
||||
**Planned diet (documented this cycle, not yet coded):**
|
||||
|
||||
- `exec_wait`, `exec_interact`, `tts` → **hidden-compatibility** (exact
|
||||
duplicates of `exec_shell_wait`, `exec_shell_interact`, `speech`).
|
||||
- `todo_*` (`todo_write/add/update/list`) → **deprecated → checklist_*** (drop
|
||||
from tool-search, keep registered, add result-metadata notice).
|
||||
- Legacy subagent names → already hidden; remaining work is **cleanup +
|
||||
guardrail tests**, rebased on PR #2684.
|
||||
|
||||
**Explicitly NOT touched** (distinct niches, per #2681 non-goals) — doc-only
|
||||
canonical guidance, no diet: `apply_patch` / `edit_file` / `write_file` /
|
||||
`fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` /
|
||||
`web.run` / `web_search`; `task_shell_*`; `handle_read` /
|
||||
`retrieve_tool_result`.
|
||||
|
||||
**`tool_agent` gating decision.** `tool_agent` ("Fin") **stays** as a canonical
|
||||
subagent tool, but is **gated to DeepSeek-V4 models only**. It is the fast,
|
||||
non-thinking executor lane built on `deepseek-v4-flash`; offering it to other
|
||||
providers/models is meaningless (the lane *is* a specific model) and would just
|
||||
add a name to recall. The gate is provider/model-conditional in the same spirit
|
||||
as the Arcee first-turn narrowing.
|
||||
|
||||
**Dependencies.** The alias table backs the Intent Router (§1) and Tool Repair
|
||||
(§7). **Full spec in `docs/TOOL_LIFECYCLE.md`** (to be written this cycle).
|
||||
|
||||
---
|
||||
|
||||
## 4. WhaleFlow / Workflow Mode
|
||||
|
||||
**What it is.** A typed, multi-agent **workflow runner**. A workflow is a graph
|
||||
of typed nodes — **branches, leaves, reviewers, verifiers, test-runners,
|
||||
PR-creators**, with **trace-replay** and a **progress-monitor**. Authors write
|
||||
workflows in **Starlark or YAML**, which compile to a **typed Rust IR**; the
|
||||
**Rust executor** runs the IR. "Like Claude's workflow mode, but safer" — the
|
||||
safety comes from the typed IR and Rust execution boundary rather than free-form
|
||||
model-driven orchestration.
|
||||
|
||||
**Why it helps weaker models.** Long-running, multi-step work (implement →
|
||||
review → verify → test → open PR) is exactly where weaker models drift, lose
|
||||
state, or skip verification. Encoding the *process* as a typed graph means the
|
||||
model only has to be competent at each *leaf*, while the harness guarantees the
|
||||
sequencing, the verification gates, and the evidence trail.
|
||||
|
||||
**Rough shape.** Starlark/YAML → typed IR → Rust executor. Nodes map to
|
||||
subagent lanes (`agent_open` / `tool_agent` / `agent_eval` / `agent_close`,
|
||||
`registry.rs:1017-1029`). Reviewer/verifier/test-runner nodes are first-class
|
||||
node *types*, not ad-hoc prompts. Every run emits a trace (→ §8). Surfaced via
|
||||
`/workflow` (alias `/whaleflow`) and the `workflow_run` intent (§1).
|
||||
|
||||
**Dependencies.** Subagent runtime; the evaluation loop (§8) for traces;
|
||||
Skills & Rules (§5) so a skill can *define* a workflow; the command taxonomy
|
||||
(§9).
|
||||
|
||||
---
|
||||
|
||||
## 5. Skills & Rules as First-Class Runtime
|
||||
|
||||
**What it is.** Skills and rules become real runtime objects, not just prompt
|
||||
text. Skills gain **activation modes**:
|
||||
|
||||
- **always-on** — injected every turn,
|
||||
- **glob** — activated when matching files are in scope,
|
||||
- **model-decision** — offered to the model to opt into,
|
||||
- **manual** — only via explicit `$<skill-name>` invocation (§9).
|
||||
|
||||
Skills can **restrict the tool surface**, **define workflows** (§4), and
|
||||
**inject repo context**.
|
||||
|
||||
**Why it helps weaker models.** A skill scoped to a task can shrink the tool
|
||||
surface to exactly what that task needs and pre-load the relevant rules and
|
||||
context — so the model operates inside a curated, smaller world instead of the
|
||||
full catalog.
|
||||
|
||||
**Rough shape (vs. today).** Today: skills are discovered
|
||||
(`crates/tui/src/tools/skills/mod.rs`, `discover_in_workspace ~421`; struct
|
||||
parses name/description `~382-388`), enable-state is tracked
|
||||
(`skill_state.rs`, `SkillStateStore::is_enabled ~73`), and there's an
|
||||
inline-mention popup (`slash_menu.rs ~86`). **But:** no parser activates inline
|
||||
`$` mentions on submit (submit path: `ui.rs build_queued_message ~4721`), there
|
||||
is **no activation-mode concept**, and **skills cannot restrict tools**. The
|
||||
direction adds (a) a submit-time `$<skill-name>` activation parser, (b) the
|
||||
four activation modes in skill metadata, and (c) a tool-restriction field
|
||||
enforced by the registry/router.
|
||||
|
||||
**Dependencies.** Tool lifecycle/alias table (§3) for restriction; Intent Router
|
||||
(§1); WhaleFlow (§4); command taxonomy (§9). **Full design in
|
||||
`docs/SKILL_INVOCATION_DESIGN.md`** (to be written this cycle).
|
||||
|
||||
---
|
||||
|
||||
## 6. Context Memory Stack
|
||||
|
||||
**What it is.** Memory modeled as **explicit, layered, inspectable** stores
|
||||
rather than one undifferentiated blob. Each layer is **visible, inspectable,
|
||||
clearable, and scoped**:
|
||||
|
||||
1. **User memory** — small user prefs/facts (surfaced via `/memory`, §9).
|
||||
2. **Repo rules** — checked-in guidance (`/rules`).
|
||||
3. **Codemap-wiki** — derived structural/semantic knowledge of the repo (§2).
|
||||
4. **Trace store** — recorded workflow/turn evidence (§8).
|
||||
5. **ARMH–RLM memo** — the RLM kernel's in-session working memory
|
||||
(`rlm_open`/`rlm_eval`/`rlm_configure`/`rlm_close`/`rlm_session_objects`,
|
||||
`crates/tui/src/tools/rlm.rs`; `handle_read` retrieves var handles;
|
||||
`finalize`/`FINAL` is an *in-kernel Python function*, not a tool).
|
||||
6. **Cached-main overlay** — promoted lessons from the cached main branch
|
||||
(`/overlay`, §9).
|
||||
7. **External memory (Aleph)** — large local data via the `aleph` skill.
|
||||
|
||||
**Why it helps weaker models.** The model never has to *guess* where a fact
|
||||
should live or *re-derive* context it already established. Each layer has a
|
||||
clear scope and a clear command to inspect/clear it, so stale context is
|
||||
visible and removable rather than silently poisoning the prefix.
|
||||
|
||||
**Rough shape.** A `/context` dashboard (§9) renders all active layers and their
|
||||
sizes; `/memory` manages the small user layer; `/overlay` manages promoted
|
||||
lessons. The RLM layer already exists and is plumbed through `rlm.rs`.
|
||||
|
||||
**Dependencies.** Command taxonomy (§9); codebase intelligence (§2); evaluation
|
||||
loop (§8) for promotion into the overlay.
|
||||
|
||||
---
|
||||
|
||||
## 7. Tool Repair & Autoload
|
||||
|
||||
**What it is.** When the model emits a wrong, deferred, deprecated, or
|
||||
environment-blocked tool call, the harness **repairs** it instead of returning a
|
||||
bare error — and **autoloads** what's needed.
|
||||
|
||||
**Why it helps weaker models.** Recovery from a malformed call is precisely
|
||||
where weak models loop or give up. Turning every failure into an actionable,
|
||||
schema-bearing correction keeps the model on-task.
|
||||
|
||||
**Rough shape — representative repairs:**
|
||||
|
||||
- **Wrong/legacy name** → *"you meant `agent_eval`; here's the schema"* (autoload
|
||||
the deferred tool's schema in the same turn).
|
||||
- **Mode mismatch** → *"shell is unavailable in Plan mode — ask the user or
|
||||
switch modes"*.
|
||||
- **Missing dependency** → *"this tool needs Node; Node is missing"*
|
||||
(dependency probe via `ExternalTool`, already imported in `tool_catalog.rs`).
|
||||
- **Deprecated alias** → silently **routed to the canonical** tool, with the
|
||||
replacement notice in **result metadata only** (§3) — never the cached prefix.
|
||||
|
||||
**Dependencies.** The alias table + lifecycle states (§3); the Intent Router
|
||||
(§1); dependency detection (`ExternalTool`). Builds on PR #2685's actionable
|
||||
RLM/field errors and PR #2684's lifecycle signals — **must not contradict
|
||||
either**.
|
||||
|
||||
---
|
||||
|
||||
## 8. Evaluation Loop
|
||||
|
||||
**What it is.** Every workflow run **leaves evidence**: the tests it ran, the
|
||||
diffs it produced, the failures it hit, the searches it issued, the claims it
|
||||
verified, and the PR outcome. A **teacher/student replay** turns *good* traces
|
||||
into reusable **rules, skills, tests, and cached guidance**.
|
||||
|
||||
**Why it helps weaker models.** The system gets better at *this repo* over time
|
||||
without the model getting smarter. Verified good traces become rules/skills the
|
||||
weaker model can lean on next time, and become the source of the cached-main
|
||||
overlay (§6).
|
||||
|
||||
**Rough shape.** Workflow nodes (§4) emit structured evidence into the trace
|
||||
store (§6). A replay/distillation pass (teacher reviews student trace) promotes
|
||||
high-value traces into: repo rules (`/rules`), skills (§5), regression tests,
|
||||
and overlay guidance (`/overlay`). Verified-claim tracking ties into the
|
||||
adversarial-verification posture already used elsewhere.
|
||||
|
||||
**Dependencies.** WhaleFlow (§4) for trace emission; trace store + overlay (§6);
|
||||
Skills & Rules (§5) as promotion targets.
|
||||
|
||||
---
|
||||
|
||||
## 9. Command-Surface Taxonomy
|
||||
|
||||
**What it is.** One name = **one thing**. The command surface is split so each
|
||||
prefix has a single, memorable responsibility:
|
||||
|
||||
| Surface | Responsibility |
|
||||
|---|---|
|
||||
| `/memory` | **Small** user prefs/facts only |
|
||||
| `/context` | **Dashboard** of all active memory layers (§6) |
|
||||
| `/rules` | Repo guidance |
|
||||
| `/workflow` (`/whaleflow`) | Long-running multi-agent runs (§4) |
|
||||
| `/overlay` | Promoted cached-main lessons (§6/§8) |
|
||||
| `$<skill-name>` | Skill invocation — **the token *is* the skill name** |
|
||||
| `codebase_search` | Concept-level code retrieval (§2) |
|
||||
|
||||
**Why it helps weaker models (and users).** No overloaded command does five
|
||||
jobs; the model/user never has to disambiguate *which* `/memory` behavior they
|
||||
meant. `$systematic-debugging` self-documents what it invokes.
|
||||
|
||||
**`/memory` subcommand sketch:**
|
||||
|
||||
```
|
||||
/memory add "<fact>" # store a small pref/fact
|
||||
/memory edit # edit stored facts
|
||||
/memory search <query> # find a stored fact
|
||||
/memory clear # clear user memory
|
||||
/memory doctor # health check; detects legacy ~/.deepseek path
|
||||
/memory promote <fact> # (later) promote a fact to a higher layer
|
||||
```
|
||||
|
||||
`doctor` specifically detects the **legacy `~/.deepseek`** path and guides
|
||||
migration.
|
||||
|
||||
**`$<skill-name>` invocation examples:**
|
||||
|
||||
```
|
||||
$systematic-debugging # local skill
|
||||
$github:gh-fix-ci # namespaced skill
|
||||
```
|
||||
|
||||
The submit-time parser (to be added; submit path `ui.rs ~4721`) recognizes the
|
||||
`$` token and activates the named skill (§5).
|
||||
|
||||
**`/context` layers dashboard (example render):**
|
||||
|
||||
```
|
||||
/context
|
||||
user-memory ▸ 7 facts (12 KB) [clear]
|
||||
repo-rules ▸ CLAUDE.md, AGENTS.md (8 KB) [view]
|
||||
codemap-wiki ▸ 412 symbols indexed (auto) [rebuild]
|
||||
trace-store ▸ 3 recent workflow runs (—) [open]
|
||||
rlm-memo ▸ 0 active sessions (—) [—]
|
||||
cached-overlay ▸ 5 promoted lessons (3 KB) [view]
|
||||
aleph-external ▸ not attached (—) [attach]
|
||||
```
|
||||
|
||||
**Dependencies.** Memory stack (§6); skills (§5); codebase intelligence (§2);
|
||||
workflow runner (§4).
|
||||
|
||||
---
|
||||
|
||||
## 10. Deferred-Not-Done 0.8.53 Diet Items
|
||||
|
||||
Recorded here so they are **not silently dropped** — these were considered for
|
||||
the 0.8.53 diet and deliberately **deferred** (design-only or out of scope this
|
||||
cycle):
|
||||
|
||||
- **File-mutation overload** — `apply_patch` / `edit_file` / `write_file` /
|
||||
`fim_edit` overlap in purpose. Per #2681 non-goals these stay distinct;
|
||||
canonical *guidance* (prefer `apply_patch`) is doc-only, no consolidation
|
||||
this cycle.
|
||||
- **`task_shell_*` ↔ `exec_*` redundancy** — `task_shell_start` /
|
||||
`task_shell_wait` overlap conceptually with the `exec_*` family. Left intact
|
||||
this cycle (distinct niche per #2681); revisit under §1/§3.
|
||||
- **`handle_read` / `retrieve_tool_result`** — result-handle plumbing kept as-is
|
||||
(doc-only canonical guidance); folds naturally into the memory stack (§6) and
|
||||
intent routing (§1) later.
|
||||
- **Search-cluster consolidation** — `grep_files` / `file_search` /
|
||||
`project_map` remain three tools this cycle; consolidation is the *job of the
|
||||
hybrid index* (§2) under `codebase_search`, not a catalog edit in 0.8.53.
|
||||
|
||||
---
|
||||
|
||||
## Phased Roadmap
|
||||
|
||||
### 0.8.53 — design + small fixes only
|
||||
- **Code:** only the already-scoped, narrow fixes — PR #2684 (subagent role
|
||||
vocab, lifecycle signals, eval ergonomics) and PR #2685 (read-only git history
|
||||
active + actionable RLM/field errors). Subagent legacy-name cleanup +
|
||||
guardrail tests rebased on #2684.
|
||||
- **Docs:** this north star, plus `docs/TOOL_LIFECYCLE.md`,
|
||||
`docs/CODEBASE_SEARCH_DESIGN.md`, `docs/SKILL_INVOCATION_DESIGN.md`.
|
||||
- **No tool-catalog code:** the diet (§3), the Intent Router (§1), and the
|
||||
hybrid index (§2) are **documented, not coded** this cycle.
|
||||
|
||||
### 0.9.0 — first structural moves
|
||||
- Implement the **tool lifecycle** const name-sets + alias table in
|
||||
`tool_catalog.rs` (§3) as a one-time deterministic head edit.
|
||||
- Land the **planned diet**: `exec_wait`/`exec_interact`/`tts` →
|
||||
hidden-compatibility; `todo_*` → deprecated→`checklist_*` (result-metadata
|
||||
notice only).
|
||||
- Gate **`tool_agent`** to DeepSeek-V4 models only (§3).
|
||||
- First version of the **default hybrid codebase index** (FTS5/BM25 + symbol +
|
||||
codemap) behind `codebase_search` (§2).
|
||||
- First **Intent Router** verbs mapping onto today's tools (§1).
|
||||
- **Tool Repair** for deferred/deprecated/mode/dependency cases (§7).
|
||||
|
||||
### Later (post-0.9.0)
|
||||
- **WhaleFlow** typed-IR workflow runner (§4) and the **evaluation loop** /
|
||||
teacher-student replay (§8).
|
||||
- **Skills activation modes** + tool restriction + `$<skill-name>` submit-time
|
||||
activation (§5).
|
||||
- Full **Context Memory Stack** with `/context` dashboard, `/overlay`
|
||||
promotion, and Aleph external memory (§6).
|
||||
- Dense/semantic retrieval and PR/commit/issue history in the index (§2).
|
||||
- Search-cluster consolidation and the remaining §10 deferred items.
|
||||
|
||||
---
|
||||
|
||||
## North-star one-liner
|
||||
|
||||
> **The harness handles memory, search, routing, state, and guardrails — so a
|
||||
> weaker model can just think.**
|
||||
Reference in New Issue
Block a user