Design-only deliverables for the v0.8.53 "tool surface diet / canonical surfaces" cutover (no catalog code in this cycle). Grounded in a verified inventory of the actual tool registry. - docs/TOOL_LIFECYCLE.md (#2681): the umbrella policy. Five lifecycle states (active / deferred / hidden-compatibility / deprecated / removed) modeled as const name-sets + an alias table in tool_catalog.rs (not a per-ToolSpec field), so registration stays untouched and old transcripts always replay. Includes the deprecation manifest (exec_wait/exec_interact/tts → hidden-compat; todo_* → checklist_* deprecated; 11 legacy subagent names are already non-visible dead code → cleanup + guardrail), per-mode/per-provider active-catalog budget (incl. Arcee's 8-tool first-turn set), prefix-cache safety rules, and the tool_agent decision: canonical but DeepSeek-V4-gated. - docs/CODEBASE_SEARCH_DESIGN.md (#2680, v0.9.0): local-first FTS5/BM25 + symbol/path ranking + RRF hybrid; rusqlite storage; mtime/branch/vendor invalidation; an explainable tool contract returning reasons[]; and a real CodeWhale query eval set. Complements grep_files/file_search, never replaces. - docs/SKILL_INVOCATION_DESIGN.md (0.9.0): the $<skill-name> inline invocation syntax (the token IS the skill name), namespaced resolution, ambiguity- suggests-not-guesses, visible activation line, and a smallest-viable slice. - docs/VISION_NORTH_STAR.md (0.9.0+): intent router, hybrid codebase intelligence, WhaleFlow typed workflow IR, skills/rules runtime, the layered context-memory stack, tool repair/autoload, the evaluation loop, and the command-surface taxonomy (/memory small · /context dashboard · /rules · /workflow · /overlay · $<skill> · codebase_search). Marked DIRECTION, not committed 0.8.53 work; also records the deferred-not-done diet items. Targets codex/v0.8.53.
20 KiB
codebase_search — Local-First Semantic Code Retrieval
Status: Design note + eval scaffold. Code is DEFERRED. GitHub #2680 · Milestone v0.9.0 · This DOC ships in v0.8.53 (doc-only; no catalog code in this cycle). Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either.
This document specifies a model-visible codebase_search tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding tool lifecycle decisions for v0.8.53 so the eventual catalog edit is a single deterministic change.
1. Problem
Today CodeWhale ships two complementary code-locating tools and one structure map:
file_search— filename search (uses theignorecrate'sWalkBuilderfor vendor exclusion; default excludes atcrates/tui/src/tools/file_search.rs:210-219).grep_files— content search (literal/regex token match).project_map— a deferred structure map.
None of these answer concept-level questions where the user does not know the exact token:
- "Where is provider auth resolved?"
- "What enforces the shell approval policy?"
- "Where do mode prompts get assembled?"
- "How does the subagent lifecycle close out a child?"
grep_files requires you to already know the literal string (resolve_api_key, ApprovalRequirement, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens.
Goal. Add a retrieval tool keyed on intent, not on exact lexemes, that returns ranked, explainable code locations.
Non-goal / explicit complement. codebase_search does not replace grep_files or file_search. Exact-token and filename lookups remain the right tool when you know what you're looking for. codebase_search is the "I don't know the token yet" entry point and always falls back to exact grep so it is never worse than grep for a literal query. (See §2 fallback, §6 non-goals.)
There is currently no FTS5/BM25, sparse, or dense index in the tree. rusqlite is already a workspace dependency (crates/tui/Cargo.toml), so the lexical core can be built with no new heavy dependencies.
2. Approach Comparison
| Approach | What it indexes | Local-first? | Recall on paraphrase | Cost / deps | Verdict for v0.9.0 |
|---|---|---|---|---|---|
Lexical FTS5 + bm25() |
tokenized code/comments/identifiers (camelCase/snake_case split) | Yes — SQLite built in via rusqlite |
Medium (with tokenizer help) | Near-zero (existing dep) | Phase 1 core |
| Symbol / path ranking | extracted symbols (fn/struct/impl/const), path components | Yes | Medium-high for "where is X defined" | Low (regex/tree-sitter optional) | Phase 1 core |
| Sparse encoders (SPLADE) | learned term-expansion weights | Yes (model is local but heavy) | High | Model download + inference | Phase 3, feature-flagged |
| Dense embeddings | vector of chunk semantics | Optional — embedding model needed | Highest on paraphrase | Model + vector store; HF download | Phase 3, feature-flagged |
| Cross-encoder reranker | re-scores top-K candidates | Heavy | Boosts precision@k | Inference cost | Phase 4, feature-flagged |
Recommended architecture: Hybrid via Reciprocal-Rank Fusion (RRF)
Each enabled signal produces an independent ranked list; results are merged with RRF
(score(d) = Σ_signals 1/(k + rank_signal(d)), conventional k≈60). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales.
v0.9.0 Phase 1 signal set (all local, no model downloads):
- Lexical (FTS5
bm25()) over chunk text with an identifier-aware tokenizer. - Symbol rank — boost chunks whose extracted symbol name fuzzy-matches query terms.
- Path rank — boost chunks whose path components match (e.g. query "auth" →
…/auth/…,…/provider…). - Session-relevance boost — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at.
- Exact grep fallback — the query is also run as a literal
grep_files-equivalent pass; any exact hit is fused in and tagged, guaranteeingcodebase_search⊇ grep for literal queries.
Optional later backends (feature-flagged, off by default):
--features sparse-splade— adds a SPLADE signal list to the RRF.--features dense-embed— adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy).--features rerank— cross-encoder reranks the fused top-K.
Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI.
3. Storage & Index
Location
~/.codewhale/index/<workspace-hash>.db
<workspace-hash> is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by rusqlite (existing dep).
Migration note (ties to the
/memory doctortaxonomy in §7): older builds used~/.deepseek. The index path is~/.codewhaleonly; if a legacy~/.deepseek/indexexists it is ignored (a futuredoctormay offer to migrate, never auto-read).
Schema sketch
CREATE TABLE files (
id INTEGER PRIMARY KEY,
path TEXT NOT NULL UNIQUE, -- workspace-relative
mtime_ns INTEGER NOT NULL, -- invalidation
size_bytes INTEGER NOT NULL,
content_hash TEXT NOT NULL, -- blake3; skip re-chunk if unchanged
lang TEXT, -- detected language
branch TEXT -- branch at last index (invalidation)
);
CREATE TABLE chunks (
id INTEGER PRIMARY KEY,
file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
start_line INTEGER NOT NULL,
end_line INTEGER NOT NULL,
kind TEXT, -- fn | struct | impl | const | doc | block
symbol TEXT, -- primary symbol name if any
text TEXT NOT NULL -- chunk body (identifier-split copy for FTS)
);
-- Lexical index. external-content FTS so we don't duplicate bodies twice.
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
symbol,
content='chunks',
content_rowid='id',
tokenize = 'unicode61 remove_diacritics 2' -- + identifier pre-split at index time
);
CREATE TABLE symbols (
id INTEGER PRIMARY KEY,
file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
chunk_id INTEGER REFERENCES chunks(id) ON DELETE CASCADE,
name TEXT NOT NULL,
kind TEXT NOT NULL, -- fn | struct | enum | trait | impl | const | macro
line INTEGER NOT NULL
);
CREATE INDEX symbols_name ON symbols(name);
-- Session relevance: lightweight touch log, written by the session, decayed on read.
CREATE TABLE session_touch (
path TEXT PRIMARY KEY,
last_touch INTEGER NOT NULL, -- unix ns
touch_count INTEGER NOT NULL DEFAULT 1
);
Identifier-aware tokenization (splitting resolveApiKey / resolve_api_key → resolve api key) is applied at index time into the FTS text column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (chunks_sparse, chunks_vec) behind their feature flags.
Chunking strategy (structure-aware)
Chunk on syntactic boundaries, not fixed windows: one chunk per top-level item (fn, struct, impl block, const, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere.
Invalidation
- mtime + content_hash: on index/refresh, skip files whose
mtime_nsandcontent_hashare unchanged. - Branch switch:
files.branchis recorded; on a branch change the affected files are re-checked (cheap because of content_hash). - Generated / vendor exclusion: reuse the same
ignore-crateWalkBuilderexclusion behavior asfile_search(mirror the defaults atcrates/tui/src/tools/file_search.rs:210-219:target/**,node_modules/**,.git/**,DerivedData/**,dist/**,build/**,*.lock,*.plist, plus.gitignore). One exclusion source of truth shared withfile_searchavoids index drift.
Privacy / trust
- Workspace-scoped, local-only. The index lives under
~/.codewhale/index/and never leaves the machine. - No cloud by default. Phase 1 has zero network dependency.
- Embeddings / Hugging Face downloads are gated. Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag and an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything.
4. Model-Visible Tool Contract
// codebase_search
{
"name": "codebase_search",
"description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).",
"parameters": {
"query": { "type": "string", "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." },
"max_results":{ "type": "integer", "default": 10 },
"path_glob": { "type": "string", "description": "Optional path filter, e.g. 'crates/tui/**'." },
"lang": { "type": "string", "description": "Optional language filter." },
"kind": { "type": "string", "description": "Optional symbol-kind filter: fn|struct|impl|const|..." }
}
}
Result shape — ranked, explainable, auditable:
{
"results": [
{
"path": "crates/tui/src/config/provider.rs",
"line": 142,
"snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result<Secret> { ... }",
"score": 0.91,
"reasons": [
"symbol: resolve_api_key matches 'auth/resolve'",
"lexical: matched tokens [provider, api, key, resolve]",
"path: component 'provider' matches query",
"session: file read 2 turns ago"
]
}
],
"backend": "lexical+symbol+path+session", // which signals were fused (RRF)
"fallback_grep_hits": 1 // exact-match hits folded in
}
reasons[] is mandatory and is the auditability contract: every result explains why it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The backend field records which signals were active so results are reproducible given the feature set.
5. Benchmark / Eval Set
A fixed set of real CodeWhale concept queries, each with the expected file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on file, not line.
| # | Query (concept, no exact token) | Expected file(s) | Anchor |
|---|---|---|---|
| 1 | Where is provider auth / API key resolved? | crates/tui/src/config/ provider auth path |
provider/config module |
| 2 | What is the first-turn active tool set? | crates/tui/src/core/engine/tool_catalog.rs |
DEFAULT_ACTIVE_NATIVE_TOOLS :37-64 |
| 3 | How are deferred tools hydrated / searched? | crates/tui/src/core/engine/tool_catalog.rs |
tool_search regex/bm25 :26-35 |
| 4 | Why does Arcee get a reduced tool set? (WAF workaround) | crates/tui/src/core/engine/tool_catalog.rs |
ARCEE_FIRST_TURN_NATIVE_TOOLS :106-115 |
| 5 | What keeps the tool catalog byte-stable for the KV prefix cache? | crates/tui/src/core/engine/tool_catalog.rs |
catalog-head invariant :169-196 |
| 6 | Where is the shell approval / cancel policy? | crates/tui/src/tools/shell.rs + tools/spec.rs (ApprovalRequirement) |
shell tools, ShellWaitTool/ShellInteractTool registry.rs:524-531 |
| 7 | Where are mode prompts (Plan/Agent/YOLO) assembled? | mode prompt / AppMode assembly in crates/tui/src/tui/ |
AppMode usage |
| 8 | How does the subagent lifecycle open/eval/close a child? | crates/tui/src/tools/subagent/mod.rs; registry registration |
registry.rs:1017-1029; send_input/cancel/resume mod.rs:1495,1521,1605 |
| 9 | What is the RLM session surface and its default child model? | crates/tui/src/tools/rlm.rs |
DEFAULT_CHILD_MODEL = "deepseek-v4-flash" :26 |
| 10 | Where is RLM eval / var_handle retrieval (handle_read)? |
crates/tui/src/tools/rlm.rs, tools/handle.rs |
VarHandle import rlm.rs:21 |
| 11 | Where are skills discovered and parsed in the workspace? | crates/tui/src/tools/skills/mod.rs |
discover_in_workspace ~421; skill struct ~382-388 |
| 12 | Where is skill enable-state stored / checked? | crates/tui/src/tools/skills/skill_state.rs |
SkillStateStore::is_enabled ~73 |
| 13 | How does vendor/generated exclusion work for file walking? | crates/tui/src/tools/file_search.rs |
ignore WalkBuilder excludes :210-219 |
| 14 | Where is the queued user message built on submit? | crates/tui/src/tui/ui.rs |
build_queued_message ~4721 |
| 15 | Where are speech / TTS tools registered? (duplicate names) | crates/tui/src/tools/registry.rs |
speech ≡ tts :787-792 |
Each entry is a (query, expected_paths[]) row in a fixture (e.g. crates/tui/tests/fixtures/codebase_search_eval.jsonl). Phase 1 ships the harness that runs all queries against the live index and reports recall@k and MRR; a regression bar (e.g. recall@10 ≥ target) gates future ranking changes.
6. Phasing, Feature Flags, and Non-Goals
Phasing
- Phase 0 (this cycle, v0.8.53): this design note + eval fixture only. No catalog code.
- Phase 1 (v0.9.0): local lexical core — FTS5
bm25()+ symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at~/.codewhale/index/<workspace-hash>.db. Eval harness wired into CI. No network, no model downloads. Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant. - Phase 2: incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language).
- Phase 3 (feature-flagged, off by default):
sparse-spladeanddense-embedRRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy). - Phase 4 (feature-flagged):
rerankcross-encoder over the fused top-K.
Feature flags
codebase-search-core # Phase 1, default-on once it lands
sparse-splade # Phase 3, default-off
dense-embed # Phase 3, default-off (gated HF download)
rerank # Phase 4, default-off
Non-goals
- No cloud index is required for the core experience. Ever, for Phase 1.
- Not a grep replacement. Exact-token (
grep_files) and filename (file_search) search stay first-class;codebase_searchcomplements them and folds exact hits in as a fallback. - Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more.
Cross-link: WhaleFlow epic
codebase_search is a building block for the long-running multi-agent WhaleFlow (/workflow / /whaleflow) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable reasons[] feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors).
Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only)
These are design decisions for the eventual one-time catalog edit; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (tool_catalog.rs:169-196) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn.
Lifecycle states (represented as const name-sets + an alias table in tool_catalog.rs, NOT a per-ToolSpec field)
| State | Active first turn? | In tool-search? | Registered/dispatchable? | Result-metadata notice? |
|---|---|---|---|---|
| active | yes | yes | yes | no |
| deferred | no | yes | yes | no |
| hidden-compatibility | no | no | yes | no |
| deprecated | no | no | yes | yes (replacement notice, metadata only) |
| removed | no | no | no | — |
Deprecated/hidden tools stay registered and dispatchable so old transcripts always replay. A deprecated tool appends a replacement notice to RESULT METADATA only — never to the cached prefix (which would break the invariant).
Planned diet (documented, not yet coded)
exec_wait,exec_interact,tts→ hidden-compatibility. These are exact duplicates of canonical tools:exec_wait≡exec_shell_wait(sameShellWaitTool,registry.rs:526,529); router already unifies them atcrates/tui/src/tui/tool_routing.rs:1139-1140.exec_interact≡exec_shell_interact(sameShellInteractTool,registry.rs:527,530).tts≡speech(sameSpeechTool,registry.rs:787-792).- Action: drop from active + search, keep registered, identical behavior, no notice.
todo_*(todo_write/add/update/list) → deprecated →checklist_*. They are deferred twins ofchecklist_*(sameTodoWriteTool::newvs::checklist,todo.rs:187,194);checklist_writeis active, andtodo_*are not in the active set. Action: drop from tool-search, keep registered, add replacement notice (metadata only).- Legacy subagent names (
agent_spawn,spawn_agent,agent_result,agent_wait,agent_send_input,send_input,agent_assign,agent_list,agent_cancel,resume_agent,delegate_to_agent) are already#[allow(dead_code)]structs never instantiated outside tests (crates/tui/src/tools/subagent/mod.rs) → already not model-visible. Action: cleanup + guardrail tests, rebased on PR #2684. Note the live internalSubAgentManagermethodssend_input/cancel/resume(mod.rs:1495,1521,1605) are used byagent_eval/agent_closeand must be kept — only the model-visible tool names are retired.
Model-visible subagent surface (unchanged)
Only agent_open, agent_eval, tool_agent, agent_close are registered (registry.rs:1017-1029).
tool_agent— KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY. It is the fast non-thinking "Fin" executor lane built ondeepseek-v4-flash(cf. RLMDEFAULT_CHILD_MODEL = "deepseek-v4-flash",rlm.rs:26). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit.
Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance)
apply_patch / edit_file / write_file / fim_edit; grep_files / file_search / project_map; fetch_url / web.run / web_search; task_shell_*; handle_read / retrieve_tool_result. These serve distinct purposes and stay as-is.
Appendix B — Command-Surface Taxonomy (context)
Each name maps to exactly one thing; codebase_search slots in as concept-level code retrieval alongside these surfaces:
/memory— small user prefs/facts only (subcommandsadd/edit/search/clear/doctor, plus laterpromote;doctordetects the legacy~/.deepseekpath)./context— dashboard of all active layers./rules— repo guidance./workflow(/whaleflow) — long-running multi-agent (the WhaleFlow epic)./overlay— promoted cached-main lessons.$<skill-name>— skill invocation prefix; the token is the skill name (e.g.$systematic-debugging,$github:gh-fix-ci).codebase_search— concept-level code retrieval (this document).