# `codebase_search` — Local-First Semantic Code Retrieval > **Status:** Design note + planned eval scaffold. **Code is DEFERRED.** > GitHub #2680 · Milestone **v0.9.0** · This DOC ships in **v0.8.53** (doc-only; no catalog code in this cycle). > Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either. This document specifies a model-visible `codebase_search` tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding **tool lifecycle** decisions for v0.8.53 so the eventual catalog edit is a single deterministic change. --- ## 1. Problem Today CodeWhale ships two complementary code-locating tools and one structure map: - `file_search` — **filename** search (uses the `ignore` crate's `WalkBuilder` for vendor exclusion; default excludes at `crates/tui/src/tools/file_search.rs:210-219`). - `grep_files` — **content** search (literal/regex token match). - `project_map` — a deferred **structure** map. None of these answer **concept-level** questions where the user does not know the exact token: - "Where is provider auth resolved?" - "What enforces the shell approval policy?" - "Where do mode prompts get assembled?" - "How does the subagent lifecycle close out a child?" `grep_files` requires you to already know the literal string (`resolve_api_key`, `ApprovalRequirement`, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens. **Goal.** Add a retrieval tool keyed on *intent*, not on exact lexemes, that returns ranked, **explainable** code locations. **Non-goal / explicit complement.** `codebase_search` does **not** replace `grep_files` or `file_search`. Exact-token and filename lookups remain the right tool when you know what you're looking for. `codebase_search` is the "I don't know the token yet" entry point and always falls back to exact grep so it is never *worse* than grep for a literal query. (See §2 fallback, §6 non-goals.) There is currently **no** FTS5/BM25, sparse, or dense index in the tree. `rusqlite` is already a workspace dependency (`crates/tui/Cargo.toml`), so the lexical core can be built with no new heavy dependencies. --- ## 2. Approach Comparison | Approach | What it indexes | Local-first? | Recall on paraphrase | Cost / deps | Verdict for v0.9.0 | |---|---|---|---|---|---| | **Lexical FTS5 + `bm25()`** | tokenized code/comments/identifiers (camelCase/snake_case split) | Yes — SQLite built in via `rusqlite` | Medium (with tokenizer help) | Near-zero (existing dep) | **Phase 1 core** | | **Symbol / path ranking** | extracted symbols (fn/struct/impl/const), path components | Yes | Medium-high for "where is X defined" | Low (regex/tree-sitter optional) | **Phase 1 core** | | **Sparse encoders (SPLADE)** | learned term-expansion weights | Yes (model is local but heavy) | High | Model download + inference | Phase 3, feature-flagged | | **Dense embeddings** | vector of chunk semantics | Optional — embedding model needed | Highest on paraphrase | Model + vector store; HF download | Phase 3, feature-flagged | | **Cross-encoder reranker** | re-scores top-K candidates | Heavy | Boosts precision@k | Inference cost | Phase 4, feature-flagged | ### Recommended architecture: Hybrid via Reciprocal-Rank Fusion (RRF) Each enabled signal produces an independent ranked list; results are merged with RRF (`score(d) = Σ_signals 1/(k + rank_signal(d))`, conventional `k≈60`). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales. **v0.9.0 Phase 1 signal set (all local, no model downloads):** 1. **Lexical (FTS5 `bm25()`)** over chunk text with an identifier-aware tokenizer. 2. **Symbol rank** — boost chunks whose extracted symbol name fuzzy-matches query terms. 3. **Path rank** — boost chunks whose path components match (e.g. query "auth" → `…/auth/…`, `…/provider…`). 4. **Session-relevance boost** — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at. 5. **Exact grep fallback** — the query is *also* run as a literal `grep_files`-equivalent pass; any exact hit is fused in and tagged, guaranteeing `codebase_search` ⊇ grep for literal queries. **Optional later backends (feature-flagged, off by default):** - `--features sparse-splade` — adds a SPLADE signal list to the RRF. - `--features dense-embed` — adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy). - `--features rerank` — cross-encoder reranks the fused top-K. Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI. --- ## 3. Storage & Index ### Location ``` ~/.codewhale/index/.db ``` `` is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by `rusqlite` (existing dep). > Migration note (ties to the `/memory doctor` taxonomy in §7): older builds used `~/.deepseek`. The index path is `~/.codewhale` only; if a legacy `~/.deepseek/index` exists it is ignored (a future `doctor` may offer to migrate, never auto-read). ### Schema sketch ```sql CREATE TABLE files ( id INTEGER PRIMARY KEY, path TEXT NOT NULL UNIQUE, -- workspace-relative mtime_ns INTEGER NOT NULL, -- invalidation size_bytes INTEGER NOT NULL, content_hash TEXT NOT NULL, -- blake3; skip re-chunk if unchanged lang TEXT, -- detected language branch TEXT -- branch at last index (invalidation) ); CREATE TABLE chunks ( id INTEGER PRIMARY KEY, file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE, start_line INTEGER NOT NULL, end_line INTEGER NOT NULL, kind TEXT, -- fn | struct | impl | const | doc | block symbol TEXT, -- primary symbol name if any text TEXT NOT NULL -- chunk body (identifier-split copy for FTS) ); -- Lexical index. external-content FTS so we don't duplicate bodies twice. CREATE VIRTUAL TABLE chunks_fts USING fts5( text, symbol, content='chunks', content_rowid='id', tokenize = 'unicode61 remove_diacritics 2' -- + identifier pre-split at index time ); CREATE TABLE symbols ( id INTEGER PRIMARY KEY, file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE, chunk_id INTEGER REFERENCES chunks(id) ON DELETE CASCADE, name TEXT NOT NULL, kind TEXT NOT NULL, -- fn | struct | enum | trait | impl | const | macro line INTEGER NOT NULL ); CREATE INDEX symbols_name ON symbols(name); -- Session relevance: lightweight touch log, written by the session, decayed on read. CREATE TABLE session_touch ( path TEXT PRIMARY KEY, last_touch INTEGER NOT NULL, -- unix ns touch_count INTEGER NOT NULL DEFAULT 1 ); ``` Identifier-aware tokenization (splitting `resolveApiKey` / `resolve_api_key` → `resolve api key`) is applied **at index time** into the FTS `text` column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (`chunks_sparse`, `chunks_vec`) behind their feature flags. ### Chunking strategy (structure-aware) Chunk on **syntactic boundaries**, not fixed windows: one chunk per top-level item (`fn`, `struct`, `impl` block, `const`, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere. ### Invalidation - **mtime + content_hash:** on index/refresh, skip files whose `mtime_ns` and `content_hash` are unchanged. - **Branch switch:** `files.branch` is recorded; on a branch change the affected files are re-checked (cheap because of content_hash). - **Generated / vendor exclusion:** reuse the **same** `ignore`-crate `WalkBuilder` exclusion behavior as `file_search` (mirror the defaults at `crates/tui/src/tools/file_search.rs:210-219`: `target/**`, `node_modules/**`, `.git/**`, `DerivedData/**`, `dist/**`, `build/**`, `*.lock`, `*.plist`, plus `.gitignore`). One exclusion source of truth shared with `file_search` avoids index drift. ### Privacy / trust - **Workspace-scoped, local-only.** The index lives under `~/.codewhale/index/` and never leaves the machine. - **No cloud by default.** Phase 1 has zero network dependency. - **Embeddings / Hugging Face downloads are gated.** Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag *and* an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything. --- ## 4. Model-Visible Tool Contract ```jsonc // codebase_search { "name": "codebase_search", "description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).", "parameters": { "query": { "type": "string", "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." }, "max_results":{ "type": "integer", "default": 10 }, "path_glob": { "type": "string", "description": "Optional path filter, e.g. 'crates/tui/**'." }, "lang": { "type": "string", "description": "Optional language filter." }, "kind": { "type": "string", "description": "Optional symbol-kind filter: fn|struct|impl|const|..." } } } ``` **Result shape — ranked, explainable, auditable:** ```jsonc { "results": [ { "path": "crates/tui/src/config/provider.rs", "line": 142, "snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result { ... }", "score": 0.91, "reasons": [ "symbol: resolve_api_key matches 'auth/resolve'", "lexical: matched tokens [provider, api, key, resolve]", "path: component 'provider' matches query", "session: file read 2 turns ago" ] } ], "backend": "lexical+symbol+path+session", // which signals were fused (RRF) "fallback_grep_hits": 1 // exact-match hits folded in } ``` `reasons[]` is **mandatory** and is the auditability contract: every result explains *why* it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The `backend` field records which signals were active so results are reproducible given the feature set. --- ## 5. Benchmark / Eval Set A fixed set of real CodeWhale concept queries, each with the **expected** file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on **file**, not line. | # | Query (concept, no exact token) | Expected file(s) | Anchor | |---|---|---|---| | 1 | Where is provider auth / API key resolved? | `crates/tui/src/config/` provider auth path | provider/config module | | 2 | What is the first-turn active tool set? | `crates/tui/src/core/engine/tool_catalog.rs` | `DEFAULT_ACTIVE_NATIVE_TOOLS` :37-64 | | 3 | How are deferred tools hydrated / searched? | `crates/tui/src/core/engine/tool_catalog.rs` | tool_search regex/bm25 :26-35 | | 4 | Why does Arcee get a reduced tool set? (WAF workaround) | `crates/tui/src/core/engine/tool_catalog.rs` | `ARCEE_FIRST_TURN_NATIVE_TOOLS` :106-115 | | 5 | What keeps the tool catalog byte-stable for the KV prefix cache? | `crates/tui/src/core/engine/tool_catalog.rs` | catalog-head invariant :169-196 | | 6 | Where is the shell approval / cancel policy? | `crates/tui/src/tools/shell.rs` + `tools/spec.rs` (`ApprovalRequirement`) | shell tools, `ShellWaitTool`/`ShellInteractTool` registry.rs:524-531 | | 7 | Where are mode prompts (Plan/Agent/YOLO) assembled? | mode prompt / `AppMode` assembly in `crates/tui/src/tui/` | `AppMode` usage | | 8 | How does the subagent lifecycle open/eval/close a child? | `crates/tui/src/tools/subagent/mod.rs`; registry registration | registry.rs:1017-1029; `send_input`/`cancel`/`resume` mod.rs:1495,1521,1605 | | 9 | What is the RLM session surface and its default child model? | `crates/tui/src/tools/rlm.rs` | `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"` :26 | | 10 | Where is RLM eval / var_handle retrieval (`handle_read`)? | `crates/tui/src/tools/rlm.rs`, `tools/handle.rs` | `VarHandle` import rlm.rs:21 | | 11 | Where are skills discovered and parsed in the workspace? | `crates/tui/src/tools/skills/mod.rs` | `discover_in_workspace` ~421; skill struct ~382-388 | | 12 | Where is skill enable-state stored / checked? | `crates/tui/src/tools/skills/skill_state.rs` | `SkillStateStore::is_enabled` ~73 | | 13 | How does vendor/generated exclusion work for file walking? | `crates/tui/src/tools/file_search.rs` | `ignore` WalkBuilder excludes :210-219 | | 14 | Where is the queued user message built on submit? | `crates/tui/src/tui/ui.rs` | `build_queued_message` ~4721 | | 15 | Where are speech / TTS tools registered? (duplicate names) | `crates/tui/src/tools/registry.rs` | `speech` ≡ `tts` :787-792 | Each entry is intended to become a `(query, expected_paths[])` row in a fixture (e.g. `crates/tui/tests/fixtures/codebase_search_eval.jsonl`). This PR ships the design table only; the fixture and harness are deferred to Phase 1. The Phase 1 harness runs all queries against the live index and reports recall@k and MRR; a regression bar (e.g. recall@10 >= target) gates future ranking changes. --- ## 6. Phasing, Feature Flags, and Non-Goals ### Phasing - **Phase 0 (this cycle, v0.8.53):** this design note + benchmark table only. No fixture, harness, or catalog code. - **Phase 1 (v0.9.0):** local lexical core — FTS5 `bm25()` + symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at `~/.codewhale/index/.db`. Eval harness wired into CI. **No network, no model downloads.** Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant. - **Phase 2:** incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language). - **Phase 3 (feature-flagged, off by default):** `sparse-splade` and `dense-embed` RRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy). - **Phase 4 (feature-flagged):** `rerank` cross-encoder over the fused top-K. ### Feature flags ``` codebase-search-core # Phase 1, default-on once it lands sparse-splade # Phase 3, default-off dense-embed # Phase 3, default-off (gated HF download) rerank # Phase 4, default-off ``` ### Non-goals - **No cloud index is required** for the core experience. Ever, for Phase 1. - **Not a grep replacement.** Exact-token (`grep_files`) and filename (`file_search`) search stay first-class; `codebase_search` complements them and folds exact hits in as a fallback. - Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more. ### Cross-link: WhaleFlow epic `codebase_search` is a building block for the long-running multi-agent **WhaleFlow** (`/workflow` / `/whaleflow`) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable `reasons[]` feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors). --- ## Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only) These are **design decisions for the eventual one-time catalog edit**; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (`tool_catalog.rs:169-196`) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn. ### Lifecycle states (represented as const name-sets + an alias table in `tool_catalog.rs`, NOT a per-`ToolSpec` field) | State | Active first turn? | In tool-search? | Registered/dispatchable? | Result-metadata notice? | |---|---|---|---|---| | **active** | yes | yes | yes | no | | **deferred** | no | yes | yes | no | | **hidden-compatibility** | no | no | yes | no | | **deprecated** | no | no | yes | yes (replacement notice, **metadata only**) | | **removed** | no | no | no | — | Deprecated/hidden tools stay **registered and dispatchable** so old transcripts always replay. A deprecated tool appends a replacement notice to **RESULT METADATA only** — never to the cached prefix (which would break the invariant). ### Planned diet (documented, not yet coded) - **`exec_wait`, `exec_interact`, `tts` → hidden-compatibility.** These are exact duplicates of canonical tools: - `exec_wait` ≡ `exec_shell_wait` (same `ShellWaitTool`, `registry.rs:526,529`); router already unifies them at `crates/tui/src/tui/tool_routing.rs:1139-1140`. - `exec_interact` ≡ `exec_shell_interact` (same `ShellInteractTool`, `registry.rs:527,530`). - `tts` ≡ `speech` (same `SpeechTool`, `registry.rs:787-792`). - Action: drop from active + search, keep registered, identical behavior, **no notice**. - **`todo_*` (`todo_write/add/update/list`) → deprecated → `checklist_*`.** They are deferred twins of `checklist_*` (same `TodoWriteTool::new` vs `::checklist`, `todo.rs:187,194`); `checklist_write` is active, and `todo_*` are **not** in the active set. Action: drop from tool-search, keep registered, **add replacement notice** (metadata only). - **Legacy subagent names** (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`, `send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, `delegate_to_agent`) are already `#[allow(dead_code)]` structs never instantiated outside tests (`crates/tui/src/tools/subagent/mod.rs`) → **already not model-visible.** Action: cleanup + guardrail tests, **rebased on PR #2684.** Note the live internal `SubAgentManager` methods `send_input`/`cancel`/`resume` (`mod.rs:1495,1521,1605`) are used by `agent_eval`/`agent_close` and **must be kept** — only the model-visible *tool* names are retired. ### Model-visible subagent surface (unchanged) Only `agent_open`, `agent_eval`, `tool_agent`, `agent_close` are registered (`registry.rs:1017-1029`). - **`tool_agent` — KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY.** It is the fast non-thinking "Fin" executor lane built on `deepseek-v4-flash` (cf. RLM `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit. ### Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance) `apply_patch` / `edit_file` / `write_file` / `fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` / `web.run` / `web_search`; `task_shell_*`; `handle_read` / `retrieve_tool_result`. These serve distinct purposes and stay as-is. --- ## Appendix B — Command-Surface Taxonomy (context) Each name maps to exactly one thing; `codebase_search` slots in as concept-level code retrieval alongside these surfaces: - `/memory` — small user prefs/facts only (subcommands `add`/`edit`/`search`/`clear`/`doctor`, plus later `promote`; `doctor` detects the legacy `~/.deepseek` path). - `/context` — dashboard of all active layers. - `/rules` — repo guidance. - `/workflow` (`/whaleflow`) — long-running multi-agent (the WhaleFlow epic). - `/overlay` — promoted cached-main lessons. - `$` — skill invocation prefix; the token *is* the skill name (e.g. `$systematic-debugging`, `$github:gh-fix-ci`). - `codebase_search` — concept-level code retrieval (this document).