codewhale/docs/rfcs/CODEBASE_SEARCH_DESIGN.md

# `codebase_search` — Local-First Semantic Code Retrieval

> **Status:** Design note + planned eval scaffold. **Code is DEFERRED.**
> GitHub #2680 · Milestone **v0.9.0** · This DOC ships in **v0.8.53** (doc-only; no catalog code in this cycle).
> Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either.

This document specifies a model-visible `codebase_search` tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding **tool lifecycle** decisions for v0.8.53 so the eventual catalog edit is a single deterministic change.

---

## 1. Problem

Today CodeWhale ships two complementary code-locating tools and one structure map:

- `file_search` — **filename** search (uses the `ignore` crate's `WalkBuilder` for vendor exclusion; default excludes at `crates/tui/src/tools/file_search.rs:210-219`).
- `grep_files` — **content** search (literal/regex token match).
- `project_map` — a deferred **structure** map.

None of these answer **concept-level** questions where the user does not know the exact token:

- "Where is provider auth resolved?"
- "What enforces the shell approval policy?"
- "Where do mode prompts get assembled?"
- "How does the subagent lifecycle close out a child?"

`grep_files` requires you to already know the literal string (`resolve_api_key`, `ApprovalRequirement`, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens.

**Goal.** Add a retrieval tool keyed on *intent*, not on exact lexemes, that returns ranked, **explainable** code locations.

**Non-goal / explicit complement.** `codebase_search` does **not** replace `grep_files` or `file_search`. Exact-token and filename lookups remain the right tool when you know what you're looking for. `codebase_search` is the "I don't know the token yet" entry point and always falls back to exact grep so it is never *worse* than grep for a literal query. (See §2 fallback, §6 non-goals.)

There is currently **no** FTS5/BM25, sparse, or dense index in the tree. `rusqlite` is already a workspace dependency (`crates/tui/Cargo.toml`), so the lexical core can be built with no new heavy dependencies.

---

## 2. Approach Comparison

| Approach | What it indexes | Local-first? | Recall on paraphrase | Cost / deps | Verdict for v0.9.0 |
|---|---|---|---|---|---|
| **Lexical FTS5 + `bm25()`** | tokenized code/comments/identifiers (camelCase/snake_case split) | Yes — SQLite built in via `rusqlite` | Medium (with tokenizer help) | Near-zero (existing dep) | **Phase 1 core** |
| **Symbol / path ranking** | extracted symbols (fn/struct/impl/const), path components | Yes | Medium-high for "where is X defined" | Low (regex/tree-sitter optional) | **Phase 1 core** |
| **Sparse encoders (SPLADE)** | learned term-expansion weights | Yes (model is local but heavy) | High | Model download + inference | Phase 3, feature-flagged |
| **Dense embeddings** | vector of chunk semantics | Optional — embedding model needed | Highest on paraphrase | Model + vector store; HF download | Phase 3, feature-flagged |
| **Cross-encoder reranker** | re-scores top-K candidates | Heavy | Boosts precision@k | Inference cost | Phase 4, feature-flagged |

### Recommended architecture: Hybrid via Reciprocal-Rank Fusion (RRF)

Each enabled signal produces an independent ranked list; results are merged with RRF
(`score(d) = Σ_signals 1/(k + rank_signal(d))`, conventional `k≈60`). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales.

**v0.9.0 Phase 1 signal set (all local, no model downloads):**

1. **Lexical (FTS5 `bm25()`)** over chunk text with an identifier-aware tokenizer.
2. **Symbol rank** — boost chunks whose extracted symbol name fuzzy-matches query terms.
3. **Path rank** — boost chunks whose path components match (e.g. query "auth" → `…/auth/…`, `…/provider…`).
4. **Session-relevance boost** — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at.
5. **Exact grep fallback** — the query is *also* run as a literal `grep_files`-equivalent pass; any exact hit is fused in and tagged, guaranteeing `codebase_search` ⊇ grep for literal queries.

**Optional later backends (feature-flagged, off by default):**

- `--features sparse-splade` — adds a SPLADE signal list to the RRF.
- `--features dense-embed` — adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy).
- `--features rerank` — cross-encoder reranks the fused top-K.

Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI.

---

## 3. Storage & Index

### Location

```
~/.codewhale/index/<workspace-hash>.db
```

`<workspace-hash>` is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by `rusqlite` (existing dep).

> Migration note (ties to the `/memory doctor` taxonomy in §7): older builds used `~/.deepseek`. The index path is `~/.codewhale` only; if a legacy `~/.deepseek/index` exists it is ignored (a future `doctor` may offer to migrate, never auto-read).

### Schema sketch

```sql
CREATE TABLE files (
  id            INTEGER PRIMARY KEY,
  path          TEXT NOT NULL UNIQUE,   -- workspace-relative
  mtime_ns      INTEGER NOT NULL,       -- invalidation
  size_bytes    INTEGER NOT NULL,
  content_hash  TEXT NOT NULL,          -- blake3; skip re-chunk if unchanged
  lang          TEXT,                   -- detected language
  branch        TEXT                    -- branch at last index (invalidation)
);

CREATE TABLE chunks (
  id          INTEGER PRIMARY KEY,
  file_id     INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
  start_line  INTEGER NOT NULL,
  end_line    INTEGER NOT NULL,
  kind        TEXT,                     -- fn | struct | impl | const | doc | block
  symbol      TEXT,                     -- primary symbol name if any
  text        TEXT NOT NULL             -- chunk body (identifier-split copy for FTS)
);

-- Lexical index. external-content FTS so we don't duplicate bodies twice.
CREATE VIRTUAL TABLE chunks_fts USING fts5(
  text,
  symbol,
  content='chunks',
  content_rowid='id',
  tokenize = 'unicode61 remove_diacritics 2'   -- + identifier pre-split at index time
);

CREATE TABLE symbols (
  id        INTEGER PRIMARY KEY,
  file_id   INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
  chunk_id  INTEGER REFERENCES chunks(id) ON DELETE CASCADE,
  name      TEXT NOT NULL,
  kind      TEXT NOT NULL,              -- fn | struct | enum | trait | impl | const | macro
  line      INTEGER NOT NULL
);
CREATE INDEX symbols_name ON symbols(name);

-- Session relevance: lightweight touch log, written by the session, decayed on read.
CREATE TABLE session_touch (
  path        TEXT PRIMARY KEY,
  last_touch  INTEGER NOT NULL,         -- unix ns
  touch_count INTEGER NOT NULL DEFAULT 1
);
```

Identifier-aware tokenization (splitting `resolveApiKey` / `resolve_api_key` → `resolve api key`) is applied **at index time** into the FTS `text` column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (`chunks_sparse`, `chunks_vec`) behind their feature flags.

### Chunking strategy (structure-aware)

Chunk on **syntactic boundaries**, not fixed windows: one chunk per top-level item (`fn`, `struct`, `impl` block, `const`, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere.

### Invalidation

- **mtime + content_hash:** on index/refresh, skip files whose `mtime_ns` and `content_hash` are unchanged.
- **Branch switch:** `files.branch` is recorded; on a branch change the affected files are re-checked (cheap because of content_hash).
- **Generated / vendor exclusion:** reuse the **same** `ignore`-crate `WalkBuilder` exclusion behavior as `file_search` (mirror the defaults at `crates/tui/src/tools/file_search.rs:210-219`: `target/**`, `node_modules/**`, `.git/**`, `DerivedData/**`, `dist/**`, `build/**`, `*.lock`, `*.plist`, plus `.gitignore`). One exclusion source of truth shared with `file_search` avoids index drift.

### Privacy / trust

- **Workspace-scoped, local-only.** The index lives under `~/.codewhale/index/` and never leaves the machine.
- **No cloud by default.** Phase 1 has zero network dependency.
- **Embeddings / Hugging Face downloads are gated.** Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag *and* an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything.

---

## 4. Model-Visible Tool Contract

```jsonc
// codebase_search
{
  "name": "codebase_search",
  "description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).",
  "parameters": {
    "query":      { "type": "string",  "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." },
    "max_results":{ "type": "integer", "default": 10 },
    "path_glob":  { "type": "string",  "description": "Optional path filter, e.g. 'crates/tui/**'." },
    "lang":       { "type": "string",  "description": "Optional language filter." },
    "kind":       { "type": "string",  "description": "Optional symbol-kind filter: fn|struct|impl|const|..." }
  }
}
```

**Result shape — ranked, explainable, auditable:**

```jsonc
{
  "results": [
    {
      "path": "crates/tui/src/config/provider.rs",
      "line": 142,
      "snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result<Secret> { ... }",
      "score": 0.91,
      "reasons": [
        "symbol: resolve_api_key matches 'auth/resolve'",
        "lexical: matched tokens [provider, api, key, resolve]",
        "path: component 'provider' matches query",
        "session: file read 2 turns ago"
      ]
    }
  ],
  "backend": "lexical+symbol+path+session",   // which signals were fused (RRF)
  "fallback_grep_hits": 1                       // exact-match hits folded in
}
```

`reasons[]` is **mandatory** and is the auditability contract: every result explains *why* it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The `backend` field records which signals were active so results are reproducible given the feature set.

---

## 5. Benchmark / Eval Set

A fixed set of real CodeWhale concept queries, each with the **expected** file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on **file**, not line.

| # | Query (concept, no exact token) | Expected file(s) | Anchor |
|---|---|---|---|
| 1 | Where is provider auth / API key resolved? | `crates/tui/src/config/` provider auth path | provider/config module |
| 2 | What is the first-turn active tool set? | `crates/tui/src/core/engine/tool_catalog.rs` | `DEFAULT_ACTIVE_NATIVE_TOOLS` :37-64 |
| 3 | How are deferred tools hydrated / searched? | `crates/tui/src/core/engine/tool_catalog.rs` | tool_search regex/bm25 :26-35 |
| 4 | Why does Arcee get a reduced tool set? (WAF workaround) | `crates/tui/src/core/engine/tool_catalog.rs` | `ARCEE_FIRST_TURN_NATIVE_TOOLS` :106-115 |
| 5 | What keeps the tool catalog byte-stable for the KV prefix cache? | `crates/tui/src/core/engine/tool_catalog.rs` | catalog-head invariant :169-196 |
| 6 | Where is the shell approval / cancel policy? | `crates/tui/src/tools/shell.rs` + `tools/spec.rs` (`ApprovalRequirement`) | shell tools, `ShellWaitTool`/`ShellInteractTool` registry.rs:524-531 |
| 7 | Where are mode prompts (Plan/Agent/YOLO) assembled? | mode prompt / `AppMode` assembly in `crates/tui/src/tui/` | `AppMode` usage |
| 8 | How does the subagent lifecycle open/eval/close a child? | `crates/tui/src/tools/subagent/mod.rs`; registry registration | registry.rs:1017-1029; `send_input`/`cancel`/`resume` mod.rs:1495,1521,1605 |
| 9 | What is the RLM session surface and its default child model? | `crates/tui/src/tools/rlm.rs` | `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"` :26 |
| 10 | Where is RLM eval / var_handle retrieval (`handle_read`)? | `crates/tui/src/tools/rlm.rs`, `tools/handle.rs` | `VarHandle` import rlm.rs:21 |
| 11 | Where are skills discovered and parsed in the workspace? | `crates/tui/src/tools/skills/mod.rs` | `discover_in_workspace` ~421; skill struct ~382-388 |
| 12 | Where is skill enable-state stored / checked? | `crates/tui/src/tools/skills/skill_state.rs` | `SkillStateStore::is_enabled` ~73 |
| 13 | How does vendor/generated exclusion work for file walking? | `crates/tui/src/tools/file_search.rs` | `ignore` WalkBuilder excludes :210-219 |
| 14 | Where is the queued user message built on submit? | `crates/tui/src/tui/ui.rs` | `build_queued_message` ~4721 |
| 15 | Where are speech / TTS tools registered? (duplicate names) | `crates/tui/src/tools/registry.rs` | `speech` ≡ `tts` :787-792 |

Each entry is intended to become a `(query, expected_paths[])` row in a fixture
(e.g. `crates/tui/tests/fixtures/codebase_search_eval.jsonl`). This PR ships
the design table only; the fixture and harness are deferred to Phase 1. The
Phase 1 harness runs all queries against the live index and reports recall@k
and MRR; a regression bar (e.g. recall@10 >= target) gates future ranking
changes.

---

## 6. Phasing, Feature Flags, and Non-Goals

### Phasing

- **Phase 0 (this cycle, v0.8.53):** this design note + benchmark table only. No fixture, harness, or catalog code.
- **Phase 1 (v0.9.0):** local lexical core — FTS5 `bm25()` + symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at `~/.codewhale/index/<workspace-hash>.db`. Eval harness wired into CI. **No network, no model downloads.** Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant.
- **Phase 2:** incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language).
- **Phase 3 (feature-flagged, off by default):** `sparse-splade` and `dense-embed` RRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy).
- **Phase 4 (feature-flagged):** `rerank` cross-encoder over the fused top-K.

### Feature flags

```
codebase-search-core    # Phase 1, default-on once it lands
sparse-splade           # Phase 3, default-off
dense-embed             # Phase 3, default-off (gated HF download)
rerank                  # Phase 4, default-off
```

### Non-goals

- **No cloud index is required** for the core experience. Ever, for Phase 1.
- **Not a grep replacement.** Exact-token (`grep_files`) and filename (`file_search`) search stay first-class; `codebase_search` complements them and folds exact hits in as a fallback.
- Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more.

### Cross-link: WhaleFlow epic

`codebase_search` is a building block for the long-running multi-agent **WhaleFlow** (`/workflow` / `/whaleflow`) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable `reasons[]` feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors).

---

## Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only)

These are **design decisions for the eventual one-time catalog edit**; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (`tool_catalog.rs:169-196`) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn.

### Lifecycle states (represented as const name-sets + an alias table in `tool_catalog.rs`, NOT a per-`ToolSpec` field)

| State | Active first turn? | In tool-search? | Registered/dispatchable? | Result-metadata notice? |
|---|---|---|---|---|
| **active** | yes | yes | yes | no |
| **deferred** | no | yes | yes | no |
| **hidden-compatibility** | no | no | yes | no |
| **deprecated** | no | no | yes | yes (replacement notice, **metadata only**) |
| **removed** | no | no | no | — |

Deprecated/hidden tools stay **registered and dispatchable** so old transcripts always replay. A deprecated tool appends a replacement notice to **RESULT METADATA only** — never to the cached prefix (which would break the invariant).

### Planned diet (documented, not yet coded)

- **`exec_wait`, `exec_interact`, `tts` → hidden-compatibility.** These are exact duplicates of canonical tools:
  - `exec_wait` ≡ `exec_shell_wait` (same `ShellWaitTool`, `registry.rs:526,529`); router already unifies them at `crates/tui/src/tui/tool_routing.rs:1139-1140`.
  - `exec_interact` ≡ `exec_shell_interact` (same `ShellInteractTool`, `registry.rs:527,530`).
  - `tts` ≡ `speech` (same `SpeechTool`, `registry.rs:787-792`).
  - Action: drop from active + search, keep registered, identical behavior, **no notice**.
- **`todo_*` (`todo_write/add/update/list`) → deprecated → `checklist_*`.** They are deferred twins of `checklist_*` (same `TodoWriteTool::new` vs `::checklist`, `todo.rs:187,194`); `checklist_write` is active, and `todo_*` are **not** in the active set. Action: drop from tool-search, keep registered, **add replacement notice** (metadata only).
- **Legacy subagent names** (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`, `agent_send_input`, `send_input`, `agent_assign`, `agent_list`, `agent_cancel`, `resume_agent`, `delegate_to_agent`) are already `#[allow(dead_code)]` structs never instantiated outside tests (`crates/tui/src/tools/subagent/mod.rs`) → **already not model-visible.** Action: cleanup + guardrail tests, **rebased on PR #2684.** Note the live internal `SubAgentManager` methods `send_input`/`cancel`/`resume` (`mod.rs:1495,1521,1605`) are used by `agent_eval`/`agent_close` and **must be kept** — only the model-visible *tool* names are retired.

### Model-visible subagent surface (unchanged)

Only `agent_open`, `agent_eval`, `tool_agent`, `agent_close` are registered (`registry.rs:1017-1029`).

- **`tool_agent` — KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY.** It is the fast non-thinking "Fin" executor lane built on `deepseek-v4-flash` (cf. RLM `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit.

### Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance)

`apply_patch` / `edit_file` / `write_file` / `fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` / `web.run` / `web_search`; `task_shell_*`; `handle_read` / `retrieve_tool_result`. These serve distinct purposes and stay as-is.

---

## Appendix B — Command-Surface Taxonomy (context)

Each name maps to exactly one thing; `codebase_search` slots in as concept-level code retrieval alongside these surfaces:

- `/memory` — small user prefs/facts only (subcommands `add`/`edit`/`search`/`clear`/`doctor`, plus later `promote`; `doctor` detects the legacy `~/.deepseek` path).
- `/context` — dashboard of all active layers.
- `/rules` — repo guidance.
- `/workflow` (`/whaleflow`) — long-running multi-agent (the WhaleFlow epic).
- `/overlay` — promoted cached-main lessons.
- `$<skill-name>` — skill invocation prefix; the token *is* the skill name (e.g. `$systematic-debugging`, `$github:gh-fix-ci`).
- `codebase_search` — concept-level code retrieval (this document).