Files
codewhale/docs/rfcs/CODEBASE_SEARCH_DESIGN.md
2026-06-09 23:23:25 -07:00

20 KiB

codebase_search — Local-First Semantic Code Retrieval

Status: Design note + planned eval scaffold. Code is DEFERRED. GitHub #2680 · Milestone v0.9.0 · This DOC ships in v0.8.53 (doc-only; no catalog code in this cycle). Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either.

This document specifies a model-visible codebase_search tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding tool lifecycle decisions for v0.8.53 so the eventual catalog edit is a single deterministic change.


1. Problem

Today CodeWhale ships two complementary code-locating tools and one structure map:

  • file_searchfilename search (uses the ignore crate's WalkBuilder for vendor exclusion; default excludes at crates/tui/src/tools/file_search.rs:210-219).
  • grep_filescontent search (literal/regex token match).
  • project_map — a deferred structure map.

None of these answer concept-level questions where the user does not know the exact token:

  • "Where is provider auth resolved?"
  • "What enforces the shell approval policy?"
  • "Where do mode prompts get assembled?"
  • "How does the subagent lifecycle close out a child?"

grep_files requires you to already know the literal string (resolve_api_key, ApprovalRequirement, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens.

Goal. Add a retrieval tool keyed on intent, not on exact lexemes, that returns ranked, explainable code locations.

Non-goal / explicit complement. codebase_search does not replace grep_files or file_search. Exact-token and filename lookups remain the right tool when you know what you're looking for. codebase_search is the "I don't know the token yet" entry point and always falls back to exact grep so it is never worse than grep for a literal query. (See §2 fallback, §6 non-goals.)

There is currently no FTS5/BM25, sparse, or dense index in the tree. rusqlite is already a workspace dependency (crates/tui/Cargo.toml), so the lexical core can be built with no new heavy dependencies.


2. Approach Comparison

Approach What it indexes Local-first? Recall on paraphrase Cost / deps Verdict for v0.9.0
Lexical FTS5 + bm25() tokenized code/comments/identifiers (camelCase/snake_case split) Yes — SQLite built in via rusqlite Medium (with tokenizer help) Near-zero (existing dep) Phase 1 core
Symbol / path ranking extracted symbols (fn/struct/impl/const), path components Yes Medium-high for "where is X defined" Low (regex/tree-sitter optional) Phase 1 core
Sparse encoders (SPLADE) learned term-expansion weights Yes (model is local but heavy) High Model download + inference Phase 3, feature-flagged
Dense embeddings vector of chunk semantics Optional — embedding model needed Highest on paraphrase Model + vector store; HF download Phase 3, feature-flagged
Cross-encoder reranker re-scores top-K candidates Heavy Boosts precision@k Inference cost Phase 4, feature-flagged

Each enabled signal produces an independent ranked list; results are merged with RRF (score(d) = Σ_signals 1/(k + rank_signal(d)), conventional k≈60). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales.

v0.9.0 Phase 1 signal set (all local, no model downloads):

  1. Lexical (FTS5 bm25()) over chunk text with an identifier-aware tokenizer.
  2. Symbol rank — boost chunks whose extracted symbol name fuzzy-matches query terms.
  3. Path rank — boost chunks whose path components match (e.g. query "auth" → …/auth/…, …/provider…).
  4. Session-relevance boost — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at.
  5. Exact grep fallback — the query is also run as a literal grep_files-equivalent pass; any exact hit is fused in and tagged, guaranteeing codebase_search ⊇ grep for literal queries.

Optional later backends (feature-flagged, off by default):

  • --features sparse-splade — adds a SPLADE signal list to the RRF.
  • --features dense-embed — adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy).
  • --features rerank — cross-encoder reranks the fused top-K.

Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI.


3. Storage & Index

Location

~/.codewhale/index/<workspace-hash>.db

<workspace-hash> is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by rusqlite (existing dep).

Migration note (ties to the /memory doctor taxonomy in §7): older builds used ~/.deepseek. The index path is ~/.codewhale only; if a legacy ~/.deepseek/index exists it is ignored (a future doctor may offer to migrate, never auto-read).

Schema sketch

CREATE TABLE files (
  id            INTEGER PRIMARY KEY,
  path          TEXT NOT NULL UNIQUE,   -- workspace-relative
  mtime_ns      INTEGER NOT NULL,       -- invalidation
  size_bytes    INTEGER NOT NULL,
  content_hash  TEXT NOT NULL,          -- blake3; skip re-chunk if unchanged
  lang          TEXT,                   -- detected language
  branch        TEXT                    -- branch at last index (invalidation)
);

CREATE TABLE chunks (
  id          INTEGER PRIMARY KEY,
  file_id     INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
  start_line  INTEGER NOT NULL,
  end_line    INTEGER NOT NULL,
  kind        TEXT,                     -- fn | struct | impl | const | doc | block
  symbol      TEXT,                     -- primary symbol name if any
  text        TEXT NOT NULL             -- chunk body (identifier-split copy for FTS)
);

-- Lexical index. external-content FTS so we don't duplicate bodies twice.
CREATE VIRTUAL TABLE chunks_fts USING fts5(
  text,
  symbol,
  content='chunks',
  content_rowid='id',
  tokenize = 'unicode61 remove_diacritics 2'   -- + identifier pre-split at index time
);

CREATE TABLE symbols (
  id        INTEGER PRIMARY KEY,
  file_id   INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
  chunk_id  INTEGER REFERENCES chunks(id) ON DELETE CASCADE,
  name      TEXT NOT NULL,
  kind      TEXT NOT NULL,              -- fn | struct | enum | trait | impl | const | macro
  line      INTEGER NOT NULL
);
CREATE INDEX symbols_name ON symbols(name);

-- Session relevance: lightweight touch log, written by the session, decayed on read.
CREATE TABLE session_touch (
  path        TEXT PRIMARY KEY,
  last_touch  INTEGER NOT NULL,         -- unix ns
  touch_count INTEGER NOT NULL DEFAULT 1
);

Identifier-aware tokenization (splitting resolveApiKey / resolve_api_keyresolve api key) is applied at index time into the FTS text column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (chunks_sparse, chunks_vec) behind their feature flags.

Chunking strategy (structure-aware)

Chunk on syntactic boundaries, not fixed windows: one chunk per top-level item (fn, struct, impl block, const, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere.

Invalidation

  • mtime + content_hash: on index/refresh, skip files whose mtime_ns and content_hash are unchanged.
  • Branch switch: files.branch is recorded; on a branch change the affected files are re-checked (cheap because of content_hash).
  • Generated / vendor exclusion: reuse the same ignore-crate WalkBuilder exclusion behavior as file_search (mirror the defaults at crates/tui/src/tools/file_search.rs:210-219: target/**, node_modules/**, .git/**, DerivedData/**, dist/**, build/**, *.lock, *.plist, plus .gitignore). One exclusion source of truth shared with file_search avoids index drift.

Privacy / trust

  • Workspace-scoped, local-only. The index lives under ~/.codewhale/index/ and never leaves the machine.
  • No cloud by default. Phase 1 has zero network dependency.
  • Embeddings / Hugging Face downloads are gated. Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag and an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything.

4. Model-Visible Tool Contract

// codebase_search
{
  "name": "codebase_search",
  "description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).",
  "parameters": {
    "query":      { "type": "string",  "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." },
    "max_results":{ "type": "integer", "default": 10 },
    "path_glob":  { "type": "string",  "description": "Optional path filter, e.g. 'crates/tui/**'." },
    "lang":       { "type": "string",  "description": "Optional language filter." },
    "kind":       { "type": "string",  "description": "Optional symbol-kind filter: fn|struct|impl|const|..." }
  }
}

Result shape — ranked, explainable, auditable:

{
  "results": [
    {
      "path": "crates/tui/src/config/provider.rs",
      "line": 142,
      "snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result<Secret> { ... }",
      "score": 0.91,
      "reasons": [
        "symbol: resolve_api_key matches 'auth/resolve'",
        "lexical: matched tokens [provider, api, key, resolve]",
        "path: component 'provider' matches query",
        "session: file read 2 turns ago"
      ]
    }
  ],
  "backend": "lexical+symbol+path+session",   // which signals were fused (RRF)
  "fallback_grep_hits": 1                       // exact-match hits folded in
}

reasons[] is mandatory and is the auditability contract: every result explains why it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The backend field records which signals were active so results are reproducible given the feature set.


5. Benchmark / Eval Set

A fixed set of real CodeWhale concept queries, each with the expected file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on file, not line.

# Query (concept, no exact token) Expected file(s) Anchor
1 Where is provider auth / API key resolved? crates/tui/src/config/ provider auth path provider/config module
2 What is the first-turn active tool set? crates/tui/src/core/engine/tool_catalog.rs DEFAULT_ACTIVE_NATIVE_TOOLS :37-64
3 How are deferred tools hydrated / searched? crates/tui/src/core/engine/tool_catalog.rs tool_search regex/bm25 :26-35
4 Why does Arcee get a reduced tool set? (WAF workaround) crates/tui/src/core/engine/tool_catalog.rs ARCEE_FIRST_TURN_NATIVE_TOOLS :106-115
5 What keeps the tool catalog byte-stable for the KV prefix cache? crates/tui/src/core/engine/tool_catalog.rs catalog-head invariant :169-196
6 Where is the shell approval / cancel policy? crates/tui/src/tools/shell.rs + tools/spec.rs (ApprovalRequirement) shell tools, ShellWaitTool/ShellInteractTool registry.rs:524-531
7 Where are mode prompts (Plan/Agent/YOLO) assembled? mode prompt / AppMode assembly in crates/tui/src/tui/ AppMode usage
8 How does the subagent lifecycle open/eval/close a child? crates/tui/src/tools/subagent/mod.rs; registry registration registry.rs:1017-1029; send_input/cancel/resume mod.rs:1495,1521,1605
9 What is the RLM session surface and its default child model? crates/tui/src/tools/rlm.rs DEFAULT_CHILD_MODEL = "deepseek-v4-flash" :26
10 Where is RLM eval / var_handle retrieval (handle_read)? crates/tui/src/tools/rlm.rs, tools/handle.rs VarHandle import rlm.rs:21
11 Where are skills discovered and parsed in the workspace? crates/tui/src/tools/skills/mod.rs discover_in_workspace ~421; skill struct ~382-388
12 Where is skill enable-state stored / checked? crates/tui/src/tools/skills/skill_state.rs SkillStateStore::is_enabled ~73
13 How does vendor/generated exclusion work for file walking? crates/tui/src/tools/file_search.rs ignore WalkBuilder excludes :210-219
14 Where is the queued user message built on submit? crates/tui/src/tui/ui.rs build_queued_message ~4721
15 Where are speech / TTS tools registered? (duplicate names) crates/tui/src/tools/registry.rs speechtts :787-792

Each entry is intended to become a (query, expected_paths[]) row in a fixture (e.g. crates/tui/tests/fixtures/codebase_search_eval.jsonl). This PR ships the design table only; the fixture and harness are deferred to Phase 1. The Phase 1 harness runs all queries against the live index and reports recall@k and MRR; a regression bar (e.g. recall@10 >= target) gates future ranking changes.


6. Phasing, Feature Flags, and Non-Goals

Phasing

  • Phase 0 (this cycle, v0.8.53): this design note + benchmark table only. No fixture, harness, or catalog code.
  • Phase 1 (v0.9.0): local lexical core — FTS5 bm25() + symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at ~/.codewhale/index/<workspace-hash>.db. Eval harness wired into CI. No network, no model downloads. Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant.
  • Phase 2: incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language).
  • Phase 3 (feature-flagged, off by default): sparse-splade and dense-embed RRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy).
  • Phase 4 (feature-flagged): rerank cross-encoder over the fused top-K.

Feature flags

codebase-search-core    # Phase 1, default-on once it lands
sparse-splade           # Phase 3, default-off
dense-embed             # Phase 3, default-off (gated HF download)
rerank                  # Phase 4, default-off

Non-goals

  • No cloud index is required for the core experience. Ever, for Phase 1.
  • Not a grep replacement. Exact-token (grep_files) and filename (file_search) search stay first-class; codebase_search complements them and folds exact hits in as a fallback.
  • Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more.

codebase_search is a building block for the long-running multi-agent WhaleFlow (/workflow / /whaleflow) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable reasons[] feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors).


Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only)

These are design decisions for the eventual one-time catalog edit; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (tool_catalog.rs:169-196) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn.

Lifecycle states (represented as const name-sets + an alias table in tool_catalog.rs, NOT a per-ToolSpec field)

State Active first turn? In tool-search? Registered/dispatchable? Result-metadata notice?
active yes yes yes no
deferred no yes yes no
hidden-compatibility no no yes no
deprecated no no yes yes (replacement notice, metadata only)
removed no no no

Deprecated/hidden tools stay registered and dispatchable so old transcripts always replay. A deprecated tool appends a replacement notice to RESULT METADATA only — never to the cached prefix (which would break the invariant).

Planned diet (documented, not yet coded)

  • exec_wait, exec_interact, tts → hidden-compatibility. These are exact duplicates of canonical tools:
    • exec_waitexec_shell_wait (same ShellWaitTool, registry.rs:526,529); router already unifies them at crates/tui/src/tui/tool_routing.rs:1139-1140.
    • exec_interactexec_shell_interact (same ShellInteractTool, registry.rs:527,530).
    • ttsspeech (same SpeechTool, registry.rs:787-792).
    • Action: drop from active + search, keep registered, identical behavior, no notice.
  • todo_* (todo_write/add/update/list) → deprecated → checklist_*. They are deferred twins of checklist_* (same TodoWriteTool::new vs ::checklist, todo.rs:187,194); checklist_write is active, and todo_* are not in the active set. Action: drop from tool-search, keep registered, add replacement notice (metadata only).
  • Legacy subagent names (agent_spawn, spawn_agent, agent_result, agent_wait, agent_send_input, send_input, agent_assign, agent_list, agent_cancel, resume_agent, delegate_to_agent) are already #[allow(dead_code)] structs never instantiated outside tests (crates/tui/src/tools/subagent/mod.rs) → already not model-visible. Action: cleanup + guardrail tests, rebased on PR #2684. Note the live internal SubAgentManager methods send_input/cancel/resume (mod.rs:1495,1521,1605) are used by agent_eval/agent_close and must be kept — only the model-visible tool names are retired.

Model-visible subagent surface (unchanged)

Only agent_open, agent_eval, tool_agent, agent_close are registered (registry.rs:1017-1029).

  • tool_agent — KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY. It is the fast non-thinking "Fin" executor lane built on deepseek-v4-flash (cf. RLM DEFAULT_CHILD_MODEL = "deepseek-v4-flash", rlm.rs:26). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit.

Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance)

apply_patch / edit_file / write_file / fim_edit; grep_files / file_search / project_map; fetch_url / web.run / web_search; task_shell_*; handle_read / retrieve_tool_result. These serve distinct purposes and stay as-is.


Appendix B — Command-Surface Taxonomy (context)

Each name maps to exactly one thing; codebase_search slots in as concept-level code retrieval alongside these surfaces:

  • /memory — small user prefs/facts only (subcommands add/edit/search/clear/doctor, plus later promote; doctor detects the legacy ~/.deepseek path).
  • /context — dashboard of all active layers.
  • /rules — repo guidance.
  • /workflow (/whaleflow) — long-running multi-agent (the WhaleFlow epic).
  • /overlay — promoted cached-main lessons.
  • $<skill-name> — skill invocation prefix; the token is the skill name (e.g. $systematic-debugging, $github:gh-fix-ci).
  • codebase_search — concept-level code retrieval (this document).