dgf1988/codewhale

Files

T

Hunter B 6551106e79 docs: move internal design docs into docs/rfcs/

2026-06-09 23:23:25 -07:00

20 KiB

Raw Permalink Blame History

`codebase_search` — Local-First Semantic Code Retrieval

Status: Design note + planned eval scaffold. Code is DEFERRED. GitHub #2680 · Milestone v0.9.0 · This DOC ships in v0.8.53 (doc-only; no catalog code in this cycle). Related in-flight: PR #2684 (subagent role vocab / lifecycle signals / eval ergonomics), PR #2685 (git history active + RLM/field errors). This note must not contradict either.

This document specifies a model-visible codebase_search tool for concept-level code retrieval, the storage/index that backs it, a verifiable benchmark set, and a phased feature-flag plan. It also records the surrounding tool lifecycle decisions for v0.8.53 so the eventual catalog edit is a single deterministic change.

1. Problem

Today CodeWhale ships two complementary code-locating tools and one structure map:

file_search — filename search (uses the ignore crate's WalkBuilder for vendor exclusion; default excludes at crates/tui/src/tools/file_search.rs:210-219).
grep_files — content search (literal/regex token match).
project_map — a deferred structure map.

None of these answer concept-level questions where the user does not know the exact token:

"Where is provider auth resolved?"
"What enforces the shell approval policy?"
"Where do mode prompts get assembled?"
"How does the subagent lifecycle close out a child?"

grep_files requires you to already know the literal string (resolve_api_key, ApprovalRequirement, …). When the concept and the identifier diverge — which is the normal case for an unfamiliar area of the tree — grep returns nothing useful and the agent burns turns guessing tokens.

Goal. Add a retrieval tool keyed on intent, not on exact lexemes, that returns ranked, explainable code locations.

Non-goal / explicit complement. codebase_search does not replace grep_files or file_search. Exact-token and filename lookups remain the right tool when you know what you're looking for. codebase_search is the "I don't know the token yet" entry point and always falls back to exact grep so it is never worse than grep for a literal query. (See §2 fallback, §6 non-goals.)

There is currently no FTS5/BM25, sparse, or dense index in the tree. rusqlite is already a workspace dependency (crates/tui/Cargo.toml), so the lexical core can be built with no new heavy dependencies.

2. Approach Comparison

Approach	What it indexes	Local-first?	Recall on paraphrase	Cost / deps	Verdict for v0.9.0
Lexical FTS5 + `bm25()`	tokenized code/comments/identifiers (camelCase/snake_case split)	Yes — SQLite built in via `rusqlite`	Medium (with tokenizer help)	Near-zero (existing dep)	Phase 1 core
Symbol / path ranking	extracted symbols (fn/struct/impl/const), path components	Yes	Medium-high for "where is X defined"	Low (regex/tree-sitter optional)	Phase 1 core
Sparse encoders (SPLADE)	learned term-expansion weights	Yes (model is local but heavy)	High	Model download + inference	Phase 3, feature-flagged
Dense embeddings	vector of chunk semantics	Optional — embedding model needed	Highest on paraphrase	Model + vector store; HF download	Phase 3, feature-flagged
Cross-encoder reranker	re-scores top-K candidates	Heavy	Boosts precision@k	Inference cost	Phase 4, feature-flagged

Recommended architecture: Hybrid via Reciprocal-Rank Fusion (RRF)

Each enabled signal produces an independent ranked list; results are merged with RRF (score(d) = Σ_signals 1/(k + rank_signal(d)), conventional k≈60). RRF is chosen because it fuses heterogeneous scorers (BM25 scores, integer symbol ranks, path-depth ranks, cosine similarities) without needing score normalization across incomparable scales.

v0.9.0 Phase 1 signal set (all local, no model downloads):

Lexical (FTS5 bm25()) over chunk text with an identifier-aware tokenizer.
Symbol rank — boost chunks whose extracted symbol name fuzzy-matches query terms.
Path rank — boost chunks whose path components match (e.g. query "auth" → …/auth/…, …/provider…).
Session-relevance boost — recently read/edited files in the current session rank higher (mtime + session touch log). This mirrors how a human grounds "where is X" against what they were just looking at.
Exact grep fallback — the query is also run as a literal grep_files-equivalent pass; any exact hit is fused in and tagged, guaranteeing codebase_search ⊇ grep for literal queries.

Optional later backends (feature-flagged, off by default):

--features sparse-splade — adds a SPLADE signal list to the RRF.
--features dense-embed — adds a dense vector signal list (embedding model gated behind the same workset/feature flag as any HF download; see §3 Privacy).
--features rerank — cross-encoder reranks the fused top-K.

Phase 1 deliberately omits all four ML backends so the tool ships with zero network/model dependency and is reproducible in CI.

3. Storage & Index

Location

~/.codewhale/index/<workspace-hash>.db

<workspace-hash> is a stable hash of the canonical workspace root, so each checkout/worktree gets its own index and nothing is shared across unrelated projects. Backed by rusqlite (existing dep).

Migration note (ties to the /memory doctor taxonomy in §7): older builds used ~/.deepseek. The index path is ~/.codewhale only; if a legacy ~/.deepseek/index exists it is ignored (a future doctor may offer to migrate, never auto-read).

Schema sketch

CREATE TABLE files (
  id            INTEGER PRIMARY KEY,
  path          TEXT NOT NULL UNIQUE,   -- workspace-relative
  mtime_ns      INTEGER NOT NULL,       -- invalidation
  size_bytes    INTEGER NOT NULL,
  content_hash  TEXT NOT NULL,          -- blake3; skip re-chunk if unchanged
  lang          TEXT,                   -- detected language
  branch        TEXT                    -- branch at last index (invalidation)
);

CREATE TABLE chunks (
  id          INTEGER PRIMARY KEY,
  file_id     INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
  start_line  INTEGER NOT NULL,
  end_line    INTEGER NOT NULL,
  kind        TEXT,                     -- fn | struct | impl | const | doc | block
  symbol      TEXT,                     -- primary symbol name if any
  text        TEXT NOT NULL             -- chunk body (identifier-split copy for FTS)
);

-- Lexical index. external-content FTS so we don't duplicate bodies twice.
CREATE VIRTUAL TABLE chunks_fts USING fts5(
  text,
  symbol,
  content='chunks',
  content_rowid='id',
  tokenize = 'unicode61 remove_diacritics 2'   -- + identifier pre-split at index time
);

CREATE TABLE symbols (
  id        INTEGER PRIMARY KEY,
  file_id   INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE,
  chunk_id  INTEGER REFERENCES chunks(id) ON DELETE CASCADE,
  name      TEXT NOT NULL,
  kind      TEXT NOT NULL,              -- fn | struct | enum | trait | impl | const | macro
  line      INTEGER NOT NULL
);
CREATE INDEX symbols_name ON symbols(name);

-- Session relevance: lightweight touch log, written by the session, decayed on read.
CREATE TABLE session_touch (
  path        TEXT PRIMARY KEY,
  last_touch  INTEGER NOT NULL,         -- unix ns
  touch_count INTEGER NOT NULL DEFAULT 1
);

Identifier-aware tokenization (splitting resolveApiKey / resolve_api_key → resolve api key) is applied at index time into the FTS text column so the query side stays a plain FTS5 MATCH. SPLADE/dense backends, when enabled, add their own sidecar tables (chunks_sparse, chunks_vec) behind their feature flags.

Chunking strategy (structure-aware)

Chunk on syntactic boundaries, not fixed windows: one chunk per top-level item (fn, struct, impl block, const, doc-comment block), falling back to a sliding window for unparseable files. Structure-aware chunks keep a function and its doc comment together, so a paraphrase query lands on a coherent unit rather than a mid-function slice. A tree-sitter grammar per language is the long-term plan; Phase 1 may start with a brace/indent + regex heuristic for Rust/TS and a line-window fallback elsewhere.

Invalidation

mtime + content_hash: on index/refresh, skip files whose mtime_ns and content_hash are unchanged.
Branch switch: files.branch is recorded; on a branch change the affected files are re-checked (cheap because of content_hash).
Generated / vendor exclusion: reuse the same ignore-crate WalkBuilder exclusion behavior as file_search (mirror the defaults at crates/tui/src/tools/file_search.rs:210-219: target/**, node_modules/**, .git/**, DerivedData/**, dist/**, build/**, *.lock, *.plist, plus .gitignore). One exclusion source of truth shared with file_search avoids index drift.

Privacy / trust

Workspace-scoped, local-only. The index lives under ~/.codewhale/index/ and never leaves the machine.
No cloud by default. Phase 1 has zero network dependency.
Embeddings / Hugging Face downloads are gated. Any SPLADE/dense backend (which may pull a model from HF) is behind a feature flag and an explicit workset/opt-in, consistent with how the rest of CodeWhale treats network model access. The core tool never downloads anything.

4. Model-Visible Tool Contract

// codebase_search
{
  "name": "codebase_search",
  "description": "Concept-level code retrieval. Find code by what it does, even without exact tokens. Complements grep_files (exact text) and file_search (filenames).",
  "parameters": {
    "query":      { "type": "string",  "description": "Natural-language or concept query, e.g. 'where is provider auth resolved'." },
    "max_results":{ "type": "integer", "default": 10 },
    "path_glob":  { "type": "string",  "description": "Optional path filter, e.g. 'crates/tui/**'." },
    "lang":       { "type": "string",  "description": "Optional language filter." },
    "kind":       { "type": "string",  "description": "Optional symbol-kind filter: fn|struct|impl|const|..." }
  }
}

Result shape — ranked, explainable, auditable:

{
  "results": [
    {
      "path": "crates/tui/src/config/provider.rs",
      "line": 142,
      "snippet": "fn resolve_api_key(provider: ApiProvider, env: &Env) -> Result<Secret> { ... }",
      "score": 0.91,
      "reasons": [
        "symbol: resolve_api_key matches 'auth/resolve'",
        "lexical: matched tokens [provider, api, key, resolve]",
        "path: component 'provider' matches query",
        "session: file read 2 turns ago"
      ]
    }
  ],
  "backend": "lexical+symbol+path+session",   // which signals were fused (RRF)
  "fallback_grep_hits": 1                       // exact-match hits folded in
}

reasons[] is mandatory and is the auditability contract: every result explains why it ranked — which tokens/symbols/path components matched and whether session-recency contributed. This makes retrieval debuggable and lets the model (and the human reviewing a transcript) judge trust. The backend field records which signals were active so results are reproducible given the feature set.

5. Benchmark / Eval Set

A fixed set of real CodeWhale concept queries, each with the expected file(s) verified against the current tree, so retrieval quality is measurable (recall@k / MRR). Line numbers are indicative anchors at time of writing; the eval matches on file, not line.

#	Query (concept, no exact token)	Expected file(s)	Anchor
1	Where is provider auth / API key resolved?	`crates/tui/src/config/` provider auth path	provider/config module
2	What is the first-turn active tool set?	`crates/tui/src/core/engine/tool_catalog.rs`	`DEFAULT_ACTIVE_NATIVE_TOOLS` :37-64
3	How are deferred tools hydrated / searched?	`crates/tui/src/core/engine/tool_catalog.rs`	tool_search regex/bm25 :26-35
4	Why does Arcee get a reduced tool set? (WAF workaround)	`crates/tui/src/core/engine/tool_catalog.rs`	`ARCEE_FIRST_TURN_NATIVE_TOOLS` :106-115
5	What keeps the tool catalog byte-stable for the KV prefix cache?	`crates/tui/src/core/engine/tool_catalog.rs`	catalog-head invariant :169-196
6	Where is the shell approval / cancel policy?	`crates/tui/src/tools/shell.rs` + `tools/spec.rs` (`ApprovalRequirement`)	shell tools, `ShellWaitTool`/`ShellInteractTool` registry.rs:524-531
7	Where are mode prompts (Plan/Agent/YOLO) assembled?	mode prompt / `AppMode` assembly in `crates/tui/src/tui/`	`AppMode` usage
8	How does the subagent lifecycle open/eval/close a child?	`crates/tui/src/tools/subagent/mod.rs`; registry registration	registry.rs:1017-1029; `send_input`/`cancel`/`resume` mod.rs:1495,1521,1605
9	What is the RLM session surface and its default child model?	`crates/tui/src/tools/rlm.rs`	`DEFAULT_CHILD_MODEL = "deepseek-v4-flash"` :26
10	Where is RLM eval / var_handle retrieval (`handle_read`)?	`crates/tui/src/tools/rlm.rs`, `tools/handle.rs`	`VarHandle` import rlm.rs:21
11	Where are skills discovered and parsed in the workspace?	`crates/tui/src/tools/skills/mod.rs`	`discover_in_workspace` ~421; skill struct ~382-388
12	Where is skill enable-state stored / checked?	`crates/tui/src/tools/skills/skill_state.rs`	`SkillStateStore::is_enabled` ~73
13	How does vendor/generated exclusion work for file walking?	`crates/tui/src/tools/file_search.rs`	`ignore` WalkBuilder excludes :210-219
14	Where is the queued user message built on submit?	`crates/tui/src/tui/ui.rs`	`build_queued_message` ~4721
15	Where are speech / TTS tools registered? (duplicate names)	`crates/tui/src/tools/registry.rs`	`speech` ≡ `tts` :787-792

Each entry is intended to become a (query, expected_paths[]) row in a fixture (e.g. crates/tui/tests/fixtures/codebase_search_eval.jsonl). This PR ships the design table only; the fixture and harness are deferred to Phase 1. The Phase 1 harness runs all queries against the live index and reports recall@k and MRR; a regression bar (e.g. recall@10 >= target) gates future ranking changes.

6. Phasing, Feature Flags, and Non-Goals

Phasing

Phase 0 (this cycle, v0.8.53): this design note + benchmark table only. No fixture, harness, or catalog code.
Phase 1 (v0.9.0): local lexical core — FTS5 bm25() + symbol + path + session-relevance + exact grep fallback, fused via RRF. SQLite index at ~/.codewhale/index/<workspace-hash>.db. Eval harness wired into CI. No network, no model downloads. Tool registered as deferred (hydrated via tool-search) initially; promotion to the active first-turn set is a separate, deliberate decision (see lifecycle below) because of the prefix-cache invariant.
Phase 2: incremental/background reindex, branch-aware invalidation hardening, richer chunkers (tree-sitter per language).
Phase 3 (feature-flagged, off by default): sparse-splade and dense-embed RRF signals. Embedding/HF downloads behind the flag + workset opt-in (§3 Privacy).
Phase 4 (feature-flagged): rerank cross-encoder over the fused top-K.

Feature flags

codebase-search-core    # Phase 1, default-on once it lands
sparse-splade           # Phase 3, default-off
dense-embed             # Phase 3, default-off (gated HF download)
rerank                  # Phase 4, default-off

Non-goals

No cloud index is required for the core experience. Ever, for Phase 1.
Not a grep replacement. Exact-token (grep_files) and filename (file_search) search stay first-class; codebase_search complements them and folds exact hits in as a fallback.
Not a code-rewrite or navigation/LSP tool — it returns ranked locations, nothing more.

Cross-link: WhaleFlow epic

codebase_search is a building block for the long-running multi-agent WhaleFlow (/workflow / /whaleflow) epic: a planning or executor lane can ground itself ("find where X is handled") without spending shell/grep turns, and the explainable reasons[] feed audit trails. Sequencing here must not regress PR #2684 (subagent lifecycle/eval ergonomics) or PR #2685 (git history active + RLM/field errors).

Appendix A — Tool Lifecycle Decisions (v0.8.53, doc-only)

These are design decisions for the eventual one-time catalog edit; no catalog code changes this cycle. The active first-turn tool block is a DeepSeek KV prefix-cache invariant (tool_catalog.rs:169-196) — it must stay byte-identical run-to-run, so any change is a single deterministic edit, never incremental churn.

Lifecycle states (represented as const name-sets + an alias table in `tool_catalog.rs`, NOT a per-`ToolSpec` field)

State	Active first turn?	In tool-search?	Registered/dispatchable?	Result-metadata notice?
active	yes	yes	yes	no
deferred	no	yes	yes	no
hidden-compatibility	no	no	yes	no
deprecated	no	no	yes	yes (replacement notice, metadata only)
removed	no	no	no	—

Deprecated/hidden tools stay registered and dispatchable so old transcripts always replay. A deprecated tool appends a replacement notice to RESULT METADATA only — never to the cached prefix (which would break the invariant).

Planned diet (documented, not yet coded)

exec_wait, exec_interact, tts → hidden-compatibility. These are exact duplicates of canonical tools:
- exec_wait ≡ exec_shell_wait (same ShellWaitTool, registry.rs:526,529); router already unifies them at crates/tui/src/tui/tool_routing.rs:1139-1140.
- exec_interact ≡ exec_shell_interact (same ShellInteractTool, registry.rs:527,530).
- tts ≡ speech (same SpeechTool, registry.rs:787-792).
- Action: drop from active + search, keep registered, identical behavior, no notice.
todo_* (todo_write/add/update/list) → deprecated → checklist_*. They are deferred twins of checklist_* (same TodoWriteTool::new vs ::checklist, todo.rs:187,194); checklist_write is active, and todo_* are not in the active set. Action: drop from tool-search, keep registered, add replacement notice (metadata only).
Legacy subagent names (agent_spawn, spawn_agent, agent_result, agent_wait, agent_send_input, send_input, agent_assign, agent_list, agent_cancel, resume_agent, delegate_to_agent) are already #[allow(dead_code)] structs never instantiated outside tests (crates/tui/src/tools/subagent/mod.rs) → already not model-visible. Action: cleanup + guardrail tests, rebased on PR #2684. Note the live internal SubAgentManager methods send_input/cancel/resume (mod.rs:1495,1521,1605) are used by agent_eval/agent_close and must be kept — only the model-visible tool names are retired.

Model-visible subagent surface (unchanged)

Only agent_open, agent_eval, tool_agent, agent_close are registered (registry.rs:1017-1029).

tool_agent — KEEP as a canonical subagent tool, GATED to DeepSeek-V4 models ONLY. It is the fast non-thinking "Fin" executor lane built on deepseek-v4-flash (cf. RLM DEFAULT_CHILD_MODEL = "deepseek-v4-flash", rlm.rs:26). On non-DeepSeek-V4 providers it must not be offered. This is a model/provider-gating decision recorded here for the eventual catalog edit.

Explicitly NOT touched (distinct niches, per #2681 non-goals — doc-only canonical guidance)

apply_patch / edit_file / write_file / fim_edit; grep_files / file_search / project_map; fetch_url / web.run / web_search; task_shell_*; handle_read / retrieve_tool_result. These serve distinct purposes and stay as-is.

Appendix B — Command-Surface Taxonomy (context)

Each name maps to exactly one thing; codebase_search slots in as concept-level code retrieval alongside these surfaces:

/memory — small user prefs/facts only (subcommands add/edit/search/clear/doctor, plus later promote; doctor detects the legacy ~/.deepseek path).
/context — dashboard of all active layers.
/rules — repo guidance.
/workflow (/whaleflow) — long-running multi-agent (the WhaleFlow epic).
/overlay — promoted cached-main lessons.
$<skill-name> — skill invocation prefix; the token is the skill name (e.g. $systematic-debugging, $github:gh-fix-ci).
codebase_search — concept-level code retrieval (this document).

20 KiB Raw Permalink Blame History

codebase_search — Local-First Semantic Code Retrieval