codewhale/docs/VISION_NORTH_STAR.md

# CodeWhale North Star (0.9.0+)

> **STATUS: DIRECTION, NOT COMMITTED WORK.**
> Everything in this document is the maintainer's intended *direction* for
> CodeWhale 0.9.0 and beyond. **None of it is committed 0.8.53 work.** The
> 0.8.53 cycle ships **design docs only** for these areas — no tool-catalog code
> lands this cycle except the small, already-scoped subagent/git/RLM fixes in
> PR #2684 and PR #2685. Treat every "rough shape" below as a sketch to be
> refined, not an API contract. Where this doc names tools that do not exist yet
> (`codebase_search`, `read_file` as a canonical alias, `agent_run`, etc.) those
> are **aspirational names** that will *map onto today's tools*; see each
> section.

## Why this document exists

The vision is at risk of being lost between point releases. CodeWhale is
accumulating capability (subagents, RLM, skills, workflows, an enormous tool
catalog) faster than it is accumulating *shape*. This is the north star that the
incremental 0.8.x stabilization work is steering toward, written down once so it
survives the next dozen PRs.

### The one principle

**The harness handles memory, search, routing, state, and guardrails so a
weaker model can just *think*.** Every design decision below is in service of
moving cognitive load *out* of the model and *into* the harness. A
`deepseek-v4-flash`-class model should not have to remember ~80 tool names, hold
the codebase index in its head, track which layer of memory a fact lives in, or
re-derive a recovery path after a malformed tool call. The harness does that.
The model decides *what it wants*; the harness figures out *how*.

---

## Ground-truth anchor (today's reality)

So the direction is honest about where it starts:

- **Active first-turn tool set** is `DEFAULT_ACTIVE_NATIVE_TOOLS`
  (`crates/tui/src/core/engine/tool_catalog.rs:37-64`) — 26 tools. Everything
  else is **deferred** and hydrates via `tool_search_tool_regex` /
  `tool_search_tool_bm25` (`tool_catalog.rs:26-35`).
- **Catalog-head byte-stability is a hard invariant** for DeepSeek's KV
  prefix cache (`tool_catalog.rs:169-196`). The active first-turn tool block
  must stay byte-identical run-to-run; any change to it is a **one-time,
  deterministic edit**, never a per-turn or per-mode mutation.
- **Arcee** narrows the first turn to 8 read-only tools
  (`ARCEE_FIRST_TURN_NATIVE_TOOLS`, `tool_catalog.rs:106-115`) as a Cloudflare
  WAF workaround — proof the active partition is already provider-shaped.
- **Subagent tools that are model-visible:** only `agent_open`, `agent_eval`,
  `tool_agent`, `agent_close` (`crates/tui/src/tools/registry.rs:1017-1029`).
  All legacy names (`agent_spawn`, `spawn_agent`, `agent_result`, `agent_wait`,
  `agent_send_input`, `agent_assign`, `agent_list`, `agent_cancel`,
  `resume_agent`, `delegate_to_agent`, …) are `#[allow(dead_code)]` structs in
  `crates/tui/src/tools/subagent/mod.rs`, never instantiated outside tests →
  **already not model-visible**. The live internal `send_input` / `cancel` /
  `resume` methods on `SubAgentManager` (`mod.rs:1495,1521,1605`) back
  `agent_eval` / `agent_close` and **stay**.
- **`tool_agent` is "Fin"** — the experimental fast-lane executor: DeepSeek V4
  Flash with thinking forced off (`mod.rs:5233`, `TOOL_AGENT_INTRO`;
  `DEFAULT_CHILD_MODEL = "deepseek-v4-flash"`, `rlm.rs:26`).
- **Known duplicates today:** `exec_wait ≡ exec_shell_wait`,
  `exec_interact ≡ exec_shell_interact` (same structs, all four in the active
  set), `tts ≡ speech` (both deferred). `todo_*` are deferred twins of
  `checklist_*` (same `TodoWriteTool`, `::new` vs `::checklist`,
  `todo.rs:187,194`). The router already unifies `exec_wait`/`exec_shell_wait`
  (`crates/tui/src/tui/tool_routing.rs:1139-1140`).

This is the surface the north star refactors *toward simplicity*.

---

## 1. Intent Router

**What it is.** A thin layer where the model declares an **intent** —
*search / inspect / edit / test / delegate / ask-user / run-shell /
run-workflow* — and the harness maps that intent to the correct low-level tool
and arguments. The model picks from a tiny, stable verb vocabulary instead of
recalling ~80 concrete tool names and their schemas.

**Why it helps weaker models.** Tool-name recall is one of the largest sources
of wasted turns for small models: choosing a deferred tool (double-invoke),
choosing a deprecated alias, or hallucinating a name. A fixed intent vocabulary
collapses that decision space to ~10 verbs. The model spends its budget on
*reasoning about the task*, not on *remembering the API*.

**Rough shape.** A small **canonical visible set** — aspirational names that
route onto today's tools:

| Intent verb (aspirational) | Routes onto today |
|---|---|
| `codebase_search` | concept-level retrieval over the hybrid index (§2); today: `grep_files` + `file_search` + `project_map` |
| `read_file` | `read_file` (already canonical) |
| `apply_patch` | `apply_patch` (canonical; `edit_file`/`write_file`/`fim_edit` remain as distinct lower-level tools) |
| `run_tests` | `run_tests` / `run_verifiers` |
| `git_status` | `git_status` |
| `git_diff` | `git_diff` |
| `work_update` | `update_plan` / `checklist_write` |
| `ask_user` | `request_user_input` |
| `shell_run` | `exec_shell` (canonical; `exec_wait`/`exec_interact` hidden — §10) |
| `agent_run` | `agent_open` / `tool_agent` (gated, §3) / `agent_eval` / `agent_close` |
| `workflow_run` | WhaleFlow runner (§4) |

The router is the *only* place the catalog's full complexity is allowed to live.
It is also where **tool repair** (§7) hooks in: a mis-stated intent or a
deferred/deprecated name is rewritten to the canonical route.

**Dependencies.** The small canonical surface (§3), the lifecycle alias table
(§3 / `docs/TOOL_LIFECYCLE.md`), and the hybrid index for `codebase_search`
(§2). Must respect the **catalog-head byte-stability invariant**: the visible
verb set is itself a one-time deterministic edit, not a dynamic per-turn list.

---

## 2. Default Hybrid Codebase Intelligence

**What it is.** An always-on, local-first codebase index that ships with the
harness — not an opt-in tool the model has to remember to build. It fuses:

- plain **text** search,
- **symbol** index (definitions/references),
- **import / call graph**,
- **FTS5 + BM25** lexical ranking (rusqlite is already a dependency —
  `Cargo.toml`),
- **sparse** retrieval,
- optional **dense** (embedding) retrieval,
- **PR / commit / issue history** as a first-class retrieval source,
- a **codemap** (structural overview, the successor to today's deferred
  `project_map`).

**Why it helps weaker models.** Today the model must orchestrate `grep_files`
(content), `file_search` (filename), and `project_map` (structure) by hand,
reconcile their outputs, and re-run them as it narrows. There is **no FTS5/BM25
or semantic index today** — every search is a cold walk (`file_search` uses the
`ignore` crate's `WalkBuilder` for vendor exclusion, `file_search.rs:~210`). A
weaker model burns turns stitching partial results. A single `codebase_search`
intent backed by a hybrid index returns ranked, concept-level hits in one call,
so the model reasons about *answers*, not *query mechanics*.

**Rough shape.** A background indexer maintains a SQLite store (FTS5 + symbol +
graph tables), refreshed on file change and on git events. `codebase_search`
(§1) queries it; the codemap is regenerated incrementally. Vendor exclusion
reuses the existing `ignore`/`WalkBuilder` path.

**Dependencies.** rusqlite/FTS5; the Intent Router (§1) for the
`codebase_search` verb; the trace store (§6/§8) for history retrieval. **Full
design lives in `docs/CODEBASE_SEARCH_DESIGN.md`** (to be written this cycle).

---

## 3. Small Canonical Tool Surface

**What it is.** A deliberately tiny set of always-visible canonical tools;
**everything else is hidden, deferred, or skill-scoped**. The catalog grows
behind the scenes but the *visible* surface stays small and stable.

**Why it helps weaker models.** Fewer choices, no aliases competing for the same
job, no deferred double-invokes for common operations. The model sees the verbs
it needs and nothing else.

**Rough shape — tool lifecycle states.** Five states, represented as **const
name-sets plus an alias table in `tool_catalog.rs`** (NOT a per-`ToolSpec`
field, to preserve the byte-stable head):

1. **active** — in the first-turn catalog head.
2. **deferred** — registered, hydrated via tool-search.
3. **hidden-compatibility** — registered + dispatchable, **dropped from both
   active and search**, identical behavior, **no notice**. (For exact
   duplicates that should simply disappear from discovery.)
4. **deprecated** — registered + dispatchable, **dropped from search**, appends
   a *replacement notice to RESULT METADATA only* — **never** to the cached
   prefix.
5. **removed** — final state; no longer registered.

**Invariant:** deprecated and hidden-compatibility tools **stay registered and
dispatchable forever** so old transcripts always replay deterministically.

**Planned diet (documented this cycle, not yet coded):**

- `exec_wait`, `exec_interact`, `tts` → **hidden-compatibility** (exact
  duplicates of `exec_shell_wait`, `exec_shell_interact`, `speech`).
- `todo_*` (`todo_write/add/update/list`) → **deprecated → checklist_*** (drop
  from tool-search, keep registered, add result-metadata notice).
- Legacy subagent names → already hidden; remaining work is **cleanup +
  guardrail tests**, rebased on PR #2684.

**Explicitly NOT touched** (distinct niches, per #2681 non-goals) — doc-only
canonical guidance, no diet: `apply_patch` / `edit_file` / `write_file` /
`fim_edit`; `grep_files` / `file_search` / `project_map`; `fetch_url` /
`web.run` / `web_search`; `task_shell_*`; `handle_read` /
`retrieve_tool_result`.

**`tool_agent` gating decision.** `tool_agent` ("Fin") **stays** as a canonical
subagent tool, but is **gated to DeepSeek-V4 models only**. It is the fast,
non-thinking executor lane built on `deepseek-v4-flash`; offering it to other
providers/models is meaningless (the lane *is* a specific model) and would just
add a name to recall. The gate is provider/model-conditional in the same spirit
as the Arcee first-turn narrowing.

**Dependencies.** The alias table backs the Intent Router (§1) and Tool Repair
(§7). **Full spec in `docs/TOOL_LIFECYCLE.md`** (to be written this cycle).

---

## 4. WhaleFlow / Workflow Mode

**What it is.** A typed, multi-agent **workflow runner**. A workflow is a graph
of typed nodes — **branches, leaves, reviewers, verifiers, test-runners,
PR-creators**, with **trace-replay** and a **progress-monitor**. Authors write
workflows in **Starlark or YAML**, which compile to a **typed Rust IR**; the
**Rust executor** runs the IR. "Like Claude's workflow mode, but safer" — the
safety comes from the typed IR and Rust execution boundary rather than free-form
model-driven orchestration.

**Why it helps weaker models.** Long-running, multi-step work (implement →
review → verify → test → open PR) is exactly where weaker models drift, lose
state, or skip verification. Encoding the *process* as a typed graph means the
model only has to be competent at each *leaf*, while the harness guarantees the
sequencing, the verification gates, and the evidence trail.

**Rough shape.** Starlark/YAML → typed IR → Rust executor. Nodes map to
subagent lanes (`agent_open` / `tool_agent` / `agent_eval` / `agent_close`,
`registry.rs:1017-1029`). Reviewer/verifier/test-runner nodes are first-class
node *types*, not ad-hoc prompts. Every run emits a trace (→ §8). Surfaced via
`/workflow` (alias `/whaleflow`) and the `workflow_run` intent (§1).

**Dependencies.** Subagent runtime; the evaluation loop (§8) for traces;
Skills & Rules (§5) so a skill can *define* a workflow; the command taxonomy
(§9).

---

## 5. Skills & Rules as First-Class Runtime

**What it is.** Skills and rules become real runtime objects, not just prompt
text. Skills gain **activation modes**:

- **always-on** — injected every turn,
- **glob** — activated when matching files are in scope,
- **model-decision** — offered to the model to opt into,
- **manual** — only via explicit `$<skill-name>` invocation (§9).

Skills can **restrict the tool surface**, **define workflows** (§4), and
**inject repo context**.

**Why it helps weaker models.** A skill scoped to a task can shrink the tool
surface to exactly what that task needs and pre-load the relevant rules and
context — so the model operates inside a curated, smaller world instead of the
full catalog.

**Rough shape (vs. today).** Today: skills are discovered
(`crates/tui/src/tools/skills/mod.rs`, `discover_in_workspace ~421`; struct
parses name/description `~382-388`), enable-state is tracked
(`skill_state.rs`, `SkillStateStore::is_enabled ~73`), and there's an
inline-mention popup (`slash_menu.rs ~86`). **But:** no parser activates inline
`$` mentions on submit (submit path: `ui.rs build_queued_message ~4721`), there
is **no activation-mode concept**, and **skills cannot restrict tools**. The
direction adds (a) a submit-time `$<skill-name>` activation parser, (b) the
four activation modes in skill metadata, and (c) a tool-restriction field
enforced by the registry/router.

**Dependencies.** Tool lifecycle/alias table (§3) for restriction; Intent Router
(§1); WhaleFlow (§4); command taxonomy (§9). **Full design in
`docs/SKILL_INVOCATION_DESIGN.md`** (to be written this cycle).

---

## 6. Context Memory Stack

**What it is.** Memory modeled as **explicit, layered, inspectable** stores
rather than one undifferentiated blob. Each layer is **visible, inspectable,
clearable, and scoped**:

1. **User memory** — small user prefs/facts (surfaced via `/memory`, §9).
2. **Repo rules** — checked-in guidance (`/rules`).
3. **Codemap-wiki** — derived structural/semantic knowledge of the repo (§2).
4. **Trace store** — recorded workflow/turn evidence (§8).
5. **ARMH–RLM memo** — the RLM kernel's in-session working memory
   (`rlm_open`/`rlm_eval`/`rlm_configure`/`rlm_close`/`rlm_session_objects`,
   `crates/tui/src/tools/rlm.rs`; `handle_read` retrieves var handles;
   `finalize`/`FINAL` is an *in-kernel Python function*, not a tool).
6. **Cached-main overlay** — promoted lessons from the cached main branch
   (`/overlay`, §9).
7. **External memory (Aleph)** — large local data via the `aleph` skill;
   see `docs/WHALEFLOW_EXTERNAL_MEMORY.md` for the v0.9.0 cutline that keeps
   this optional, explicit, inspectable, and out of the default path.

**Why it helps weaker models.** The model never has to *guess* where a fact
should live or *re-derive* context it already established. Each layer has a
clear scope and a clear command to inspect/clear it, so stale context is
visible and removable rather than silently poisoning the prefix.

**Rough shape.** A `/context` dashboard (§9) renders all active layers and their
sizes; `/memory` manages the small user layer; `/overlay` manages promoted
lessons. The RLM layer already exists and is plumbed through `rlm.rs`.

**Dependencies.** Command taxonomy (§9); codebase intelligence (§2); evaluation
loop (§8) for promotion into the overlay.

---

## 7. Tool Repair & Autoload

**What it is.** When the model emits a wrong, deferred, deprecated, or
environment-blocked tool call, the harness **repairs** it instead of returning a
bare error — and **autoloads** what's needed.

**Why it helps weaker models.** Recovery from a malformed call is precisely
where weak models loop or give up. Turning every failure into an actionable,
schema-bearing correction keeps the model on-task.

**Rough shape — representative repairs:**

- **Wrong/legacy name** → *"you meant `agent_eval`; here's the schema"* (autoload
  the deferred tool's schema in the same turn).
- **Mode mismatch** → *"shell is unavailable in Plan mode — ask the user or
  switch modes"*.
- **Missing dependency** → *"this tool needs Node; Node is missing"*
  (dependency probe via `ExternalTool`, already imported in `tool_catalog.rs`).
- **Deprecated alias** → silently **routed to the canonical** tool, with the
  replacement notice in **result metadata only** (§3) — never the cached prefix.

**Dependencies.** The alias table + lifecycle states (§3); the Intent Router
(§1); dependency detection (`ExternalTool`). Builds on PR #2685's actionable
RLM/field errors and PR #2684's lifecycle signals — **must not contradict
either**.

---

## 8. Evaluation Loop

**What it is.** Every workflow run **leaves evidence**: the tests it ran, the
diffs it produced, the failures it hit, the searches it issued, the claims it
verified, and the PR outcome. A **teacher/student replay** turns *good* traces
into reusable **rules, skills, tests, and cached guidance**.

**Why it helps weaker models.** The system gets better at *this repo* over time
without the model getting smarter. Verified good traces become rules/skills the
weaker model can lean on next time, and become the source of the cached-main
overlay (§6).

**Rough shape.** Workflow nodes (§4) emit structured evidence into the trace
store (§6). A replay/distillation pass (teacher reviews student trace) promotes
high-value traces into: repo rules (`/rules`), skills (§5), regression tests,
and overlay guidance (`/overlay`). Verified-claim tracking ties into the
adversarial-verification posture already used elsewhere.

**Dependencies.** WhaleFlow (§4) for trace emission; trace store + overlay (§6);
Skills & Rules (§5) as promotion targets.

---

## 9. Command-Surface Taxonomy

**What it is.** One name = **one thing**. The command surface is split so each
prefix has a single, memorable responsibility:

| Surface | Responsibility |
|---|---|
| `/memory` | **Small** user prefs/facts only |
| `/context` | **Dashboard** of all active memory layers (§6) |
| `/rules` | Repo guidance |
| `.codewhale/constitution.json` | Repo constitution: checked-in **local law** |
| `/workflow` (`/whaleflow`) | Long-running multi-agent runs (§4) |
| `/overlay` | Promoted cached-main lessons (§6/§8) |
| `$<skill-name>` | Skill invocation — **the token *is* the skill name** |
| `codebase_search` | Concept-level code retrieval (§2) |

The repo constitution is not another memory bucket. It is the local-law layer in
a layered authority model:

```
base myth / global Constitution
  -> repo constitution (.codewhale/constitution.json)
  -> task packet
  -> runtime policy
```

At conflict time, the **current user request for the task remains above the repo
constitution**; the repo constitution supplies durable defaults and local law
only when the active task packet and runtime policy leave room. Runtime policy is
the compiled enforcement surface for the run, not a separate place for the model
to invent new rules.

**Why it helps weaker models (and users).** No overloaded command does five
jobs; the model/user never has to disambiguate *which* `/memory` behavior they
meant. `$systematic-debugging` self-documents what it invokes.

**`/memory` subcommand sketch:**

```
/memory add "<fact>"        # store a small pref/fact
/memory edit                # edit stored facts
/memory search <query>      # find a stored fact
/memory clear               # clear user memory
/memory doctor              # health check; detects legacy ~/.deepseek path
/memory promote <fact>      # (later) promote a fact to a higher layer
```

`doctor` specifically detects the **legacy `~/.deepseek`** path and guides
migration.

**`$<skill-name>` invocation examples:**

```
$systematic-debugging       # local skill
$github:gh-fix-ci           # namespaced skill
```

The submit-time parser (to be added; submit path `ui.rs ~4721`) recognizes the
`$` token and activates the named skill (§5).

**`/context` layers dashboard (example render):**

```
/context
  user-memory      ▸ 7 facts                 (12 KB)   [clear]
  repo-constitution ▸ .codewhale/constitution.json (4 KB) [view]
  repo-rules       ▸ CLAUDE.md, AGENTS.md     (8 KB)   [view]
  codemap-wiki     ▸ 412 symbols indexed     (auto)    [rebuild]
  trace-store      ▸ 3 recent workflow runs  (—)       [open]
  rlm-memo         ▸ 0 active sessions        (—)       [—]
  cached-overlay   ▸ 5 promoted lessons       (3 KB)   [view]
  aleph-external   ▸ not attached             (—)       [attach]
```

**Dependencies.** Memory stack (§6); skills (§5); codebase intelligence (§2);
workflow runner (§4).

---

## 10. Deferred-Not-Done 0.8.53 Diet Items

Recorded here so they are **not silently dropped** — these were considered for
the 0.8.53 diet and deliberately **deferred** (design-only or out of scope this
cycle):

- **File-mutation overload** — `apply_patch` / `edit_file` / `write_file` /
  `fim_edit` overlap in purpose. Per #2681 non-goals these stay distinct;
  canonical *guidance* (prefer `apply_patch`) is doc-only, no consolidation
  this cycle.
- **`task_shell_*` ↔ `exec_*` redundancy** — `task_shell_start` /
  `task_shell_wait` overlap conceptually with the `exec_*` family. Left intact
  this cycle (distinct niche per #2681); revisit under §1/§3.
- **`handle_read` / `retrieve_tool_result`** — result-handle plumbing kept as-is
  (doc-only canonical guidance); folds naturally into the memory stack (§6) and
  intent routing (§1) later.
- **Search-cluster consolidation** — `grep_files` / `file_search` /
  `project_map` remain three tools this cycle; consolidation is the *job of the
  hybrid index* (§2) under `codebase_search`, not a catalog edit in 0.8.53.

---

## Phased Roadmap

### 0.8.53 — design + small fixes only
- **Code:** only the already-scoped, narrow fixes — PR #2684 (subagent role
  vocab, lifecycle signals, eval ergonomics) and PR #2685 (read-only git history
  active + actionable RLM/field errors). Subagent legacy-name cleanup +
  guardrail tests rebased on #2684.
- **Docs:** this north star, plus `docs/TOOL_LIFECYCLE.md`,
  `docs/CODEBASE_SEARCH_DESIGN.md`, `docs/SKILL_INVOCATION_DESIGN.md`.
- **No tool-catalog code:** the diet (§3), the Intent Router (§1), and the
  hybrid index (§2) are **documented, not coded** this cycle.

### 0.9.0 — first structural moves
- Implement the **tool lifecycle** const name-sets + alias table in
  `tool_catalog.rs` (§3) as a one-time deterministic head edit.
- Land the **planned diet**: `exec_wait`/`exec_interact`/`tts` →
  hidden-compatibility; `todo_*` → deprecated→`checklist_*` (result-metadata
  notice only).
- Gate **`tool_agent`** to DeepSeek-V4 models only (§3).
- First version of the **default hybrid codebase index** (FTS5/BM25 + symbol +
  codemap) behind `codebase_search` (§2).
- First **Intent Router** verbs mapping onto today's tools (§1).
- **Tool Repair** for deferred/deprecated/mode/dependency cases (§7).

### Later (post-0.9.0)
- **WhaleFlow** typed-IR workflow runner (§4) and the **evaluation loop** /
  teacher-student replay (§8).
- **Skills activation modes** + tool restriction + `$<skill-name>` submit-time
  activation (§5).
- Full **Context Memory Stack** with `/context` dashboard, `/overlay`
  promotion, and Aleph external memory (§6).
- Dense/semantic retrieval and PR/commit/issue history in the index (§2).
- Search-cluster consolidation and the remaining §10 deferred items.

---

## North-star one-liner

> **The harness handles memory, search, routing, state, and guardrails — so a
> weaker model can just think.**