dgf1988/codewhale

Files

T

Hunter Bown aa4c734602 docs: align v0.8.53 tool surface notes

2026-06-03 12:37:39 -07:00

23 KiB

Raw Blame History

CodeWhale North Star (0.9.0+)

STATUS: DIRECTION, NOT COMMITTED WORK. Everything in this document is the maintainer's intended direction for CodeWhale 0.9.0 and beyond. None of it is committed 0.8.53 work. The 0.8.53 cycle ships design docs only for these areas — no tool-catalog code lands this cycle except the small, already-scoped subagent/git/RLM fixes in PR #2684 and PR #2685. Treat every "rough shape" below as a sketch to be refined, not an API contract. Where this doc names tools that do not exist yet (codebase_search, read_file as a canonical alias, agent_run, etc.) those are aspirational names that will map onto today's tools; see each section.

Why this document exists

The vision is at risk of being lost between point releases. CodeWhale is accumulating capability (subagents, RLM, skills, workflows, an enormous tool catalog) faster than it is accumulating shape. This is the north star that the incremental 0.8.x stabilization work is steering toward, written down once so it survives the next dozen PRs.

The one principle

The harness handles memory, search, routing, state, and guardrails so a weaker model can just think. Every design decision below is in service of moving cognitive load out of the model and into the harness. A deepseek-v4-flash-class model should not have to remember ~80 tool names, hold the codebase index in its head, track which layer of memory a fact lives in, or re-derive a recovery path after a malformed tool call. The harness does that. The model decides what it wants; the harness figures out how.

Ground-truth anchor (today's reality)

So the direction is honest about where it starts:

Active first-turn tool set is DEFAULT_ACTIVE_NATIVE_TOOLS (crates/tui/src/core/engine/tool_catalog.rs:37-64) — 26 tools. Everything else is deferred and hydrates via tool_search_tool_regex / tool_search_tool_bm25 (tool_catalog.rs:26-35).
Catalog-head byte-stability is a hard invariant for DeepSeek's KV prefix cache (tool_catalog.rs:169-196). The active first-turn tool block must stay byte-identical run-to-run; any change to it is a one-time, deterministic edit, never a per-turn or per-mode mutation.
Arcee narrows the first turn to 8 read-only tools (ARCEE_FIRST_TURN_NATIVE_TOOLS, tool_catalog.rs:106-115) as a Cloudflare WAF workaround — proof the active partition is already provider-shaped.
Subagent tools that are model-visible: only agent_open, agent_eval, tool_agent, agent_close (crates/tui/src/tools/registry.rs:1017-1029). All legacy names (agent_spawn, spawn_agent, agent_result, agent_wait, agent_send_input, agent_assign, agent_list, agent_cancel, resume_agent, delegate_to_agent, …) are #[allow(dead_code)] structs in crates/tui/src/tools/subagent/mod.rs, never instantiated outside tests → already not model-visible. The live internal send_input / cancel / resume methods on SubAgentManager (mod.rs:1495,1521,1605) back agent_eval / agent_close and stay.
tool_agent is "Fin" — the experimental fast-lane executor: DeepSeek V4 Flash with thinking forced off (mod.rs:5233, TOOL_AGENT_INTRO; DEFAULT_CHILD_MODEL = "deepseek-v4-flash", rlm.rs:26).
Known duplicates today: exec_wait ≡ exec_shell_wait, exec_interact ≡ exec_shell_interact (same structs, all four in the active set), tts ≡ speech (both deferred). todo_* are deferred twins of checklist_* (same TodoWriteTool, ::new vs ::checklist, todo.rs:187,194). The router already unifies exec_wait/exec_shell_wait (crates/tui/src/tui/tool_routing.rs:1139-1140).

This is the surface the north star refactors toward simplicity.

1. Intent Router

What it is. A thin layer where the model declares an intent — search / inspect / edit / test / delegate / ask-user / run-shell / run-workflow — and the harness maps that intent to the correct low-level tool and arguments. The model picks from a tiny, stable verb vocabulary instead of recalling ~80 concrete tool names and their schemas.

Why it helps weaker models. Tool-name recall is one of the largest sources of wasted turns for small models: choosing a deferred tool (double-invoke), choosing a deprecated alias, or hallucinating a name. A fixed intent vocabulary collapses that decision space to ~10 verbs. The model spends its budget on reasoning about the task, not on remembering the API.

Rough shape. A small canonical visible set — aspirational names that route onto today's tools:

Intent verb (aspirational)	Routes onto today
`codebase_search`	concept-level retrieval over the hybrid index (§2); today: `grep_files` + `file_search` + `project_map`
`read_file`	`read_file` (already canonical)
`apply_patch`	`apply_patch` (canonical; `edit_file`/`write_file`/`fim_edit` remain as distinct lower-level tools)
`run_tests`	`run_tests` / `run_verifiers`
`git_status`	`git_status`
`git_diff`	`git_diff`
`work_update`	`update_plan` / `checklist_write`
`ask_user`	`request_user_input`
`shell_run`	`exec_shell` (canonical; `exec_wait`/`exec_interact` hidden — §10)
`agent_run`	`agent_open` / `tool_agent` (gated, §3) / `agent_eval` / `agent_close`
`workflow_run`	WhaleFlow runner (§4)

The router is the only place the catalog's full complexity is allowed to live. It is also where tool repair (§7) hooks in: a mis-stated intent or a deferred/deprecated name is rewritten to the canonical route.

Dependencies. The small canonical surface (§3), the lifecycle alias table (§3 / docs/TOOL_LIFECYCLE.md), and the hybrid index for codebase_search (§2). Must respect the catalog-head byte-stability invariant: the visible verb set is itself a one-time deterministic edit, not a dynamic per-turn list.

2. Default Hybrid Codebase Intelligence

What it is. An always-on, local-first codebase index that ships with the harness — not an opt-in tool the model has to remember to build. It fuses:

plain text search,
symbol index (definitions/references),
import / call graph,
FTS5 + BM25 lexical ranking (rusqlite is already a dependency — Cargo.toml),
sparse retrieval,
optional dense (embedding) retrieval,
PR / commit / issue history as a first-class retrieval source,
a codemap (structural overview, the successor to today's deferred project_map).

Why it helps weaker models. Today the model must orchestrate grep_files (content), file_search (filename), and project_map (structure) by hand, reconcile their outputs, and re-run them as it narrows. There is no FTS5/BM25 or semantic index today — every search is a cold walk (file_search uses the ignore crate's WalkBuilder for vendor exclusion, file_search.rs:~210). A weaker model burns turns stitching partial results. A single codebase_search intent backed by a hybrid index returns ranked, concept-level hits in one call, so the model reasons about answers, not query mechanics.

Rough shape. A background indexer maintains a SQLite store (FTS5 + symbol + graph tables), refreshed on file change and on git events. codebase_search (§1) queries it; the codemap is regenerated incrementally. Vendor exclusion reuses the existing ignore/WalkBuilder path.

Dependencies. rusqlite/FTS5; the Intent Router (§1) for the codebase_search verb; the trace store (§6/§8) for history retrieval. Full design lives in docs/CODEBASE_SEARCH_DESIGN.md (to be written this cycle).

3. Small Canonical Tool Surface

What it is. A deliberately tiny set of always-visible canonical tools; everything else is hidden, deferred, or skill-scoped. The catalog grows behind the scenes but the visible surface stays small and stable.

Why it helps weaker models. Fewer choices, no aliases competing for the same job, no deferred double-invokes for common operations. The model sees the verbs it needs and nothing else.

Rough shape — tool lifecycle states. Five states, represented as const name-sets plus an alias table in tool_catalog.rs (NOT a per-ToolSpec field, to preserve the byte-stable head):

active — in the first-turn catalog head.
deferred — registered, hydrated via tool-search.
hidden-compatibility — registered + dispatchable, dropped from both active and search, identical behavior, no notice. (For exact duplicates that should simply disappear from discovery.)
deprecated — registered + dispatchable, dropped from search, appends a replacement notice to RESULT METADATA only — never to the cached prefix.
removed — final state; no longer registered.

Invariant: deprecated and hidden-compatibility tools stay registered and dispatchable forever so old transcripts always replay deterministically.

Planned diet (documented this cycle, not yet coded):

exec_wait, exec_interact, tts → hidden-compatibility (exact duplicates of exec_shell_wait, exec_shell_interact, speech).
todo_* (todo_write/add/update/list) → deprecated → checklist_* (drop from tool-search, keep registered, add result-metadata notice).
Legacy subagent names → already hidden; remaining work is cleanup + guardrail tests, rebased on PR #2684.

Explicitly NOT touched (distinct niches, per #2681 non-goals) — doc-only canonical guidance, no diet: apply_patch / edit_file / write_file / fim_edit; grep_files / file_search / project_map; fetch_url / web.run / web_search; task_shell_*; handle_read / retrieve_tool_result.

tool_agent gating decision. tool_agent ("Fin") stays as a canonical subagent tool, but is gated to DeepSeek-V4 models only. It is the fast, non-thinking executor lane built on deepseek-v4-flash; offering it to other providers/models is meaningless (the lane is a specific model) and would just add a name to recall. The gate is provider/model-conditional in the same spirit as the Arcee first-turn narrowing.

Dependencies. The alias table backs the Intent Router (§1) and Tool Repair (§7). Full spec in docs/TOOL_LIFECYCLE.md (to be written this cycle).

4. WhaleFlow / Workflow Mode

What it is. A typed, multi-agent workflow runner. A workflow is a graph of typed nodes — branches, leaves, reviewers, verifiers, test-runners, PR-creators, with trace-replay and a progress-monitor. Authors write workflows in Starlark or YAML, which compile to a typed Rust IR; the Rust executor runs the IR. "Like Claude's workflow mode, but safer" — the safety comes from the typed IR and Rust execution boundary rather than free-form model-driven orchestration.

Why it helps weaker models. Long-running, multi-step work (implement → review → verify → test → open PR) is exactly where weaker models drift, lose state, or skip verification. Encoding the process as a typed graph means the model only has to be competent at each leaf, while the harness guarantees the sequencing, the verification gates, and the evidence trail.

Rough shape. Starlark/YAML → typed IR → Rust executor. Nodes map to subagent lanes (agent_open / tool_agent / agent_eval / agent_close, registry.rs:1017-1029). Reviewer/verifier/test-runner nodes are first-class node types, not ad-hoc prompts. Every run emits a trace (→ §8). Surfaced via /workflow (alias /whaleflow) and the workflow_run intent (§1).

Dependencies. Subagent runtime; the evaluation loop (§8) for traces; Skills & Rules (§5) so a skill can define a workflow; the command taxonomy (§9).

5. Skills & Rules as First-Class Runtime

What it is. Skills and rules become real runtime objects, not just prompt text. Skills gain activation modes:

always-on — injected every turn,
glob — activated when matching files are in scope,
model-decision — offered to the model to opt into,
manual — only via explicit $<skill-name> invocation (§9).

Skills can restrict the tool surface, define workflows (§4), and inject repo context.

Why it helps weaker models. A skill scoped to a task can shrink the tool surface to exactly what that task needs and pre-load the relevant rules and context — so the model operates inside a curated, smaller world instead of the full catalog.

Rough shape (vs. today). Today: skills are discovered (crates/tui/src/tools/skills/mod.rs, discover_in_workspace ~421; struct parses name/description ~382-388), enable-state is tracked (skill_state.rs, SkillStateStore::is_enabled ~73), and there's an inline-mention popup (slash_menu.rs ~86). But: no parser activates inline $ mentions on submit (submit path: ui.rs build_queued_message ~4721), there is no activation-mode concept, and skills cannot restrict tools. The direction adds (a) a submit-time $<skill-name> activation parser, (b) the four activation modes in skill metadata, and (c) a tool-restriction field enforced by the registry/router.

Dependencies. Tool lifecycle/alias table (§3) for restriction; Intent Router (§1); WhaleFlow (§4); command taxonomy (§9). Full design in docs/SKILL_INVOCATION_DESIGN.md (to be written this cycle).

6. Context Memory Stack

What it is. Memory modeled as explicit, layered, inspectable stores rather than one undifferentiated blob. Each layer is visible, inspectable, clearable, and scoped:

User memory — small user prefs/facts (surfaced via /memory, §9).
Repo rules — checked-in guidance (/rules).
Codemap-wiki — derived structural/semantic knowledge of the repo (§2).
Trace store — recorded workflow/turn evidence (§8).
ARMH–RLM memo — the RLM kernel's in-session working memory (rlm_open/rlm_eval/rlm_configure/rlm_close/rlm_session_objects, crates/tui/src/tools/rlm.rs; handle_read retrieves var handles; finalize/FINAL is an in-kernel Python function, not a tool).
Cached-main overlay — promoted lessons from the cached main branch (/overlay, §9).
External memory (Aleph) — large local data via the aleph skill.

Why it helps weaker models. The model never has to guess where a fact should live or re-derive context it already established. Each layer has a clear scope and a clear command to inspect/clear it, so stale context is visible and removable rather than silently poisoning the prefix.

Rough shape. A /context dashboard (§9) renders all active layers and their sizes; /memory manages the small user layer; /overlay manages promoted lessons. The RLM layer already exists and is plumbed through rlm.rs.

Dependencies. Command taxonomy (§9); codebase intelligence (§2); evaluation loop (§8) for promotion into the overlay.

7. Tool Repair & Autoload

What it is. When the model emits a wrong, deferred, deprecated, or environment-blocked tool call, the harness repairs it instead of returning a bare error — and autoloads what's needed.

Why it helps weaker models. Recovery from a malformed call is precisely where weak models loop or give up. Turning every failure into an actionable, schema-bearing correction keeps the model on-task.

Rough shape — representative repairs:

Wrong/legacy name → "you meant agent_eval; here's the schema" (autoload the deferred tool's schema in the same turn).
Mode mismatch → "shell is unavailable in Plan mode — ask the user or switch modes".
Missing dependency → "this tool needs Node; Node is missing" (dependency probe via ExternalTool, already imported in tool_catalog.rs).
Deprecated alias → silently routed to the canonical tool, with the replacement notice in result metadata only (§3) — never the cached prefix.

Dependencies. The alias table + lifecycle states (§3); the Intent Router (§1); dependency detection (ExternalTool). Builds on PR #2685's actionable RLM/field errors and PR #2684's lifecycle signals — must not contradict either.

8. Evaluation Loop

What it is. Every workflow run leaves evidence: the tests it ran, the diffs it produced, the failures it hit, the searches it issued, the claims it verified, and the PR outcome. A teacher/student replay turns good traces into reusable rules, skills, tests, and cached guidance.

Why it helps weaker models. The system gets better at this repo over time without the model getting smarter. Verified good traces become rules/skills the weaker model can lean on next time, and become the source of the cached-main overlay (§6).

Rough shape. Workflow nodes (§4) emit structured evidence into the trace store (§6). A replay/distillation pass (teacher reviews student trace) promotes high-value traces into: repo rules (/rules), skills (§5), regression tests, and overlay guidance (/overlay). Verified-claim tracking ties into the adversarial-verification posture already used elsewhere.

Dependencies. WhaleFlow (§4) for trace emission; trace store + overlay (§6); Skills & Rules (§5) as promotion targets.

9. Command-Surface Taxonomy

What it is. One name = one thing. The command surface is split so each prefix has a single, memorable responsibility:

Surface	Responsibility
`/memory`	Small user prefs/facts only
`/context`	Dashboard of all active memory layers (§6)
`/rules`	Repo guidance
`.codewhale/constitution.json`	Repo constitution: checked-in local law
`/workflow` (`/whaleflow`)	Long-running multi-agent runs (§4)
`/overlay`	Promoted cached-main lessons (§6/§8)
`$<skill-name>`	Skill invocation — the token is the skill name
`codebase_search`	Concept-level code retrieval (§2)

The repo constitution is not another memory bucket. It is the local-law layer in a layered authority model:

base myth / global Constitution
  -> repo constitution (.codewhale/constitution.json)
  -> task packet
  -> runtime policy

At conflict time, the current user request for the task remains above the repo constitution; the repo constitution supplies durable defaults and local law only when the active task packet and runtime policy leave room. Runtime policy is the compiled enforcement surface for the run, not a separate place for the model to invent new rules.

Why it helps weaker models (and users). No overloaded command does five jobs; the model/user never has to disambiguate which /memory behavior they meant. $systematic-debugging self-documents what it invokes.

/memory subcommand sketch:

/memory add "<fact>"        # store a small pref/fact
/memory edit                # edit stored facts
/memory search <query>      # find a stored fact
/memory clear               # clear user memory
/memory doctor              # health check; detects legacy ~/.deepseek path
/memory promote <fact>      # (later) promote a fact to a higher layer

doctor specifically detects the legacy ~/.deepseek path and guides migration.

$<skill-name> invocation examples:

$systematic-debugging       # local skill
$github:gh-fix-ci           # namespaced skill

The submit-time parser (to be added; submit path ui.rs ~4721) recognizes the $ token and activates the named skill (§5).

/context layers dashboard (example render):

/context
  user-memory      ▸ 7 facts                 (12 KB)   [clear]
  repo-constitution ▸ .codewhale/constitution.json (4 KB) [view]
  repo-rules       ▸ CLAUDE.md, AGENTS.md     (8 KB)   [view]
  codemap-wiki     ▸ 412 symbols indexed     (auto)    [rebuild]
  trace-store      ▸ 3 recent workflow runs  (—)       [open]
  rlm-memo         ▸ 0 active sessions        (—)       [—]
  cached-overlay   ▸ 5 promoted lessons       (3 KB)   [view]
  aleph-external   ▸ not attached             (—)       [attach]

Dependencies. Memory stack (§6); skills (§5); codebase intelligence (§2); workflow runner (§4).

10. Deferred-Not-Done 0.8.53 Diet Items

Recorded here so they are not silently dropped — these were considered for the 0.8.53 diet and deliberately deferred (design-only or out of scope this cycle):

File-mutation overload — apply_patch / edit_file / write_file / fim_edit overlap in purpose. Per #2681 non-goals these stay distinct; canonical guidance (prefer apply_patch) is doc-only, no consolidation this cycle.
task_shell_* ↔ exec_* redundancy — task_shell_start / task_shell_wait overlap conceptually with the exec_* family. Left intact this cycle (distinct niche per #2681); revisit under §1/§3.
handle_read / retrieve_tool_result — result-handle plumbing kept as-is (doc-only canonical guidance); folds naturally into the memory stack (§6) and intent routing (§1) later.
Search-cluster consolidation — grep_files / file_search / project_map remain three tools this cycle; consolidation is the job of the hybrid index (§2) under codebase_search, not a catalog edit in 0.8.53.

Phased Roadmap

0.8.53 — design + small fixes only

Code: only the already-scoped, narrow fixes — PR #2684 (subagent role vocab, lifecycle signals, eval ergonomics) and PR #2685 (read-only git history active + actionable RLM/field errors). Subagent legacy-name cleanup + guardrail tests rebased on #2684.
Docs: this north star, plus docs/TOOL_LIFECYCLE.md, docs/CODEBASE_SEARCH_DESIGN.md, docs/SKILL_INVOCATION_DESIGN.md.
No tool-catalog code: the diet (§3), the Intent Router (§1), and the hybrid index (§2) are documented, not coded this cycle.

0.9.0 — first structural moves

Implement the tool lifecycle const name-sets + alias table in tool_catalog.rs (§3) as a one-time deterministic head edit.
Land the planned diet: exec_wait/exec_interact/tts → hidden-compatibility; todo_* → deprecated→checklist_* (result-metadata notice only).
Gate tool_agent to DeepSeek-V4 models only (§3).
First version of the default hybrid codebase index (FTS5/BM25 + symbol + codemap) behind codebase_search (§2).
First Intent Router verbs mapping onto today's tools (§1).
Tool Repair for deferred/deprecated/mode/dependency cases (§7).

Later (post-0.9.0)

WhaleFlow typed-IR workflow runner (§4) and the evaluation loop / teacher-student replay (§8).
Skills activation modes + tool restriction + $<skill-name> submit-time activation (§5).
Full Context Memory Stack with /context dashboard, /overlay promotion, and Aleph external memory (§6).
Dense/semantic retrieval and PR/commit/issue history in the index (§2).
Search-cluster consolidation and the remaining §10 deferred items.

North-star one-liner

The harness handles memory, search, routing, state, and guardrails — so a weaker model can just think.

23 KiB Raw Blame History Unescape Escape