# System Prompt Analysis — "Mismanaged Genius" Hypothesis ## Methodology Read every prompt layer (`base.md`, mode overlays, personality, approval policies), traced the assembly logic in `prompts.rs`, and compared against what DeepSeek V4 can actually do vs what the prompt currently encourages. --- ## Summary: The Prompt Is Cautious, Not Strategic The current prompt has excellent safety rails — clear "when NOT to use" guidance, anti-hallucination instructions, and decomposition philosophy. But it treats the model's most powerful capabilities (RLM, sub-agents, parallel tool execution) as **specialty escape hatches** rather than **default strategic tools**. The result: a capable model that hesitates to parallelize, underuses its fan-out abilities, and serializes work that could be done concurrently. The prompt was written when the model was less reliable and needed guardrails. V4 models can handle more autonomy — the prompt should reflect that. --- ## Gap-by-Gap Analysis ### Gap 1: RLM Is Framed as a Last Resort, Not a Strategic Tool **Current text** (`base.md`, "RLM Is a Specialty Tool"): > `rlm` is for one specific shape of work: a long input that genuinely does not fit > in your context. Reach for it ONLY when direct reasoning over the input is impossible > because of its size. **Problem**: RLM is actually three tools in one: 1. Chunk-and-process for long inputs (the only case the prompt acknowledges) 2. Parallel `llm_query_batched` for multi-angle analysis (e.g., "classify these 20 items") 3. `rlm_query` for recursive decomposition of problems that benefit from sub-LLM critique The prompt actively discourages cases 2 and 3. A model that could classify 20 files in parallel instead reads them one at a time. A model that could get a "second opinion" on its reasoning from a sub-LLM instead trusts its first pass. **Suggested rewrite** — replace the restrictive framing with a capability guide: ``` ## RLM — When to Use It RLM loads input into a Python REPL where you write code that calls sub-LLM helpers (`llm_query`, `llm_query_batched`, `rlm_query`). Three patterns, not one: **CHUNK** — A single input that genuinely doesn't fit in your context window (a whole file > 50K tokens, a long transcript, a multi-document corpus). Split it, process each chunk, synthesize. **BATCH** — Many independent items that each need LLM attention (classify 20 entries, extract fields from 30 documents, score 15 candidates). Use `llm_query_batched` for parallel execution — it fans out to the same DeepSeek client and finishes in one turn what would take 15 sequential reads. **RECURSE** — A problem that benefits from decomposition + critique. Use `rlm_query` to have a sub-LLM review your reasoning, identify gaps, or explore alternative approaches. The sub-LLM returns a synthesized answer you verify against live tool output. **When NOT to use RLM**: a single short file you can read directly; a simple classification on 3 items; interactive iterative exploration (RLM is one-shot batch). For those, `read_file`, `grep_files`, or `agent_spawn` are faster and cheaper. ``` ### Gap 2: Sub-Agents Are "Implementation, Not Exploration" **Current text** (`base.md`, "When NOT to use `agent_spawn`"): > You haven't first laid out a plan with `checklist_write`. Sub-agents are > implementation, not exploration. **Problem**: This directly contradicts the Plan mode prompt, which correctly says "Spawn read-only sub-agents for parallel investigation." But the Agent mode prompt gets the restrictive version. The result: in Agent mode (where most work happens), the model treats sub-agents as a last step ("now implement the plan") rather than a discovery tool ("investigate these 4 things in parallel to understand the problem"). **Reality**: Sub-agents are the BEST tool for parallel exploration. A single `agent_spawn` call that fans out to 3 read-only children investigating different modules is faster AND more thorough than reading them sequentially. **Suggested rewrite** — move sub-agent guidance from "when NOT to use" to a positive section: ``` ## Sub-Agent Strategy Sub-agents are cheap — DeepSeek V4 Flash costs $0.14/M input. Use them liberally for parallel work: - **Parallel investigation**: When you need to understand 3+ independent files or modules, spawn one read-only sub-agent per target. They run concurrently and return structured findings you synthesize. - **Parallel implementation**: After a plan is laid out (`checklist_write` + `update_plan`), spawn one sub-agent per independent leaf task. Each does one thing well; you integrate results. - **Solo tasks**: A single read, a single search, a focused question — do these yourself. Spawning has overhead; one-turn reads are faster direct. - **Sequential work**: If step B depends on step A's output, run A yourself, then decide whether to spawn B based on what A found. ``` ### Gap 3: No "Batch Everything" Instinct **Current text** (`base.md`, "Your V4 Characteristics"): > **Parallel execution.** Batch independent reads, searches, and greps into a single > turn. Never serialize operations that can run concurrently — parallel tool calls > share the same turn and finish faster. **Problem**: This instruction is correct but buried in a V4 Characteristics section the model may not internalize as a behavioral rule. The model often fires one tool, waits for the result, then fires another — even when both are independent. **Suggested addition** — add a concrete heuristic at the top of the toolbox section: ``` ## Parallel-First Heuristic Before you fire any tool, scan your plan: is there another tool you could run concurrently? If two operations don't depend on each other, batch them. Examples: - Reading 3 files → 3 `read_file` calls in one turn - Searching for 2 patterns → 2 `grep_files` calls in one turn - Checking git status AND reading a config → `git_status` + `read_file` in one turn The dispatcher runs parallel tool calls simultaneously. Serializing independent operations wastes the user's time and your context budget. ``` ### Gap 4: Thinking Budget Too Conservative for V4 **Current text** (`base.md`, "Thinking Budget"): | Task type | Thinking depth | Rationale | |-----------|---------------|-----------| | Simple factual lookup | Skip | Answer is immediate | | Code generation (single function) | Light | Pattern-matching | **Problem**: V4 models have 1M context and produce thinking tokens that improve output quality even for "simple" tasks. Skipping thinking on a factual lookup is correct. But "Light" for code generation understates the value of thinking — a 30-second think before writing a function catches edge cases, checks against project conventions, and prevents rework. **Suggested rewrite** — bump the defaults up one tier: | Task type | Thinking depth | Rationale | |-----------|---------------|-----------| | Simple factual lookup (read, search) | Skip | Answer is immediate | | Tool output interpretation | Light | Verify result matches intent | | Code generation (single function) | Medium | Conventions, edge cases, context fit | | Multi-file refactor | Medium | Cross-file dependencies | | Debugging (error to root cause) | Deep | Hypothesis generation | | Architecture design | Deep | Trade-offs, constraints | | Security review | Deep | Adversarial reasoning | ### Gap 5: No "Verify Before Claiming" Pattern **Current state**: The subagent output format (`subagent_output_format.md`) has an EVIDENCE section that requires concrete artifact citations. This is excellent. But the main prompt (`base.md`) doesn't establish this as a general habit. **Problem**: The model sometimes reads a file, then writes a patch based on its memory of the file rather than re-reading the specific lines it's changing. Or it claims a shell command succeeded based on exit code 0 without checking the output. **Suggested addition** — add to the "Decomposition Philosophy" section: ``` ## Verification Principle After every tool call that produces a result you'll act on, verify before proceeding: - File reads: confirm the line numbers you're about to patch are what you think - Shell commands: check stdout, not just exit code - Search results: confirm the match is what you expected - Sub-agent results: cross-check one finding against a direct `read_file` Don't claim a change worked until you've observed evidence. Don't trust memory over live tool output. ``` ### Gap 6: No Composition Heuristic for Complex Work **Current state**: The prompt says "For complex initiatives, layer `update_plan` above `checklist_write`." This is correct but vague. The model sometimes creates a plan, creates a checklist, and then works through the checklist without re-evaluating the plan. **Suggested addition**: ``` ## Composition Pattern for Multi-Step Work For any task estimated to take 5+ steps: 1. `update_plan` — 3-6 high-level phases (status: pending) 2. `checklist_write` — concrete leaf tasks under the first phase (mark first `in_progress`) 3. Execute phase 1, updating checklist as you go 4. After each phase completes, re-read your plan: does phase 2 still make sense? Update the plan if new information changes the approach. 5. When a phase reveals sub-problems, add them to the checklist or spawn investigation sub-agents — don't guess. ``` ### Gap 7: Approval Mode Contradiction **Current state**: The Agent mode approval policy says "Any write, patch, shell execution, sub-agent spawn, or CSV batch operation will ask for approval first." But the "Key principle" says "make your work visible" and encourages `checklist_write` to populate the sidebar. **Problem**: In Agent mode, the model often waits for approval on EACH step individually. A batch of 3 `edit_file` calls requires 3 separate approval rounds. The prompt should encourage batching approvals: present the full plan, get approval once, then execute all writes in parallel. **Suggested addition** — add to the Agent mode overlay: ``` ## Efficient Approvals When your plan includes multiple writes, present them together: 1. Show `checklist_write` with all write steps listed 2. Request approval for the batch ("I need to make 3 edits across 2 files...") 3. Once approved, execute all writes in one turn (parallel `edit_file` / `apply_patch` calls) Don't sequence approvals one at a time. The user wants context, not interruption. ``` --- ## Concrete Prompt Changes ### 1. `base.md` — Replace "RLM Is a Specialty Tool" section Remove the current restrictive "RLM Is a Specialty Tool" section entirely. Replace with the "RLM — When to Use It" section from Gap 1 above. ### 2. `base.md` — Replace "When NOT to use `agent_spawn`" Remove the bullet about sub-agents from the "When NOT to use" section. Move it to a new positive "Sub-Agent Strategy" section (Gap 2 above) placed immediately after the "Decomposition Philosophy" section. ### 3. `base.md` — Add "Parallel-First Heuristic" Insert after the toolbox reference section, before "When NOT to use." (Gap 3 above.) ### 4. `base.md` — Bump thinking budget defaults Change the "Code generation (single function)" row from Light → Medium. (Gap 4 above.) Single-line change. ### 5. `base.md` — Add "Verification Principle" Insert as a sub-heading under "Decomposition Philosophy." (Gap 5 above.) ### 6. `base.md` — Add "Composition Pattern" Insert as a sub-heading under "Decomposition Philosophy," after "Verification Principle." (Gap 6 above.) ### 7. `modes/agent.md` — Add "Efficient Approvals" Insert at the end of the Agent mode overlay. (Gap 7 above.) --- ## What NOT to Change - **"When NOT to use `exec_shell`"** — this guidance is correct and important. Typed tools beat shell-outs for reliability. - **"When NOT to use `edit_file` / `apply_patch`"** — tool selection rules are good and prevent blind patching. - **Preamble rhythm** — the tone guidance is well-calibrated. - **Output formatting** — terminal constraints are real; the guidance is correct. - **Context management** — the ~80% compaction suggestion is practical. - **Sub-agent sentinel protocol** — the integration pattern is well-defined. --- ## Risk Assessment **Risk: Over-parallelization**. A model told to "batch everything" might spawn sub-agents for trivial reads. Mitigation: the "Solo tasks" bullet in the new sub-agent strategy section explicitly says "do these yourself." **Risk: Over-thinking**. Bumping the thinking budget might waste tokens on simple code generation. Mitigation: "Medium" for single-function generation is still conservative; the model can self-regulate with the existing guidance "skip for lookups." **Risk: RLM over-use**. Framing RLM as a strategic tool might cause inappropriate use for tasks better served by `agent_spawn`. Mitigation: the new "When NOT to use RLM" bullet covers the common failure modes. **Risk: Cache busting**. Adding text to the system prompt changes its byte representation, which busts the prefix cache for the first turn after the change. Mitigation: this is a one-time cost; subsequent turns hit the cache at the new prompt boundary.