docs(rlm): land Hetun design + helper layer + Sakana research methodology

Captures the full RLM-fundamental story across the design doc, MODES.md, and the Hetun prompt. Tracking issues are now #46–#55 (helper layer filed as #53, Hetun as #54, vendoring as #55). What this nails down: - **Hetun mode** is added at the END of the Tab cycle (Plan → Agent → YOLO → Hetun → Plan), not as a Plan replacement. Default landing mode is unchanged so people don't accidentally start there. Plan stays as it is. - **Mission-level approval, not block-level.** Hetun runs a research phase, presents one mission card, and only executes after explicit user approval. Inside the execution turn the repl block runs straight through with no per-block prompts — that's the whole point of the mode. - **The user's configured model is left alone on enter/exit.** Pro/max users stay on Pro/max. The flash-as-coordinator behaviour is internal to the runtime (ZIGRLM_RLM_CMD always points to flash regardless of mode). No global model swap. - **No /hetun slash command.** Tab cycles into the mode; /plan keeps switching to Plan as today. - **The helper layer (#53) is fundamental, not aleph-derived.** A curated ~20-function ctx-helper module + AST-validated Python sandbox baked into the repl runtime so a single block can load → slice → fan out flash queries → aggregate without crossing tool boundaries. Inspired by aleph's pattern but our own native primitive — not a port. - **Hetun research methodology adopts Sakana's Fugu patterns.** The research phase is recursive novelty sampling + hierarchical narrative tree synthesis + multi-detector cross-verification (flash for breadth, Pro for depth) + hypothesis-verification loop. Not "fan out 8 fixed queries". This is what makes "Plan + Recursive Agents" meaningful versus a flash-coordinator wrapper. - **No version-number framing anywhere.** The plan ships as one cohesive RLM landing across #46/#48/#49/#50/#53/#54/#55 — order is dependency, not release schedule. We keep shipping. - **Auto-compaction stays automatic.** Removed a manual /compact nag from the Hetun prompt; the existing coherence + capacity system already handles this. Files: docs/rlm-design.md new — full design doc with Hetun details docs/research-react-vs-rlm.md new — supporting research treatment docs/MODES.md 4-mode cycle, Hetun added at end, Plan kept crates/tui/src/prompts/hetun.txt prompt teaching the recursive-novelty + hierarchical-synthesis + verification-loop rhythm, mission-card structure, two-step gate .gitignore ignore .claude/scheduled_tasks.lock runtime Closes nothing yet — implementation lands across the tracking issues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:37:25 -05:00
parent 229b1993ac
commit 027d6d19b6
5 changed files with 682 additions and 7 deletions
@@ -64,3 +64,6 @@ project_overhaul_prompt.md

 # Companion app (tracked separately)
 apps/
+
+# Claude Code runtime artifacts
+.claude/scheduled_tasks.lock
@@ -0,0 +1,63 @@
+You are DeepSeek TUI in Hetun mode (河豚, "Plan + Recursive Agents"). Hetun folds planning and execution into one rhythm: you research the problem with recursive RLM, present a single mission for the user to approve, and then carry that mission out without further per-step interruptions.
+
+IMPORTANT: You are ALREADY running inside the DeepSeek TUI. You have direct access to all tools below — do NOT try to launch the CLI binary. Your tools execute directly in the current session.
+
+## The two-step rhythm
+
+1. **Research + plan.** Use RLM aggressively to investigate the workspace in parallel, then synthesise a concrete mission. Land it in the transcript ending with an explicit "OK to run?" prompt — and stop there. Do not execute in the same turn.
+2. **Execute.** After the user approves, emit a `repl` block that runs the planned sub-tasks via `rlm_query_batched` and aggregates into a `FINAL`. No further approval prompts — you approved the mission, now run it.
+
+If the user redirects ("change item 2", "drop item 3"), revise the mission and ask again. Once approved, execute and report.
+
+## How the research phase actually works
+
+Hetun's research is not one-shot batched queries. It is a small recursive program inside a `repl` block, modelled on the recursive-novelty-search rhythm:
+
+- **Sample broadly first.** Read or chunk the relevant material (files the user named, the working directory, prior turns) into a coarse `ctx` and run a flash sweep that asks each chunk "what is surprising or important here, and why?".
+- **Score by novelty, recurse on the high-signal chunks.** The chunks whose flash answers carry the most new information get resampled at finer resolution. Stop recursing when the answers stop changing or you hit a budget (default: 2–3 levels, ~12 total flash calls).
+- **Build a hierarchical narrative tree, not a flat list.** Cluster the findings into intermediate nodes (related observations) under root nodes (top-level themes) under the mission goal. The mission card the user approves displays this tree.
+- **Cross-verify before locking the mission.** Every load-bearing claim in the mission gets two passes: a flash sweep for obvious errors / contradictions, and one Pro check for subtler structural issues. Claims that fail either pass are marked low-confidence rather than dropped silently — the user gets to decide whether to keep them.
+- **Hypothesis-verification loop.** Form a working hypothesis from the first round of findings; generate verification queries from it ("if X is true, we should also see Y — check for Y"); run them; update. Cap the loop at 2–3 iterations.
+
+This is the substance behind "Plan + Recursive Agents". A bad Hetun turn is "fan out 8 fixed queries and concatenate"; a good one is the recursive sampling + hierarchical synthesis + verification loop above.
+
+## The mission card
+
+When you present the mission for approval, structure it like this:
+
+- **Goal** (one sentence)
+- **Hierarchy of findings** (the tree, collapsible — top-level themes with their child observations)
+- **Sub-tasks to execute** (numbered, each with: what it looks at, expected output, anything that gets written)
+- **Confidence notes** (any claim flagged as low-confidence by the cross-verification pass)
+- **Estimate** (e.g. "~6 flash calls during execution")
+- End with: **"OK to run? (Enter to approve, Esc to cancel, prose to revise)"**
+
+Do not skip the hierarchy or the confidence notes — they are what makes the mission card legitimately useful versus a wall of bullets.
+
+## RLM usage cheat sheet
+
+- **Parallel analysis** ("review these 3 files") → `rlm_query_batched`
+- **Recursive decomposition** ("break this into sub-tasks") → `rlm_query` with depth
+- **Programmatic data inspection** (grep / extract / chunk / diff a blob already in memory) → use the `ctx` helpers inside the same `repl` block; do NOT round-trip through shell
+- **Cheap leaf work** (any reasoning, search, summarisation, classification that doesn't need tools) → flash via `rlm_query_batched`
+
+The child model is `deepseek-v4-flash` (~1/10th the cost of Pro). Be lavish with parallelism: 8–16 children is normal when the work is decomposable. Reserve Pro for the cross-verification check and for sub-tasks that genuinely need deep reasoning.
+
+## Frontier escalation
+
+If a sub-task genuinely needs Pro, use the explicit `zigrlm` tool with `main_model = "deepseek-v4-pro"`. Default everything else to flash.
+
+## Tool use during execution
+
+After mission approval the execution turn runs without per-block approval — that is the point of the mode. But:
+
+- Avoid unnecessary destructive or irreversible actions inside `repl` blocks.
+- Prefer `repl` blocks over `agent_swarm` for parallel work; `agent_swarm` is for multi-step autonomous workflows that need tools at each step.
+- Use `grep_files` + `list_dir` for quick lookups that don't need parallelism.
+
+## What Hetun is not
+
+- Not auto-execute. The mission gate is real and required.
+- Not Plan mode. Plan stays unchanged for design-first investigation that hands off to a human.
+- Not a model swap. Your conversational model is unchanged when entering Hetun.
+- Not a `/hetun` slash command. Tab cycles into the mode like any other.
@@ -2,17 +2,18 @@

 DeepSeek TUI has two related concepts:

- **TUI mode**: what kind of visible interaction you’re in (Plan/Agent/YOLO).
+- **TUI mode**: what kind of visible interaction you're in (Plan/Agent/YOLO/Hetun).
 - **Approval mode**: how aggressively the UI asks before executing tools.

 ## TUI Modes

-Press `Tab` to cycle through the visible modes: **Plan → Agent → YOLO → Plan**.
-Press `Shift+Tab` to cycle in reverse.
+Press `Tab` to cycle through the visible modes: **Plan → Agent → YOLO → Hetun → Plan**.
+Press `Shift+Tab` to cycle in reverse. Hetun sits at the end of the cycle so a fresh session doesn't land on it accidentally — the default landing mode is unchanged.

- **Plan**: design-first prompting. Read-only investigation tools stay available, but shell and patch execution stay off.
- **Agent**: multi-step tool use. Approvals for shell and paid tools (file writes are allowed without a prompt).
- **YOLO**: enables shell + trust mode and auto-approves all tools. Use only in trusted repos.
+- **Plan**: design-first prompting. Read-only investigation tools stay available; shell and patch execution stay off. Use this when you want to think out loud and produce a plan to hand to a human (yourself later, or a reviewer).
+- **Agent**: multi-step tool use. Approvals for shell and paid tools (file writes are allowed without a prompt). RLM is available — the model reaches for `repl` blocks when the work is decomposable.
+- **YOLO**: enables shell + trust mode and auto-approves all tools. RLM is available and auto-executes like everything else. Use only in trusted repos.
+- **Hetun** (河豚, "Plan + Recursive Agents"): the most opinionated mode the TUI offers. The model uses RLM aggressively to research and decompose tasks in parallel via cheap `deepseek-v4-flash` child calls, then presents a consolidated **mission** for your approval. Once approved, the RLM tree auto-executes without per-tool interruption — you approve the mission, not each individual bullet. Plan + execution folded into one rhythm.

 ## Compatibility Notes

@@ -42,7 +43,17 @@ Legacy note: `/set approval_mode ...` was retired in favor of `/config`.

 - `suggest` (default): uses the per-mode rules above.
 - `auto`: auto-approves all tools (similar to YOLO approval behavior, but without forcing YOLO mode).
- `never`: blocks any tool that isn’t considered safe/read-only.
+- `never`: blocks any tool that isn't considered safe/read-only.
+
+### Task-level approval (Hetun mode)
+
+Hetun mode introduces a higher-level approval concept. Before executing an RLM tree, the engine presents a **mission card** showing what will be done, estimated flash calls, and expected outcomes. You can:
+
+- **Approve** — the RLM tree runs without further prompts.
+- **Reject** — the engine returns to planning.
+- **Modify** — edit the mission description and re-submit.
+
+This is independent of the base `approval_mode` setting. If you set `approval_mode = auto` while in Hetun, you still see mission cards (task-level approval is part of the mode, not the approval policy).

 ## Small-Screen Status Behavior

@@ -76,6 +87,7 @@ Run `deepseek --help` for the canonical list. Common flags:
 - `--model <MODEL>`: when using the `deepseek` facade, forward a DeepSeek model override to the TUI
 - `--workspace <DIR>`: workspace root for file tools
 - `--yolo`: start in YOLO mode
+- `--hetun`: start in Hetun mode
 - `-r, --resume <ID|PREFIX|latest>`: resume a saved session
 - `-c, --continue`: resume the most recent session
 - `--max-subagents <N>`: clamp to `1..=20`
@@ -0,0 +1,217 @@
+# ReAct vs. Recursive Language Models (RLM): A Design Document Comparison
+
+> **Purpose:** Provide the deepseek-tui team with a grounded, citation-rich comparison of the ReAct agent paradigm and the emerging Recursive Language Model (RLM) paradigm so that integration choices (e.g. `zigrlm`, `agent_swarm`, inline tool use) can be made deliberately.
+
+---
+
+## 1. ReAct: Reasoning + Acting
+
+### 1.1 Origins and Definition
+
+**ReAct** (Reason + Act) is a prompting and inference paradigm introduced by Yao et al. (Google Research / Princeton) and published at ICLR 2023. It unifies *reasoning traces* (chain-of-thought-style internal monologue) with *task-specific actions* (tool calls, API requests, environment commands) in a single autoregressive loop.  
+**Citation:** Shunyu Yao et al., *"ReAct: Synergizing Reasoning and Acting in Language Models"*, ICLR 2023.
+
+The core insight is that reasoning without acting suffers from fact hallucination and stale knowledge, while acting without reasoning lacks planning, error recovery, and interpretability. ReAct interleaves the two explicitly.
+
+### 1.2 The Thought → Action → Observation Loop
+
+At each timestep \(t\) the agent maintains a context \(c_t\) containing the original query and all prior tuples. The loop is:
+
+1. **Thought** — The LLM generates a reasoning trace: plan decomposition, progress tracking, or exception handling.  
+   \(c_t \rightarrow \text{Thought}_{t+1}\)
+2. **Action** — Conditioned on the thought, the LLM emits a structured action (e.g. `Search[entity]`, `Calculator[expr]`, `Finish[answer]`).  
+   \(c_{t+1} := c_t \parallel \text{Thought}_{t+1}\)
+3. **Observation** — The action is executed in the external environment and the result is appended.  
+   \(c_{t+1} := c_{t+1} \parallel \text{Action}_{t+1} \parallel \text{Obs}_{t+1}\)
+
+The process halts when a special finish action is produced or a hard iteration limit is reached. Probabilistically this is:
+
+\[
+P(\tau \mid q) = \prod_{t=1}^{T} P(v_t \mid q, v_{<t})
+\]
+
+where \(v_t\) spans both Thought and Action tokens and \(\tau\) is the trajectory.
+
+**Key traits:**
+- **Linear / sequential** — Each observation must return before the next thought is generated.
+- **Scratchpad-based** — The entire history of thoughts, actions, and observations is appended to the prompt; there is no external variable store.
+- **Bounded by context window** — As the loop iterates, the prompt grows monotonically (until compaction heuristics truncate it).
+
+### 1.3 Implementations in the Wild
+
+| Framework | ReAct flavour |
+|-----------|---------------|
+| **OpenAI Function Calling** (and compatible APIs) | The model emits JSON `function_call` objects as Actions; tool results are fed back as `tool` role messages as Observations. The "Thought" is often implicit or rendered as a visible `<thinking>` block. |
+| **LangChain / LangGraph** | Pre-built `ReAct` agent chain with a stop-and-observe parser. LangGraph generalises the loop into a state machine with nodes (Thought, Action, Observation) and conditional edges. |
+| **LlamaIndex, BeeAI, etc.** | Provide pre-configured ReAct modules that wrap an LLM with a tool registry and a loop driver. |
+
+A 2025–2026 refinement called **Focused ReAct** presets the original query at each step to prevent drift, reportedly improving accuracy by >5× and reducing runtime by ~34%.
+
+---
+
+## 2. Recursive Language Models (RLM)
+
+### 2.1 Origins and Definition
+
+**RLM** is a general *inference-time scaling* paradigm proposed by Zhang, Kraska, and Khattab (MIT CSAIL) in late 2025. Rather than viewing the user prompt as static input tokens, RLMs treat the prompt as part of an **external environment** that the model can programmatically examine, decompose, and recursively query.  
+**Citation:** Alex L. Zhang, Tim Kraska, and Omar Khattab, *"Recursive Language Models"*, arXiv:2512.24601 [cs.AI], 2025 (v2 Jan 2026).
+
+A second formalisation, **λ-RLM**, refines the open-ended code generation of the original paper into a deterministic λ-calculus combinator runtime (SPLIT, PEEK, MAP, FILTER, REDUCE, CONCAT) to eliminate brittle free-form generation.  
+**Citation:** *"Solving Long-Context Rot with λ-Calculus"*, arXiv:2603.20105, 2026.
+
+### 2.2 The REPL Environment and Recursive Call Model
+
+The canonical RLM implementation wraps the root LM in a read-eval-print loop (REPL) — usually Python, though Clojure (`loop-infer`) and bash (`claude-rlm`) adaptations exist. The full context is stored as a variable (e.g. `context`) in the REPL, **not** in the model's prompt window.
+
+At each root iteration:
+1. The LM receives only *metadata* about the REPL state (short stdout prefix, variable names).
+2. The LM emits **code** (or fenced `repl` directives) that manipulate the variable, run regex/grep, or spawn recursive sub-calls.
+3. The code executes; stdout and updated variables are captured.
+4. The loop repeats until the LM sets a special `Final` variable (or emits `FINAL(...)` / `FINAL_VAR(...)`), at which point the run returns.
+
+Because the full text never enters the root LM context window, RLMs can scale to **10M+ tokens** (two orders of magnitude beyond the base model's window) without retraining.
+
+### 2.3 The `repl` Grammar and Tree Structure
+
+In the `zigrlm` runtime (and the reference Python implementation), the root LM writes fenced blocks tagged `repl`. The grammar includes:
+
+| Directive | Semantics |
+|-----------|-----------|
+| `let name = "..."` | Bind a string variable |
+| `js name = "...FINAL(...)"` | Execute deterministic JS in a sandbox |
+| `llm_query name = expr` | Call the *same* model (same depth) |
+| `rlm_query name = expr` | Spawn a **child** RLM (depth + 1) |
+| `llm_query_batched name = a \| b \| c` | Parallel same-depth calls |
+| `rlm_query_batched name = a \| b \| c` | Parallel child RLMs |
+| `FINAL(expression)` | Terminate and return this string |
+| `FINAL_VAR(name)` | Terminate and return the named variable |
+
+These recursive calls form a **tree of reasoning**, not a single chain. Each child processes a snippet of the external context and stores its partial result back into a parent REPL variable. Aggregation is performed programmatically (lists, tallies, tables) rather than autoregressively.
+
+**Key traits:**
+- **Context-centric decomposition** — The model decides how to slice the *input context*, not just how to sequence actions.
+- **Variable store** — Intermediate results live in the REPL, keeping the LM context window constant-size.
+- **Bounded output** — Because `Final` can be assembled from variables, RLMs can produce answers longer than the model's output token limit.
+
+### 2.4 Implementations and Ecosystem
+
+| Project | Notes |
+|---------|-------|
+| **alexzhang13/rlm** | Official research repo (Python). Includes reference REPL, natively fine-tuned `RLM-Qwen3-8B`, and OOLONG / BrowseComp-Plus benchmarks. |
+| **alexzhang13/rlm-minimal** | Stripped-down Python version for hacking. |
+| **zigrlm** | Zig-native runtime with JS sandbox, batched parallel fan-out, and JSONL tracing. Used by deepseek-tui for cheap `deepseek-v4-flash` child dispatch. |
+| **claude-rlm** | Depth-N recursion using Claude Code instances as sub-agents; bash-as-REPL; `mkdir`-based concurrency limiter. |
+| **loop-infer** | Clojure REPL implementation. |
+| **minrlm** | Independent minimal RLM reducing token usage up to 4× vs. flat inference. |
+| **rlm-mcp** | MCP server wrapper exposing RLM through the Model Context Protocol. |
+
+---
+
+## 3. Key Differences
+
+### 3.1 Parallelism
+
+| Dimension | ReAct | RLM |
+|-----------|-------|-----|
+| **Structure** | Linear chain. Each Action depends on the prior Observation. | Tree. A parent can fan out N children in parallel. |
+| **Batched execution** | Not native. Some frameworks (LangGraph) add parallel branches, but the canonical ReAct loop is sequential. | Native via `*_batched` directives. `zigrlm` dispatches children across OS threads capped by `max_concurrent_subcalls`. |
+| **Synchronisation** | Implicit: the loop blocks on the environment. | Children write to named variables; parent continues only after aggregation code runs. |
+
+The RLM paper explicitly notes that their reference implementation used *blocking* sequential sub-calls and left async fan-out as "low-hanging fruit" for systems builders. `zigrlm` realises that fruit.
+
+### 3.2 Reasoning Representation
+
+| Dimension | ReAct | RLM |
+|-----------|-------|-----|
+| **Form** | Natural-language "Thought" traces appended to a scratchpad. | Code / DSL inside fenced `repl` blocks, plus natural-language plan text outside the fence. |
+| **State management** | Monolithic prompt history. Intermediate values are re-tokenised every turn. | External REPL variables. The LM sees only constant-size metadata. |
+| **Aggregation** | The model must autoregressively synthesise the final answer from the scratchpad. | Programmatic: `FINAL_VAR(tally)` or `FINAL("\n".join(results))`. |
+| **Length limits** | Bounded by context window for both input and output. | Input: theoretically unbounded (10M+ tested). Output: bounded only by REPL variable memory. |
+
+### 3.3 Tool Use
+
+| Dimension | ReAct | RLM |
+|-----------|-------|-----|
+| **Interface** | Structured JSON schemas (OpenAI function calling) or text parsing (LangChain). | Natural-language fenced blocks (`repl`). The "tool" is the REPL itself. |
+| **Tool set** | Fixed registry of functions known at build time. | Open-ended: the LM can write arbitrary regex, loops, or JS to manipulate data. |
+| **Child agents** | Spawning a sub-agent is a heavyweight Action (new thread/process, full tool registry, event channels). | Spawning a child is a lightweight `rlm_query` inside the same runtime; the child uses a cheaper model by default. |
+
+### 3.4 Cost Model
+
+| Dimension | ReAct | RLM |
+|-----------|-------|-----|
+| **Primary model** | Usually one expensive frontier model (e.g. GPT-5, Claude Opus, deepseek-v4-pro) for every turn. | A **root** model (frontier) for control + cheap **child** models (`deepseek-v4-flash`, GPT-5-mini) for sub-tasks. |
+| **Cost scaling** | Grows with iteration count × full prompt length. Compaction heuristics trade quality for cost. | Grows with *task complexity*, not input length. Selective inspection means most tokens are never fed to the LM. |
+| **Empirical results** | N/A (baseline). | On OOLONG 128K, `RLM(GPT-5-mini)` outperformed flat `GPT-5` by >2× and was cheaper on average. On BrowseComp-Plus (1K docs, 6–11M tokens), RLM(GPT-5) averaged **$0.99** vs. $1.50–$2.75 for the base model ingesting everything. |
+| **Variance** | Predictable per-turn cost. | High variance: some trajectories are cheaper than a flat call, outliers can be more expensive. |
+
+### 3.5 Observability
+
+| Dimension | ReAct | RLM |
+|-----------|-------|-----|
+| **Trace shape** | Linear log of (Thought, Action, Observation) tuples. | Tree log: each node is a REPL turn that may branch into child RLM nodes. |
+| **Depth** | Flat iteration count. | Explicit recursion depth (`max_depth`). |
+| **Tooling** | LangSmith, OpenTelemetry spans, simple print logging. | JSONL trace files (`--trace`) capturing every code cell, stdout snapshot, and sub-call with usage metadata. |
+| **Human readability** | Easy: read the scratchpad top-to-bottom. | Harder: requires tree traversal, but the `FINAL` node summarises the aggregate. |
+
+---
+
+## 4. When Is Each Appropriate? Trade-offs
+
+### Use ReAct when …
+- The task is **interactive and stateful** (e.g. browsing, CLI commands, file editing) where each observation is dynamic and the next action cannot be predicted ahead of time.
+- The tool surface is **fixed and schema-driven** (e.g. a known set of REST APIs, file-system operations, database queries).
+- You need **deterministic latency bounds** per turn (e.g. a chat UI that must stream a Thought before the next Action).
+- The context fits comfortably within the model's window and does not suffer from context rot.
+- Human inspectability of a single linear reasoning chain is a priority.
+
+### Use RLM when …
+- The input is **very long** (100K–10M+ tokens) and you want to avoid summarisation or compaction loss.
+- The work is **embarrassingly parallel** (e.g. classify 1,000 rows, evaluate 50 files, score 20 answers). `rlm_query_batched` maps naturally.
+- The task is **recursively decomposable** (e.g. divide-and-conquer summarisation, map-reduce aggregation, multi-hop retrieval over a corpus).
+- Cost is a constraint: you can offload leaf work to a **cheap child model** while reserving the frontier model for control decisions.
+- You need **deterministic local compute** interleaved with model calls (JS / Python in the REPL).
+
+### Hybrids
+There is no forced binary choice. A pragmatic system (like deepseek-tui) can use:
+- **ReAct / OpenAI-style function calling** for interactive tool use and user-facing chat turns.
+- **RLM `repl` blocks** for internal parallel decomposition, batched generation, or long-context analysis.
+- **Agent swarm** (multi-step ReAct sub-agents) only when autonomous, stateful, multi-tool workflows are required.
+
+The RLM paper itself positions RLMs as the next milestone *after* CoT-style reasoning and ReAct-style agents, not as a replacement for them.
+
+---
+
+## 5. Bibliography
+
+1. **Yao, S. et al.** *ReAct: Synergizing Reasoning and Acting in Language Models.* ICLR 2023.  
+   - Blog explainer: https://www.promptingguide.ai/techniques/react  
+   - IBM overview: https://www.ibm.com/think/topics/react-agent
+
+2. **Zhang, A. L., Kraska, T., and Khattab, O.** *Recursive Language Models.* arXiv:2512.24601 [cs.AI], 2025 (v2 Jan 2026).  
+   - Paper: https://arxiv.org/abs/2512.24601  
+   - Blog: https://alexzhang13.github.io/blog/2025/rlm/  
+   - Code: https://github.com/alexzhang13/rlm  
+   - Minimal code: https://github.com/alexzhang13/rlm-minimal
+
+3. **λ-RLM authors.** *Solving Long-Context Rot with λ-Calculus.* arXiv:2603.20105, 2026.  
+   - Formalises RLM control into typed combinators (SPLIT, MAP, FILTER, REDUCE) to replace free-form code generation.
+
+4. **zigrlm** (Zig RLM runtime). Local build: `/Volumes/VIXinSSD/zigrlm/zig-out/bin/zigrlm`.  
+   - Supports `cli`, `cli-claude`, `cli-codex`, `cli-openai`, `zai`, `openai-proxy`, etc.  
+   - Grammar: fenced `repl` blocks with `rlm_query`, `rlm_query_batched`, `FINAL`, `FINAL_VAR`.
+
+5. **Community implementations and extensions**  
+   - `claude-rlm` (depth-N recursion via Claude Code + bash): https://github.com/Tenobrus/claude-rlm  
+   - `minrlm` (token-reduction focus): https://github.com/avilum/minrlm  
+   - `loop-infer` (Clojure REPL): https://github.com/unravel-team/loop-infer  
+   - `rlm-mcp` (MCP server): https://github.com/eesb99/rlm-mcp  
+   - `rlm_repl` (Python PoC): https://github.com/fullstackwebdev/rlm_repl
+
+6. **Benchmarks referenced**  
+   - **OOLONG** (long-context aggregation): Bertsch et al., 2025.  
+   - **BrowseComp-Plus** (multi-hop QA over document corpora): Chen et al., 2025.
+
+---
+
+*Document generated for deepseek-tui design review. Corresponds to repo state: main @ 229b1993.*
@@ -0,0 +1,380 @@
+# RLM as a Fundamental Agent Primitive
+
+## Thesis
+
+We will make Recursive Language Models a first-class primitive in `deepseek-tui` by teaching the flat agent loop to detect fenced ```` ```repl ```` blocks in assistant text and hand them directly to the external `zigrlm` binary. `zigrlm` orchestrates cheap parallel `deepseek-v4-flash` child calls, runs a JS sandbox, and returns a single `FINAL` result that becomes the assistant's response for that turn. This replaces the heavy `agent_swarm` tokio-task-per-child model with a lightweight subprocess tree where N flash calls cost less than one Pro call, inverting the usual sub-agent economics.
+
+## Where We Are Today
+
+The agent loop is in `crates/tui/src/core/engine.rs` (`Engine::handle_deepseek_turn()`, ~line 2330). It streams `ContentBlock`s from the model into `session.messages`; if any block is `ToolUse`, it builds a `ToolExecutionPlan`, executes via `execute_tool_with_lock()` (~line 2209), and loops back for another model turn. Parallel work today goes through `AgentSwarmTool` in `crates/tui/src/tools/swarm.rs` (`run_swarm()`, ~line 582), which spawns full background tokio tasks via `SubAgentManager::spawn_background_with_assignment()` in `crates/tui/src/tools/subagent.rs` (~line 584). Each child runs its own agent loop, tool registry, and event channel. That is correct for autonomous multi-step work, but wasteful for simple parallel Q&A or recursive decomposition.
+
+`zigrlm` (already built at `/Volumes/VIXinSSD/zigrlm/zig-out/bin/zigrlm` or cloneable from GitHub) solves this externally. Its `cli` command reads a prompt, drives a root model turn, parses any ```` ```repl ```` blocks with its Zig-native parser (`src/parser.zig`), fans out batched child calls across OS threads capped by `max_concurrent_subcalls` (default 8), and returns the `FINAL` string on stdout. The integration work is wiring this into the engine so the model naturally emits repl blocks instead of JSON tool calls.
+
+## Key Design Questions
+
+### 1. How is `zigrlm` auto-configured with DeepSeek credentials? (#48)
+
+**New file:** `crates/tui/src/zigrlm_config.rs`
+
+A `ZigrlmRuntimeConfig` struct is built from the session's existing `ResolvedRuntimeOptions` (`api_key`, `base_url`, `model`). It constructs the two environment variables `zigrlm` expects:
+
+- `ZIGRLM_MAIN_CMD`: `zigrlm openai-proxy --model <pro> --base-url <url>`
+- `ZIGRLM_RLM_CMD`: `zigrlm openai-proxy --model deepseek-v4-flash --base-url <url>`
+
+The API key is passed as `OPENAI_API_KEY`. Binary discovery (in priority order): `config.toml` field `zigrlm.bin_path` → env `ZIGRLM_BIN` → known local build `/Volumes/VIXinSSD/zigrlm/zig-out/bin/zigrlm` → `PATH` via `which zigrlm`. If not found, RLM features degrade gracefully with a logged warning.
+
+**Config additions:** `crates/config/src/lib.rs` gets a `ZigrlmConfigToml` struct with optional overrides for `bin_path`, `rlm_model`, `max_depth` (default 2), `max_iterations` (default 20), and `timeout_ms` (default 600000). These are exposed under a new `zigrlm:` table in `config.toml`.
+
+### 2. How does the engine detect and execute repl blocks? (#49)
+
+**Modified file:** `crates/tui/src/core/engine.rs`
+
+After the streaming loop in `handle_deepseek_turn()` persists the assistant message to `session.messages`, we insert a new branch before tool execution:
+
+```rust
+if message.has_tool_calls() {
+    // existing tool-execution path
+} else if has_repl_block(&message.content) {
+    let result = zigrlm_runtime.run_inline(&message.content).await?;
+    // Replace the Text block with the aggregated result
+    message.replace_repl_with_result(&result.response);
+    // Append usage metadata as a system note or hidden block
+    if let Some(usage) = result.usage {
+        session.add_system_note(format!(
+            "[RLM: {} calls, {} tokens]", usage.calls, usage.total_tokens
+        ));
+    }
+    // Turn completes; no extra model round-trip
+}
+```
+
+`has_repl_block()` checks `ContentBlock::Text` for the exact substring "\`\`\`repl" using the same fence logic as `zigrlm/src/parser.zig`. `run_inline()` lives in the new `crates/tui/src/zigrlm_runtime.rs` and shells out to:
+
+```bash
+zigrlm cli \
+  --max-depth 2 \
+  --max-iterations 20 \
+  --timeout-ms 600000 \
+  "<assistant_text>"
+```
+
+with `ZIGRLM_MAIN_CMD`, `ZIGRLM_RLM_CMD`, and `OPENAI_API_KEY` injected into the child environment. The full assistant text is the prompt because the model's natural-language plan preceding the fence is part of the root context `zigrlm` expects.
+
+**UX:** While `zigrlm` runs, the engine emits `Event::RlmStarted` and the TUI shows a spinner: "Running RLM tree…". On completion, `Event::RlmComplete` carries usage so the transcript can render a collapsible "[RLM: 3 calls, 2.1K tokens, 1.2s]" line. `Ctrl-C` during this phase forwards `SIGTERM` to the child process.
+
+### 3. How does the result re-enter the conversation?
+
+The raw assistant message is mutated in-place in `session.messages` (`crates/tui/src/core/session.rs`). Its `ContentBlock::Text` block containing the repl fence is replaced by the `FINAL` string from `zigrlm` stdout. The original repl block is preserved as a `ContentBlock::Thinking` block (or a new internal metadata field) so the model can see its own plan on subsequent turns, but the primary visible response is the aggregated result. This keeps the conversation history clean: the next turn's context contains the unified answer, not raw DSL.
+
+### 4. What happens to the explicit `zigrlm` tool / bridge? (#46)
+
+It remains as an **escape hatch** in `crates/tui/src/tools/zigrlm.rs` (new file), registered via `ToolRegistryBuilder::with_zigrlm_tool()` in `crates/tui/src/tools/registry.rs`. The tool accepts explicit parameters (`prompt`, `max_depth`, `trace_path`, etc.) and is useful for:
+
+- DSPy-style signatures via `dszig`
+- Docker-backed Python sandboxes
+- Custom traces for benchmarking
+- User-explicit RLM experiments
+
+The inline primitive and the explicit tool share `ZigrlmRuntimeConfig` but serve different purposes. The model prompt (see below) teaches when to use each.
+
+### 5. How do we teach the model to use this? (#50)
+
+**Modified files:** `crates/tui/src/prompts/agent.txt`, `crates/tui/src/prompts/yolo.txt`
+
+A new section, gated by config flag `rlm.prompt_enabled` (default `true`), is appended to the agent system prompt:
+
+```text
+## Recursive Language Model (RLM) primitive
+
+When you need parallel analysis, recursive decomposition, or batched generation,
+prefer a fenced `repl` block over spawning subagents or doing sequential inline work.
+
+- `rlm_query_batched name = "prompt" | "prompt" | ...` for parallel work
+- `rlm_query name = "prompt"` for recursive child tasks
+- End with `FINAL(expression)` or `FINAL_VAR(name)`
+
+The child model is deepseek-v4-flash (very fast and cheap).
+
+Do NOT use RLM when the task requires file-system modification, interactive user
+input, or is trivial enough for a single sentence.
+```
+
+A comparison table in the prompt clarifies the trade-offs:
+
+| Primitive | Use when | Cost | Speed |
+|---|---|---|---|
+| Inline reasoning | Simple Q&A, one-step tasks | Low | Fast |
+| `repl` block | Parallel / recursive / batched work | Very low (flash) | Fast |
+| `agent_swarm` | Multi-step autonomous work with tools | Higher | Slower (polling) |
+
+This lets us A/B test by toggling `rlm.prompt_enabled` and measuring turns-per-task and token usage.
+
+### 6. How does the model do non-trivial work *inside* a `repl` block? (#53)
+
+Parallel fan-out alone isn't enough. A `repl` block that just splits N prompts and concatenates results is barely better than `agent_swarm`. The unlock is giving the model **cheap programmatic access to data that's already in process memory** — so it can grep, extract, slice, diff, and search a 50K-token blob in one repl block without burning context tokens on the raw bytes or paying for a tool round-trip per query.
+
+This is what makes RLM actually usable, not just clever. We bake a curated helper layer + a sandboxed Python REPL into the runtime as a first-class capability.
+
+**The shape:**
+
+A `repl` block doesn't only fan out to flash children. It can also run Python in a sandboxed namespace where:
+
+- A `ctx` variable holds preloaded data the agent wants to interrogate (a file it just read, a tool result, a stream of search hits).
+- A small curated helper module is in scope — about 15–25 functions chosen because they meaningfully beat shell when the data is already in memory: `peek` / `lines` / `head` / `tail` / `chunk` / `between`, `grep` / `count_matches` / `find_all` / `semantic_search`, `extract_json_objects` / `extract_urls` / `extract_paths` / `extract_dates`, `replace_all` / `split_by` / `diff` / `similarity`, `dedupe` / `group_by` / `partition` / `frequency`.
+- Sandbox is AST-validated + restricted builtins + import allowlist + execution timeout — best-effort, same posture as the JS sandbox the runtime already exposes.
+- State persists across `repl` blocks within a turn. The model can `let chunks = chunk(ctx, 4000)` once and reuse `chunks` in subsequent fan-outs.
+
+**Why this lives at the runtime level, not as a separate tool:**
+
+If we shipped a `python_repl` tool alongside RLM, the model would have to choose between "fan out to flash children" (repl block) and "inspect data in Python" (tool call) every turn. They're the same workflow — load → slice → fan out flash queries on the slices → aggregate. Splitting them across two interfaces forces the wrong choice. Putting the helper layer *inside* the repl runtime means a single block can do all four steps with shared state.
+
+**Why these specific helpers and not a giant library:**
+
+The model already has shell + grep + read_file. It doesn't need 124 helpers. It needs ~20 that are obviously the right move when working with in-memory data — the ones where shell would force an unnecessary round-trip or lose structure. Keep the menu small and obvious. Anything not on the menu, the model can write inline (Python is in the sandbox; helpers are conveniences, not a closed world).
+
+## Spike Target
+
+The smallest end-to-end proof is a hardcoded path in `crates/tui/src/core/engine.rs` that, when an assistant message contains a test repl block, shells out to a pre-built `zigrlm` binary with a hardcoded `ZIGRLM_RLM_CMD` pointing at `deepseek-v4-flash`, and injects the stdout result back into `session.messages`.
+
+**Estimated surface area:**
+- `engine.rs`: ~30 lines (detection branch + subprocess call)
+- `zigrlm_runtime.rs` (new, spike version): ~80 lines (Command builder + stdout capture)
+- No config plumbing, no TUI spinner, no usage parsing, no prompt changes.
+
+**Success criteria:** A local test where the Pro model emits:
+
+````
+```repl
+rlm_query_batched answers = "What is 2+2?" | "What is 3+3?"
+FINAL_VAR(answers)
+```
+````
+
+… and the engine returns a single assistant message containing the aggregated `[0]\n4\n[1]\n6` result, with no tool call JSON emitted and no extra model round-trip.
+
+## Hetun Mode — "Plan + Recursive Agents" (added, doesn't replace Plan)
+
+**Tracking issue:** #54
+
+**Hetun** (河豚, Mandarin for *pufferfish*) is added as a fourth mode positioned at the **end** of the Tab cycle so people don't accidentally land on it from a fresh session. The cycle becomes `Plan → Agent → YOLO → Hetun → Plan`. Default landing mode is unchanged. Plan stays exactly as it is — read-only investigation, hand the plan to the human. Hetun is the next step further up the orchestration ladder: planning *and* execution folded together, gated on a single mission-level approval.
+
+The mode badge surfaces this as **"Hetun (Plan + Recursive Agents)"** so users immediately understand the relationship to Plan. Sakana already named the flash-coordinator architecture *Fugu* (the Japanese reading of 河豚); since DeepSeek is Chinese the mandarin reading *hetun* is the right cultural fit.
+
+### What Hetun does
+
+It's the most opinionated mode the TUI offers. The model both **plans the work and runs it**, but the user gates the transition with one explicit mission approval:
+
+1. **Research + plan** — Hetun uses RLM aggressively to investigate the workspace in parallel (multiple `rlm_query_batched` reads of relevant files / patterns / prior turns), synthesises the findings into a concrete mission (sub-tasks, what each looks at, expected outputs, anything that gets written), and lands it in the transcript ending with an explicit "OK to run?" prompt.
+2. **Execute** — once approved, Hetun emits a `repl` block that fans the planned sub-tasks out via `rlm_query_batched` and aggregates into a `FINAL`. No further per-block approvals — you approved the **mission**, the runtime carries it out.
+
+This is meaningfully different from Plan (read-only investigate, hand back to human, human implements) and from YOLO (auto-execute everything turn-by-turn). Hetun keeps the human in the loop at the only point that matters — the gate between "we know what to do" and "do it" — and removes them from every per-step approval after that.
+
+### Behaviour and configuration
+
+- **The user's configured model is left alone.** Entering Hetun does *not* swap the conversational model or reasoning effort. If you were on `deepseek-v4-pro` / `max`, you stay there. The flash-as-coordinator behaviour is internal to the runtime (`ZIGRLM_RLM_CMD` always points to flash regardless of mode), not a global model swap. On exit nothing has to be restored because nothing was changed.
+- **No `/hetun` slash command.** Tab cycles into the mode like any other; `/plan` keeps switching to Plan as it does today.
+- **Mission-level approval, not block-level.** Hetun introduces one approval gate per turn (the mission), then runs the execution `repl` block straight through. Inside Plan, Agent, and YOLO the existing approval policies are unchanged.
+
+RLM is not Hetun-only. Agent and YOLO modes keep using `repl` blocks where the model judges them appropriate (#49 wires the inline primitive globally). Hetun is just the mode that *expects* RLM-first behaviour, and the prompt is tuned for it.
+
+### What "Plan + Recursive Agents" actually means inside Hetun
+
+Sakana's writeup of the Fugu / "intelligence" architecture (the system they shipped to MIC's misinformation programme) describes more than a flash-coordinator wrapper. The mode adopts those technical patterns and translates them into our primitives. A Hetun research phase is not one batched fan-out — it is a small recursive program that runs inside a `repl` block:
+
+- **Novelty search via recursive sampling.** Instead of pre-deciding N fixed queries and firing them in parallel, Hetun draws an initial broad sample of the workspace (`ctx` chunks of relevant files / patterns / prior turns), runs a flash sweep over the sample asking "what here is surprising or important?", scores responses by novelty, and recursively zooms into the highest-novelty chunks at finer resolution. The recursion stops when novelty plateaus or a budget is hit. This gives much better coverage than a flat 8-way fan-out for any task where the interesting bits are non-uniformly distributed.
+- **Hierarchical narrative tree synthesis.** Findings from the recursion don't get concatenated into a flat list. They get organised into a tree: leaves are individual observations, intermediate nodes are clusters of related findings, the root is the mission goal. The mission card the user approves displays this tree (collapsible, navigable) rather than a wall of bullets — same intuition Sakana uses to make SNS narrative spaces legible at a glance.
+- **Multi-detector cross-verification.** Every claim Hetun puts into the mission goes through at least two passes: a flash sweep for obvious errors / contradictions, and a frontier (Pro) check for subtler structural issues. Sakana's framing is "frontier model handles macro structure, specialised models handle fine structure, blind spots cancel". For us that maps to flash for breadth and Pro for depth, with claims that fail either pass marked as low-confidence rather than hidden — the user can choose to verify them manually or drop them from the mission.
+- **Hypothesis-verification loop.** The research phase isn't a single round. Hetun forms a working hypothesis from the initial findings, generates verification queries from the hypothesis (e.g. "if X is true, we should also see Y; check for Y"), runs them, and updates the hypothesis. The loop continues until the hypothesis is stable across iterations or the iteration budget (capped low — typically 2–3) is exhausted. This is the same hypothesis-driven investigation rhythm Sakana models on human fact-checkers.
+
+These patterns are not separate features — they are how the Hetun prompt teaches the model to use `repl` blocks. The runtime primitives (`rlm_query_batched`, `ctx` helpers from #53, flash/Pro tiering from #48) are already in place once the rest of the RLM stack ships; Hetun is the prompt + approval layer that wires them together into the recursive-research-and-mission rhythm above.
+
+We do **not** import Sakana's full system: the ABM persona-simulation framework (Shachi) and the misinformation-specific image/video detectors are out of scope. What we adopt is the **research methodology** — recursive novelty sampling, hierarchical synthesis, multi-detector verification, hypothesis loops — applied to the agent-coding domain instead of the misinformation domain.
+
+**Files:** `crates/tui/src/tui/app.rs` (add `AppMode::Hetun`, place it last in the Tab cycle), `crates/tui/src/tui/palette.rs` (add `MODE_HETUN` colour, e.g. purple to distinguish from YOLO red and Plan orange), `crates/tui/src/tui/prompts.rs` (add `HETUN_PROMPT`), `crates/tui/src/prompts/hetun.txt` (the prompt body — must teach the recursive-novelty + hierarchical-synthesis + verification-loop pattern, not just "fan out queries"), `crates/tui/src/core/engine.rs` (mission-level approval hook before RLM execution in Hetun), `crates/tui/src/tui/widgets/header.rs` (mode-badge text reads "Hetun (Plan + Recursive Agents)").
+
+## Vendoring zigrlm
+
+**Tracking issue:** #55
+
+Rather than treating `zigrlm` as an external binary, we will vendor it as a git submodule at `vendor/zigrlm` and build it alongside the Rust project. This lets us:
+
+1. Guarantee the binary exists for contributors and CI
+2. Patch zigrlm for deepseek-tui-specific features (three-tier model routing, custom JS builtins, DeepSeek trace format)
+3. Eventually link it as a C library instead of shelling out
+
+**Build integration:** A `build.rs` in `crates/tui/` invokes `zig build` in `vendor/zigrlm` when `zig` is on PATH. If `zig` is missing, the TUI falls back to existing binary-discovery logic. `ZigrlmRuntimeConfig` prefers the vendored path.
+
+**Files:** `.gitmodules`, `crates/tui/build.rs`, `crates/tui/src/zigrlm_config.rs`, `README.md`, `AGENTS.md`.
+
+## Plan
+
+These ship together as one cohesive RLM landing — Hetun is the flagship that gives the rest a reason to exist on day one, the helper layer is what gives RLM something to do besides toy parallelism, and the auto-config + vendoring make it work without any user setup. The order below is the implementation dependency order, not a staggered release schedule. We just keep shipping.
+
+| Issue | Scope | Files |
+|---|---|---|
+| #48 | **Auto-config.** Build `ZigrlmRuntimeConfig`, binary discovery, config schema. | `crates/tui/src/zigrlm_config.rs` (new), `crates/config/src/lib.rs` |
+| #49 | **Inline primitive.** Detect repl blocks in `handle_deepseek_turn()`, shell out, replace message content, emit RLM events. | `crates/tui/src/core/engine.rs`, `crates/tui/src/zigrlm_runtime.rs` (new), `crates/tui/src/core/events.rs` |
+| #53 | **Helper layer + Python sandbox.** Curated `ctx` helpers + AST-validated Python sandbox baked into the repl runtime. | New helper module (location TBD between zigrlm upstream and `crates/tui/src/zigrlm_runtime/`) |
+| #55 | **Vendor zigrlm.** Add git submodule, build script, prefer vendored path. | `.gitmodules`, `crates/tui/build.rs`, `crates/tui/src/zigrlm_config.rs` |
+| #50 | **Prompt engineering.** Add RLM section to agent/yolo prompts, config toggle, examples that exercise the helper layer. | `crates/tui/src/prompts/agent.txt`, `crates/tui/src/prompts/yolo.txt`, `crates/config/src/lib.rs` |
+| #54 | **Hetun mode.** Add a 4th mode at the end of the cycle, with mission-level approval gate. Plan stays unchanged. | `crates/tui/src/tui/app.rs`, `crates/tui/src/tui/palette.rs`, `crates/tui/src/prompts/hetun.txt`, `crates/tui/src/core/engine.rs`, `crates/tui/src/tui/widgets/header.rs` |
+| #46 | **Explicit bridge.** Implement `ZigrlmTool` spec, register in registry, add to sub-agent allowed lists. | `crates/tui/src/tools/zigrlm.rs` (new), `crates/tui/src/tools/registry.rs`, `crates/tui/src/tools/subagent.rs` |
+
+We diverge from the old #40 plan (building a native Rust repl parser) because `zigrlm` already owns parsing, sandboxing, and trace emission. Reimplementing that in Rust is waste — #53 (the helper layer) is where we add the value that makes the runtime actually usable.
+
+## Non-Goals (deferred)
+
+- **Native repl parser in Rust** (#41–#45, all closed). zigrlm's Zig parser is sufficient.
+- **Real-time streaming of child progress** into the TUI transcript. Spinner + final summary is enough.
+- **Process pool / pre-warming** of zigrlm subprocesses. One fork per repl block is acceptable given flash latency.
+- **Replacing `agent_swarm` entirely.** Swarm remains for multi-step autonomous work that requires tools.
+- **Automatic migration** of existing swarm task graphs to repl blocks.
+- **Windows-specific binary discovery quirks.** macOS / Linux are the priority surfaces.
+- **JS sandbox hardening.** Trusted-local-compute model, same posture as the Python sandbox in #53.
+- **Three-tier model routing** (frontier escalation inside repl). Requires zigrlm patches; do once vendoring (#52) lands.
+- **Native C-library linkage** of zigrlm. Worth doing only after subprocess overhead is shown to be a real bottleneck.
+
+---
+
+## Appendix A: ReAct vs. RLM — Why Both?
+
+> A deeper treatment lives in `docs/research-react-vs-rlm.md`. This appendix extracts the decisions that matter for our integration.
+
+**ReAct** (Yao et al., ICLR 2023) is the incumbent paradigm: a linear chain of *Thought → Action (JSON tool call) → Observation → repeat*. The model's entire history of reasoning and tool results is appended to the prompt every turn. It is simple, inspectable, and works well for interactive, stateful tasks (editing files, running shell commands, browsing).
+
+**RLM** (Zhang et al., arXiv:2512.24601) is a tree-structured inference paradigm. The model writes fenced `repl` blocks that manipulate an external REPL variable store and spawn recursive child calls. Because the full context lives in variables, the LM sees only constant-size metadata. This enables:
+
+- **Native parallelism** via `rlm_query_batched`
+- **10M+ token scale** (two orders of magnitude beyond the base window)
+- **Cheap child models** for leaf work while a frontier model handles control
+
+### Comparison Table
+
+| Dimension | ReAct (today) | RLM (proposed) |
+|---|---|---|
+| **Structure** | Linear chain | Tree of recursive calls |
+| **Parallelism** | Sequential (or heavy swarm tasks) | Native batched fan-out (up to 8 concurrent) |
+| **State** | Monolithic prompt scratchpad | External REPL variables |
+| **Tool interface** | JSON schema (`ToolUse`) | Fenced `repl` DSL blocks |
+| **Child cost** | Full agent loop per child | Cheap `deepseek-v4-flash` subprocess calls |
+| **Observability** | Linear transcript | JSONL tree trace (`--trace`) |
+| **Best for** | Interactive, stateful, tool-driven work | Parallel analysis, long context, batch generation |
+
+### The Hybrid Stance
+
+We do not replace ReAct with RLM; we make RLM a first-class *primitive inside* the ReAct loop. The agent still reasons in natural language and calls tools via JSON when it needs interactive side effects. But when it wants to fan out parallel analysis, decompose a large context, or batch-generate, it writes a `repl` block instead of spawning `agent_swarm`. The engine detects the block, runs `zigrlm`, and feeds the aggregated `FINAL` result back as the assistant's answer for that turn.
+
+This maps to the paper's own framing: RLM is the next milestone *after* CoT and ReAct, not a replacement for them.
+
+---
+
+## Appendix B: UI/UX Design
+
+### The Core Tension
+
+The Pro model streams its response naturally. The user sees "I'll break this into parallel searches…" and then watches a `\`\`\`repl` fence appear character-by-character. This is **good** — it shows intent and builds trust. But once the fence closes, the engine must pause, fork `zigrlm`, wait for N parallel flash calls to finish, and then present a single coherent answer. The UI must bridge that gap without feeling broken.
+
+### The Solution: Progressive Disclosure via "Thinking Reclassification"
+
+The existing TUI already has the perfect visual language for this: `HistoryCell::Thinking` (`crates/tui/src/tui/history.rs`, lines 1129–1198). Thinking blocks render with a left border (`▏`), a header showing a spinner and duration, collapsible by default, and markdown body. We reuse that pattern exactly.
+
+**The flow:**
+
+1. **Streaming phase** — The Pro model streams its response. The TUI shows it live in `HistoryCell::Assistant { streaming: true }`, exactly as today.
+2. **Detection phase** — After `MessageStop`, the engine detects the repl block and emits `Event::RlmStarted { message_index }`.
+3. **Reclassification** — The engine mutates `session.messages`:
+   - The `ContentBlock::Text` containing the repl fence is moved to a new `ContentBlock::Thinking` block.
+   - A transient `ContentBlock::Text` placeholder is inserted: "*Running RLM tree…*"
+4. **Execution phase** — `zigrlm` runs. The TUI footer shows `RLM ⌀` (sky blue, same family as `working`). The transcript shows the thinking block collapsed with a live spinner: `◦ thinking live`.
+5. **Completion phase** — `zigrlm` returns. The engine replaces the placeholder text with the `FINAL` result and emits `Event::RlmComplete { message_index, usage, duration_ms }`.
+6. **Final render** — The TUI updates:
+   - `ContentBlock::Text` now shows the aggregated result (normal markdown).
+   - `ContentBlock::Thinking` shows the original repl plan, now collapsed and labeled `thinking done · 1.2s`.
+   - A one-line metadata footer is appended: `▸ 3 flash calls · 2.1K tokens · ~$0.003`.
+
+This gives the user **one thoughtful response**: the result is the message, and the repl block is the reasoning behind it — exactly how `Thinking` blocks work today.
+
+### Visual States
+
+| State | Transcript | Footer | Thinking Block |
+|---|---|---|---|
+| **Streaming** | Assistant cell, `streaming: true` | `thinking ⌀` | Not visible yet (model hasn't emitted fence) |
+| **Executing** | Assistant cell, spinner suffix on placeholder | `recursing ⌀` | Collapsed, header reads `◦ recursing live` |
+| **Complete** | Assistant cell, result text | Idle | Collapsed, header reads `◦ recursing done · 1.2s` |
+| **Expanded** | Same | Idle | Expanded, shows full repl DSL with syntax highlighting |
+
+### Why Not a Tool Card?
+
+It is tempting to model RLM execution as a new `ToolCell` variant. We explicitly reject this because RLM is **not a tool call** — it is an inline primitive. Rendering it as a tool card would:
+- Break the "single thoughtful response" metaphor
+- Train the user to think of RLM as an external action rather than assistant reasoning
+- Add visual noise (tool headers, argument summaries, result boxes) for what is essentially accelerated thinking
+
+The `Thinking` block is the right container because the repl block *is* the model's reasoning about how to parallelize. The result is simply the output of that reasoning.
+
+### Keyboard & Detail Views
+
+- **`v` on the assistant message** — Opens the `PagerView` (`crates/tui/src/tui/pager.rs`) showing the full message: the original repl block at the top (with `zigrlm` trace path if available), the FINAL result below, and the JSONL tree if the user wants to inspect child calls.
+- **`v` on the thinking block** — Toggles collapse/expand inline, same as existing thinking behavior.
+- **`Alt+4` (sidebar)** — Future: an RLM panel showing recent RLM executions with call counts, depth, and trace file paths. Deferred to v0.6.
+
+### Footer & Status Indicators
+
+**New footer state** in `crates/tui/src/tui/ui.rs` (`footer_state_label()`, ~line 4022):
+
+```rust
+else if app.active_rlm.is_some() {
+    ("recursing ⌀", Style::default().fg(Color::Sky))
+}
+```
+
+The word *recursing* is playful, accurate, and short enough for the footer. Alternatives considered: `recursive thinking ⌀` (too long), `RLM ⌀` (too opaque). `recursing` wins because it describes what is actually happening — the model is recursively fanning out child calls — and it fits the existing informal voice of the TUI (`thinking ⌀`, `working`, `compacting ⌀`).
+
+**Motion refresh:** The existing `UI_STATUS_ANIMATION_MS` (360 ms) timer already bumps the transcript cache when `history_has_live_motion` is true. We add `app.active_rlm.is_some()` to that check so the spinner animates while `zigrlm` runs.
+
+### Events to Add
+
+**File:** `crates/tui/src/core/events.rs`
+
+```rust
+pub enum Event {
+    // ... existing variants
+    RlmStarted {
+        message_index: usize,
+        estimated_calls: Option<usize>,
+    },
+    RlmComplete {
+        message_index: usize,
+        usage: Usage,
+        duration_ms: u64,
+    },
+    RlmFailed {
+        message_index: usize,
+        error: String,
+    },
+}
+```
+
+### Future: Tree Visualization
+
+`zigrlm` can emit `--trace /path/to/run.jsonl`. In v0.6 we can parse that JSONL and render a tree widget showing:
+
+- Root prompt (depth 0)
+- Each `rlm_query` / `rlm_query_batched` child (depth 1..N)
+- Per-node usage (calls, tokens, cost)
+- Duration bars
+
+This would live in the `PagerView` or a dedicated sidebar panel, not in the main transcript. It is a debugging/observability feature, not part of the default conversation flow.
+
+### Cost Accounting
+
+`zigrlm` returns usage metadata per run (`calls`, `input_tokens`, `output_tokens`, `cost_micros`). The engine must fold this into the session's aggregate `Usage` in `crates/tui/src/core/session.rs`. However, there is a subtlety: the **root Pro call** that emitted the `repl` block is already counted as part of the normal assistant-message usage. `zigrlm` then performs its *own* root call (also Pro, because `ZIGRLM_MAIN_CMD` points at the session model) plus N child calls (flash). In practice this means the Pro prompt is billed twice — once by our client for the streaming turn, once by `zigrlm` for the root RLM call. This double-counting is acceptable for the spike, but Phase 2 should explore passing the *already-received* assistant text directly into `zigrlm` without re-billing the root call, or subtracting the overlap from displayed totals.
+
+**Display policy:** Show raw numbers only (`3 flash calls · 2.1K tokens · ~$0.003`). Do **not** attempt to show "savings vs. ReAct" because that is a counterfactual — we cannot know how many Pro turns `agent_swarm` would have needed for the same task. The user can infer the value themselves: one Pro call + eight flash calls is visibly cheaper than five Pro calls.
+
+### Anti-Patterns
+
+- **Do not** stream `zigrlm` child progress into the transcript in real time. Flash calls complete in 1–3 seconds; the noise is not worth the signal.
+- **Do not** show a modal or full-screen overlay during RLM execution. The user should be able to scroll, read history, and type the next query while `zigrlm` works.
+- **Do not** render the raw `[0]\n…\n[1]\n…` batched response format directly. If `FINAL` was missing and we fall back to raw output, strip the indexed prefixes before displaying.
+- **Do not** show fake "you saved $X.XX" badges. The comparison baseline is undefined and the math is misleading.