feat(tools): add image_ocr tool — extract text from images via tesseract

Lets the model OCR a screenshot, scanned receipt, whiteboard photo,
or image-only PDF the user drops into the workspace, without
bouncing through `exec_shell` (which would mean an approval prompt
plus the model having to remember tesseract's CLI surface). The
tool spawns `tesseract <image> -` and returns the recognised text
inline — no file is written. Capability is ReadOnly + parallel
since OCR is a side-effect-free read.

Registration is gated on `crate::dependencies::resolve_tesseract()`
via the new `ToolRegistryBuilder::with_image_ocr_tools()` builder,
hooked into `with_agent_tools` alongside `pandoc_convert`. When
tesseract is missing the tool isn't advertised — same
probe-then-decide pattern v0.8.31 introduced for Python. The
execute path also late-resolves so a concurrent uninstall surfaces
the install-tesseract hint rather than the raw spawn failure.

`deepseek doctor`'s "Tool Dependencies" section reports tesseract
status next to pandoc / node / python with platform-aware install
hints. For non-default language packs or PSM modes the user can
still drop into `exec_shell` with the full tesseract CLI surface.

Tests check the metadata (ReadOnly + parallel, not WritesFiles),
the missing-path rejection, and the happy-path OCR round-trip
against `crates/tui/tests/fixtures/ocr_hello.png` — a 2 KB
300×100 grayscale PNG generated with ImageMagick rendering
"HELLO OCR" in Helvetica. The happy-path test skips silently on
hosts without tesseract (matching the catalog-build behaviour) and
on hosts where the fixture isn't checked out (sparse / shallow
clones).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hunter Bown
2026-05-12 00:58:48 -05:00
parent aed7dbefaa
commit bd603a271c
7 changed files with 276 additions and 1 deletions
+15
View File
@@ -39,6 +39,21 @@ real world uses."
### Added
- **`image_ocr` tool — extract text from images via local
tesseract.** Lets the model OCR a screenshot, scanned receipt,
whiteboard photo, or image-only PDF the user drops into the
workspace, without bouncing through `exec_shell`. Spawns
`tesseract <image> -` and returns the recognised text inline;
no file is written. PNG / JPEG / TIFF inputs supported.
Registration is gated on `dependencies::resolve_tesseract()`;
when tesseract is missing the tool isn't advertised, so the
model never tries to call an OCR engine the host can't run.
`deepseek doctor` reports tesseract status alongside the other
external-binary dependencies with platform-aware install hints
(`brew install tesseract` / `apt install tesseract-ocr` /
`winget install UB-Mannheim.TesseractOCR`). For non-default
language packs or PSM modes, users can still drop into
`exec_shell` with the full tesseract CLI surface.
- **`pandoc_convert` tool — convert documents between formats via
the local pandoc binary.** Pandoc is the Swiss Army knife the
real world uses for moving prose around — Markdown to HTML,
+26
View File
@@ -117,6 +117,32 @@ pub fn resolve_pdftotext() -> Option<String> {
.clone()
}
/// Resolve `tesseract` (OCR engine) once per process. Used by
/// the `image_ocr` tool to decide whether to register itself with
/// the model. Tesseract is the de-facto open-source OCR engine and
/// ships as a single binary on every platform we support, so the
/// candidate list is just `tesseract`.
pub fn resolve_tesseract() -> Option<String> {
static CACHE: OnceLock<Option<String>> = OnceLock::new();
CACHE
.get_or_init(|| {
if probe_executable("tesseract") {
tracing::info!(
target: "tool_dependencies",
"Resolved tesseract binary for image_ocr",
);
Some("tesseract".to_string())
} else {
tracing::warn!(
target: "tool_dependencies",
"tesseract binary not found; image_ocr tool will not be registered",
);
None
}
})
.clone()
}
/// Resolve `pandoc` (universal document converter) once per
/// process. Used by the `pandoc_convert` tool to decide whether
/// to register itself with the model. Pandoc is a single-binary
+23
View File
@@ -2180,6 +2180,29 @@ async fn run_doctor(config: &Config, workspace: &Path, config_path_override: Opt
}
}
match crate::dependencies::resolve_tesseract() {
Some(_) => println!(
" {} tesseract: present → image_ocr tool registered",
"".truecolor(aqua_r, aqua_g, aqua_b),
),
None => {
println!(" {} tesseract: not found (optional)", "·".dimmed(),);
println!(
" image_ocr tool is NOT advertised to the model. Install tesseract to enable:"
);
match std::env::consts::OS {
"macos" => println!(" brew install tesseract"),
"linux" => println!(
" sudo apt install tesseract-ocr (Debian/Ubuntu) — or your distro's equivalent"
),
"windows" => println!(" winget install UB-Mannheim.TesseractOCR"),
other => {
println!(" install tesseract for {other} from tesseract-ocr.github.io")
}
}
}
}
// PDF reader: pure-Rust `pdf-extract` is the v0.8.32 default, so
// `pdftotext` is no longer required for `read_file` to handle PDFs.
// We still surface its presence (a) so users with column-heavy PDFs
+194
View File
@@ -0,0 +1,194 @@
//! `image_ocr` tool — extract text from an image via the local
//! `tesseract` OCR engine.
//!
//! Tesseract is the open-source workhorse for "convert this image
//! to text" — covers screenshots, scanned PDFs that arrived as
//! image-only blobs, handwriting-free documents in 100+ languages,
//! receipts, whiteboard photos, etc. Surfacing it as a
//! model-callable tool means the model can OCR an asset the user
//! drops into the workspace without bouncing through `exec_shell`.
//!
//! Registration is gated by [`crate::dependencies::resolve_tesseract`]
//! (see [`crate::tools::registry::ToolRegistryBuilder::with_image_ocr_tools`]).
//! When tesseract isn't installed the tool simply doesn't appear in
//! the catalog, so the model never sees a binary it can't actually
//! use.
use std::process::{Command, Stdio};
use async_trait::async_trait;
use serde_json::{Value, json};
use super::spec::{ToolCapability, ToolContext, ToolError, ToolResult, ToolSpec, required_str};
/// Tool implementing `image_ocr`. Spawns `tesseract <image> -` and
/// returns the extracted text on success.
pub struct ImageOcrTool;
#[async_trait]
impl ToolSpec for ImageOcrTool {
fn name(&self) -> &'static str {
"image_ocr"
}
fn description(&self) -> &'static str {
"Extract text from an image (PNG, JPEG, or TIFF) via local tesseract OCR. Use this for screenshots, scanned receipts/whiteboards, image-only PDFs, or any visual that contains text the model needs to read. Returns the extracted text inline; no file is written. Use `exec_shell` only when you need a non-default OCR language pack or PSM mode."
}
fn input_schema(&self) -> Value {
json!({
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path to the image file (relative to workspace or absolute). PNG / JPEG / TIFF supported."
}
},
"required": ["path"]
})
}
fn capabilities(&self) -> Vec<ToolCapability> {
vec![ToolCapability::ReadOnly, ToolCapability::Sandboxable]
}
fn supports_parallel(&self) -> bool {
true
}
async fn execute(&self, input: Value, context: &ToolContext) -> Result<ToolResult, ToolError> {
let path_str = required_str(&input, "path")?;
let image_path = context.resolve_path(path_str)?;
if !image_path.exists() {
return Err(ToolError::execution_failed(format!(
"image_ocr: source path does not exist: {}",
image_path.display()
)));
}
// Late-resolve tesseract too. Registration gated on
// resolve_tesseract(), but a concurrent uninstall between
// catalog build and the model's call should surface a clear
// error rather than the raw spawn failure.
let tesseract = crate::dependencies::resolve_tesseract().ok_or_else(|| {
ToolError::execution_failed(
"image_ocr: tesseract binary not found on PATH. \
Install tesseract (macOS: `brew install tesseract`; \
Debian/Ubuntu: `apt install tesseract-ocr`; \
Windows: `winget install UB-Mannheim.TesseractOCR`) \
and restart deepseek-tui.",
)
})?;
// `tesseract <image> -` writes the recognised text to stdout.
// The trailing `-` is documented and produces text mode by
// default (no `.txt` file written to disk).
let mut cmd = Command::new(&tesseract);
cmd.arg(&image_path);
cmd.arg("-");
cmd.stdin(Stdio::null())
.stdout(Stdio::piped())
.stderr(Stdio::piped());
let output = cmd
.output()
.map_err(|e| ToolError::execution_failed(format!("failed to launch tesseract: {e}")))?;
if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr).trim().to_string();
return Err(ToolError::execution_failed(format!(
"tesseract failed (exit {:?}): {stderr}",
output.status.code()
)));
}
// Tesseract appends a trailing form-feed on some platforms;
// trim trailing whitespace so the result reads cleanly inline.
let text = String::from_utf8_lossy(&output.stdout)
.trim_end()
.to_string();
Ok(ToolResult::success(text))
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::fs;
use tempfile::tempdir;
/// Tesseract availability — happy-path tests skip when missing so
/// CI environments without OCR still pass the suite.
fn tesseract_present() -> bool {
crate::dependencies::resolve_tesseract().is_some()
}
/// Resolve the checked-in OCR fixture path. The image lives at
/// `crates/tui/tests/fixtures/ocr_hello.png` (300x100 grayscale,
/// "HELLO OCR" rendered in Helvetica) and is committed for the
/// happy-path round-trip below.
fn ocr_fixture_path() -> std::path::PathBuf {
std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("tests/fixtures/ocr_hello.png")
}
#[test]
fn tool_metadata_marks_image_ocr_read_only_and_parallel() {
let tool = ImageOcrTool;
assert_eq!(tool.name(), "image_ocr");
assert!(tool.supports_parallel());
let caps = tool.capabilities();
assert!(caps.contains(&ToolCapability::ReadOnly));
assert!(!caps.contains(&ToolCapability::WritesFiles));
}
#[tokio::test]
async fn image_ocr_rejects_missing_path() {
let tmp = tempdir().expect("tempdir");
let ctx = ToolContext::new(tmp.path().to_path_buf());
let err = ImageOcrTool
.execute(json!({"path": "definitely-not-here.png"}), &ctx)
.await
.expect_err("nonexistent path must reject before tesseract spawn");
let msg = err.to_string();
assert!(
msg.contains("does not exist"),
"error must call out missing path; got {msg}"
);
}
#[tokio::test]
async fn image_ocr_recovers_hello_from_fixture_image() {
if !tesseract_present() {
// Tool wouldn't be registered without tesseract — mirror
// that here so the suite stays green on CI images that
// intentionally omit OCR tooling.
return;
}
let fixture = ocr_fixture_path();
if !fixture.exists() {
// Fixture not committed (sparse / shallow checkout). Skip
// silently rather than failing the suite.
return;
}
let tmp = tempdir().expect("tempdir");
// Stage the fixture under the workspace so the path resolver
// accepts the relative input — keeps the test independent of
// the workspace boundary check inside `resolve_path`.
let staged = tmp.path().join("ocr_hello.png");
fs::copy(&fixture, &staged).unwrap();
let ctx = ToolContext::new(tmp.path().to_path_buf());
let result = ImageOcrTool
.execute(json!({"path": "ocr_hello.png"}), &ctx)
.await
.expect("execute");
assert!(result.success);
// Tesseract reliably recovers "HELLO OCR" from the rendered
// PNG; allow either spacing variant.
let normalised = result.content.to_uppercase();
assert!(
normalised.contains("HELLO") && normalised.contains("OCR"),
"expected OCR to recover HELLO OCR; got {:?}",
result.content
);
}
}
+1
View File
@@ -23,6 +23,7 @@ pub mod fim;
pub mod git;
pub mod git_history;
pub mod github;
pub mod image_ocr;
pub mod js_execution;
pub mod large_output_router;
pub mod notify;
+17 -1
View File
@@ -490,6 +490,21 @@ impl ToolRegistryBuilder {
}
}
/// Include the `image_ocr` tool only when the `tesseract`
/// binary is present on this host. Probe-then-decide mirroring
/// `with_pandoc_tools` — when tesseract is missing the tool
/// stays out of the catalog, so the model never tries to call
/// an OCR engine the host can't actually run.
#[must_use]
pub fn with_image_ocr_tools(self) -> Self {
if crate::dependencies::resolve_tesseract().is_some() {
use super::image_ocr::ImageOcrTool;
self.with_tool(Arc::new(ImageOcrTool))
} else {
self
}
}
/// Include the `load_skill` tool (#434) so the model can pull a
/// SKILL.md body + companion file list into context with one
/// call instead of `read_file` + `list_dir` against the path
@@ -748,7 +763,8 @@ impl ToolRegistryBuilder {
.with_tool_result_retrieval_tool()
.with_runtime_task_tools()
.with_revert_turn_tool()
.with_pandoc_tools();
.with_pandoc_tools()
.with_image_ocr_tools();
if allow_shell {
builder.with_shell_tools()
Binary file not shown.

After

Width:  |  Height:  |  Size: 2.0 KiB