feat(tools): add image_ocr tool — extract text from images via tesseract
Lets the model OCR a screenshot, scanned receipt, whiteboard photo, or image-only PDF the user drops into the workspace, without bouncing through `exec_shell` (which would mean an approval prompt plus the model having to remember tesseract's CLI surface). The tool spawns `tesseract <image> -` and returns the recognised text inline — no file is written. Capability is ReadOnly + parallel since OCR is a side-effect-free read. Registration is gated on `crate::dependencies::resolve_tesseract()` via the new `ToolRegistryBuilder::with_image_ocr_tools()` builder, hooked into `with_agent_tools` alongside `pandoc_convert`. When tesseract is missing the tool isn't advertised — same probe-then-decide pattern v0.8.31 introduced for Python. The execute path also late-resolves so a concurrent uninstall surfaces the install-tesseract hint rather than the raw spawn failure. `deepseek doctor`'s "Tool Dependencies" section reports tesseract status next to pandoc / node / python with platform-aware install hints. For non-default language packs or PSM modes the user can still drop into `exec_shell` with the full tesseract CLI surface. Tests check the metadata (ReadOnly + parallel, not WritesFiles), the missing-path rejection, and the happy-path OCR round-trip against `crates/tui/tests/fixtures/ocr_hello.png` — a 2 KB 300×100 grayscale PNG generated with ImageMagick rendering "HELLO OCR" in Helvetica. The happy-path test skips silently on hosts without tesseract (matching the catalog-build behaviour) and on hosts where the fixture isn't checked out (sparse / shallow clones). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -39,6 +39,21 @@ real world uses."
|
||||
|
||||
### Added
|
||||
|
||||
- **`image_ocr` tool — extract text from images via local
|
||||
tesseract.** Lets the model OCR a screenshot, scanned receipt,
|
||||
whiteboard photo, or image-only PDF the user drops into the
|
||||
workspace, without bouncing through `exec_shell`. Spawns
|
||||
`tesseract <image> -` and returns the recognised text inline;
|
||||
no file is written. PNG / JPEG / TIFF inputs supported.
|
||||
Registration is gated on `dependencies::resolve_tesseract()`;
|
||||
when tesseract is missing the tool isn't advertised, so the
|
||||
model never tries to call an OCR engine the host can't run.
|
||||
`deepseek doctor` reports tesseract status alongside the other
|
||||
external-binary dependencies with platform-aware install hints
|
||||
(`brew install tesseract` / `apt install tesseract-ocr` /
|
||||
`winget install UB-Mannheim.TesseractOCR`). For non-default
|
||||
language packs or PSM modes, users can still drop into
|
||||
`exec_shell` with the full tesseract CLI surface.
|
||||
- **`pandoc_convert` tool — convert documents between formats via
|
||||
the local pandoc binary.** Pandoc is the Swiss Army knife the
|
||||
real world uses for moving prose around — Markdown to HTML,
|
||||
|
||||
@@ -117,6 +117,32 @@ pub fn resolve_pdftotext() -> Option<String> {
|
||||
.clone()
|
||||
}
|
||||
|
||||
/// Resolve `tesseract` (OCR engine) once per process. Used by
|
||||
/// the `image_ocr` tool to decide whether to register itself with
|
||||
/// the model. Tesseract is the de-facto open-source OCR engine and
|
||||
/// ships as a single binary on every platform we support, so the
|
||||
/// candidate list is just `tesseract`.
|
||||
pub fn resolve_tesseract() -> Option<String> {
|
||||
static CACHE: OnceLock<Option<String>> = OnceLock::new();
|
||||
CACHE
|
||||
.get_or_init(|| {
|
||||
if probe_executable("tesseract") {
|
||||
tracing::info!(
|
||||
target: "tool_dependencies",
|
||||
"Resolved tesseract binary for image_ocr",
|
||||
);
|
||||
Some("tesseract".to_string())
|
||||
} else {
|
||||
tracing::warn!(
|
||||
target: "tool_dependencies",
|
||||
"tesseract binary not found; image_ocr tool will not be registered",
|
||||
);
|
||||
None
|
||||
}
|
||||
})
|
||||
.clone()
|
||||
}
|
||||
|
||||
/// Resolve `pandoc` (universal document converter) once per
|
||||
/// process. Used by the `pandoc_convert` tool to decide whether
|
||||
/// to register itself with the model. Pandoc is a single-binary
|
||||
|
||||
@@ -2180,6 +2180,29 @@ async fn run_doctor(config: &Config, workspace: &Path, config_path_override: Opt
|
||||
}
|
||||
}
|
||||
|
||||
match crate::dependencies::resolve_tesseract() {
|
||||
Some(_) => println!(
|
||||
" {} tesseract: present → image_ocr tool registered",
|
||||
"✓".truecolor(aqua_r, aqua_g, aqua_b),
|
||||
),
|
||||
None => {
|
||||
println!(" {} tesseract: not found (optional)", "·".dimmed(),);
|
||||
println!(
|
||||
" image_ocr tool is NOT advertised to the model. Install tesseract to enable:"
|
||||
);
|
||||
match std::env::consts::OS {
|
||||
"macos" => println!(" brew install tesseract"),
|
||||
"linux" => println!(
|
||||
" sudo apt install tesseract-ocr (Debian/Ubuntu) — or your distro's equivalent"
|
||||
),
|
||||
"windows" => println!(" winget install UB-Mannheim.TesseractOCR"),
|
||||
other => {
|
||||
println!(" install tesseract for {other} from tesseract-ocr.github.io")
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// PDF reader: pure-Rust `pdf-extract` is the v0.8.32 default, so
|
||||
// `pdftotext` is no longer required for `read_file` to handle PDFs.
|
||||
// We still surface its presence (a) so users with column-heavy PDFs
|
||||
|
||||
@@ -0,0 +1,194 @@
|
||||
//! `image_ocr` tool — extract text from an image via the local
|
||||
//! `tesseract` OCR engine.
|
||||
//!
|
||||
//! Tesseract is the open-source workhorse for "convert this image
|
||||
//! to text" — covers screenshots, scanned PDFs that arrived as
|
||||
//! image-only blobs, handwriting-free documents in 100+ languages,
|
||||
//! receipts, whiteboard photos, etc. Surfacing it as a
|
||||
//! model-callable tool means the model can OCR an asset the user
|
||||
//! drops into the workspace without bouncing through `exec_shell`.
|
||||
//!
|
||||
//! Registration is gated by [`crate::dependencies::resolve_tesseract`]
|
||||
//! (see [`crate::tools::registry::ToolRegistryBuilder::with_image_ocr_tools`]).
|
||||
//! When tesseract isn't installed the tool simply doesn't appear in
|
||||
//! the catalog, so the model never sees a binary it can't actually
|
||||
//! use.
|
||||
|
||||
use std::process::{Command, Stdio};
|
||||
|
||||
use async_trait::async_trait;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::spec::{ToolCapability, ToolContext, ToolError, ToolResult, ToolSpec, required_str};
|
||||
|
||||
/// Tool implementing `image_ocr`. Spawns `tesseract <image> -` and
|
||||
/// returns the extracted text on success.
|
||||
pub struct ImageOcrTool;
|
||||
|
||||
#[async_trait]
|
||||
impl ToolSpec for ImageOcrTool {
|
||||
fn name(&self) -> &'static str {
|
||||
"image_ocr"
|
||||
}
|
||||
|
||||
fn description(&self) -> &'static str {
|
||||
"Extract text from an image (PNG, JPEG, or TIFF) via local tesseract OCR. Use this for screenshots, scanned receipts/whiteboards, image-only PDFs, or any visual that contains text the model needs to read. Returns the extracted text inline; no file is written. Use `exec_shell` only when you need a non-default OCR language pack or PSM mode."
|
||||
}
|
||||
|
||||
fn input_schema(&self) -> Value {
|
||||
json!({
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"path": {
|
||||
"type": "string",
|
||||
"description": "Path to the image file (relative to workspace or absolute). PNG / JPEG / TIFF supported."
|
||||
}
|
||||
},
|
||||
"required": ["path"]
|
||||
})
|
||||
}
|
||||
|
||||
fn capabilities(&self) -> Vec<ToolCapability> {
|
||||
vec![ToolCapability::ReadOnly, ToolCapability::Sandboxable]
|
||||
}
|
||||
|
||||
fn supports_parallel(&self) -> bool {
|
||||
true
|
||||
}
|
||||
|
||||
async fn execute(&self, input: Value, context: &ToolContext) -> Result<ToolResult, ToolError> {
|
||||
let path_str = required_str(&input, "path")?;
|
||||
let image_path = context.resolve_path(path_str)?;
|
||||
if !image_path.exists() {
|
||||
return Err(ToolError::execution_failed(format!(
|
||||
"image_ocr: source path does not exist: {}",
|
||||
image_path.display()
|
||||
)));
|
||||
}
|
||||
|
||||
// Late-resolve tesseract too. Registration gated on
|
||||
// resolve_tesseract(), but a concurrent uninstall between
|
||||
// catalog build and the model's call should surface a clear
|
||||
// error rather than the raw spawn failure.
|
||||
let tesseract = crate::dependencies::resolve_tesseract().ok_or_else(|| {
|
||||
ToolError::execution_failed(
|
||||
"image_ocr: tesseract binary not found on PATH. \
|
||||
Install tesseract (macOS: `brew install tesseract`; \
|
||||
Debian/Ubuntu: `apt install tesseract-ocr`; \
|
||||
Windows: `winget install UB-Mannheim.TesseractOCR`) \
|
||||
and restart deepseek-tui.",
|
||||
)
|
||||
})?;
|
||||
|
||||
// `tesseract <image> -` writes the recognised text to stdout.
|
||||
// The trailing `-` is documented and produces text mode by
|
||||
// default (no `.txt` file written to disk).
|
||||
let mut cmd = Command::new(&tesseract);
|
||||
cmd.arg(&image_path);
|
||||
cmd.arg("-");
|
||||
cmd.stdin(Stdio::null())
|
||||
.stdout(Stdio::piped())
|
||||
.stderr(Stdio::piped());
|
||||
|
||||
let output = cmd
|
||||
.output()
|
||||
.map_err(|e| ToolError::execution_failed(format!("failed to launch tesseract: {e}")))?;
|
||||
|
||||
if !output.status.success() {
|
||||
let stderr = String::from_utf8_lossy(&output.stderr).trim().to_string();
|
||||
return Err(ToolError::execution_failed(format!(
|
||||
"tesseract failed (exit {:?}): {stderr}",
|
||||
output.status.code()
|
||||
)));
|
||||
}
|
||||
|
||||
// Tesseract appends a trailing form-feed on some platforms;
|
||||
// trim trailing whitespace so the result reads cleanly inline.
|
||||
let text = String::from_utf8_lossy(&output.stdout)
|
||||
.trim_end()
|
||||
.to_string();
|
||||
Ok(ToolResult::success(text))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use std::fs;
|
||||
use tempfile::tempdir;
|
||||
|
||||
/// Tesseract availability — happy-path tests skip when missing so
|
||||
/// CI environments without OCR still pass the suite.
|
||||
fn tesseract_present() -> bool {
|
||||
crate::dependencies::resolve_tesseract().is_some()
|
||||
}
|
||||
|
||||
/// Resolve the checked-in OCR fixture path. The image lives at
|
||||
/// `crates/tui/tests/fixtures/ocr_hello.png` (300x100 grayscale,
|
||||
/// "HELLO OCR" rendered in Helvetica) and is committed for the
|
||||
/// happy-path round-trip below.
|
||||
fn ocr_fixture_path() -> std::path::PathBuf {
|
||||
std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("tests/fixtures/ocr_hello.png")
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tool_metadata_marks_image_ocr_read_only_and_parallel() {
|
||||
let tool = ImageOcrTool;
|
||||
assert_eq!(tool.name(), "image_ocr");
|
||||
assert!(tool.supports_parallel());
|
||||
let caps = tool.capabilities();
|
||||
assert!(caps.contains(&ToolCapability::ReadOnly));
|
||||
assert!(!caps.contains(&ToolCapability::WritesFiles));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn image_ocr_rejects_missing_path() {
|
||||
let tmp = tempdir().expect("tempdir");
|
||||
let ctx = ToolContext::new(tmp.path().to_path_buf());
|
||||
let err = ImageOcrTool
|
||||
.execute(json!({"path": "definitely-not-here.png"}), &ctx)
|
||||
.await
|
||||
.expect_err("nonexistent path must reject before tesseract spawn");
|
||||
let msg = err.to_string();
|
||||
assert!(
|
||||
msg.contains("does not exist"),
|
||||
"error must call out missing path; got {msg}"
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn image_ocr_recovers_hello_from_fixture_image() {
|
||||
if !tesseract_present() {
|
||||
// Tool wouldn't be registered without tesseract — mirror
|
||||
// that here so the suite stays green on CI images that
|
||||
// intentionally omit OCR tooling.
|
||||
return;
|
||||
}
|
||||
let fixture = ocr_fixture_path();
|
||||
if !fixture.exists() {
|
||||
// Fixture not committed (sparse / shallow checkout). Skip
|
||||
// silently rather than failing the suite.
|
||||
return;
|
||||
}
|
||||
let tmp = tempdir().expect("tempdir");
|
||||
// Stage the fixture under the workspace so the path resolver
|
||||
// accepts the relative input — keeps the test independent of
|
||||
// the workspace boundary check inside `resolve_path`.
|
||||
let staged = tmp.path().join("ocr_hello.png");
|
||||
fs::copy(&fixture, &staged).unwrap();
|
||||
let ctx = ToolContext::new(tmp.path().to_path_buf());
|
||||
let result = ImageOcrTool
|
||||
.execute(json!({"path": "ocr_hello.png"}), &ctx)
|
||||
.await
|
||||
.expect("execute");
|
||||
assert!(result.success);
|
||||
// Tesseract reliably recovers "HELLO OCR" from the rendered
|
||||
// PNG; allow either spacing variant.
|
||||
let normalised = result.content.to_uppercase();
|
||||
assert!(
|
||||
normalised.contains("HELLO") && normalised.contains("OCR"),
|
||||
"expected OCR to recover HELLO OCR; got {:?}",
|
||||
result.content
|
||||
);
|
||||
}
|
||||
}
|
||||
@@ -23,6 +23,7 @@ pub mod fim;
|
||||
pub mod git;
|
||||
pub mod git_history;
|
||||
pub mod github;
|
||||
pub mod image_ocr;
|
||||
pub mod js_execution;
|
||||
pub mod large_output_router;
|
||||
pub mod notify;
|
||||
|
||||
@@ -490,6 +490,21 @@ impl ToolRegistryBuilder {
|
||||
}
|
||||
}
|
||||
|
||||
/// Include the `image_ocr` tool only when the `tesseract`
|
||||
/// binary is present on this host. Probe-then-decide mirroring
|
||||
/// `with_pandoc_tools` — when tesseract is missing the tool
|
||||
/// stays out of the catalog, so the model never tries to call
|
||||
/// an OCR engine the host can't actually run.
|
||||
#[must_use]
|
||||
pub fn with_image_ocr_tools(self) -> Self {
|
||||
if crate::dependencies::resolve_tesseract().is_some() {
|
||||
use super::image_ocr::ImageOcrTool;
|
||||
self.with_tool(Arc::new(ImageOcrTool))
|
||||
} else {
|
||||
self
|
||||
}
|
||||
}
|
||||
|
||||
/// Include the `load_skill` tool (#434) so the model can pull a
|
||||
/// SKILL.md body + companion file list into context with one
|
||||
/// call instead of `read_file` + `list_dir` against the path
|
||||
@@ -748,7 +763,8 @@ impl ToolRegistryBuilder {
|
||||
.with_tool_result_retrieval_tool()
|
||||
.with_runtime_task_tools()
|
||||
.with_revert_turn_tool()
|
||||
.with_pandoc_tools();
|
||||
.with_pandoc_tools()
|
||||
.with_image_ocr_tools();
|
||||
|
||||
if allow_shell {
|
||||
builder.with_shell_tools()
|
||||
|
||||
BIN
Binary file not shown.
|
After Width: | Height: | Size: 2.0 KiB |
Reference in New Issue
Block a user