GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR is a family of vision-language models (VLMs) designed to serve as a “grounded OCR front-end”. Unlike traditional OCR pipelines (which are brittle) or modern “OCR-free” VLMs (which often lack precise token-to-pixel alignment), GutenOCR is fine-tuned to provide both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface.

Abstract

Traditional OCR pipelines are often brittle, while modern “OCR-free” Vision-Language Models (VLMs) frequently lack precise token-to-pixel alignment. To address this, we introduce GutenOCR, a family of VLMs designed specifically as a “grounded OCR front-end.” By fine-tuning Qwen2.5-VL on a curriculum of synthetic and real-world documents, GutenOCR provides both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface. This approach allows downstream systems to request exactly the data format they need, from plain text to complex JSON structures.

Key Contributions & Results

Unified Interface: Transforms Qwen2.5-VL models into specialized OCR systems supporting full-page reading, detection, localized reading, and conditional detection via prompting.
In-Domain Improvements: GutenOCR-7B more than doubles the composite grounded OCR score of its base model (0.40 to 0.82) on 10.5K held-out pages, showing massive gains in localized reading and detection.
Fox Benchmark: Significantly outperforms baselines on region-level and line-level OCR, with GutenOCR-3B achieving a region-level Character Error Rate (CER) of 0.053, surpassing even the dedicated Fox model.
Curriculum Learning: Training uses a three-stage curriculum across synthetic data, real-world business documents, and long-context scientific articles to progressively build layout and grounding competency.
Trade-offs: While GutenOCR reads content accurately (high Page F1), it orders text based on 2D layout columns. It also experiences some catastrophic forgetting of color-based prompts and slight degradation in math formula recognition.

Methodology

Data: The training mixture combines large-scale real-world documents (business forms, scientific articles) with synthetic data designed to teach precise grounding (e.g., “Grounded LaTeX” and “SynthDoG Grounding”).
Curriculum Learning: Training progresses through three stages, starting with short contexts and synthetic data, moving to real-world business documents, and finishing with long-context scientific articles (up to 16k tokens).
Unified Interface: The model treats “pipeline” stages (detection, reading, grounding) as different input-output schemas of a single model, allowing downstream systems to request exactly the data format they need (e.g., plain text vs. JSON boxes).

Models

We release 3B and 7B parameter models on HuggingFace:

You can try GutenOCR directly at ocr.roots.ai, where you can upload a document image and see the model’s parsed text output alongside bounding-box highlights on the original image.

GutenOCR demo interface showing parsed text with hover-to-highlight on the uploaded image — The live demo at ocr.roots.ai: hovering over any parsed token highlights its bounding box on the original document.

Why This Matters

GutenOCR is proposed as a foundational layer for systems where every extracted answer must be explicitly linked to supporting pixels. By providing stable, grounded outputs, it enables human-in-the-loop workflows where reviewers can easily verify hallucinations or missing text by checking the predicted bounding boxes. This work pairs closely with our release of PubMed-OCR, which provides the large-scale, high-density annotations necessary to train such layout-aware models.

Resources

Live Demo: Try GutenOCR on your own documents.
Paper (arXiv): Full technical report.
Code (GitHub): Training code and model release.
GutenOCR-3B (HuggingFace): 3B parameter model weights.
GutenOCR-7B (HuggingFace): 7B parameter model weights.

Citation

@misc{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
      title={GutenOCR: A Grounded Vision-Language Front-End for Documents},
      author={Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew},
      year={2026},
      eprint={2601.14490},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.14490},
}

PubMed-OCR: The large-scale annotation dataset used to train GutenOCR’s layout-aware grounding capabilities.
LLMs for Page Stream Segmentation: Complementary work on document understanding at the page-stream level.
The Evolution of Page Stream Segmentation: Rules to LLMs: Background on the history and evolution of document processing pipelines.
The Reliability Trap: When 99% Accuracy Isn’t Enough: Explores calibration challenges in deployed PSS systems, directly relevant to GutenOCR’s deployment context as an OCR front-end.

Abstract#

Key Contributions & Results#

Methodology#

Models#

Why This Matters#

Resources#

Citation#

Related Work#