DeepSeek has just open-sourced its specialized model for OCR scenarios, DeepSeek-OCR 2, with the technical report released simultaneously. This model represents an upgrade over last year's DeepSeek-OCR, featuring a novel decoder that enables the model to view images and read documents in a more human-like sequence rather than following mechanical scanner patterns.
Simply put, while previous models scanned images from top-left to bottom-right in a carpet-sweeping manner, DeepSeek-OCR 2 can comprehend structures and read step-by-step according to logical organization. This new visual understanding approach allows DeepSeek-OCR 2 to better interpret complex layout sequences, mathematical formulas, and tabular data.
On the document understanding benchmark OmniDocBench v1.5, DeepSeek-OCR 2 achieved a score of 91.09%, representing a 3.73% improvement over DeepSeek-OCR despite using identical training data and encoder configurations. Compared to other end-to-end OCR models, this performance reaches state-of-the-art levels, though it slightly trails Baidu's PaddleOCR-VL (92.86%) OCR pipeline.
Simultaneously, under similar visual token budgets, DeepSeek-OCR 2 demonstrates lower edit distance (measuring effort required to correct text) in document parsing compared to Gemini-3 Pro, proving that DeepSeek-OCR 2 maintains high visual token compression rates while ensuring superior performance.
DeepSeek-OCR 2 offers dual value: it serves both as an exploratory research platform for novel VLM (Vision Language Model) architectures and as a practical tool for generating high-quality pre-training data to support large language model training processes.
From an architectural perspective, DeepSeek-OCR 2 inherits the overall framework of DeepSeek-OCR, which consists of encoder and decoder components. The encoder discretizes images into visual tokens, while the decoder generates outputs based on these visual tokens and text prompts.
The crucial difference lies in the encoder: DeepSeek upgraded the previous DeepEncoder to DeepEncoder V2, which retains all original capabilities but replaces the CLIP-based encoder with an LLM-based architecture while introducing causal reasoning through novel design elements.
DeepEncoder V2 addresses the core issue where two-dimensional structures mapped into one-dimensional sequences with fixed linear ordering inevitably cause models to be influenced by this sequence when modeling spatial relationships.
This problem becomes particularly critical in OCR, tables, forms, and other complex layout scenarios where linear sequencing often severely mismatches actual semantic organization, thereby limiting the model's expressive power for visual structures.
How does DeepEncoder V2 mitigate this issue? It first employs a visual tokenizer for efficient image representation, achieving approximately 16x token compression through windowed attention while maintaining sufficient local and medium-scale visual information with significantly reduced global attention computation and memory overhead.
Rather than relying on position encoding to dictate semantic order of visual tokens, it introduces causal queries that reorder and distill visual tokens through content-aware mechanisms. This sequencing isn't determined by spatial unfolding rules but generated progressively by the model after observing global visual context, thereby avoiding strong dependency on fixed one-dimensional ordering.
Each causal query can attend to all visual tokens and previous queries, enabling semantic reordering and information distillation of visual features while maintaining constant token count. Ultimately, only the outputs from causal queries are fed into the downstream LLM decoder.
This design essentially creates a two-stage cascaded causal reasoning process: first, the encoder performs semantic ordering of unordered visual tokens through causal queries, followed by the LLM decoder executing autoregressive reasoning on this ordered sequence.
Compared to approaches that forcibly impose spatial ordering through position encoding, the sequence induced by causal queries better aligns with visual semantics itself, matching normal human reading patterns.
Since DeepSeek-OCR 2 primarily focuses on encoder improvements, no upgrades were made to decoder components. Following this design principle, DeepSeek retained DeepSeek-OCR's decoder: a 3B parameter MoE structure with approximately 500 million active parameters.
To validate the effectiveness of this design, DeepSeek conducted experiments. The research team trained DeepSeek-OCR 2 in three stages: encoder pre-training, query enhancement, and decoder specialization.
The first stage enabled the visual tokenizer and LLM-style encoder to acquire basic capabilities for feature extraction, token compression, and token reordering. The second stage further enhanced the encoder's token reordering ability while improving visual knowledge compression. The third stage froze encoder parameters and optimized only the decoder, achieving higher data throughput under identical FLOPs.
For model evaluation, DeepSeek selected OmniDocBench v1.5 as the primary benchmark. This benchmark contains 1,355 document pages covering 9 major categories in Chinese and English (including magazines, academic papers, research reports, etc.).
DeepSeek-OCR 2 achieved 91.09% performance using only the minimum visual token上限 (V-token max). Compared to the DeepSeek-OCR baseline with similar training data sources, it demonstrated a 3.73% improvement, validating the new architecture's effectiveness.
Beyond overall improvement, the edit distance for reading order significantly decreased (from 0.085 to 0.057), indicating that the new DeepEncoder V2 can effectively select and arrange initial visual tokens based on image information.
Under similar visual token budgets (1120), DeepSeek-OCR 2 (0.100) showed lower edit distance in document parsing than Gemini-3 Pro (0.115), further proving the new model maintains high visual token compression rates while ensuring performance.
However, DeepSeek-OCR 2 isn't omnipotent. On ultra-high-density newspaper text, its recognition effectiveness falls short compared to other text types. This issue could potentially be addressed by increasing local cropping quantities or providing more training samples.
DeepEncoder V2 provides preliminary validation for LLM-style encoders' feasibility in visual tasks. More importantly, DeepSeek's research team believes this architecture possesses potential to evolve into unified full-modal encoders capable of compressing text, extracting speech features, and reorganizing visual content within the same parameter space.
DeepSeek states that DeepSeek-OCR's optical compression represents initial exploration toward native multimodality, with future research continuing to explore integrating additional modalities through this shared encoder framework, potentially marking the beginning of novel VLM architecture exploration.