DeepSeek-OCR: Reducing the Number of Tokens with Visual Context Compression

🧠 Stage 1 – Analysis and Understanding

Main Technical Topic: Optical context compression to reduce the high computational cost of long documents in large language and visual-language models.

Solved Problem: Increasing number of tokens in long documents increases memory consumption, processing time and cost. DeepSeek-OCR minimizes this with image tokens.

Steps Followed by the User:

Converting document pages into visual tokens with DeepEncoder.
Reconstructing the text from these tokens with the DeepSeek-3B-MoE parser.
Testing performance on benchmarks like OmniDocBench and Fox.

Brief Technical Summary: DeepSeek-OCR compresses page images into a small number of visual tokens, providing ≈ 97% accurate OCR with the number of tokens 7–20 times below text-based representation.

💡 What You Will Learn in This Guide

In this article, you will learn the architecture of DeepSeek-OCR, how it is trained, and in which scenarios it works more efficiently than traditional OCR systems.

⚙️ Architectural Overview

🔸 DeepEncoder – Visual Tokenization

It renders high-resolution page images with minimal memory usage.

Local Attention (SAM – Segment Anything Model, 80 M parameters): Captures page layout and fine details.
Global Attention (CLIP – Contrastive Language–Image Pretraining, 300 M parameters): Extracts semantic features from visual tokens.

🔸 DeepSeek-3B-MoE-A570M – Solver

It operates on a mixed-expert (MoE) architecture with 3 billion parameters; Only ≈ 570 M parameters are active in the inference. In this way, it produces results much faster with accuracy similar to large models.

🧩 Training Data

30+ million PDF pages, 100+ languages.
OCR 2.0 data containing 10M graphics, 5M chemical formulas, 1M geometric shapes. In this way, the model can also interpret non-text elements (tables, formulas, diagrams) correctly.

📊 Performance and Benchmarks

Compression Ratio	OCR Accuracy	Usage Status
< 10×	≈ 97%	Training data creation and production processes
20×	≈ 60%	Archiving and secondary use

OmniDocBench: Beats GOT-OCR 2.0 with 100 tokens/page. MinerU 2.0: With < 800 tokens/page it outperforms competitors using > 6000 tokens.

🧠 Application Areas

Large-scale document digitization (archives, law, libraries)
Creating LLM training data (marked image-text pairs)
Structured data extraction (tabular or scientific documents)
Multilingual document processing (100+ languages supported)

💻 Quick Setup Example

from transformers import AutoModel, AutoTokenizer
import torch
from PIL import Image

model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation="flash_attention_2",
    trust_remote_code=True,
    use_safetensors=True
).eval().cuda().to(torch.bfloat16)

image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."
inputs = tokenizer(prompt, images=[image], return_tensors="pt").to("cuda")
output = model.generate(**inputs)
print(tokenizer.decode(output[0]))

This example extracts text from the document image and performs OCR with compressed tokens.

⚙️ Resolution Modes

Mode	Resolution	Visual Token	Usage Area
Tiny	512×512	64	Quick preview
Small	640×640	100	Standard document
Base	1024×1024	256	High resolution
Large	1280×1280	400	Complex page layout
Gundam	Dynamic	795+	Multi-column dense pages

⚠️ Limitations and Cautions

Accuracy decreases at compressions above 10×.
Manual control may be required for complex newspaper layouts.
NVIDIA GPU with CUDA support is required.

❓ Frequently Asked Questions

What is DeepSeek-OCR?

It is an open source OCR system based on the visual-language model; Reduces computational cost by 7-20 times with visual tokenization.

How does it provide high efficiency?

Instead of dividing the entire page into text tokens, it compresses visual information into a small number of tokens.

Which languages does the training data cover?

Trained in more than 100 languages (mainly Chinese and English) and various document types.

What are its areas of use?

Large-scale digitization, LLM education data creation, financial and scientific document analysis.

🏁 Result

DeepSeek-OCR redefines document processing with optical context compression. It sets a new standard in AI training data generation and archiving processes by providing high accuracy with 7–20 times fewer tokens. 💡 You can immediately experience high-performance OCR by testing the model on GenixNode GPU servers!

🧠 Stage 1 – Analysis and Understanding​

💡 What You Will Learn in This Guide​

⚙️ Architectural Overview​

🔸 DeepEncoder – Visual Tokenization​

🔸 DeepSeek-3B-MoE-A570M – Solver​

🧩 Training Data​

📊 Performance and Benchmarks​

🧠 Application Areas​

💻 Quick Setup Example​

⚙️ Resolution Modes​

⚠️ Limitations and Cautions​

❓ Frequently Asked Questions​

🏁 Result​