DeepSeek-OCR: Reducing the Number of Tokens with Visual Context Compression
🧠 Stage 1 – Analysis and Understanding
Main Technical Topic: Optical context compression to reduce the high computational cost of long documents in large language and visual-language models.
Solved Problem: Increasing number of tokens in long documents increases memory consumption, processing time and cost. DeepSeek-OCR minimizes this with image tokens.
Steps Followed by the User:
- Converting document pages into visual tokens with DeepEncoder.
- Reconstructing the text from these tokens with the DeepSeek-3B-MoE parser.
- Testing performance on benchmarks like OmniDocBench and Fox.
Brief Technical Summary: DeepSeek-OCR compresses page images into a small number of visual tokens, providing ≈ 97% accurate OCR with the number of tokens 7–20 times below text-based representation.
💡 What You Will Learn in This Guide
In this article, you will learn the architecture of DeepSeek-OCR, how it is trained, and in which scenarios it works more efficiently than traditional OCR systems.
⚙️ Architectural Overview
🔸 DeepEncoder – Visual Tokenization
It renders high-resolution page images with minimal memory usage.
- Local Attention (SAM – Segment Anything Model, 80 M parameters): Captures page layout and fine details.
- Global Attention (CLIP – Contrastive Language–Image Pretraining, 300 M parameters): Extracts semantic features from visual tokens.
🔸 DeepSeek-3B-MoE-A570M – Solver
It operates on a mixed-expert (MoE) architecture with 3 billion parameters; Only ≈ 570 M parameters are active in the inference. In this way, it produces results much faster with accuracy similar to large models.
🧩 Training Data
- 30+ million PDF pages, 100+ languages.
- OCR 2.0 data containing 10M graphics, 5M chemical formulas, 1M geometric shapes. In this way, the model can also interpret non-text elements (tables, formulas, diagrams) correctly.
📊 Performance and Benchmarks
| Compression Ratio | OCR Accuracy | Usage Status |
|---|---|---|
| < 10× | ≈ 97% | Training data creation and production processes |
| 20× | ≈ 60% | Archiving and secondary use |
OmniDocBench: Beats GOT-OCR 2.0 with 100 tokens/page. MinerU 2.0: With < 800 tokens/page it outperforms competitors using > 6000 tokens.
🧠 Application Areas
- Large-scale document digitization (archives, law, libraries)
- Creating LLM training data (marked image-text pairs)
- Structured data extraction (tabular or scientific documents)
- Multilingual document processing (100+ languages supported)
💻 Quick Setup Example
from transformers import AutoModel, AutoTokenizer
import torch
from PIL import Image
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True
).eval().cuda().to(torch.bfloat16)
image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."
inputs = tokenizer(prompt, images=[image], return_tensors="pt").to("cuda")
output = model.generate(**inputs)
print(tokenizer.decode(output[0]))
This example extracts text from the document image and performs OCR with compressed tokens.
⚙️ Resolution Modes
| Mode | Resolution | Visual Token | Usage Area |
|---|---|---|---|
| Tiny | 512×512 | 64 | Quick preview |
| Small | 640×640 | 100 | Standard document |
| Base | 1024×1024 | 256 | High resolution |
| Large | 1280×1280 | 400 | Complex page layout |
| Gundam | Dynamic | 795+ | Multi-column dense pages |
⚠️ Limitations and Cautions
- Accuracy decreases at compressions above 10×.
- Manual control may be required for complex newspaper layouts.
- NVIDIA GPU with CUDA support is required.
❓ Frequently Asked Questions
What is DeepSeek-OCR?
It is an open source OCR system based on the visual-language model; Reduces computational cost by 7-20 times with visual tokenization.
How does it provide high efficiency?
Instead of dividing the entire page into text tokens, it compresses visual information into a small number of tokens.
Which languages does the training data cover?
Trained in more than 100 languages (mainly Chinese and English) and various document types.
What are its areas of use?
Large-scale digitization, LLM education data creation, financial and scientific document analysis.
🏁 Result
DeepSeek-OCR redefines document processing with optical context compression. It sets a new standard in AI training data generation and archiving processes by providing high accuracy with 7–20 times fewer tokens. 💡 You can immediately experience high-performance OCR by testing the model on GenixNode GPU servers!

