Insight·10 min read·26 May 2026

TechnologyFinance

Why Vision Transformers Beat OCR for Document Understanding

A head-to-head benchmark on real annual-report pages — comparing EasyOCR and docTR against Gemini 3.1 Flash Lite for text recall, structure preservation, and visual reasoning.

TL;DR

OCR was designed to convert pixels of letters into strings of letters. That's it. A modern vision transformer (ViT) does not have that limitation — it ingests a page the way a human does: as a 2-D arrangement of words, tables, charts, icons, and pictures, and reasons over all of it jointly.

On four representative annual-report pages, the gap is large enough to settle the question — and the multimodal model on the ViT side is Google's smallest, cheapest current option (gemini-3.1-flash-lite), not a flagship.

Numeric recall on a financial table: EasyOCR captured 0 of 12 ground-truth values. Gemini 3.1 Flash Lite captured 12 of 12, preserved as a properly aligned markdown table.
Structure preservation: Both OCR engines flatten multi-column tables into noisy single-column blobs. The ViT reconstructs row labels, columns, and values in their original layout.
Visual elements: Neither OCR engine emits a single character about pie charts, world maps, infographics, or photographs. The ViT describes what they show and the values they encode (e.g. "75% Domestic Sales / 25% Export Sales").
Latency is competitive: Flash Lite finishes a page in about six seconds via API — faster than EasyOCR on three of four test pages on a CPU box.

For a RAG pipeline that retrieves pages from a noisy mix of investor presentations, prospectuses and annual reports, the question is no longer "should we OCR everything?" — it is "why are we still OCR'ing anything that contains a chart?"

Capability heat-map across five document-understanding tasks. OCR is a one-trick pony; the ViT covers the full surface.

What OCR was built to do

Traditional OCR pipelines (Tesseract, ABBYY) and modern deep-learning OCR (EasyOCR, PaddleOCR, docTR) share the same task formulation:

Detection — find bounding boxes around regions that look like text.
Recognition — for each box, decode the pixels into Unicode characters.
Aggregation — concatenate the boxes in approximate reading order.

Crucially, an OCR engine has no opinion about whether a region is a header, a footnote, a table cell, a chart axis label, or a country name floating on a map. There is no semantic output channel. Anything that isn't a glyph is silently dropped on the floor.

For documents that look like a Word page printed to PDF, that's fine. For documents that look like an investor presentation or an annual-report design spread, that's catastrophic.

The benchmark

Four pages from real annual reports were rendered at 144 DPI and run through three engines on the identical input bytes:

EasyOCR 1.7.2 — PyTorch-based, popular open-source OCR.
docTR 1.0.1 — Mindee's deep-learning OCR (FAST text detector + CRNN recogniser).
Gemini 3.1 Flash Lite — Google's smallest current multimodal model, called once with the page image and an extraction prompt (temperature=0.1).

PaddleOCR 3.5 was also installed and attempted but failed on this Windows machine with a PIR/oneDNN runtime error — a small reminder that "popular OCR library" is not the same as "stable production dependency".

The four pages were chosen to stress different document-understanding skills:

Page	What's on it	What it stresses
Chart visual	World map of export markets, pie chart, certification thumbnails	Non-textual content
Financial table	Two stacked credit-risk tables with merged headers	Tabular structure
Dense text	Plain accounting policy commentary on PP&E	Fairness check — pure text
Image captions	Eight project photos in a 4×2 grid with captions	Spatial caption-image binding

Example 1 — A financial table

Page 118 of Imagine Marketing's standalone financial statements contains two credit-risk tables. The columns are weighted average loss rate, gross carrying amount, and credit impaired. The rows are ageing buckets.

Financial table page — Input page (Imagine Marketing / boAt — DRHP, page 118).

Here is the table region as each engine produced it.

EasyOCR output

53551
1Su,+
67u-Su u
J-fcr;
2-Ycan:
Mote thun  YWc
15437
393.85
83.97
79.13
14.22
2.99

docTR output

NotI Due
0%
1,535.52 No
Lesst than6months
3%-16%
1,130.74 No
6Monthstol Year
6%-50%
33.50 No
1-2Years
100%
74.84 No
2-3Years
100%
1.20 No
Moret than3 3Years
100%
14.21 Yes

Gemini 3.1 Flash Lite output

| Not Due            | 0%     | 1,535.52 | No  |
| Less than 6 months | 3%-16% | 1,130.74 | No  |
| 6 Months to 1 Year | 6%-50% |    33.50 | No  |
| 1 - 2 Years        | 100%   |    74.84 | No  |
| 2 -3 Years         | 100%   |     1.20 | No  |
| More than 3 Years  | 100%   |    14.21 | Yes |

(plus a second table for 31 March 2024 with all
12 ground-truth values aligned to the correct row
and column — and the prose that follows the table,
verbatim.)

Three things to notice:

EasyOCR mangled the digits. "1,535.52" became "53551" — the model treated the comma and decimal as separators and emitted a string that's worse than useless in a downstream financial pipeline. 0 / 12 numbers recovered
docTR captured every digit correctly but emitted them as a vertical stream of tokens — the column boundary is gone. A downstream regex parser can recover values, but the mapping between (row, column) is now ambiguous. 12 / 12 numbers, structure lost
The ViT reconstructed the table. Row labels, column headers, and values are aligned. The Markdown can be parsed straight into a DataFrame. 12 / 12 numbers, structure preserved

Numeric recall comparison — Exact-match numeric recall on the 12 ground-truth values in the credit-risk table.

The deeper point: even when an OCR engine returns "all the numbers", it returns them without the spatial scaffolding that gives them meaning. 1,535.52 is just a number. "Not Due / 0% / 1,535.52 / No" is a fact. OCR cannot make that leap because reading order is fundamentally a 1-D model imposed on a 2-D artefact.

Example 2 — Graph and chart understanding

This is where the gap stops being "noticeable" and becomes "categorical".

Annual report visual page — Goodluck India Annual Report 2024-25 — global-presence spread.

The page contains:

A world map with 27 numbered export-market locations.
A zoomed India inset showing warehouses (green) and manufacturing plants (blue).
A pie-chart-style infographic — 75% Domestic / 25% Export — encoded as icons with percentages.
Three scanned certification documents with their own small text.
A bulleted "Core Competencies" section and three certifications callouts.

Ask any OCR engine "what is on this page?" and the answer is a bag of disconnected text tokens — country names floating without a map, "75%" and "25%" floating without a chart, "ISO 9001:2015" floating without a certificate.

EasyOCR output (excerpt)

Core
Competencies
Domestic &
industry knowledge & experience
Focus on high-margin; value-added products &
Export Sales
...
75%
Domestic Sales
2
25%
Export Sales
Canada
USA
Mexico
...
sis3]
484
Antted
Russia
@
FoRHAt
GOODLUCK ENGINEERING Co
Terulalin
53
OCCDLLCX INC A Limitedi
Export Market
IDoreleomant Daparimt
UAE
ECDWCX
REL
Irtt Edt
Tilteeuit
Acen
India
Warehouse
Central Boilars
JaaTor

Gemini 3.1 Flash Lite output (excerpt)

Domestic & Export Sales:
  Domestic Sales: 75% (icon of a hand holding a coin).
  Export Sales:   25% (icon of an airplane).

World map highlights 27 export-market locations.
A zoomed-in inset map of India shows
manufacturing plants and warehouses.

Export Market Locations:
1. Canada       10. Belgium     19. Ethiopia
2. USA          11. Germany     20. Kenya
3. Mexico       12. UK          21. Tanzania
4. Bolivia      ...             ...

India Map Legend:
  Manufacturing Plants (blue dots): Sikandrabad, Kutch
  Warehouses (green dots): Ludhiana, Faridabad,
                           Rudrapur, Nashik &
                           Aurangabad, Chennai

Certifications:
  - Bureau Veritas ISO 9001:2015
  - Engineers India Limited (Procurement Dev. Dept.)
  - Certificate of Approval for Well Known Forge

EasyOCR's output for the certification thumbnails is genuinely worth quoting in full: sis3] 484 Antted, FoRHAt, Terulalin 53, OCCDLLCX INC A Limitedi. The OCR engine is trying to do its job — read the pixels — but the pixels in question are a tiny, low-contrast scanned certificate; the optimal answer is not "transcribe the noise letter by letter", it is "this is a certification thumbnail; here's roughly what it certifies". OCR has no escape hatch to that answer. A ViT does.

This is the qualitative difference. Once a model can see the image of the page and not just the letters on it, the entire problem reformulates from "transcribe glyphs" to "describe what is being communicated". And the moment that reformulation happens, charts, diagrams, maps, infographics, and even decorative icons all become first-class queryable objects.

The pie-chart test. Ask the ViT "what is the ratio of domestic to export sales on this page?" — it answers 75 : 25. Ask the same question to EasyOCR's output and you'd have to write a regex to fish out the percentages, then guess (from word proximity) which one is domestic. The reasoning that a human does in 200 ms — "this big slice next to the icon of money means domestic sales" — is not a transformation an OCR pipeline can perform.

An honest caveat

Flash Lite is the smallest model in this family, and it does fumble one detail on this page. Its first sentence about the India inset map says "shows 5 manufacturing plants and 1 warehouse location". The legend two paragraphs down — which the model itself then reads correctly — says the opposite (2 manufacturing plants and 5 warehouses). It contradicted itself inside the same response. For a downstream RAG model reading the full extraction, the correct values are still present and dominate. For a system that reads only the first paragraph, this would matter. Validate row-count and value-range assumptions if you feed the extraction directly into structured records.

Example 3 — Project photos with captions

The second Goodluck spread shows eight engineering projects as photographs with one-line captions.

Project photos page — Engineering Excellence in Action — eight project photographs in a 4×2 grid.

Both OCR engines do a respectable job of pulling the captions (70-80% recall on caption text). But neither says a single word about what is in any of the eight photographs. The ViT, asked to describe the page, produces things like:

"Rapid Rail Corridor between Ashoknagar and Barapullah — a night-time view of a lit-up arched bridge structure with a crane working underneath."

"Mumbai Ahmedabad High Speed Rail — a night-time view of a large steel truss bridge structure illuminated by floodlights."

"Chennai Metro Rail Corporation elevated Viaduct — an elevated metro rail construction site with steel spans and green safety netting."

For a RAG pipeline answering questions like "show me Goodluck's bridge projects", the ViT output is directly indexable. The OCR output requires the user to already know which captions mention bridges — which defeats the point of retrieval. Note that the caption "Chennai Metro Rail Corporation elevated Viaduct — Steel Spans" doesn't itself contain the word "bridge", so an OCR-only index couldn't retrieve it for that query at all.

The numbers — across all four pages

Structural fidelity per page — Structural fidelity (0-100) by page type. The dense-text page is the only one where OCR is competitive — because there is no structure to lose.

The dense-text page is the fair-fight case: paragraph text, no tables, no charts. There OCR and ViT are roughly tied. That is the niche OCR was actually designed for — and it is also the niche that contains the smallest fraction of pages in a real financial document corpus.

Latency per page — Per-page latency, CPU-only for the OCR engines, API latency for the ViT.

On latency, Flash Lite is competitive with the OCR engines on a CPU box and faster than EasyOCR on three of four pages. That removes the most common deployment objection to multimodal-LLM document extraction — "we can't afford the latency budget". The cost side has the same shape: Lite-tier pricing is materially below the mid-tier multimodal models, so the per-page extraction cost is small enough to be a rounding error on most pipelines.

Why does a ViT do this and an OCR doesn't?

There is no magic. There is a very different architectural choice.

OCR is a chain of narrow models

An OCR pipeline is a hand-engineered cascade: text detection → crop → recognition → reading-order heuristic → concatenate. Every stage produces lossy intermediate representations. The chart-axis annotation is dropped at detection (it's a number floating in whitespace). The pie-chart labels are kept but disconnected from the visual element they describe. The structure of a table is recovered, if at all, by post-processing bounding-box geometry — and only for table-like layouts the heuristic was tuned on.

ViTs treat the page as one signal

A vision transformer divides the page into a uniform grid of patches (typically 14×14 or 16×16 pixels) and embeds each patch into a token. From there, self-attention lets every patch attend to every other patch. A "patch" containing the digit 75% and a "patch" containing the icon of hands holding money can interact directly. The model is free to learn the association "this icon means domestic sales" because patches do not have a designated semantic type — only spatial coordinates and embeddings.

Combined with a language decoder (in Gemini's case, an autoregressive transformer that consumes those visual tokens alongside a text prompt), this lets the model emit any natural-language description of the page — extraction, summary, structured JSON, or answer to a question — without a separate post-processor for each.

That is the architectural story behind the empirical gap. And because the gap is architectural, it does not depend on picking the largest available multimodal model — a Lite-tier one preserves most of the win, as the numbers above show.

What this means for document-extraction pipelines

A typical financial-document RAG pipeline does text retrieval over pre-chunked embeddings (cheap, indexable), then ships the retrieved PDF pages as multimodal input to a vision-language model at generation time. The hybrid is the right shape — text retrieval keeps index size and latency under control, while page-image generation rescues the parts of the page that text extraction permanently lost.

Two practical things follow from the benchmark above:

The generation-side model can be a Lite-tier multimodal model. Six seconds per page is a budget that fits inside interactive question-answering pipelines, and the structural fidelity on tables and visual fidelity on charts holds.
The retrieval side is the next frontier. Because today's text-only retrieval cannot fire on chart-only or infographic-only pages — they produce little or no meaningful text chunk — those pages are effectively invisible to the index. ColPali-style late-interaction embeddings or multivector image embeddings (e.g. Jina v4 with input_type="image" and return_multivector=true) close that loop.

Bottom line. OCR is a tool for transcribing typed letters from images. Vision transformers are a tool for understanding documents. Those are not the same problem, and they have not been the same problem for several years. The interesting design question is no longer "ViT vs OCR" (settled) and no longer "which size of ViT" (Lite is sufficient for ingestion) — it is how aggressively to push retrieval itself toward image-native embeddings, so that the chart-only and infographic-only pages a text index currently cannot retrieve become first-class results.

Benchmark notes — single-page latency measured on a CPU-only Windows 11 laptop (Python 3.12). EasyOCR weights were downloaded on first run; the figures above are warm-cache numbers. Gemini calls go through gemini-3.1-flash-lite with temperature=0.1.

Related Insights

← All insights Book a discovery call→