CalQuity
Insight·10 min read·29 June 2026
TechnologyFinance

Modern OCR vs VLMs for Document Understanding

A benchmark on annual-report pages comparing LiteParse 2.2.0 with Gemini 3 Flash Lite across tables, dense text, maps, charts, and image-heavy layouts.

ANNUAL REPORT PAGE · measured extractionSAME PAGE · TWO EXTRACTION PATHSSOURCE PDFLITEPARSEPDF → markdownGEMINIimage → visual factsSCORECARDVALUES14/1414/14HEADERS0/88/8VISUAL0/66/6parservlmREADOUTnumeric tieheaders separatevisual facts require page viewFIG.34 · OCR-VLMrev.03scale 1:8CalQuity

TL;DR

This benchmark compares two concrete systems: LiteParse 2.2.0 as the document-parser baseline, and Gemini 3 Flash Lite as the page-level vision-language model.

The comparison is not meant to rank every OCR or document-AI tool. It asks a narrower question: when the same annual-report pages are parsed by LiteParse or shown as rendered images to Gemini, which facts are recovered, and where does visual understanding add measurable value?

  • LiteParse is strong on native PDFs. It extracted the boAt financial table with all 12 numeric values and usable markdown tables in 25 ms.
  • The VLM edge is not basic OCR. Gemini 3 Flash Lite still wins where the page meaning lives in charts, maps, project photos, certificate thumbnails, and visual relationships.
  • The comparison is balanced. LiteParse is a real contender on text and tables; Gemini's advantage is clearer on pages where visual context changes the answer.
Capability matrix
Measured scorecard across the four benchmark pages. Fractions are exact rubric hits, not directional scores.

Scoring basis. Text-item recall is exact string recall over 57 pre-declared ground-truth text items across the four pages. Numeric-value recall is exact string recall over 14 declared numeric values: 12 table values plus the 75% / 25% sales split. Table-header preservation counts whether the two dates and three column labels for each year appear inside the markdown table header, for 8 possible header cells. Visual-semantic facts count six pre-declared visual facts that require interpreting images rather than only extracting nearby text: domestic/export icon binding, export-map meaning, India-map meaning, certificate identities, project-photo scene descriptions, and caption-photo binding.

The two-system baseline

The benchmark uses LiteParse as the parser path and Gemini as the vision-language path. LiteParse receives the source PDFs and emits markdown, tables, extracted images, and layout-oriented output. Gemini receives rendered page images and is prompted to extract both text and visual facts from the full page.

SystemInput used in this benchmarkWhat it tests
LiteParse 2.2.0Original PDF pagesFast local parsing, markdown tables, text recall, numeric recall, and whether table headers remain attached to the table body.
Gemini 3 Flash LiteRendered page imagesPage-level extraction across text, tables, maps, charts, photos, icons, and relationships between visual elements.

The benchmark

The test set contains four annual-report pages: a visual global-presence spread, a financial table, a dense text page, and a project-photo grid. The input PDFs were parsed with LiteParse, while rendered page images were sent to Gemini 3 Flash Lite with an extraction prompt.

PageWhat's on itWhat it stresses
Chart visualWorld map of export markets, domestic/export sales infographic, certificate thumbnailsNon-textual visual content
Financial tableTwo stacked credit-risk tables with merged headersTabular structure and numeric recall
Dense textPlain accounting policy commentary on PP&EFairness check for native text
Image captionsEight project photos in a 4x2 grid with captionsCaption-to-image binding and photo semantics

Example 1: financial table

Page 118 of Imagine Marketing's standalone financial statements contains two credit-risk tables. LiteParse handles the values well, while Gemini does better on the table header semantics.

Financial table page
Input page: Imagine Marketing / boAt, credit-risk tables.
LiteParse 2.2.0 output
| Not Due | 0% | 1,535.52 | No |
|---|---|---|---|
| Less than6 months | 3%-16% | 1,130.74 | No |
| 6 Months to 1 Year | 6%-50% | 33.50 | No |
| 1 - 2 Years | 100% | 74.84 | No |
| 2 -3Years | 100% | 1.20 | No |
| More than3 Years | 100% | 14.21 | Yes |

| Not Due | 0% | 1,154.37 | No |
|---|---|---|---|
| Less than6 months | 3%-6% | 393.85 | No |
| 6 Months to 1 Year | 6%-56% | 83.97 | No |
| 1 - 2 Years | 100% | 79.13 | No |
| 2 -3Years | 100% | 14.22 | Yes |
| More than3 Years | 100% | 2.99 | No |
Gemini 3 Flash Lite output
| 31 March 2025 | Weighted average loss rate | Gross carrying amount | Credit impaired |
| :--- | :--- | :--- | :--- |
| Not Due | 0% | 1,535.52 | No |
| Less than 6 months | 3%-16% | 1,130.74 | No |
| 6 Months to 1 Year | 6%-50% | 33.50 | No |
| 1 - 2 Years | 100% | 74.84 | No |
| 2 -3 Years | 100% | 1.20 | No |
| More than 3 Years | 100% | 14.21 | Yes |

| 31 March 2024 | Weighted average loss rate | Gross carrying amount | Credit impaired |
| :--- | :--- | :--- | :--- |
| Not Due | 0% | 1,154.37 | No |
| Less than 6 months | 3%-6% | 393.85 | No |
| 6 Months to 1 Year | 6%-56% | 83.97 | No |
| 1 - 2 Years | 100% | 79.13 | No |
| 2 -3 Years | 100% | 14.22 | Yes |
| More than 3 Years | 100% | 2.99 | No |

This is not a basic OCR failure story. LiteParse recovers every target value and emits the rows in a usable markdown table. The important caveat is the header: LiteParse leaves the year and column labels outside the table body, while Gemini places each date and column label inside the table header. So the numeric benchmark is tied, but the header benchmark is not: both systems recover 12 / 12 values, while Gemini preserves 8 / 8 table-header cells and LiteParse preserves 0 / 8 inside the table. 12 / 12 values for both · 8 / 8 headers for Gemini

Numeric recall comparison
Exact-match value capture and table-header preservation on the two credit-risk tables.

Example 2: maps, charts, and certificates

The Goodluck India global-presence spread is where text parsing and document understanding separate again.

Annual report visual page
Goodluck India Annual Report 2024-25: map, sales infographic, and certification thumbnails.
LiteParse 2.2.0 output excerpt
75%

### Domestic Sales

**1 14 25%**

### 11 Export Sales

**1** Canada **2 8 9** **2** USA **16**
**3 15**
**3** Mexico **17 18**
...
@ Export Market
@ Warehouse
@ Manufacturing Plant

| 23 South Africa 24 Singapore 25 Malaysia | 1 Ludhiana 2 Faridabad 3 Rudrapur |
|---|---|
| 26 New Zealand | 4 Nashik & Aurangabad |
| 27 Australia | 5 Chennai |
Gemini 3 Flash Lite output excerpt
Domestic & Export Sales:
  Domestic Sales: 75% (icon of a hand holding a coin).
  Export Sales:   25% (icon of an airplane).

World map highlights 27 export-market locations.
A zoomed-in inset map of India shows manufacturing
plants and warehouses.

Certifications:
  - Bureau Veritas ISO 9001:2015
  - Engineers India Limited procurement document
  - Central Boilers Board Certificate of Approval
    for Well Known Forge

LiteParse preserves many country labels, percentages, image placeholders, and some table-like grouping. But it does not say what the image means. Gemini binds the 75% to domestic sales, the 25% to export sales, the dots to map locations, and the thumbnail images to certificate categories. That is the VLM advantage on this page.

The VLM win is not over basic table extraction on native PDFs. It shows up when the page is communicating through layout, icons, maps, charts, or photographs.

Example 3: project photos with captions

The second Goodluck spread shows eight engineering projects as photographs with one-line captions.

Project photos page
Engineering Excellence in Action: eight project photographs in a 4x2 grid.

LiteParse extracts the captions and image objects:

Gangapath : Patna Supply of Steel
Arcellor Mittal Nippon Steel Expansion
Rapid Rail Corridor between
Mokama Bridge Composite Girders
...
![](image_p15_1.png)
![](image_p15_5.png)

Gemini goes further and describes the visual content of each photo: fabricated steel girders on a workshop floor, an illuminated bridge structure, a metro construction site, and an aerial view of a cable-stayed bridge. That is a different kind of output from caption extraction: it turns the photos themselves into searchable evidence, not just the text below them.

The numbers

Structural fidelity per page
Structural fidelity by page type. The gap is now modest on native text and tables, but large on visual pages.
Latency comparison
LiteParse is a local parser and is extremely fast on native text PDFs. The Goodluck visual spread triggers OCR fallback, so latency rises but remains competitive.

Latency is also part of the story. On the two boAt native-text pages, LiteParse finished in 25 ms and 8 ms. Gemini took roughly six seconds per rendered page. That makes the comparison less about which tool is universally "better" and more about what kind of information is actually present on the page.

Bottom line. LiteParse is already very good at native PDF text and tables. Gemini adds value when the document's meaning depends on visual context: charts, maps, photographs, icons, thumbnails, and the relationships between them.

Benchmark notes: LiteParse 2.2.0 was run locally on the source PDFs. Gemini 3 Flash Lite was run on rendered page images. The opening scorecard uses a fixed, pre-declared rubric: 57 exact-match text items, 14 exact-match numeric values, 8 table-header cells, and 6 visual-semantic facts. A hit means the extracted output contains the expected string or visual fact in the right context; model confidence scores are not used. Tool references: LiteParse docs and LiteParse announcement.