2026-06-05 · 6 min read
Why single-pass OCR silently drops rows from bank statements — and how we fixed it
If you have ever tried to turn a bank-statement PDF into a spreadsheet, you have met the core problem: the PDF is designed to be read by a human, not parsed by a machine. Columns are visual, not structural. Multi-line descriptions wrap unpredictably. Pages break in the middle of a running balance. And the worst failure mode is the silent one — a transaction is dropped and nothing tells you it is missing.
We build bankpdf, a tool that converts bank-statement PDFs (and scans, and phone photos) into clean Excel/CSV. This is a write-up of the extraction approach, because the naive version fails in ways that matter when the output is someone's accounting.
Why a single OCR pass is not enough
Run a statement through one OCR engine and you get one interpretation of an ambiguous layout. On a clean, single-column statement that is usually fine. But real statements are messy: two-column debit/credit layouts, amounts that share a line with an authorization date, descriptions that wrap onto a second visual line, and page footers that repeat the running balance. Each of these is a place where a single pass either merges two rows into one, splits one row into two, or skips a row entirely.
The failure is not random noise you can eyeball. It is a structurally plausible — but wrong — table. A merged row still looks like a row. A dropped row leaves no gap. So you cannot trust the count, and you cannot trust the totals.
Triple extraction, merged by max transaction count
Instead of trusting one interpretation, we extract the same PDF three different ways and reconcile them. The three passes fail on different layouts, so where one collapses two rows, another usually keeps them separate.
- A Markdown pass — good at linear, single-column statements.
- An HTML-table pass — good at recovering genuine column structure.
- A vision-model pass (Mistral Large) — good at the messy, scanned, or photo cases where the text layer is unreliable or absent.
We then merge by the maximum transaction count: when the passes disagree on how many rows a section contains, the interpretation that preserves the most distinct transactions is almost always the correct one, because the dominant error mode is merging/dropping rows, not inventing them. It is a deliberately simple heuristic, and it beats any single pass on the messy long tail.
The part that makes it verifiable: balance reconciliation
Extraction that is probably right is not good enough when the output feeds accounting. So after extraction we do something the statement itself makes possible: we reconcile the extracted transactions against the statement's own declared opening and closing balance.
Opening balance, plus the sum of credits, minus the sum of debits, must equal the closing balance the bank printed. If it does not, a row is missing, duplicated, or misread — and we can say so, instead of handing over a clean-looking file that is quietly wrong. This single check is what turns 'we OCR'd your statement' into 'we OCR'd your statement and verified it adds up'.
The unsexy moat: per-bank adapters
The dream is one universal algorithm. We gave up on it. Every bank formats dates, currencies, and description prefixes differently — Chase repeats CHECKCARD/PURCHASE prefixes, Wells Fargo embeds an authorization date in the label, Barclays prints a per-page closing balance. So we fingerprint each bank and apply a dedicated adapter (80+ banks, including non-US/UK ones) that normalizes its quirks before the data ever reaches the spreadsheet.
It is grindy, manual work. That is precisely why it is a moat: a generic converter cannot match per-bank precision, and the work does not compress into a clever one-liner. The same logic shows up in the literature on the original bootstrapped player in this niche — the boring, bank-specific path is the one that actually ships accurate output.
Honest limitations
Triple extraction costs roughly three times the compute of a single pass. The max-count merge can, in rare cases, keep a spurious split if two passes make the same split error. And reconciliation only works when the statement prints opening and closing balances — some don't, and there the check degrades to a softer heuristic. We would rather tell you the check could not run than pretend it passed.
If you want to try it, bankpdf is free for the first few statements per month, hosted in the EU, and supports direct exports for accounting software (Pennylane, Sage, FEC, QuickBooks, Xero, OFX). But the approach above is the part worth stealing whether you use us or build your own: extract more than once, merge toward the higher count, and reconcile to the printed balance.