top of page

Not All OCR is Equal


Many attorneys, paralegals, and litigation support professional rely on large databases of OCR text day after day, night after night, without ever inquiring into the quality of the OCR. Not only the accruacy of the OCR (in terms of whether or not it correctly detects the spelling of a word) but also the formatting the of the OCR can have a great impact on your ability to retrieve information. Whether you are running Boolean searches in database or attempting to process data using a grep utility, the quality of the OCR can make or break your project.

OCR software can generate text with a lot different irregularities when it attempts to deal with forms. However these are just the kind of documents you may want to attempt to 'auto code' - collect the data points from all of the forms and put them in a spreadsheet or database for analysis. See below where I show the differences in the OCR generated by three widely used programs from a 1040 tax form - in this case that of Barack and Michelle Obama.

I'm going to concentrate on the just top portion of the form where the First Family's names and home address are listed. Say for the sake of argument that you were assigned to process thousands of tax forms and put the data for all of them in a single Excel spreadsheet. You would need to run searches using regular expression or another technique, based on the labels in each box of the form. It's important that the OCR place the label, "Your first name and initial" in front of where each tax filer's name appears.

In first example, Adobe does a good job of putting the label in the right place before the President and First Lady's names, and the boxes for their addresses. However the OCR could not detect the fine print used for the words 'Your first name and initial' correctly, (it shows, "Your tirst name ana mit1al"), and couldn't interpret the no longer so uncommon last name, 'OBAMA' - instead giving you 'bBAMA'

Nuance does a better job at getting the words in the label correct, but for some reason puts the names of the First Couple below those of their children which appear in below sections on the actual form. Nuance also gives us 'DRAMA' for the President's name, even though it gets his wife and children's surname correct.

Abbyy FineReader does the best, spelling 'Obama' correctly, and providing a format close to the original. However it misinterprets the side of the box in which the last name is entered as a bracket or open parenthesis and places it at the beginning of the surname. The proximity of the 'L' in the 'Last Name' label to the side of the box, causes it to scramble this as, "|_ast name".

Adobe Acrobat

(99) E 1040 U.S. Individual Income Tax Return 120131 OMB No. 1545-00741 IRS Use Only-Do not write or staple in this space. LL. For the year Jan. 1-Dec. 31, 2013, or other tax year beginning ,2013, ending 20 See seoarate instructions. Your t1rst name ana mit1a1 BARACK H. Last name bBAMA Your social security number If ajoint return, spouse's first name and initial MICHELLE L. Last name PBAMA Spouse's social security number Home address (number and street). If you have aP.O. box, see mstructions. I Apt no. 1600 PENNSYLVANIA AVENUE, NW A Make sure the SSN(s) above and on line Ge are correct. City, town or post office, state, and ZIP code. If you have a foreign address, also complete spaces below. WASHINGTON, DC 20500 nes1aent1a1 l::tection vampatgn Check here if you, or your spouse if filing join11y, want $3 to go to this fund. Checking a box below will not change your tax or refund. OOvou 00 Spouse Foreign country name IForei.gn province/state/county IForeign postal code

Nuance Power PDF Advanced

Filing Status 2 X Married filing jointly (even if only one had income) 1 I I Single Dependents on 6c not entered above Add numbers on lines a 4 394,796. 6,575. Boxes checked on 6a and 6b No. of children on 6c who: pg lived with you d go did not live with you due to divorce or separation (see instructions) 2 Dependents: 2 (1) First name Last name (2) Dependent's social security number (3) Dependent's relationship to you (4)n' I ilrift rW II liulifyingtochil Wmft MALIA A OBAMA DAUGHTER X If more than four NATASHA M OBAMA DAUGHTER X dependents, see instructions and check here I I Combine the amounts in the far right column for lines 7 through 21. This is your total income 22 27 1,404 28 20,681. 24 25 26 33 34 35 36 37 Eri 1040 U.S. individual Income Tax Return199) 20 1 31 0MB No. 1545-0074 IRS Use Only - Do not write or staple in this space. For the year Jan. 1-Dec. 31, 2013, or other tax year beginning 2013, ending 20 See separate instructions. Your first name and initial BARACK H. Last name DRAMA Your social security number If a joint return, spouse's first name and initial MICHELLE L. Last name OBAMA. Spouses social security number Home address (number and street). If you have a P.O. box, see instructions. 1600 PENNSYLVANIA AVENUE, NW Apt no. A Make sure the SSN(s) above and on line 6c are correct. City, town or post office, state, and ZIP code. If you have a foreign address, also complete spaces below. WASHINGTON, DC 20500 Presidential Election Campaign Check here if you, or your spouse if filing jointly, want $3 to go to this fund. Checking a box below will not change your tax or refund.

Abbyy FineReader

I 1040 U.S. Individual income Tax Return 2013 OMB No. 1545-0074 IRS Use Only - Do not write or staple in this space. For the year Jan. 1-Dec. 31, 2013, or other tax year beginning , 2013, ending 20 See separate instructions. Your first name and initial I |_ast name BARACK H. [OBAMA Your social security number If a joint return, spouse's first name and initial Last name MICHELLE L. (OBAMA Spouse's social security number Home address (number and street). If you have a P.O. box, see instructions. 160 0 PENNSYLVANIA AVENUE, NW Apt no. . Make sure the SSN(s) above ^ and on line 6c are correct City, town or post office, state, and ZIP code. If you have a foreign address, also complete spaces below. WASHINGTON, DC 20500 Presidential Election Campaign Check here if you, or your spouse if filing jointly, want $3 to go to this fund. Checking a box below will not change your tax or refund. I X J You I X I Spouse Foreign country name Foreign province/state/county Foreign postal code Filin Status 1 _____' Single 2 I X I Married filing jointly (even if only one had income) Check only____________I I Married filing separately. Enter spouse's SSN above


bottom of page