Tonight I posted a video to my YouTube channel which demonstrates how to get OCR text aligned in straight rows.
When you're reviewing a document like an invoice with columns of data separated by wide empty spaces, the OCR itself may become arranged in virtual text boxes that align along the columns rather than running across left to right. So if a date is listed on the left side of the page, a billed amount listed on right will not appear on the same row.
If you need to parse out data into different columns and input it into a spreadsheet this formatting of the OCR text can prevent you from lining up the entries from different
columns on single rows.
As you can see in the example shown in the video the quantity and item of the first entry on the invoice isn't on the same line as the unit cost and price.
The same problem occurs on each of the randomly selected invoices from separate sources.
We can see the problem more clearly when we save the OCR to a text file. The current arrangement of the text will make it impossible to parse out the data in
an Excel spreadsheet.
Abbyy FineReader provides us with a solution for this problem. By default, Abbyy FineReader will OCR the text in the same misaligned fashion as Adobe Acrobat. However it does provide a solution to the problem.
Begin by selecting all of the text boxes on an image page by pressing CTRL + A. Then delete them. Click on the Text box icon on the tool bar above the image and
draw a large box over the entire page.
With your large text box still selected, go to the Area menu and save it as a
Now go to the thumbnail view, and select all of the pages. Go back to the Area menu, and pick Load Area Template. The new template will delete any existing text boxes.
Next go back and select all of the text boxes in the thumbnail view and right click and choose 'Read Selected Pages'. You should now see that the text for each entry
on the invoice is lined up.
Go to File . . . Save Documents As . . . Text Document. The text will now be easier to parse out in Excel by adding in delimiters and using the Text to Columns tool.