top of page

Tonight's tip comes from the blog of Catherine Wilhelmsen by way of the Litigation Support Guru, Amy Bowser Rollins. NotePad++ allows you to edit large amounts of text far more quickly than Excel. However you may have found it frustrating like I have that it lacks a text to columns tool. Catherine's blog posting demonstrates how you can effectively use the function of Excel's text to column tool in NotePad ++.

Before beginning to edit a text file go to the Plugins menu and select Plugin Manager . . . Show Plugin Manager. Then install TextFX Characters. When this plugin has been installed (it should be necessary to restart) you'll see a new menu heading named, 'TextFX'.

As Catherine explains, you simply select the character(s) you want to align the file into columns by, copy it, select all of the text in the file, and then go to TextFX . . . TextFX Edit . . . Line up multiple characters by lines (Clipboard Character), and text will be arranged in columns. You'll see you can select an individual column or multiple columns by putting the cursor at the beginning of a column and holding down CTRL + SHIFT + END, and then pressing ALT + SHIFT and using the arrow keys. It is then possible to rearrange the columns by cutting and pasting them.

You can also sort the text by columns by selecting text in the same fashion and going to TextFX . . . TextFX Tools . . . Sort lines case insensitive (at column). Selecting individual columns won't sort them in isolation from the rest of the text.

See the demonstration of this method on my YouTube channel.


 
 

Sometimes a ¶ character, may be used in a delimited file to separate different columns. You enter this character, ¶, by pressing ALT + 20 [using the number on the numeric keypad]. The pilcrow or paragraph mark, ¶, may also be created in Excel by entering the formula =CHAR(20). See this example where the pilcrow is entered in column B with an alt code, ALT + 20 and in column D with that formula

When the file is then opened in a text editor like NotePad ++ the pilcrow generated by the =CHAR(20) will appear as 'DC4'.

This is because when is it exported the as a csv file, there reference will be converted to one for ASCII codes - the DC4 stands for Device Control 4. This encoding is used for the pausing of a device.


 
 

Tonight I posted a video to my YouTube channel which demonstrates how to get OCR text aligned in straight rows.

When you're reviewing a document like an invoice with columns of data separated by wide empty spaces, the OCR itself may become arranged in virtual text boxes that align along the columns rather than running across left to right. So if a date is listed on the left side of the page, a billed amount listed on right will not appear on the same row.

If you need to parse out data into different columns and input it into a spreadsheet this formatting of the OCR text can prevent you from lining up the entries from different columns on single rows.

As you can see in the example shown in the video the quantity and item of the first entry on the invoice isn't on the same line as the unit cost and price.

The same problem occurs on each of the randomly selected invoices from separate sources.

We can see the problem more clearly when we save the OCR to a text file. The current arrangement of the text will make it impossible to parse out the data in an Excel spreadsheet.

Abbyy FineReader provides us with a solution for this problem. By default, Abbyy FineReader will OCR the text in the same misaligned fashion as Adobe Acrobat. However it does provide a solution to the problem.

Begin by selecting all of the text boxes on an image page by pressing CTRL + A. Then delete them. Click on the Text box icon on the tool bar above the image and draw a large box over the entire page.

With your large text box still selected, go to the Area menu and save it as a template.

Now go to the thumbnail view, and select all of the pages. Go back to the Area menu, and pick Load Area Template. The new template will delete any existing text boxes.

Next go back and select all of the text boxes in the thumbnail view and right click and choose 'Read Selected Pages'. You should now see that the text for each entry on the invoice is lined up.

Go to File . . . Save Documents As . . . Text Document. The text will now be easier to parse out in Excel by adding in delimiters and using the Text to Columns tool.


 
 

Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page