Create a Redaction List with PowerGrep
As noted on this blog in the past, Adobe Acrobat's search and redaction tool can be used to redact multiple terms in multiple PDFs. It can search almost 1000 terms in a 1000 pages of PDFs in less than 15 minutes. When you are creating a list of terms to redact it may be helpful to use a grep utility such as PowerGrep which can run regular expresssion searches across multiple electronic files very quickly. Tonight I had to redact hundreds of addresses and personal names from about 1000 pages of PDFs. The street addresses always ended with the state abbreviation. I decided to run a regular expression search which would collect a certain number of words before and after the terms I was searching for. I found what I was looking for here:
The basic structure of the search is this:
(\S+\s+){0,5}\S*\bvisit\b\S*(\s+\S+){0,5}
'Visit' is the search term, and the '5's indicate the number of words before after that should be collected with the search term.
In PowerGrep, follow these steps:
1. Select the PDFs you want to search in the File Listing on the left.
2. Select the Action Type to 'Collect Data'.
3. The Filter Files setting to 'Do not filter files'
4. File Sectioning to: 'Do Not Section Files'
5. The Search Type to 'Regular Expression'
6. In the Collect box enter: '\r\n\0'
7. Put a path to a csv file in 'Target File Location'
8. Finally in the search box enter the RegEx search:
(\S+\s+){0,12}\S*\b(, AL|, AK|, AZ|, AR|, CA|, CO|, CT|, DE|, FL|, GA|, HI|, ID|, IL|, IN|, IA|, KS|, KY|, LA|, ME|, MD|, MA|, MI|, MN|, MS|, MO|, MT|, NE|, NV|, NH|, NJ|, NM|, NY|, NC|, ND|, OH|, OK|, OR|, PA|, RI|, SC|, SD|, TN|, TX|, UT|, VT|, VA|, WA|, WV|, WI|, WY)\b\S*(\s+\S+){0,2}
In this example the various search terms are separated with a pipe, | , and open and close parentheses enclose the various search terms from the rest of the regular expression.
Click collect and in a few minutes PowerGrep will extract data to the comma separate value file, with 12 words before the state abbreviations and two words after. (Usually the documents I was searching did not include zip codes, but I wanted to be sure.) It's then just a matter of parsing out the data in Excel. You can use the Text to Column tool to add a delimiter in each instance in which there is a space before a digit. If there is more than one number used in the address, just use a simple formula like =C2&" "&D2&" "&E2, to combine the data that split between columns when you needed it together in one column as the complete address.