top of page

Finding which PDFs are not text searchable

If you have a very large set of PDFs, and you're uncertain about which files have searchable text, you can set up an Adobe action utilizing the Preflight tool to find which files contain no, or very few text objects.


In the Actions Wizard, add the Preflight option from the Document Processing menu:



Click on 'Specify Settings' for the Preflight action, and in the dialog box select the option for 'Acrobat Pro DC 2015 Profiles'


Then in the long menu to the right select the option to 'List page objects, grouped by type of object'



Choose the option to create a report for either successes or errors and set a folder for these reports, and check off the box to display a summary PDF.


Also choose the option in the Save & Export menu to save each file processed by Preflight. Click on the icon to the right to get the option to set a specific local folder to save the reports to.



When it's run the action will give you the option to add multiple files:



The action will generate a PDF portfolio with multiple PDFs for each original PDF. Select all of the PDFs in the portfolio, and then right click and select the option to extract each PDF from the portfolio.



Then combine the reports into a single PDF file, and then save the text of the report to a text file . . .



Open the text file in a text editor, and run a find and replace to make sure that the captions, 'File name:'; 'Path:'; 'Text Objects'; 'Vector Objects' each appear at the beginning of a new line.



Then paste the text into column A of an Excel spreadsheet. In column B enter this formula:


=IF(LEFT(A2,4)="Path","",IF(LEFT(A2,12)="Text Objects",A2,B1))


. . . start in cell B2, and then pull down using CTRL + D. In cell C2 do the same with this formula:


=IF(LEFT(A2,9)="File name",A2,"")


Note that in the reports the text object count for a file is listed before the file name.


When you filter for any entries in column C, you'll see how many text objects are in each file:



Keep in mind that a file which has a lot of text which still needs to be OCR'd, may have a few text objects used in an exhibit slip sheet, headers, footers, and so forth. Review any file that has a small number of text objects based on the overall page count.







Comments


bottom of page