iMacros Code for Extracting Text from URLs
This is a follow up to last night's tip which showed how to run code in the iMacro add-on for FireFox to form based searches. You'll note that the web pages with the search results contained links to pages containing detailed information on the companies which came up in the search in the New York Department of State's Corporation and Business Entity database. It's the data on those pages that we want to collect. The URLs for the pages are in listed in the text of the search web pages. They look like this:
https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=1354954&p_corpid=1242845&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0
These can be parsed out in Excel easily enough [ replace href=" WITH ~ . . . separate out Text to Columns by ~ , then replace "> with ~ and Text to Columns again.] Then you also need to take out the amp; references which are not part of the actual address, but are added for some reason by iMacro.
The Sobol Soft program mentioned in the tip of the night of December 19th cannot handle paths as long as those used on the NYS Department of State's site. So we want to use iMacro to save the pages. This is the basic code to copy pages as text files:
VERSION BUILD=7500718 RECORDER=FX TAB T=1 URL GOTO=http://demo.imacros.net/Automate/SaveAs WAIT SECONDS=3 URL GOTO= https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=2884554&p_corpid=2863112&p_entity_name=%54%61%63%6F%20%42%65%6C%6C&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=2582820&p_corpid=2549739&p_entity_name=%54%61%63%6F%20%42%65%6C%6C&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3
Just add four lines between each url, and then add in the code, in an adjacent column in Excel, and take out the existing specifed URL, and copy the rest of the code down. Copy everything to a text editor and then remove the tabs. Now you'll have a code which can save all your URLs as text files. It should look like the one below. Activate iMacros in FireFox, go to the Mange tab, and click Edit Macro and enter the code that you created. Run it and you'll have text files that you need, and can combine for analysis with the for %f in (*.txt) do type "%f" >> output.txt command.
VERSION BUILD=7500718 RECORDER=FX TAB T=1 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=1354954&p_corpid=1242845&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=1158933&p_corpid=1054472&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=126226&p_corpid=104702&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=2454016&p_corpid=2413025&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=1163854&p_corpid=1059184&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=2862412&p_corpid=2840358&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=2297994&p_corpid=2250857&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=371815&p_corpid=318045&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=2859080&p_corpid=2836922&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=1647820&p_corpid=1567459&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=1647823&p_corpid=1567462&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=1647827&p_corpid=1567466&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=316438&p_corpid=269281&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3 URL GOTO=https://appext20.dos.ny.gov/corp_public/CORPSEARCH.ENTITY_INFORMATION?p_nameid=2041618&p_corpid=1984317&p_entity_name=%42%75%72%67%65%72%20%4B%69%6E%67&p_name_type=%25&p_search_type=%43%4F%4E%54%41%49%4E%53&p_srch_results_page=0 'Save the page in all three different formats SAVEAS TYPE=TXT FOLDER=C:\fastfood FILE=+_{{!NOW:yyyymmdd_hhnnss}} 'Wait a few seconds WAIT SECONDS=3