Here's a continuation of my outline of the 2016 edition of Craig Ball's Electronic Discovery Workbook which I last posted about on September 28, 2018.
The chapter entitled, "The Step-by-Step of Smart Search" provides a 10 step approach for effective keyword searching.
A. Statements By Judges on Keyword Searching
1. Judge Facciola - lawyers doing keyword searching without expert guidance going, "where angels fear to tread".
2. Judge Grimm - search methods must be tested for quality assurance.
3. Judge Peck - "wake-up call to the Bar" for their inexpert search terms.
4. Jason R. Baron of NARA - leading figure in e-discovery search.
B. 10 Step Approach
1. Start with the Request for Production
a. ESI search should really begin when litigation is anticipated.
b. Use both terms of art from the RFPs, and rephrase demands in ordinary English.
c. Push back against overboard requests.
d. If requests are vague, tell other side how you will interpret them and put them in the position of having to object.
2. Seek Input from Key Players
a. Custodians are SMEs for their own data.
b. TREC Legal Track challenge showed correlation between precision & recall and questioning key players.
3. Look at What You've Got and the Tools You'll Use
a. TIFF images require different search technique than emails or Word documents.
b. Test search tools against actual data.
c. Search tools must be able to search through container files and nested content & email attachments.
d. Search tools must identify encrypted tiles or non-standard types that can't be searched.
4. Communicate and Collaborate
a. Tell the other side the tools and terms you are using.
b. Ask for targeted suggestions and run them on sample data. They highlight terms that you overlooked.
c. Let the other side have two rounds of keyword search and review on your data.
5. Incorporate Misspellings, Variants and Synonyms
a. Common variants are more effective than fuzzy searching, which gets too many false hits.
b. Dumb Dictionary and Wikipedia lists of common misspellings.
6. Filter and Deduplicate First
a. Filter out music and image files which have alphanumeric content.
b. de-NIST by known hash values
c. Deduplication before indexing.
d. Be able to repopulate suppressed iterations.
e. Use keywords to exclude irrelevant ESI. e.g., "baby shower"
7. Test, test, test!
a. Test on data representative of custodian data with responsive evidence.
b. Can a large number of hits be found in system files, business units not subject of litigation, or other irrelevant ESI?
8. Review the Hits
a. Create spreadsheet showing hits on context - 20-30 words on each side.
b. Review responsive documents for additional keywords.
c. Search is an iterative process.
9. Tweak the Queries and Retest
a. Do keywords cluster in pairs? If so, can use Boolean AND or proximity connector to reduce noise hits.
10. Check the Discards
a. Sampling method must be rational compromise between quality assurance and cost.