Searching 2/5

globbing

May 21, 2019

Globbing refers to running wildcard searches for file names - using an asterisk to search for multiple characters, or a question mark to search for a single character. Globbing can be used in Windows command prompt, Powershell and Python to find file names.

So for example a simple wildcard search like this in command prompt is an example of globbing: dir *.png OR java*

The Tip of the Night for February 16, 2019 showed how to use the glob command in Python to get a list of files. Standard regular expression style syntax can be used with the glob command to run more complicated file name searches. For example: glob.glob('java[0-3].*')

Use Grep Utility to Collect Full Lines on Which Search Terms Appear

May 18, 2019

The Tip of the Night for August 4, 2018, discussed how to run Regex searches for multiple strings, collecting the complete line on which they appear. The tip showed how to do this using the grep utility, PowerGrep. Here's a slightly different approach using a list of separate search terms not written in the regular expression syntax.

1. In PowerGrep, select the folder which contains the files you want to search through in the directory tree at the left.

2. Set the action type to 'Collect Data'.

3. Set file sectioning to 'line by line'.

4. Check off the box for 'Collect/replace whole sections'

5. Set the search type to 'List of literal text'

6. Enter a string to search for in the search box, and then press the green plus icon to add additional lines.

7. For each search term, in the collect box enter: \0 to get the terms searched for (all with the complete line). You can also add %PATH% %FILENAME% to collect the file path and file names of the files you are searching.

8. Set target file creation to 'Save results into a single file' and in Target file location, enter a .csv file in which to export the search results.

9. Click Collect, and as you can see PowerGrep will collect the full line on which each search term appears.

Lemmatization

Apr 28, 2019

When considering document review platforms and their conceptual searching capabilities, inquire as to whether or not they can account for the Lemmatization of words. The lemma of a word is its dictionary form. So the word ‘go’ is the lemma for ‘going’, ‘went’, ‘gone’ - the various tenses of the ‘headword’, ‘go’. The multiple inflections are collectively known as the lexeme of the word. Lemmatization differs from stemming in that it considers the context in which a word is used. Stemming will not find ‘better’ which is part of the lexeme of the lemma, ‘good’. Generally stemming facilitates the recall of a search - that percentage of available responsive hits in a review set that are returned. Employing search algorithms which account for Lemmatization will improve the precision of searches - the percentage of true hits as opposed to false positives. A stemming search algorithm may use a stem of the word, ‘crazy‘, spelled as ‘crazi’ to account for craziness.

LITIGATION SUPPORT TIP OF THE NIGHT

New tips for paralegals and litigation support profesionals are posted to this site each week. Click on the blog headings for better detail.

See How-To Videos on my YouTube channel.

globbing

Use Grep Utility to Collect Full Lines on Which Search Terms Appear

Lemmatization