top of page

Using Trigrams on Documents with Poor OCR

It's possible to set a filter for a Analytics profile in Relativity that uses three letter trigrams to vet the OCR text of a set of scanned documents. Trigrams are three letter sequences that appear frequently in English. Relativity recommends against applying its OCR filter unless the majority of documents have poor OCR. Studies have shown that the use of trigrams may not greatly improve the quality of the optical character recognition text. A study published in 2017, Trigram-based algorithms for OCR result correction , found that the use of n-grams only improved the quality of OCR by six per cent. See, Bulatov, K. Trigram-based algorithms for OCR result correction, Proceedings Volume 10341, Ninth International Conference on Machine Vision, 103410O (2017); doi: 10.1117/12.2268559.

You can get an idea of how n-grams work by experimenting with the Online N-Gram Analyzer. This a word, rather than a character based n-gram, generator, but it will give you an idea of how words can be segmented.

Relativity's OCR Filter can be set on an Analytics Profile. Select the Extended view.

Note that when creating a search index if the option to remove email footers and signatures is selected , the OCR filter will automatically be disabled.

bottom of page