The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer. All content provided on this blog is for informational purposes only. The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of this information. This policy is subject to change at any time. The owner is not an attorney, and nothing posted on this site should be construed as legal advice. Litigation Support Tip of the Night does not provide confirmation that any e-discovery technique or conduct is compliant with legal, regulatory, contractual or ethical requirements.
New tips for paralegals and litigation support profesionals are posted to this site each night. Click on the blog headings for better detail.
Avenir Light is a clean and stylish font favored by designers. It's easy on the eyes and a great go to font for titles, paragraphs & more.
Nov 5, 2020
Unstructured Information Management Architecture (UIMA)
The Unstructured Information Management Architecture (UIMA) allows for the analysis of unstructured text - such as email messages and documents - data which is not in a relational database. UIMA was developed by IBM, and the source code is now freely available through the Apache Software Foundation. See: https://uima.apache.org/ . UIMA helps to find terms and topics in documents, and allow search engines to search for concepts rather than just keywords. Word boundaries are assigned to identify the names of organizations, products, places, or the names of persons. UIMA also helps to extract information to create structured data from the unstructured data.
UIMA tokenizes documents with the following methods:
1. Segmenting strings by whitespace.
2. Regular expression searching to find email addresses; phone numbers; and other common entities.
3. Dictionary word lists.
4. A word stemming algorithm named Snowball.
5. The Tika toolkit to annotate text to extract metadata and structured text in documents.
6. The ConceptMapper annotator to map dictionary entries to text, including multi-term entries to non-contiguous text.
7. The AlchemyAPI annotator to assign text to categories, and perform language identification.
Tokenization separates text into words, characters, or n-grams (syllables, or a sequence of N items from text). UIMA provides a way to analyze unstructured text that is multimodal - contains images, video, and other nontextual data.