The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer. All content provided on this blog is for informational purposes only. The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of this information. This policy is subject to change at any time. The owner is not an attorney, and nothing posted on this site should be construed as legal advice. Litigation Support Tip of the Night does not provide confirmation that any e-discovery technique or conduct is compliant with legal, regulatory, contractual or ethical requirements.
New tips for paralegals and litigation support profesionals are posted to this site each night. Click on the blog headings for better detail.
Nov 7, 2020
Stanford versus Apache Tokenization
Stanford CoreNLP and Apache OpenNLP are two of the most widely used tokenization methods, or natural language processing toolkits. What are the differences between the two?
1. In addition to tokenization (the division of text into separate words), both perform sentence segmentation, named entity recognition, and co-reference resolution. NER is the identification of entities such as places, dollar values, personal names, and organizations in unstructured text. Co-reference resolution involves finding every reference to an entity in a source document. Unlike Apache, Stanford also accounts for lemmatization (the various inflections of a word - see the Tip of the Night for April 28, 2019).
2. Apache works faster than Stanford, and will work with larger data sets than Stanford can.