top of page

Stanford versus Apache Tokenization

Stanford CoreNLP and Apache OpenNLP are two of the most widely used tokenization methods, or natural language processing toolkits. What are the differences between the two?


1. In addition to tokenization (the division of text into separate words), both perform sentence segmentation, named entity recognition, and co-reference resolution. NER is the identification of entities such as places, dollar values, personal names, and organizations in unstructured text. Co-reference resolution involves finding every reference to an entity in a source document. Unlike Apache, Stanford also accounts for lemmatization (the various inflections of a word - see the Tip of the Night for April 28, 2019).


2. Apache works faster than Stanford, and will work with larger data sets than Stanford can.


3. Stanford requires fewer lines of code.



Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page