Tokenization

- Nov 7, 2020

Stanford versus Apache Tokenization

Stanford CoreNLP and Apache OpenNLP are two of the most widely used tokenization methods, or natural language processing toolkits. What are the differences between the two?

1. In addition to tokenization (the division of text into separate words), both perform sentence segmentation, named entity recognition, and co-reference resolution. NER is the identification of entities such as places, dollar values, personal names, and organizations in unstructured text. Co-reference resolution involves finding every reference to an entity in a source document. Unlike Apache, Stanford also accounts for lemmatization (the various inflections of a word - see the Tip of the Night for April 28, 2019).

2. Apache works faster than Stanford, and will work with larger data sets than Stanford can.

3. Stanford requires fewer lines of code.

- Nov 6, 2020

Tokenization

Here's a further explanation of tokenization, the process of identifying character sequences in unstructured text. Identifying tokens on the basis of whitespace and/or all non-alphanumeric characters will not always work well. For example, in this sentence:

In New York, Sean O'Shea can't get enough sleep.

There are two words for which the tokenization could vary

shea

oshea

o'shea

o' shea

o shea

can't

cant

can t

. . . and we would not want to use whitespace to separate, 'New York', which should be a single token.

Some words, such as 'bona fides', may or may not use a space. It may be clear that a hyphenated word like 'e-discovery' should be one token, a phrase like 'poorly-thought-out strategy' should consist of four tokens, but it's unclear whether or not a company name like 'Mercedes-Benz' should be one token or two.

Lexeme is the term used to identify a sequence of characters from source data that matches a token. Tokenization often works by using regular expressions to find lexemes in a stream of text, which are then categorized as tokens.

- May 20, 2019

Decompounding and Tokenization

Decompounding is an important facet of tokenization in languages which use a lot of compound words such as German and Finnish. Tokenization involves using a lexer or scanner program to convert characters into tokens with specific meaning, and tokens aren't necessarily listed as words in a dictionary. When words use hyphens, it may or may not make sense to treat the characters on either side of the hyphen as separate tokens. So, forty-five should be treated as one token, but the term 'Manhattan-based' should be two tokens.

Compound words like Schreibtischcomputer, (which means desktop computer) must be decompounded so Schreibtisch and computer are separate tokens. Decompounding should be performed before keyword frequency counts are generated. A article published by academics with the Ubiquitous Knowledge Processing Lab of University of Darmstadt discusses how decompounding effects keyphrase extraction. See, Nicolai Erbs, Pedro Bispo Santos, Torsten Zesch, and Iryna Gurevych, Counting What Counts: Decompounding for Keyphrase Extraction (2015) in Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 10–17, available at https://www.aclweb.org/anthology/W15-3603.

The authors describe five different common algorithms for decompounding. A Left-to-Right algorithm will review a word from the left and generate a split when a dictionary word is found. JWord splitter will go from left to right and only generate a split if the remainder of the word is also a dictionary word. Banana Splitter searches words from right to left and will only generate a split for the longest possible dictionary word. The Data Driven algorithm will review every position in a word, and make a split at a position where the prefix count in a dictionary is greatest - the highest number of compound words that use the prefix. The ASV Toolbox method uses a radix tree to recursively search for splits.

A different approach is to only use base words of 4 or more characters, and only use compound parts which appear more frequently in a collection than the compound words.

LITIGATION SUPPORT TIP OF THE NIGHT

New tips for paralegals and litigation support profesionals are posted to this site each week. Click on the blog headings for better detail.

See How-To Videos on my YouTube channel.

Stanford versus Apache Tokenization

Stanford CoreNLP and Apache OpenNLP are two of the most widely used tokenization methods, or natural language processing toolkits. What are the differences between the two?

2. Apache works faster than Stanford, and will work with larger data sets than Stanford can.

3. Stanford requires fewer lines of code.

Tokenization

Here's a further explanation of tokenization, the process of identifying character sequences in unstructured text. Identifying tokens on the basis of whitespace and/or all non-alphanumeric characters will not always work well. For example, in this sentence:

In New York, Sean O'Shea can't get enough sleep.

There are two words for which the tokenization could vary

shea

oshea

o'shea

o' shea

o shea

can't

cant

can t

. . . and we would not want to use whitespace to separate, 'New York', which should be a single token.

Lexeme is the term used to identify a sequence of characters from source data that matches a token. Tokenization often works by using regular expressions to find lexemes in a stream of text, which are then categorized as tokens.

Decompounding and Tokenization