Tokenization
- Sean O'Shea
- Nov 6, 2020
- 1 min read
Here's a further explanation of tokenization, the process of identifying character sequences in unstructured text. Identifying tokens on the basis of whitespace and/or all non-alphanumeric characters will not always work well. For example, in this sentence:
In New York, Sean O'Shea can't get enough sleep.
There are two words for which the tokenization could vary
shea
oshea
o'shea
o' shea
o shea
can't
cant
can t
. . . and we would not want to use whitespace to separate, 'New York', which should be a single token.
Some words, such as 'bona fides', may or may not use a space. It may be clear that a hyphenated word like 'e-discovery' should be one token, a phrase like 'poorly-thought-out strategy' should consist of four tokens, but it's unclear whether or not a company name like 'Mercedes-Benz' should be one token or two.
Lexeme is the term used to identify a sequence of characters from source data that matches a token. Tokenization often works by using regular expressions to find lexemes in a stream of text, which are then categorized as tokens.
Recent Posts
See AllStanford CoreNLP and Apache OpenNLP are two of the most widely used tokenization methods, or natural language processing toolkits. What...
Decompounding is an important facet of tokenization in languages which use a lot of compound words such as German and Finnish. ...