top of page


Here's a further explanation of tokenization, the process of identifying character sequences in unstructured text. Identifying tokens on the basis of whitespace and/or all non-alphanumeric characters will not always work well. For example, in this sentence:

In New York, Sean O'Shea can't get enough sleep.

There are two words for which the tokenization could vary




o' shea

o shea



can t

. . . and we would not want to use whitespace to separate, 'New York', which should be a single token.

Some words, such as 'bona fides', may or may not use a space. It may be clear that a hyphenated word like 'e-discovery' should be one token, a phrase like 'poorly-thought-out strategy' should consist of four tokens, but it's unclear whether or not a company name like 'Mercedes-Benz' should be one token or two.

Lexeme is the term used to identify a sequence of characters from source data that matches a token. Tokenization often works by using regular expressions to find lexemes in a stream of text, which are then categorized as tokens.


bottom of page