Decompounding and Tokenization

May 19, 2019
2 min read

Decompounding is an important facet of tokenization in languages which use a lot of compound words such as German and Finnish. Tokenization involves using a lexer or scanner program to convert characters into tokens with specific meaning, and tokens aren't necessarily listed as words in a dictionary. When words use hyphens, it may or may not make sense to treat the characters on either side of the hyphen as separate tokens. So, forty-five should be treated as one token, but the term 'Manhattan-based' should be two tokens.

Compound words like Schreibtischcomputer, (which means desktop computer) must be decompounded so Schreibtisch and computer are separate tokens. Decompounding should be performed before keyword frequency counts are generated. A article published by academics with the Ubiquitous Knowledge Processing Lab of University of Darmstadt discusses how decompounding effects keyphrase extraction. See, Nicolai Erbs, Pedro Bispo Santos, Torsten Zesch, and Iryna Gurevych, Counting What Counts: Decompounding for Keyphrase Extraction (2015) in Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 10–17, available at https://www.aclweb.org/anthology/W15-3603.

The authors describe five different common algorithms for decompounding. A Left-to-Right algorithm will review a word from the left and generate a split when a dictionary word is found. JWord splitter will go from left to right and only generate a split if the remainder of the word is also a dictionary word. Banana Splitter searches words from right to left and will only generate a split for the longest possible dictionary word. The Data Driven algorithm will review every position in a word, and make a split at a position where the prefix count in a dictionary is greatest - the highest number of compound words that use the prefix. The ASV Toolbox method uses a radix tree to recursively search for splits.

A different approach is to only use base words of 4 or more characters, and only use compound parts which appear more frequently in a collection than the compound words.

LITIGATION SUPPORT TIP OF THE NIGHT

New tips for paralegals and litigation support profesionals are posted to this site each week. Click on the blog headings for better detail.

See How-To Videos on my YouTube channel.

Decompounding and Tokenization