top of page

Decompounding and Tokenization


Decompounding is an important facet of tokenization in languages which use a lot of compound words such as German and Finnish. Tokenization involves using a lexer or scanner program to convert characters into tokens with specific meaning, and tokens aren't necessarily listed as words in a dictionary. When words use hyphens, it may or may not make sense to treat the characters on either side of the hyphen as separate tokens. So, forty-five should be treated as one token, but the term 'Manhattan-based' should be two tokens.

Compound words like Schreibtischcomputer, (which means desktop computer) must be decompounded so Schreibtisch and computer are separate tokens. Decompounding should be performed before keyword frequency counts are generated. A article published by academics with the Ubiquitous Knowledge Processing Lab of University of Darmstadt discusses how decompounding effects keyphrase extraction. See, Nicolai Erbs, Pedro Bispo Santos, Torsten Zesch, and Iryna Gurevych, Counting What Counts: Decompounding for Keyphrase Extraction (2015) in Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 10–17, available at https://www.aclweb.org/anthology/W15-3603.

The authors describe five different common algorithms for decompounding. A Left-to-Right algorithm will review a word from the left and generate a split when a dictionary word is found. JWord splitter will go from left to right and only generate a split if the remainder of the word is also a dictionary word. Banana Splitter searches words from right to left and will only generate a split for the longest possible dictionary word. The Data Driven algorithm will review every position in a word, and make a split at a position where the prefix count in a dictionary is greatest - the highest number of compound words that use the prefix. The ASV Toolbox method uses a radix tree to recursively search for splits.

A different approach is to only use base words of 4 or more characters, and only use compound parts which appear more frequently in a collection than the compound words.


Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

​

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

​

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page