Litigation Support Tip of the Night

May 20, 2019

Decompounding is an important facet of tokenization in languages which use a lot of compound words such as German and Finnish.  Tokenization involves using a lexer or scanner program to convert characters into tokens with specific meaning, and tokens aren't necessarily listed as words in a dictionary.   When words use hyphens, it may or may not make sense to treat the characters on either side of the hyphen as separate tokens.  So, forty-five should be treated as one token, but the term 'Manhattan-based' should be two tokens.

Compound words like Schreibtischcomputer, (which means desktop computer) must be decompounded so Schreibtisch and computer are separate tokens.   Decompounding should be performed before keyword frequency counts are generated.    A article published by academics with the Ubiquitous Knowledge Processing Lab of University of Darmstadt discusses how decompounding effects keyphrase extraction.   See, Nicolai Erbs, Pedro Bispo Santos, Torsten Zesch, and Iryna Gurevych, Counting What Counts: Decompounding for Keyphrase Extraction (2015) in Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 10–17, available at https://www.aclweb.org/anthology/W15-3603.  

The authors describe five different common algorithms for decompounding.   A Left-to-Right algorithm will review a word from the left and generate a split when a dictionary word is found.  JWord splitter will go from left to right and only generate a split if the remainder of the word is also a dictionary word.  Banana Splitter searches words from right to left and will only generate a split for the longest possible dictionary word.  The Data Driven algorithm will review every position in a word, and make a split at a position where the prefix count in a dictionary is greatest - the highest number of compound words that use the prefix.   The ASV Toolbox method uses a radix tree to recursively search for splits.

A different approach is to only use base words of 4 or more characters, and only use compound parts which appear more frequently in a collection than the compound words. 

Please reload

Please reload

Sean O'Shea has more than 15 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

 

All content provided on this blog is for informational purposes only. The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of this information.

 

This policy is subject to change at any time.

 

Contact Me With Your Litigation Support Questions:

seankevinoshea@hotmail.com

  • Twitter Long Shadow

© 2015 by Sean O'Shea . Proudly created with Wix.com