Term frequency/ inverse document frequency
Term frequency / inverse document frequency is a way of measuring the relevancy of a word in a document based on its frequency in the full set it belongs to. Words that appear very often in many documents, will be ranked lower than those which appear a lot in just one document.
A calculation for td-idf is done by multiplying (the number of times a word appears in a document divided by the total word count) by (the logarithm [with a base of 10] of (the total document count divided by the number of documents that contain the word)). As we recall from math class in high school, the logarithm of a number is found by determining the exponent for the base which will result in a set number. So, 3 will be the logarithm of 1000 with a base of 10. (10 x 10 x 10 =1000).
So if we have a document which contains a term 20 times, and has 1000 total words, and the document is part of a set of 10,000 documents, 50 of which contain this term, the tf / idf of the document will be 0.046. If 500 documents in the full set contain this term, the tf / idf will be 0.026. If the document with 1000 words, contains the term 200 times and is from the full set of 10,000 documents, 50 of which contain this term, the tf /idf will be 0.460.
The formula can be written like this:
To get the logarithm, first divide the total document set count by the number of documents with the key term; press equals; and then press the key on the scientific version of your calculator which is labeled log10.