What is Term Frequency?

Term Frequency (TF)

Term frequency (TF) often used in Text Mining, NLP and Information Retrieval tells you how frequently a term occurs in a document. In the context natural language, terms correspond to words or phrases. Since every document is different in length, it is possible that a term would appear more often in longer documents than shorter ones. Thus, term frequency is often divided by the  the total number of terms in the document as a way of normalization.

There are other ways to normalize term frequencies including using the maximum term frequency in a document as well as average term frequency.

Term Frequency in Practice

Term frequencies are often used to characterize documents. In theory, the more frequent a term appears in a document, the more the term characterizes your document. However there is limitation to this assumption. Let’s take this following news article about the Dow.

 

The top occurring terms are the ones that appear in large fonts below. Notice that common words such as ‘the’ and ‘and’ with low information tend to dominate the counts. This is inevitable, since in every spoken language, you will inherently have determiners, connectors and conjunctions to make sentences flow.

There are two ways you can improve the ranking of these words such that topic words appear more prominently. The first approach is to eliminate all stop words (common words) such as ‘the’, ‘is’, ‘are’ and so on before computing the term frequencies. Here is an example with some of the stop words removed where the larger fonts indicate high term frequencies:

Notice that now it becomes much clearer that the document in question actually talks about economic recession.

Another way to suppress common words and surface topic words is to multiply the term frequencies with what’s called Inverse Document Frequencies (IDF). IDF is a weight indicating how widely a word is used. The more frequent its usage across documents, the lower its score. For example, the word  the would appear in almost all English texts and thus would have a very low inverse document frequency.

Multiplying term frequencies with the IDFs dampens the frequencies of highly occurring words and improves the prominence of important topic words and this is the basis of the commonly talked about TF-IDF weighting.

References