Term frequency (TF) means how often a term occurs in a document. In the context of natural language, terms correspond to words or phrases. But terms could also represent any token in text. It’s all about how you define it. Term frequency is commonly used in Text Mining, Machine Learning, and Information Retrieval tasks.
As documents can have different lengths, it’s possible that a term would appear more frequently in longer documents versus shorter ones. Because of this, it will seem like a term is more important to a longer document than to a shorter one. To reduce this effect, term frequency is often divided by the total number of terms in the document as a way of normalization.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
There are other ways to normalize term frequencies including using the maximum term frequency in a document as well as average term frequency. It will take some experimentation to decide which of the normalization techniques to use.
Term Frequency in Practice
Term frequencies are often used to characterize documents. In theory, the more frequent a term appears in a document, the more the term characterizes your document. However, there is a limitation to this assumption.
Let’s take this following news article about the Dow.
New York (CNN Business)The Dow fell 460 points Friday after a US recession indicator blinked red and a report on German manufacturing raised concerns about Europe's most important economy. The index shed 1.8%, while the S&P 500 closed down 1.9%. The Nasdaq plunged 2.5%. It was the worst performance for all three major indexes since January 3. The yield on 3-month Treasuries rose above the rate on 10-year Treasuries for the first time since 2007 — a shift that scared Wall Street. Investors have piled back into stocks after a sell-off in late 2018. The flattening yield curve, or the difference between short- and long-term rates, has worried investors for months. A narrowing spread is typically seen as sign that long-term economic confidence is dwindling. For decades, an inversion has been a reliable predictor of a future recession. Friday's flip added to pressure on the Dow that was building before US markets opened. The index stumbled at the bell on poor manufacturing data from Germany, which also spelled trouble for the country's bond market. The yield on Germany's benchmark 10-year government bond fell below zero for the first time since October 2016. That news out of Europe fueled Wall Street's ongoing concerns about slowing global growth. Investors remain jittery about Brexit and the lasting effects of the trade fight between the United States and China, even as Washington and Beijing move toward a deal. And they are unsure how to interpret the Federal Reserve's signal that it won't hike interest rates this year. On one hand, maintaining rates could ensure that credit keeps flowing and the 10-year bull market continues. But it also speaks to concern about the country's economic health, which could stifle investment. For the week, the Dow, S&P 500 and Nasdaq finished modestly lower. But bank stocks, which are particularly sensitive to interest rates and economic worries, took a beating. The KBW Bank index (BKX) dropped more than 8% in the past week. White House economic adviser Larry Kudlow told CNBC last year that the spread between 3-month and 10-year Treasury yields was important to watch. "It's actually not 10s to 2s; it's 10s to 3-month Treasury bills," Kudlow said last May. He was referring to the spread between 2-month and 10-year Treasury yields, which is also closely monitored. Michael Darda, chief economist and market strategist at MKM Partners, said in a note that investors should wait for weekly and monthly averages to show an inversion before they read it as a "powerful recession signal." And he noted that on average, recessions occur 12 months after an inversion — not immediately.
The top occurring terms are the ones that appear in large fonts below. Notice that common words such as ‘the’ and ‘and’ with low information tend to dominate the counts. This is inevitable since, in every spoken language, you will inherently have determiners, connectors, and conjunctions to make sentences flow.
There are two ways you can improve the ranking of these words such that topic words appear more prominently. The first approach is to eliminate all stop words (common words) such as ‘the’, ‘is’, ‘are’, and so on before computing the term frequencies. Here’s an example with some of the stop words removed where the larger fonts indicate higher term frequencies:
Notice that now it becomes much clearer that the document in question actually talks about the economic recession.
Another way to suppress common words and surface topic words is to multiply the term frequencies with what’s called Inverse Document Frequencies (IDF). IDF is a weight indicating how widely a word is used. The more frequent its usage across documents, the lower its score. For example, the word
the would appear in almost all English texts and thus would have a very low inverse document frequency.
Multiplying term frequencies with the IDFs dampens the frequencies of highly occurring words and improves the prominence of important topic words and this is the basis of the commonly talked about TF-IDF weighting.
- Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA.