Stop words are a set of commonly used words in a language. Examples of stop words in English are “a,” “the,” “is,” “are,” etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so widely used that they carry very little useful information.
For example, in the context of a search system, if your search query is “what is a stop word?” you want the search system to focus on surfacing documents that talk about
stop word over documents that talk about
what is a.
This can be done by maintaining a list of stop words (which can be manually or automatically curated) and preventing all words from your stop word list from being analyzed. In this example, the words
what is a could be eliminated, leaving only the words:
stop word. This ensures that topically relevant documents rank highly in your search results.
Stop words for context
While stop words are generally used to remove low information words, stop words can also become powerful in adding context. For example, in the query “what is a stop word?” , if you know that this is, in fact, a what is question and not a how to question, you can further refine the results shown to users. One way of knowing this is to just look at the non-topic words. Once you have this context information, instead of ranking “How to use stop words?” at the top of the search results, you can teach your algorithms to rank documents related to “What are stop words?” much higher.
Where to find a stop word list?
There are established stop word lists that you could easily plug in and use. Some of the stop word lists come out of NLP research work, and some are just manually curated by different people. Here are a few for you to try in different languages:
Domain specific stop word lists
While it is fairly easy to use a published set of stop words, in many cases, such stop words are insufficient. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval.
Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. Unfortunately, the language specific stop words do not cover domain specific terms. The good news is that you can easily construct your own domain specific stop word list. This article talks about a few ideas on how you can go about constructing such lists.