Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.
For example, in the context of a search system, if your search query is “what is a stop word?”, you want the search system to focus on surfacing documents that talk about
stop word over documents that talk about
what is a.
This can be done by maintaining a list of stop words (which can be manually or automatically curated) and preventing all words from your stop word list from being analyzed. In this example, the words
what is a could be eliminated, leaving only the words:
stop word. This ensures that documents that are topically relevant have a high rank in your search results.
Stop words for context
While stop words are generally used to remove low information words, stop words can also become powerful in adding context. For example, in the query “what is a stop word?” , if you know that this is in fact a what is question and not a how to question, you can further refine the results shown to users. One of way of knowing this, is to just look at the non topic-words. Once you have this context information, instead of ranking “How to use stop words?” at the very top of the search results, you can teach your algorithms to rank documents related to “What are stop words?” much higher.
Where to find a stop word list?
There are established stop word lists that you could easily plug and use. Some of the stop word lists come out of NLP research work and some are just manually curated by different people. Here are a few for you to try in different languages:
Domain specific stop word lists
While it is fairly easy to use a published set of stop words, in many cases, such stop words are insufficient. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval.
Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. Unfortunately, the language specific stop words do not cover domain specific terms. The good news is that you can easily construct your own domain specific stop word list. This article talks about a few ideas on how you can go about constructing such lists.