How we automatically organize large amounts of text data with topics?

How we automatically organize large amounts of text data with topics?

The Problem

Making sense of volumes of text data in surveys, legal documents, websites, customer support tickets and discussion threads can be daunting. This is why organizations are turning to tags, labels and topics to help organize all of their data. Unfortunately, not all organizations can afford the time to manually create labels for each and every document that they deal with.

One highly effective way of addressing this problem is to automatically generate labels, tags and topics using Natural Language Processing (NLP) and Machine Learning techniques. While there are several off-the-shelf tools for topic extraction, they are often overly general and do not perform well for highly noisy or specialized domains (e.g. social media comments, customer reviews, clinical notes). Some of our customers have tapped into our expertise to build out custom topic extraction algorithms.

Our Approach

We have used different NLP and Machine Learning approaches to extract topics for different types of problems. Below, we outline two domains we have worked in and the approach that we used at a very high level.

Hotel Reviews

In a hotel review domain, we were asked to find topics consisting of common facets and opinion mentions from large amounts of user reviews.

Example of facets: room staff bathroom

Example of opinion mentions: dirty rooms clean rooms rude staff insufficient parking

The goal of having these facets and opinion mentions available is for exploration and detailed analysis of customer feedback. Since this involved analysis of large amounts of text data, we leveraged an unsupervised word graph-based approach to extract out topics that are salient and then further filtered down relevant topics through phrase scoring which involved analyzing the part-of-speech composition, readability of a phrase and the length of a phrase.

In the end, for each hotel, we ended up with a list of top scoring opinion mentions and facets that were easily analyzable. We were also able to aggregate these topics by segments such as by county, by hotel class and etc.

Support Ticket

In another example, customers needed topics to organize their support tickets. Each ticket was sparsely populated and instead of having humans manually enter labels, our customers wanted topics to be suggested to humans from a set of possible topics. For this problem, we utilized large volumes of manually labeled data and trained a high accuracy text classifier that can predict a set of topics for a given support ticket. In addition to specific topics, we also recommended related topics using a word embedding model trained on large volumes of category data. This significantly improved efficiency in our customers’ workflow.

The topics further allowed our customers to perform detailed analytics on their unstructured data. For example, they were able to determine what the most used and least used topics were and with that they were able to determine where to allocate more resources. A sudden spike in certain types of topics also became an indicator of an odd event which triggered additional investigation.



These are some of the technologies that we have used for our topic extraction work:

  • Gensim – word embeddings
  • SKLearn – machine learning
  • SpaCy – natural language processing
  • In-house phrase extractors
  • In-house text cleaning tools



From our highly customized, topic extraction models, our clients have been able to:

  • Reduce the amount of time spent on manually creating topics for individual documents
  • Gain insights into their business and customers to fix nagging problems and improve resource allocation
  • Build more intelligent product features on top of topics


Need a customized topic extraction model? Get in touch with us and we will be happy to help.