How we use automatic categorization to make sense of documents?

How we use automatic categorization to make sense of documents?

Customer Problem

Enterprises are overwhelmed with the volume of text they have to deal with every day. You have emails, chats, web pages, social media, support tickets, survey responses, clinical notes, incident reports and a whole lot more that are purely unstructured in nature. While text data can be an extremely rich source of information, manually extracting insights from large volumes of text data is labor intensive.

One way to effectively address this problem, is to automatically categorize documents and pieces of textual content in different ways using Machine Learning and Natural Language Processing. For example, classifying how support tickets should be routed and to which agent can help address resourcing as different agents are trained to handle different set of issues.

Application Areas

We have helped multiple businesses with document and text categorization in very different application areas. These are some of the interesting areas where we have applied text classification models successfully:


1. Language Detection

The language used in online content can vary greatly – English, Arabic, Mandarin, you name it. It is imperative to be able to detect the language of specific content in order to serve users appropriately. We have developed high accuracy language classifiers to detect the different languages of the world.

2. Clinical Text Segmentation and Labeling

Clinical notes that physicians narrate and type into Enterprise Medical Record (EMR) systems are highly unstructured. In order to correctly bill insurance companies, these notes have to be analyzed so that only the appropriate diagnosis are billed. Not all content in the clinical notes are relevant for this analysis. We have helped clinics with algorithms to create structure from their unstructured clinical notes by segmenting and labeling paragraphs with appropriate labels such as “past medical history”, “social history”, “assessment” and etc so that they only have to analyze appropriate sections for the purpose of billing.

3. Trademark Class Prediction

The U.S. Patent and Trademark Office (“USPTO”), which is the federal agency charged with overseeing the registration of trademarks, divides marks into 45 different categories. Within these 45 top-level categories, there are 40,000 sub-categories. These categories are typically manually entered by attorneys and this is an extremely slow process. We have helped legal firms automate this process by developing a flexible classifier that predicts appropriate trademark classes both at the top-level category level as well as sub-category level when a user applies for a trademark, limiting the time attorneys need spend reviewing each application.  

4. Sentiment Polarity Prediction

Much of the data on the Web is unstructured in nature and contain opinions. Opinions are spread across tweets, user reviews, news articles, comments to articles and videos and more. There is a lot that you can do with these opinions including understanding what customers are unhappy about, what’s good or bad about a product prior to a purchase decision, predict stock market sentiment and more. We have built several high accuracy, domain specific sentiment classifiers to detect if a piece of text is a complaint or a praise statement or if it is just positive, negative or neutral at different granularities (i.e. sentence level, phrase level and paragraph level).

From these actual use-cases, it is clear that the application for text classification is extremely broad. It can be useful in any industry really, if applied to the correct problems!


Technology We Use

These are some of the technologies that we use for our classifiers:

  • NumPy – scientific computing
  • Keras – deep learning models
  • Gensim – word embeddings
  • SKLearn – machine learning
  • SpaCy – natural language processing
  • In-house phrase extractors
  • In-house text cleaning tools

When comes to the tools and technologies, we focus on approaches that would give us the best accuracy, would scale over time and also generalizes to real world use cases.

Benefits to Customers

From our highly customized, high accuracy models, our clients have been able to:

  • Reduce the amount of time spent on manual and highly tedious tasks
  • Reduce human errors in analysis of large amounts of documents
  • Build more intelligent product features via the predictions generated by our models


Do you need a high-accuracy document categorization pipeline? We can help build it or show you how to. Get in touch with us!