The 4 Must-Do’s in Building a Text Classifier— For The Real World

Most text classification examples that you see on the Web or in books focus on demonstrating techniques. This will help you build a pseudo usable prototype.

If you want to take your classifier to the next level and use it within a product or service workflow, then there are things you need to do from day one to make this a reality.

I’ve seen classifiers failing miserably and being replaced with off the shelf solutions because they don’t work in practice. Not only is money wasted on developing solutions that don’t go anywhere, the problem could have been avoided if enough thought was put into the process prior to development of these classifiers.

In this article, I will highlight some of the best practices in building text classifiers that actually work for real-world scenarios.

Some of these tips come from my personal experience in developing text classification solutions for different product problems. Some, come from literature that I’ve read and applied in practice.

Before we dive in, just to recap, text classification also known as document categorization or text categorization, is the process of predicting a set of labels given a piece of text. This can be labels such as sentiment classes where we predict positive, negative and neutral  given the content. This can also be stack overflow type tags where we predict a set of topics given the content as shown in Figure 1. The possibilities are endless.

Example of what text classification can do
Figure 1: Predicting stack overflow tags

While text classifiers can be developed heuristically,  in this article, we will focus on supervised approaches leveraging machine learning models. Now, let’s look at the different steps you can take to increase the likelihood of your classifier succeeding in practice.

 #1. Don’t overlook your evaluation metric

From my experience, one of the most important things to look into when developing a text classifier is to figure out how you will evaluate the quality of your classifier.

We often see accuracy being used as a common metric in text classification tasks. This is the proportion of correctly predicted labels over all predictions. While this provides a rough estimate of how well your classifier is doing, it’s often insufficient.

Understand what you are trying to optimize

To ensure that you are evaluating the right thing, you need to look at your goals and what you are trying to optimize from an application perspective.

In a customer experience improvement task for example, you may be interested in detecting all the negative customer comments and may not be too worried about other comments that are neutral or positive.

In this case, you certainly want to ensure that classification of the negative sentiment is the best it can be. Accuracy is a lousy measure for this purpose as it does not tell you how well you are capturing negative sentiment nor does it tell you what types of classification issues are happening behind the scenes. Is negative content being constantly classified as neutral? You don’t have that insight.

A better measure in this example, would be per class precision and recall  as shown in Figure 2. This gives you a breakdown of how each class is performing. With this, when you work towards improving your classifier by adding more data or tweaking the features, you want to ensure that the precision and recall for the negative class is at a satisfactory level.

per class precision recall example

Figure 2: per class precision recall example

In a supervised keyword extraction task, where the task was to detect all valid keywords, my goal was to capture as many valid keywords as possible. At the same time, I did not mind having keywords that were marginally relevant as long as it wasn’t grossly irrelevant.

With this in mind, I focused on the hit rate (aka recall – fraction of true positives/all positives) while maintaining a decent level of precision. So keywords that were highly irrelevant were eliminated, keeping keywords that are relevant and some false positives that are marginally relevant. This is another example of choosing a metric to optimize for the task at hand.

Always think about what you are trying to optimize for and choose appropriate metrics that will reflect that goal.

Use different angles for evaluation

When evaluating a classifier, you can look at different angles for evaluation. In the sentiment classification example, suppose you see that the precision and recall for the negative class is low, perhaps slightly above chance.

To understand the reasons for this, you can also look into the confusion matrix to see what types of misclassification issues you might be encountering. Perhaps most of the negative comments are being classified as positive (see example in Figure 3).

In such a case, you can check issues with class imbalance, quality of training data as well as volume of training data to get cues into how you can iteratively improve your classification.

Example confusion matrix

Figure 3: Confusion matrix for a sentiment classification task

You can use any analysis that makes sense to dissect and diagnose your classifier with the goal of understanding and improving the quality of classification. Try to go beyond the default examples that you see in online or book tutorials.

Be creative in handling evaluation obstacles

In the news classification task we looked at in one of my previous articles, the goal was to predict the category of news articles.

Given the limitations of the HuffPost data set that was used, there is only one correct category per article, even though in reality one article can fit into multiple categories.

For example, an education related article can be categorized as EDUCATION and COLLEGE. Assuming the “correct” category is COLLEGE, but the classifier predicts EDUCATION  as its first guess, that doesn’t mean it’s doing a poor job. COLLEGE might just be the second or third guess.

To work around this limitation, instead of just looking at the first predicted category, we used the top N predicted categories. This is to say that if any of the top N predicted categories contain the “correct” category, then it’s considered a hit.

With this approximation, we can then compute measures such as accuracy and mean reciprocal rank (MRR) which also looks at the position of the “hit”. The goal of MRR was to see if the correct category also moves up the ranks.

There will be many such obstacles when trying to develop a classifier for a real world problem. The good news is, you can always come up with a good workaround. Just put some thought into it. You can also get ideas from peer-reviewed papers.

#2. Use quality training data

Training data is the fuel for learning patterns in order to make accurate predictions. Without training data, no matter what engine (model) you use, nothing will work as expected.

It feels a bit clichéd to say that you should use quality training data. But, what does it mean?

To me, good quality training data has 3 properties as shown in Figure 4:

good quality training data - properties

Figure 4: Properties of good quality training data

Let’s look at what each of these mean.

Data that is compatible with the task at hand

Let’s say you are trying to predict sentiment of tweets. However, the only training data available at your disposal are labeled user reviews. Training on user reviews and performing prediction on Tweets may give you results that are suboptimal.

This is because the classifier learns the properties within reviews that puts it in different sentiment categories. If you think about it, reviews are much meatier in content compared to Tweets where you have abbreviations, hashtags, emoticons and etc.

This makes the vocabulary of Tweets quite a bit different from reviews. So, the data here does not fit the task. It’s an approximation, but it’s not ideal. It’s especially risky if you have NEVER tested it with the data that it would be used on.

I’ve seen many instances where companies train their classifier on data that just doesn’t fit the task. Don’t do this, if you want a solution that works in a production setting. Figure 5 will show you a few examples that would be a no-no from my perspective:

incompatible training data and target application

Figure 5: Examples of incompatible training data and target application

A reasonable approximation is acceptable. You just don’t want to use a dataset that has limited to no resemblance of the data that you’ll be using in practice. Note that these differences can present itself in the form of vocabulary, content volume and domain relevance.

If an approximation is used, to ensure compatibility, I always suggest an additional dataset that represents the “real stuff”. Even if it’s limited. This limited dataset can be used to tune and test your model to ensure that you are optimizing for the actual task.

Data that is fairly balanced between classes

In the same sentiment classification example from above, let’s say you have 80% positive training examples, 10% negative and remaining 10% neutral. How well do you think your classifier is going to perform on negative comments? It’s probably going to say every comment is a positive comment as it has limited information about the negative or neutral classes.

Data imbalance is a very common problem in real world classification tasks.  Although there are ways to address data imbalance, nothing beats having sufficient amounts of labeled data for each class and then sampling down to make them equal or close to being equal.

The way to generate sufficient amounts of labeled data can vary. For example, if your classifier is for a highly specialized domain (e.g. healthcare), you can hire domain experts from within your company to annotate data. This data would eventually become valid training examples.

Recently, I used a platform called LightTag to help generate a specialized dataset for one of my clients. Since the task required medical knowledge, medical coders were recruited to perform the annotation.

Find creative ways to bootstrap a good, balanced dataset. If hiring human labelers is not an option, you can start with a heuristics approach. If over time you find that your heuristics approach actually works reasonably well, you can use this to generate a dataset for training a supervised classifier.

I’ve done this several times with reasonable success. I say reasonable, because it’s not straightforward. You will have to sample correctly, understand the potential bias in your heuristics approach and have a baseline benchmark of how the heuristics approach is performing.

Data that is representative

Let’s say your task is to predict the spoken language of a webpage (e.g. Mandarin, Hindi and etc.).  If you use training data that only pseudo represents a language, for instance a region specific dialect, then you may not be able to correctly predict another dialect of the same language. You may end up with grossly misclassified languages.

Either you need each regional version as a separate class OR your training data for a given language needs to represent all variations, dialects and idiosyncrasies of that language to make it representative.

Without this, your classifier will be highly biased. This problem is called within-class imbalance. As you will see in this article, while not related to text, bias can become a real problem. Without knowing it, you may inadvertently introduce bias due to your data selection process.

In one of my previous work on clinical text segmentation, by consciously forcing variety in the training examples, the results actually improve as the classifier could better generalize across different organizations.

Ensure that your dataset is representative of reality. This starts with understanding the dataset that you will be using. How was it created? Does the dataset identify any subpopulations (e.g. by geographical location)? Who created this dataset and why?

Asking questions about your dataset can reveal potential biases encoded in it. Once you know the potential biases, you can start planning on ways to suppress it. For example, by gathering additional data to offset the bias, better preprocessing or introducing a post processing layer where humans validate certain predictions. You will find more ideas in this survey paper.

Related: Learn how not having a big data strategy can impact your AI initiatives

#3. Focus on the problem first, techniques next

When it comes to A.I. and NLP, most practitioners, leaders included, tend to focus on the techniques. This may be a computer science mentality, where we tend to over emphasize techniques.

With all of this, you may be tempted to follow the trend and rush to use all the sophisticated embeddings in your models for a task as simple as spam detection.  After all, there are so many tutorials that show you how to utilize these embeddings, so why shouldn’t you use it?

Here’s a secret. It doesn’t always work significantly better than a more simplistic or understandable approach. The gains you get in accuracy may be lost in the time your model takes to produce a prediction or the ability to operationalize your model.

I’ve seen this time and time again where adding additional layers of sophistication just made models unnecessarily sluggish and hard to deploy in practice. In some cases, it performed worse than the more well thought out simpler approaches.

More complexity does not mean more meaningful results

If you want meaningful results, this often starts with a good grasp of the problem that you are attempting to solve. Part of this involves finding answers to the following questions:

  • What exactly are you trying to predict? 
  • Why is the automation necessary?
    Are you trying to reduce costs or time or both? Are you trying to reduce human error?
  • How much do you expect to gain in terms of reduction in costs, time, human error or others with the automation?
  • What are the ramifications in getting predictions wrong?
    Will it entail someone not getting a loan or a job because of it? Will it prevent someone from getting treatment for a deadly disease?
  • How is the problem currently being solved?
    What is the manual process? Are results from this manual process being collected somewhere?
  • How will the automated solution be used?
    Will it be reviewed by humans before release or would the predictions directly affect users?
  • What are the potential data sources for this specific problem?
  • Do you have the budget and time to be able to acquire labeled data if needed?

This knowledge will first of all help you determine if supervised machine learning even makes sense as a solution. I’ve told several of my clients that I would not recommend machine learning for some of their problems. In one case, they were better off using a lookup table. In another, they did not have the data to develop a supervised classifier. So that had to be put in place first.

These questions will also act as a guiding force in acquiring a good dataset, setting up appropriate evaluation metrics, setting up tolerance for false positive rates and false negative rates, selecting the right set of tools and etc. So always lead with the problem and the rest should be planned around it. Not the other way around.

#4. Leverage domain knowledge in feature extraction

The beauty of text classification is that we have different options in terms of how we can represent features. You can represent the unigram of the entire raw data as is. You can leverage filtered down unigrams, bigrams and other n-grams.  You can use occurrence of specific terms. You can use sentence, paragraph and word level embeddings and more.

While all of these are an option, the more text you use, the larger your feature space becomes. The problem with this is that only a small number of features are actually useful. Secondly, the values of many features may actually be correlated.

One effective approach that I’ve repeatedly used is to extract features based on domain or prior knowledge. For example, for a programming language classification task, we know that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences. This is the domain knowledge.

Using this domain knowledge, you can extract relevant features. For example, the top N special characters in a source code file can highlight the differences in structure (e.g. Java uses curly braces, Python uses colon, tabs and spaces).

The top K tokens with capitalization preserved (the non special characters) can highlight the differences in vocabulary. With this, you would not have to rely on the raw text which can make the feature space explosive. For this project, my analysis also showed that this approach was much more effective then using the raw text as is. This also kept the model relatively small, easy to manage and grow over time as new languages are added.

While you can use the fanciest of approaches to feature extraction, nothing beats making the most out of your domain knowledge of the problem.


In summary, developing robust classifiers for the real world comes down to the fundamentals – the quality of data, the evaluation metric, understanding the problem, maximizing the use of your domain knowledge and finally the techniques. If you get each of the highlighted points above in shape, then your chances of developing a solution that you can operationalize would significantly improve. It’s always better to plan before you start any form of implementation.

Recommended Reading


Scroll to Top