AI is transforming nearly every industry, and text analysis is a key area of interest. That’s because there’s been an explosion in unstructured text data—nearly 80% of data at most organizations—which is quickly becoming impractical to analyze by humans alone.
We’ve already talked about some best practices for building a text classifier, but how can a tool like this help your business? Let’s take a closer look at document classification and some real-world examples.
What Is Document Classification?
Organizations need to classify documents so that their text data is easier to manage and utilize. For example, companies may need to classify incoming customer support tickets so they get sent to the right customer support agents.
With a manual approach, staff would need to sort through each text and assign a label or category to it individually. The problem is that manual classification can be time-consuming, error-prone, and cost-prohibitive.
That’s why many organizations are turning to machine learning (ML) and natural language processing (NLP) to automatically organize texts into one of several predefined categories. It doesn’t matter if the texts are very short (e.g. Tweets) or entire documents (e.g. news articles), the ability to quickly categorize this data brings efficiency to the organization and frees up staff to work on higher-level tasks.
In the ML and NLP world, document classification is also known as:
- text classification
- text categorization or
- document categorization
5 Practical Text Classification Examples
With the value of text classification clear, here are five practical use cases business leaders should know about.
1. Gmail Spam Classifier
Spam has always been annoying for email users, and these unwanted messages can cost office workers a considerable amount of time to deal with manually. Most email services filter spam emails based on a number of rules or factors, such as the sender’s email address, malicious hyperlinks, suspicious phrases, and more. But there’s no single definition of spam, and some unwanted emails can still reach users.
That’s why Google recently decided to upgrade its Gmail filters using the company’s own machine learning platform called TensorFlow. Google was able to train new ML algorithms to block an additional 100 million spam messages every day. Moreover, these new email classification algorithms are able to identify patterns over time based on what individual Gmail users consider spam themselves.
2. Great Wolf Lodge’s Sentiment Classifier
Great Wolf Lodge (GWL), a chain of resorts and indoor water parks, has expanded its broad digital strategy by using AI to classify customer comments based on sentiment. They developed what they call the Great Wolf Lodge’s Artificial Intelligence Lexicographer (GAIL).
GWL capitalizes on the concept of net promoter score (NPS) to gauge the experience of individual customers. Instead of using an NPS score to determine customer satisfaction, GAIL determines if customers are a net promoter, detractor, or neutral party based on the free-text responses posted in monthly customer surveys. This analogous to predicting if the customer sentiment is positive, negative, or neutral. GAIL essentially “reads” the comments and generates an opinion.
Through this effort, the company hopes to better understand its guests and improve the customer experience. For example, by analyzing comments by detractors, Great Wolf Lodge, would know areas in their service that need improvement.
GAIL was trained using over 67,000 reviews and has an accuracy of 95 percent. Analyzing this unstructured data manually would take far too long for humans, but GAIL can parse this data in seconds and determine whether the author is a net promoter, detractor, or neutral party.
3. Facebook’s Hate Speech Detection
Facebook—with nearly 1.7 billion daily active users—naturally has content posted on the platform that violates its rules. Among this negative content is hate speech. Defining and detecting hate speech is one of the biggest political and technical challenges for Facebook and similar platforms.
Facebook addresses this problem by having human experts review posts detected automatically using an AI text classifier. The AI flagged posts are reviewed in the same way as posts reported by users. In fact, the platform removed 9.6 million pieces of content flagged as hate speech in the first quarter of 2020 alone.
Detecting which content contains hate speech, however, is much harder than violent or explicit content. AI algorithms must understand the subtle meaning of the text using NLP, analyze the cultural context and nuance being expressed, and then determine whether it’s offensive without incorrectly penalizing innocent content.
To increase how much AI can help humans in the loop, Facebook has created a collection of more than 10,000 hate speech memes that combine images and text to spur new research.
4. Bipartisan Press’s Political Bias Detector
The Bipartisan Press is a news outlet that aims to promote transparent journalism by attempting to label the bias of every article it publishes. More recently, however, the publication has turned to AI and NLP to systematically predict political bias.
The publication experimented with multiple ML algorithms, dataset and configurations and found that the best political bias predictor is a model that leveraged Google’s BERT transformer architecture. They also found that the dataset that resulted in the best bias prediction was based on AdFontesMedia’s list of articles which was prelabeled on a per-article basis—on a bias scale of -42 to 42. The news website now uses the tool to classify and score its own articles as left or right leaning and minimal to extreme bias level.
5. LinkedIn’s Inappropriate Profile Flagging
LinkedIn has more than 590 million professionals in over 200 countries. To keep the platform safe and professional, LinkedIn puts a lot of effort into detecting and remediating behavior that violates its Terms of Service, such as spam, scams, harassment, or misinformation. One such attempt—is to detect and remove profiles with inappropriate content. Inappropriate content can range from profanity to advertisements for illegal services.
At first, the platform manually flagged profiles that contained inappropriate words or phrases. This process wasn’t scalable and limited the total number of inappropriate profiles that LinkedIn could surface. Over time, it also became much harder to manage the growing list of offending words and phrases.
Now the social media platform flags profiles that contain inappropriate content using a machine learning model. This document classification model was trained using a dataset of public profile content labeled as “appropriate” or “inappropriate”, which was carefully curated to limit false positives. LinkedIn continues to refine its ML algorithm and training set while looking into Microsoft translation services to leverage ML in all of the platform’s supported languages.
Consider Document Classification For Your Business
As you can see, text classification has a wide range of use cases for business. Unstructured data continues to grow at an enormous pace, and the most innovative companies are using ML and AI to harness this information to achieve greater business results.