Machine learning algorithms demand data. That’s how the algorithms learn—they use examples in data to learn the patterns that lead to the “correct answers.”
Let’s take a spam classification problem. To train the computer to learn which emails are spam and which ones are not, you’d need (1) the spam emails and (2) the corresponding correct classifications (i.e., the labels). The table below illustrates this example:
|Label (correct answers)|
|You owe me 10,000 USD…||Spam|
|Dear, this FREE weight loss program…||Spam|
|Hi James, How about tonight at 7 pm?||Not Spam|
|Mom, could I drop by today when dad is back?||Not Spam|
The computer and underlying algorithms will then use this email data to learn how to produce the correct classification the next time it sees similar emails.
Essentially, labeled data is the combination of data points and corresponding labels (the correct answers), where a subset becomes your training data. You can train a computer to automatically predict the correct answers when it sees hundreds or thousands of these labeled examples.
However, not all machine learning algorithms need labeled data. Some algorithms just need large amounts of data without labels. The algorithms that do need labels are called supervised machine learning algorithms.
Many industry problems lend themselves to supervised machine learning algorithms. For example, predicting sentiment orientation of customer comments, predicting labels for documents, and labeling objects in images.
Where Does Labeled Data Come From?
All this is great, but how do you get thousands of labeled examples you may think? Isn’t this unrealistic?
At some companies, these labels already exist as a result of their manual processes. Say your company is currently manually routing customer support tickets, but you now want to now automate this process. To automate this with machine learning, you’ll need (1) the original support tickets and (2) the categories (labels) created during the manual routing process. You might find this data stored away in some relational database. And all this data can be leveraged to build a machine learning model to automate ticket routing.
But things aren’t always so rosy. Companies often find that when they have a problem they want to solve with supervised machine learning, they often don’t have the necessary labels or their data is somehow incomplete. But don’t despair, you can work around such issues by synthetically or organically generating the training data.