You often hear the word training data being thrown around in your company. But, what is this training data? What do you do with it?
In the business world, training data loosely refers to any data that will be used to develop and evaluate AI solutions. Typically, training data is fed into a machine learning (ML) algorithm. The algorithm then learns the underlying patterns in the data so that it can make decisions on similar patterns.
For example, in order to build a machine-learned spam classifier similar to that in Gmail, ML algorithms need a source of data to learn from. This data would contain examples of spam and legitimate emails.
|Label (correct answers)|
|You owe me 10,000 USD…||Spam|
|Dear, this FREE weight loss program…||Spam|
|Hi James, How about tonight at 7 pm?||Not Spam|
|Mom, could I drop by today when dad is back?||Not Spam|
Using this data, an ML algorithm can then learn what makes an email “spammy” and what makes it “legitimate”. Are there specific words that indicate spam? Email addresses? What about IP addresses? Through this learning process, the next time the “learned” machine learning model sees a brand new spam email, it will know how to categorize it.
Different types of machine learning algorithms demand different types of training data. Some class of algorithms known as supervised learning algorithms, require the correct answers (also known as labels). The spam detection problem that you just saw requires the labels.
Some algorithms that fall under unsupervised learning algorithms don’t need those answers, they just need large amounts of data to automatically discover those patterns. For example, automatically grouping 5000 customer support tickets into logical subgroups.
A big challenge that companies face is getting the right type of training data for AI development. They often find that the data that they need is not stored or the data that is stored lacks specificity. This is why for enterprise adoption of AI, it’s always good to start with a data strategy.
How much training data do I need?
For machine learning problems, the amount of training data that you’ll need would depend on two key factors:
- The complexity of the problem
- The algorithm in consideration
Let’s say you’re solving a complex fraud detection problem, where you’d have to look at a hundred different variables to decide if a transaction is fraudulent or not. This is a complex problem. Even for a human, this problem can be quite challenging. This is a good indicator, that you’d need lots of examples (training data) to cover all the different patterns that can be considered fraudulent or non-fraudulent.
Contrast this to a problem where the task is to assign any given news article to one of two categories—sports or other. Other here means anything outside of sports. This is a straightforward problem for a human and can be straightforward for a machine. So at this complexity, you can expect to require much less data.
When it comes to algorithms, if you employ deep learning algorithms, with very complex architectures, instead of classical machine learning models, then you could be using hundreds upon thousands of examples to teach the network what to do. Classical machine learning algorithms typically demand much less data.
Do I need data if I use rules-based AI?
If you’re looking to use a rules-based AI that doesn’t depend on learning algorithms, data is still needed—even if the rules themselves are automatically or semi-automatically created. But perhaps not to the levels that supervised machine learning algorithms demand. You’d need data to:
- Analyze and learn more about the problem you’re looking to solve
- Use some of the data as “seeds” to trigger the automatic creation of rules
- Rigorously evaluate the quality of the rules-based system