How to Generate Quality Training Data For Your Machine Learning Projects (even if you’re data starved)

machine learning training data collection and generation

Have you run into issues acquiring the right type of data for your machine learning (ML) projects?

You’re not alone. Many teams do. And data is one of the key sticking points in starting AI initiatives at companies. In fact, according to IBM’s CEO, Arvind Krishna, data-related challenges are the top reason IBM clients have halted or canceled AI projects. 

Often what happens in practice is that the relevant ML training data is either not collected, or collected but the data lacks the required labels for training a model. It could also be that the existing volume of data is insufficient for ML model development. 

As I’ve discussed in one of my previous data articles, such issues result in delays, project cancellation, biased predictions, and an overall lack of trust in AI initiatives. Bottom line: having the right data, in the right volume is critical for any ML project. 

But, what if your company does not have a solid big data strategy, or you’re just getting started with data collection? How can you safely start machine learning projects for your automation tasks?

In this article, we’ll explore five strategies for obtaining high-quality machine learning training data for your projects, even if you’re new to AI or your data strategy is still in the works.  This is a long article, so take the time to explore each strategy carefully. For convenience, use the table of contents below.

5 Strategies for Generating Machine Learning Training Data

#1: Start Manually with Domain Experts

If you have zero data for an automation problem or your data is limited, you can put together a team of experts who’ll manually complete tasks, while at the same time start generating high-quality data.

data collection using a team of domain experts. synthetic data generation.
How data collection works using a team of experts

Say you’re looking to develop an AI tool that detects fraudulent website logins. If you’ve never tracked fraudulent login attempts, you’ll have limited to no data to train a model. But you can start the process manually with a team of security experts to start generating high-quality data. This data can later be used to train a machine to detect fraud just like its human counterpart. All the data from the manual process can be tracked, collected, and stored for model training.

When is a manual strategy with domain experts suitable?

As a manual approach can be slow, it’s especially suitable for problems that require deep domain expertise and when people’s health and safety are at stake. For example, in a tumor detection task, you can’t just train models on image data produced by a layperson or even a general practitioner. You need data from experienced radiologists as it requires deep domain expertise to detect and label tumors correctly. Otherwise, there’s a higher risk of problematic predictions, putting people’s health at risk.

Pros and cons of a manual approach with domain experts:

There is no better data than a manually generated and vetted one. Manually generated data, especially data produced by subject matter experts (SMEs), is usually accurate. Plus, there is an added advantage that doing a task manually, to begin with gives you insights into the idiosyncrasies of the task, which can help teams better handle edge cases as automation is introduced.

But the downside of this is, of course—volume. If you have a small team working on the tasks manually, it can take months to generate sufficient data.

#2: Start Manually with Customers

When you’re trying to develop AI-driven productivity enhancements tools for customers, who better to tell you what’s expected than customers themselves? Instead of starting with automation right off the bat, why not augment existing software such that customers can complete the task manually first. 

I once worked with a company that wanted to introduce a machine learning model to detect duplicate discussion threads on their software platform. But since the relevant data was non-existent, we first went manual. We introduced a mechanism to allow users to manually specify if a discussion thread was a duplicate of another. In this case, the customers were the “domain experts,” and they decided which threads were duplicates. All the customer-generated data was later leveraged to build an AI-driven duplicate detection solution. You can use a similar approach for many other automation tasks, such as automatically tagging documents with topics and tagging objects in images. With this approach, you are leveraging the power of volume and the expertise of your very own customers. 

When is a manual strategy with customers suitable?

A manual strategy with customers makes the most sense when you have a large customer base, and the task benefits the customer. For example, the duplicate detection task we discussed earlier, helped customers explore related discussion threads. So, they were willing to complete the task manually. If completing a task doesn’t benefit the customer, you could still generate data, but the volume could be much lower. 

Pros and cons of a manual strategy with customers:

When you have motivated customers willing to complete the manual tasks, this strategy ensures that you’re getting volume and a variety in your data in a relatively short time. I’ve personally found that providing sufficient variety in data helps build robust models. On the flip side, this approach may end up generating poor quality data if customers “game” the system for their benefit. Additionally, if customers couldn’t care less about the task, the volume of data could end up being sparse. 

#3: Pair Humans With Software Rules (i.e., Semi-Automatic)

Another approach to collecting good quality training data is to pair rules encoded in software with humans in the loop. Essentially, you encode rules a human would use to perform a task as a set of software rules. At the same time, you still have a few humans in the loop to act as a quality control layer. If the corrected and vetted data is stored, you can use it for model training in months to come.

synthetic data generation with humans in the loop and simple software automation
How machine learning training data generation works with the use of simple software automation and humans in the loop.

If we reimagine this approach for the fraudulent login example, the software will flag all suspicious login attempts. Then a human goes in and fixes problematic classification.  Or, in the case of the duplicate thread detection problem, the software may suggest potential duplicates to customers. But in the end, customers decide if two discussion threads are, in fact, duplicates. This reduces the amount of manual work that customers or domain experts would have to put in—hastening data generation. 

When is a semi-automatic approach suitable?

A semi-automatic approach is suitable for problems that can be encoded fairly quickly using software rules. For example, you could potentially express conditions that lead to fraudulent website logins using a set of rules. Of course, these rules may not be 100 or even 90 percent accurate. But it can be good enough to get started, allowing the humans-in-the-loop to be the final decision-makers.

Pros and cons of a semi-automatic approach:

The benefit of using a semi-automatic approach is an improvement in speed over a manual method. That’s because you can drastically reduce manual work for a human decision-maker with a rules-based software in the mix. This in turn speeds up task completion and the potential to generate a higher volume of data in a shorter time. 

The downside of this approach is that it may be hard to form a reliable set of rules for certain problems. Plus, it can take additional time and monetary investments to develop the rules-based software automation.

#4: Crowdsource Internally

Crowdsourcing internally means asking a group of people you know and trust (e.g., within the company or your friend’s circle) to complete a certain labeling task.

internal crowdsourcing to generate ML training data. Also known as synthetic data generation.
How internal crowdsourcing works to generate synthetic data for machine learning

Say you’re looking to build a sentiment classification tool; You could ask colleagues to label phrases such as “prompt customer service” and “flawed design” as containing a positive or negative sentiment, specifically to generate labeled data to develop the ML tool. This is different from a manual or semi-automatic approach as you’re taking the task out of its natural context and presenting it to different people to generate labels. 

Internal crowdsourcing requires some extra setup work to acquire labels. I’ve personally used online platforms such as LightTag to collect labels from SMEs, but there are many others out there.

Labeling with lighttag to generate data for training ML models
Example Labeling of people and places with LightTag to generate synthetic data for ML

The trick is to find a tool that fits the task. Some companies end up building their own systems to collect labels as the off-the-shelf tools are either not flexible enough for their needs or they’re concerned about data privacy. 

When is crowdsourcing internally suitable?

Data generation tasks that can be easily taken out of context or don’t require domain expertise make great internal crowdsourcing candidates. For example, these can be tasks that only require “common sense” or knowledge of a particular language. People in your trusted circle are unlikely to be spammers and are inclined to complete tasks properly to support your cause. Internal crowdsourcing can also work with complex tasks—but you must use the right SMEs. I’ve personally had several successes with crowdsourcing using a team of SMEs within the healthcare domain. You can do the same. 

Pros and cons of internal crowdsourcing:

The benefit of crowdsourcing internally is that you can generate a good volume of high-quality labeled data for many problems. But you need to be extra careful with tasks that require domain expertise. For example, if you’re asking a group of radiologists from different hospitals to study a set of digital images and label tumor location, you may need labels from different SMEs for a single task to ensure accuracy. This can slow things down, but at least you’re sure you’re generating quality data. Also, there may not be an appropriate off-the-shelf tool to help you obtain the necessary labels, and you may have to build the tool first. Not to forget, you’d also need to train your labelers to complete tasks adequately.  

#5: Crowdsource Externally

Crowdsourcing externally is about paying unknown human workers to generate the necessary data for your ML projects. These workers could very well be in different countries.

How external crowdsourcing works to generate synthetic training data for ML
How external crowdsourcing works to generate training data for ML

Amazon Mechanical Turk, for example, is an online platform that allows you to outsource labeling tasks to workers around the world, to generate data for machine learning projects quickly. There are also data labeling companies that hire workers specifically to generate data for AI projects. 

Crowdsourcing externally is a speedy way of generating large volumes of data. For example, I’ve received sentiment annotations for several thousand sentences from Mechanical Turk in a matter of minutes. But as with anything this easy, there’s always a catch. The labeling may not always be accurate. This means that you need to vet the quality of workers, or pay higher costs per labeling task to get more “trustworthy” workers on your task. You may also need multiple workers to complete the same task to ensure that the labels are accurate. 

When is crowdsourcing externally suitable?

External crowdsourcing is very effective for simple labeling tasks that don’t require special domain knowledge and can be taken out of their natural context. For example:

  • Tagging people and objects within images
  • Casting a sentiment opinion on pieces of text
  • Tagging people, places, and products in text

Pros and cons of a external crowdsourcing:

Crowdsourcing externally is the fastest way to generate huge amounts of data in a short period of time. But this comes with several downsides. First, the quality of labels may not be as accurate as you’d like it to be. So, this approach should be reserved for tasks that can tolerate some “noise” in the predictions. You’d also need to think about how to account for spam labels by workers randomly completing tasks, and also how best to improve the accuracy of labels produced by your unknown workers. 

generating machine learning training data manual vs. crowdsourced
Key differences between a manual or semi-automatic approach and a crowdsourcing approach in generating data for machine learning

Comparison of ALL Training Data Generation Strategies

Here’s a quick comparison of the different machine learning training data generation strategies discussed in this article. I hope this guides you towards the best approach for your machine learning projects.

how to generate machine learning training data comparison
Comparison of the different strategies for generating ML training data

Final Word

We’ve seen that there are many ways to get started with AI initiatives without an elaborate data strategy. If you’re looking to automate a new problem with AI,  but you don’t have the data, you can generate it by starting with a manual process. Alternatively, you can augment your manual process with a semi-automatic approach to increase data generation speed.

Further, if a task can be taken out of its natural context and you don’t need deep domain expertise, you can think about crowdsourcing internally within a trusted circle or externally, with unknown human workers. But external crowdsourcing requires extra care to counter quality issues such as spam and unreliable labels.

So which approach should you consider? This is entirely task-dependent. If you’re dealing with tasks where accuracy is of utmost importance, a manual or semi-automatic strategy and internal crowdsourcing with SMEs can work well. If you need labeled data for a task that anyone can easily complete, you could consider internal or external crowdsourcing. If you’re having data struggles for your ML projects, would you consider any of these approaches? Leave a comment below to share your thoughts!

2 thoughts on “How to Generate Quality Training Data For Your Machine Learning Projects (even if you’re data starved)”

  1. Excellent post, Kavita. Along with your recent article on LinkedIn (“What is the top reason why companies halt or cancel AI projects?”, this article highlights the need to match the right data collection strategy for the AI project on hand to maximize chances of success.

    In my experience, I’m also seeing companies (and specific divisions of larger companies) struggling to source the right data even from internal sources that are ostensibly mature (with the people, processes, funding and governance mechanisms in place). In the current phase of the hype cycle for AI/ML, teams that plan their data collection strategy for AI/ML projects as one-off, ad-hoc projects are likely to fall short of realizing the full value of the initiative when the right data collection processes are not anticipated and adopted. This seems to arise out of the mentality AI/ML applications are somehow so distinct and unique they require their own plumbing.

    I wonder, then, if in a year or two, the focus on AI/ML as a stand-alone technology will fade, and organizations will realize that such projects and initiatives will have to be supported by robust and mature data stacks all the way from transactional systems to data warehouses to analytical infrastructure (including reporting, BI, dashboards, etc.). AI/ML would then become a key capability that sits atop a functional pyramid of data, not a brittle stand-alone silo that is unsustainable and of self-limiting value to the organization.

    1. What great insights, Raj. I do believe that ML today needs to sit on top of that data pyramid…until of course ML no longer relies on data for learning. But that’s unlikely to happen in the near future. Until then, companies with robust data pipelines and those that make data more accessible to the organization at large will see more long-term use of AI.

      You mention that there are struggles to source the right data even internally at some companies even when the data is present, could you provide some insights as to why that’s happening?

Share your thoughts below

Scroll to Top