Peter Gao
May 5, 2021
13 min

How To Set Up An ML Data Labeling System

ML models need labeled data for a variety of tasks and modalities. Left: video labeling with Right: named entity recognition for text with tagtog.

Most production machine learning applications today are based on supervised learning. In this setup, a machine learning model trains on a set of labeled training data in order to learn how to do a certain task. For example, an ML model may train on images that are labeled with the type of pet in that image so that the trained model can tell you what pets are present in a set of unlabeled imagery.

To get labeled training data, ML teams often rely on human annotators who label examples for the model to train on. At the beginning, the labeling “team” may consist of a single ML engineer drawing boxes on images for hours at a time. As systems become more mature, teams want to produce higher volumes of labeled data by leveraging automation and relying on large operational workforces.

Conceptually, a labeling system is simple: a user should be able to send data to a labeling system and get back labels for that data. However, there’s many ways to set up a labeled data pipeline depending on your problem setup and requirements. To help you choose what’s right for you, here’s some lessons we’ve learned on how to set up a great ML data labeling system.

What Do You Want From Labeling?

Before we talk about different labeling setups, we should talk about the factors we should evaluate when looking at a labeling system.

  • Accuracy: A labeling system should label data accurately and consistently, otherwise your model will learn to do the wrong thing! For example, if a datapoint is labeled as fraud, you want to be sure that it’s actually fraud and not a mistake by the labeler, otherwise your trained model may make similar mistakes when it runs in production. While labeling systems will inevitably have some error rate, there’s nothing worse than not knowing if you can trust your data and your metrics due to poor label accuracy!
  • Speed: Data should be labeled quickly and in large quantities. Some applications use labeling as QA of live model predictions and require close to real-time response times (within an hour). However, labeling for offline training workflows can sacrifice latency for throughput. Quantities of labels can range from hundreds of examples to hundreds of thousands of examples a week.
  • Cost: Labels should be produced at low cost in time and money. Labeling costs time because it usually requires engineering time to set up data pipelines and operational time to do the actual labeling. This time costs money, whether it’s from directly employing people or by contracting with an external provider.

A lot of decisions in your labeling system will trade off between these factors. For example, if you have stringent accuracy requirements, you can implement a QA process that improves label quality but adds additional time and cost.

However, there are often techniques that that can uniformly improve all of these factors. Better labeling tooling, automation, and domain specific optimizations can all produce clear wins in labeling efficiency.

Labeling Options

ML teams can choose to build or buy two critical pieces of the labeling system:

  1. The tooling and software needed for users to inspect, edit, and store data and labels.
  2. The operations workforce of labelers who use tools to do labeling.

Some vendors offer both components as a fully managed service while some offer one or the other. An ML team can also decide to build both. Generally, the more you build yourself, the more control you have over the labeling process, but at a higher cost in time and money.

Fully Managed Labeling Services

Fully managed services allow you to buy tooling and an operations force for doing labeling, typically charging per-label. This requires the least time investment from an ML team, though they will pay somewhat of a premium as a result.

Fully managed services can be ideal for standard tasks that don’t require specialized tooling or expert knowledge. Fully managed services tend to optimize their technology toolchain for the most popular ML tasks (for example, image classification and detection) and employ operational forces in countries with low labor costs, which significantly reduces the cost per label. In addition, some fully managed service companies may offer additional functionality like labeling automation features or out-of-the-box model training on top of your labeled data.

However, fully managed services often will invest less effort into smaller contracts. Some larger providers will mandate minimum labeling volume and won’t take on contracts below $100,000 per year! This is because fully managed providers must train and maintain their own sizable operations forces. To get any reasonable labeling quality, it can often take significant effort to train labelers for a certain task or to move them between tasks for different customers, requiring a minimum contract size to make it worth this investment. Other providers offer pay-as-you-go pricing which supports small / spiky labeling workloads at the cost of higher price and reduced label accuracy on complex tasks.

Examples of these types of providers include Scale AI, Playment, Appen, Amazon Sagemaker Ground Truth, and Hive AI.

Buying Labeling Tooling

When you have more money than time, fully managed services are often the best option to go with. As your organization scales up and thinks about cost optimization, you may want to switch to a model where you decouple the labeling operations from the labeling tooling. Labeling tooling consists of the software to enable operations teams to . They consist of UIs for labeling data, a storage system that keeps track of labels, and APIs for programmatically moving this data around.

When looking at labeling tooling independently, companies tend to look to open source tools first. Open source tooling tends to work well for small volume labeling tasks, but do not have as much support or functionality as software that is built by companies. Open source tooling is typically not hosted (thereby requiring some initial setup) and generally do not support more advanced features like labeler monitoring, task distribution, and QA flows that are crucial to scaling labeling to large workforces. That’s the point where teams decide to buy dedicated labeling software.

Buying labeling tooling is a lot like buying any other SaaS software. Most tooling is hosted SaaS to make it easier for offshore operations teams to access without a lot of IT setup. Some companies offer on-prem tooling for sensitive data. Some tooling providers will also include the option to partner with an operations workforce they have worked with in the past. However, all provide the ability for an ML team to create accounts for their own operations personnel.

ML teams often consist of engineers who are more familiar with machine learning tools like Jupyter notebooks and Tensorflow than they are with web development tools like React and Node. Buying labeling tools allows teams to to use a higher quality end product compared to setting up ad-hoc Python GUIs, and can be cheaper and faster than hiring a team of tooling engineers to develop a polished web UI.

Additionally, there’s a lot of choice in who you can buy labeling tools from. Some companies specialize in certain types of labeling — NLP vs audio vs imagery vs video — allowing ML teams to select the best tool for the job. And if there’s nothing good enough out there, you can always hire some tooling engineers to roll it yourself.

Examples of tooling-only providers include Labelbox, Dataloop, SuperAnnotate, Supervisely, and Datasaur.

Buying Labeling Operations Workforces

The other half of the equation is the labeling workforce. If you want to scale your labeling throughput with a reasonable cost, you probably don’t want to make your ML engineers label data for hours a day. You may not even want to hire full-time employees if you can get off-shore contractors to do the same job for a fraction of the cost.

There’s a large spectrum of options for buying operations workforces. If you’re looking for flexible one-off work, it’s relatively easy to hire a contractor from TaskRabbit. This is good for labeling jobs that have flexible demand because it’s relatively easy to train a labeler for a day and keep them on as long as you need them. On the other hand, it is more expensive than hiring offshore workforces and therefore more difficult to scale to higher quantities.

As the quantity of labeling demand increases, you may want to contract with operations providers that can provide larger labeling workforces. These labelers are cheaper because they’re offshore in countries with lower wages, and their workforces are used to doing standard ML tasks for many different companies and domains.

However, the reliance on off-shore workers means there may be language or cultural barriers that require additional training to achieve the desired label accuracy. An example of this manifested at one of my previous jobs: we had asked our labeling team to draw bounding boxes around pictures of school buses, but our labeling team was based in a country that did not have school buses. We had to compile a training guide that contained many example pictures of school buses and also contained counterexamples of objects that looked like school buses, like pickup trucks with the same shade of yellow, to train the labelers to produce accurate labels.

These providers typically ask for longer-term commitments with minimum contract sizes based on reserving a number of labeling hours up-front, making it necessary for ML teams to have some constant labeling demand to keep the labelers busy or risk wasting time that has already been paid for. Some companies may provide English-speaking operations managers to make it easier to train and manage offshore teams while others simply provide hiring services and leave coordination up to the ML team.

Examples of outsourced operations providers include Cloudfactory, Samasource, TaskUs, and iMerit.

Roll Your Own

You can also choose to build everything yourself! This gives you maximum control over the labeling process, but can take a lot of time and money. I don’t recommend this for most situations, but there are a few reasons you may want to keep everything in-house:

  • Your task is so unique that external offerings are extremely poor and you can build a higher quality labeling tool in-house without much effort.
  • You have domain expertise / understanding that allows you to significantly speed up or automate the labeling process in a way that external offerings would not consider.
  • Your labeling task is so difficult that it’s better to hire and train labelers in-house to ensure better quality control and retention. This is important in domains like machine learning for medicine, where the only people who can generate acceptable quality labels are expensive physicians!
  • You are operating at a sufficiently large scale (ie Google / Facebook size) where it’s cheaper to build an in-house system from scratch than buy a service from a third party.

Building labeling tooling can be a somewhat daunting task and can balloon in complexity depending on the difficulty of your task and the scale of your operations team. The components you need to build include:

  • A user interface for viewing data and modifying labels.
  • A workflow engine to distribute labeling jobs between different people, QA labels, and reconcile conflicting labels.
  • Labeler quality + throughput statistics to monitor and manage their productivity.
  • Label storage and versioning.
  • Automation / pre-labeling features to reduce human labeler workload.

Building and managing an in-house operations workforce is a lot like managing normal employees, though at significantly higher scale than engineers / PMs. You’ll need:

  • Human resources infrastructure for labelers. This includes recruiting, interviewing, and hiring processes for labelers. Some teams will hire operations personnel as full-time employees, but most US teams will rely on 1099 contractors to keep costs down.
  • Operations managers to oversee the labeling workforce. Their job includes communicating with the ML teams about labeling guidelines, training labelers to follow these guidelines, and then monitoring labeling throughput and accuracy as they work.

Best Practices

Regardless of what labeling option you go with, there’s some best practices that ML teams get the best results.

Write Good Labeling Instructions

Regardless of what labeling setup you have, you will inevitably need to train a labeling force to annotate data according to a set of instructions defined by the ML team. As labelers encounter interesting edge cases or as requirements change, the instructions must also evolve to accurately reflect the ML team’s desires. Even as you switch labeling workforces or tools, you can use the same labeling instructions to quickly get a new labeling system up and running.

Without a set of solid labeling instructions, labelers will produce incorrect (or worse, inconsistent) results that waste time and money. More importantly, instead of the errors being spread uniformly across the entire dataset, labeling errors manifest in certain subsets of the dataset, like ambiguous situations or in certain rare scenarios that the labelers did not know how to handle. Here’s some actionable recommendations to make sure things are done right:

  • For creating and editing instructions, use a tool like Notion that allows multiple team members to edit labeling instructions, comment on questionable changes, and share final versions with labelers as a reference.
  • Include examples (with pictures) of difficult or ambiguous cases with instructions on how they should be handled. Written instructions are often difficult to interpret, but pictures help clearly communicate what should be done.
  • Version your labeling documents. This helps for keeping track of which labelers have been retrained and when. This becomes important for tracking changes in labeling instructions across different segments of your dataset, especially when you need to change previous labels!

For examples of great labeling instructions, Landing AI provides a great set of labeling instructions for industrial defect detection and some examples of instructions for difficult / ambiguous scenarios. The Waymo Open Dataset also contains instructions of when to label objects and how they should be labeled.

Trust But Verify

Some teams will receive labels and immediately train a model on that data. However, the performance metrics of their model will be very poor and they’ll have no idea why. After weeks of hand-wringing and hyperparameter optimization, they’ll inspect the labels and realize that many of them were incorrect, meaning that their model was getting confused at training time and their performance metrics were not reliable in the first place.

It’s very important to do some basic quality checks on the new labels you receive. Although your labeling provider may have a QA step in their process and wrote promises on label quality into your contract, the only way to finally verify label quality meets your expectations is to take a look yourself. Indeed, many label errors are not necessarily straight errors with missing labels, but are often the results of miscommunication and misinterpretation between the ML team and the labeling team.

An easy way to check label quality is to simply visualize a subset of the new labels being produced by the labeling system. This can be done in a few hours by an ML engineer or product manager, and it can be a good gut check that reveals places where labeling instructions are ambiguous or misinterpreted.

Ensuring label quality stays high is an ongoing effort. Mistakes slip through, are eventually caught, and then fixed. When labeling quantities become large, teams should focus their QA time on places where their models and their labels disagree, which efficiently surfaces datapoints with label errors. It may even prove useful to have higher skilled in-house operations personnel who exclusively check the quality of labels from external providers!

Do Bake offs Between Labeling Systems

There’s a lot of options for setting up labeling systems! Not only is there a decision of how much to build vs buy, there’s a lot of vendors who offer similar tooling to accomplish certain labeling tasks, and it can be difficult to narrow down what option / combination of buy and build is best for you. It’s usually possible to use a set of very rough requirements (support for a labeling task, pricing, available throughput, etc.) to pare down a large list of options to approximately 5 or so final candidates, but you still need a process to choose which system to go with.

Here, it’s important to establish what set of evaluation criteria you are trying to optimize for. What’s the right combination of labeling accuracy, speed, and cost for your use case? Once you have a set of evaluation criteria nailed down and a shortlist of options for labeling, you can run a bake off between competing labeling systems. The purpose of this bake off is to do an apples-to-apples comparison of different labeling systems on a small set of data, evaluate which option does the best on your evaluation criteria, and then use that information to choose which option to commit to.

To construct this bake off, we recommend you manually label a small “golden” set of data in-house using an expert labeler. This expert labeler often ends up being a member of the ML team that sets requirements for the ML system, such as an ML engineer or product manager. Since the labeler is also the person who knows what they want from the ML system, you can pretty reasonably assume this set of data is as close to perfectly labeled as you can get. This expert labeler should also construct a set of basic labeling instructions that can be used to produce more labels on new sets of data.

After that, you can send the same golden set of data (without labels) and identical labeling instructions to your candidate labeling systems. You can measure speed and cost fairly easily. To measure label accuracy, you can computer metrics that compare the labels from each provider to the golden set labels from your expert labeler.

Now you can choose the option that scores best on your evaluation based on the combination of accuracy, speed, and cost! This is a nice quantitative method to compare between options, though you’ll need to gauge what tradeoffs you’d like to make between those three factors.

It’s also useful to repeat this exercise every 3–6 months to see if other providers can achieve better results. Labeling is a fairly competitive field that’s relatively commoditized, making it relatively low friction to switch from one provider to another as long as they score well on your bake offs. Our company, Aquarium, makes it easy for ML teams to do bakeoffs between labeling systems. Aquarium allows teams to send data to labeling vendors to be labeled and then evaluate accuracy of the results compared to a golden dataset.


It’s sometimes daunting to figure out how to scale an ML system, and having a good flow of labeled data is a vital part of shipping great models. In this post, we’ve laid out a variety of options for setting up ML labeling systems that trade between cost, speed, and flexibility. We’ve also laid out a process that allows you to fairly evaluate different labeling options and choose what is best for your use case.

Get in touch

Schedule time to get started with Aquarium