Peter Gao
February 9, 2023
-
8 min

What Is ML Data Operations?

An image generated by Stable Diffusion from the prompt, “an aesthetic image of a robot reading a book.” It’s important to learn from data!

ML Data Operations is the set of practices around understanding, handling, and improving the data used in machine learning systems.

Improving data is a key lever to improve model performance, and ML Data Operations aims to streamline the process around iterating on and improving that data, both from a technical infrastructure perspective and a human process perspective.

Background

Traditional software engineering revolves around writing code. Software teams gather requirements, modify code, test code, deploy code, monitor code in production, and repeat.

Machine learning introduces a new ingredient. Machine learning systems are a combination of code and data. At a high level, the workflow for machine learning development seems similar. Machine learning teams gather requirements, modify code and/or data, retrain models, test models, deploy models, monitor them in production, and repeat.

MLOps encompasses many of the tools and best practices around this development cycle — distributed training infrastructure, pipeline orchestration, model versioning, model monitoring, etc. However, the emphasis of MLOps tends to be on the development of code applied to the development of the analogs in ML, with an emphasis on automation, data infrastructure, and bookkeeping for technical production operations.

As shown in a talk by Andrew Ng on data-centric AI, working on training data leads to more model improvement than working on model code.

However, most of the improvement to production ML systems comes from improvements to the data they train on. This is the “ML data” part of ML data operations. In fact, learning from data is what differentiates machine learning from traditional software development.

AI researcher Andrej Karpathy describes this concept elegantly in his article on Software 2.0:

In particular, we’ve built up a vast amount of tooling that assists humans in writing 1.0 code, such as powerful IDEs with features like syntax highlighting, debuggers, profilers, go to def, git integration, etc. In the 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets. For example, when the network fails in some hard or rare cases, we do not fix those predictions by writing code, but by including more labeled examples of those cases.

While the process of developing code also changes with machine learning systems, the process of developing data is very different and has a large operations component. This is the “operations” part of ML data operations.

Most ML systems today are based on supervised learning, where they are trained on a set of “ground truth” labels that are assumed to define the correct behavior of the model. And a majority of supervised labels in production applications come from humans manually annotating data. Imagine drawing bounding boxes on objects in an image to train an object detector for cars and pedestrians, or categorizing the text of a customer support ticket in order to train a model to do the same.

The first two stages of training ChatGPT rely heavily on supervised learning.

For example, ChatGPT relies on manually generated labels in a number of ways. Its creators fine-tuned a language model on human-written responses to human-written sample prompts, and then they trained a reward model on human labeled rankings of language model outputs. So the “operations” portion of ML data operations requires a lot of human interaction.

Human involvement in building datasets is necessary to define the desired behavior of the ML system (and therefore the product) because the training dataset is labeled by humans. As a result, product requirements for ML systems tend to be around automation, ie getting models to have human level performance on tasks that can be done by humans. Think of self-driving cars, recycling sorting robots, or self-checkout stores. As a result, the human operations component of machine learning is important and quite difficult to replace.

On the plus side, data is generally much easier to understand for a human. It’s generally easy for a human to look at a datapoint and say, “this dog is correctly labeled as a dog” or “this model does badly on cats” without needing to understand how to read or write code. As a result, data is also an accessible human interface for developing machine learning systems.

ML Data Operations Today

The ML Data Operations Workflow (image by author)

The lifecycle of improving ML data consists of steps to:

  • Define requirements via labeling guidelines
  • Label data, often with large operations workforces
  • Conduct quality assurance on labels and iterate on labeling quality
  • Determine what data to add to the dataset next. Does the dataset distribution represent the production environment it is operating in? Does the dataset need more examples of a scenario that it struggles on?
  • Collect data from production and sending it to labeling

Unfortunately, machine learning development practices are still in a nascent state. It’s often unclear to ML teams how to set up their technical infrastructure and team structures to efficiently improve and ship new models. What should a team invest effort in to get the most improvement at any given time?

  • Labeling data with offshore labelers through an external provider vs in-house vs a semi-automated method?
  • Doing a QA pass of their labeled data with an in-house team or with an external labeling provider?
  • Examining their model failures in their labeled training/validation sets or determining if the model is failing on unlabeled data in production?
  • Collecting data to label through random sampling or with a more sophisticated active learning approach?

Additionally, many of the current tools around this process are either labor intensive or not well suited to the ML domain. Consider the example of doing targeted data collection. ML teams will commonly want to collect more examples of a specific scenario that the model struggles on. With current tools, they have an unenviable choice between:

  1. Assigning an ML engineer to write one-off code queries in a Jupyter notebook to query their dataset. This not only takes an ML engineer out of commission for a relatively low value task, but it’s often not possible to write the correct code to find that data. Many queries are difficult to describe in code or SQL: for example, how does one easily write a query to collect more examples of rockfish in a dataset full of tuna?
  2. Assigning a large group of operations personnel to trawl through individual datapoints in a spreadsheet to find a specific example. While this is often very effective, it’s extremely time consuming and expensive to scale to large datasets because of the dependence on a large operations force.

Moreover, the emphasis of ML development has historically been focused on ML engineers when it should be focused on ML data operations personnel. Data operations team members often domain experts in the ML application’s product requirements and are responsible for the maintenance and improvement of the datasets that the models train on. In essence, they are the “instructors” that machine learning models “learn” from using training data as an interface. They often train larger operations workforces of labelers and can judge whether a model is producing correct or incorrect output for their application.

Having data operations team members work on data frees up ML engineers to work on tasks like building ML infrastructure or experimenting with new model code. In many domains, the data operations personnel have specialized knowledge that ML engineers do not. An ML engineer can’t look at an X-ray and judge whether the model correctly detected if a bone is fractured, whereas a radiologist can make that determination and then decide if the model needs more training data of that specific scenario.

What’s Next?

Because of the poor state of best practices and tooling, machine learning projects have extremely unpredictable return on investment. ML leaders are unsure whether some amount of time invested into improving the ML pipeline will actually lead to a lift in performance metrics or unintentionally result in regressions that can take months to diagnose and fix.

Much of the emphasis in ML development in the MLOps movement has been on data infrastructure. This helps teams iterate faster — it may be easier to rerun a retraining cycle in your ML pipeline, but there’s no guarantee that the new model will be any better. This can waste a lot of money by spending expensive GPU compute-hours to quickly produce models with similar or worse performance to the current model in production.

ML Data Operations is about helping teams iterate smarter — so that every time you run a retraining cycle, you know that the resulting model will be predictably better. More model improvement, less time and cost.

Andrew Ng already led the charge with his emphasis on data-centric AI. However, it will take more effort to figure out the best practices on the operations and tooling side of the actual human actions of improving the data. This is not purely a technical problem, it is also a human-computer interaction problem. The goal at the end of the day is to make it easier for a human to understand the performance of a model and to provide corrective feedback to improve its performance, using data as an interface that is comprehensible by a nontechnical domain expert.

Data operations has an outsize contribution to the ML system as a whole, but is currently underserved and ML teams lack access to the specialized tools they need to perform common ML data workflows. The DevOps movement recognized that development of code could be done far more efficiently by reducing separation between feature development and production operations. ML data operations aims to do the same thing for machine learning development by reducing separation between model development and human interaction with data.

We are at a point where the technical capabilities of machine learning can solve a number of economically valuable problems, and the blocker to wider adoption to AI is the difficulty in applying current technology to those use cases. In particular, we can do better as an industry to improve the development of AI/ML applications. Not just making flashy prototypes or hype-filled demos, but actual production systems that can solve real problems for people. ML Data Operations is a key part of making AI a ubiquitous and useful part of our everyday lives.

Contact Us

If you found this post interesting or are want to talk about ML data operations, please reach out or check out more of our content on improving production ML systems. We also have a community Slack if you want to chat live!

Get in touch

Schedule time to get started with Aquarium