Don’t Sleep On Data Operations
When most people think about deep learning practitioners, they think of data scientists who whisper to machine learning models using special powers they learned during their PhDs.
While that may be true for some organizations, the reality of most practical deep learning applications is more banal. The biggest determinant of model performance is now the data, not the model code. And when data is supreme, data operations becomes the most important part of your ML team.
An Intro To Data Operations
Fundamentally, data operations teams are responsible for the maintenance and improvement of the datasets that models train on. Some of their responsibilities include:
- Ensuring the data and labels are clean and consistent. Bad data in the training set means that models will be confused at train time and learn the wrong thing. Bad data in the test set mean you can’t trust your model performance metrics to be accurate.
- Tracing errors in the ML system back to the datapoints (or lack of datapoints) that caused those errors. Good understanding of error cases makes it easier to fix them.
- Sourcing, labeling, and adding data to the dataset based on current priorities: fixing critical customer problems, addressing deficiencies in the model performance, or expanding model functionality to new tasks / domains.
A data operations team member is often an expert in their domain. Think about a recycling specialist who can distinguish between plastic and glass containers on sight, or a translator who can convert Chinese to Portuguese, or a radiologist who can navigate an MRI and tell you whether a patient has cancer or not.
Data operations personnel can also come from consulting or business backgrounds. It helps to be organized and methodical when working on any operations task, but especially with data. Knowledge of the business goals and the technology’s capabilities can also inform how best to prioritize data curation in order to improve the ML system.
Within data operations teams, team members can be assigned based on the data / model type that they are responsible for (for example, in a self driving application, different teammates owning the radar, lidar, and image detection systems) or based on the customer / geography that they serve (for example, one team member handling North American deployments and another handling APAC).
Data operations team members often will work with offshore labeling teams to help scale the throughput of data labeling. The offshore team deals with tasks that are simpler but take more manual effort. For example, adjusting bounding box labels to fit exactly around a variety of objects or labeling pictures of apples vs oranges. In contrast, in-house data operations teammates act as experts who define labeling instructions, inspect the work of the offshore team, and decide how to handle difficult or ambiguous scenarios. Data operations teams are best suited for jobs that require a smaller quantity of high quality work with relatively low turnaround time. Offshore teams are suited for large amounts of simpler jobs, tasks where quality is not as important as quantity, or situations where labeling throughput is more important than latency.
In the late 90s and early 2000s, machine learning found many applications in many web companies. Models could be used to do tasks like recommendations, ranking, and forecasting much better than hand-tuned algorithms.
However, a these ML models were accurate and scalable because they were able to learn from vast amounts of data that were essentially generated for free. Recommendations and rankings models could train on logged data of what users actually clicked on while browsing their websites. Forecasting models could use past data to predict the future and then verify the accuracy of those predictions as time progressed.
In other words, the data was essentially infinite and automatically generated. As long as that data is clean and keeps flowing, you can just work on the model. Most of the effort to improve these model pipelines consisted of scaling data pipelines and ensuring data consistency. Improvements to the model accuracy are primarily through feature engineering and hyperparameter tuning. All things that can be done by data and ML engineers!
The advent of deep learning introduced powerful models that functioned well on tasks that involve unstructured data like imagery, audio, and text. While some large companies are still able to get massive amounts of data for free in these domains, everyone else requires humans to manually label this data before being able to train machine learning models on it, which takes a long time.
As a result, organizations collect and log more data than they can possibly annotate. It takes a lot of effort to determine what data to label, label large quantities of data quickly, and then ensure that the data and labels are clean and accurate. Rather than being a prediction / forecasting problem like previous models, deep learning applications look more like automation problems.
There’s still effort needed to scale your pipeline to ever larger models and datasets. However, on the modeling side, it’s pretty easy to pull a state-of-the-art pretrained model off the shelf, fine tune it on your own dataset, and get pretty good performance. Model performance improvements don’t tend to come from feature engineering or hyperparameter tuning anymore. Most gains come from fixing badly labeled data or from collecting more examples of challenging scenarios that the model has trouble on. This is work that data operations teams can do very well! Andrew Ng describes this paradigm as data-centric AI as opposed to model-centric AI. So instead of the data and ML engineers being the center of attention, the data operations managers now have the most leverage to improve their ML pipelines through proper curation of the data.
Setting Up Data Operations For Success
Collaboration, Not Replacement
Modern deep learning applications are used to automate existing workflows. A team will want to use a ML model to do part of the job that’s already being done by a human operations workforce. However, in nuanced or high-stakes tasks (like reading MRI scans to detect cancer), it’s extremely difficult to build an ML model that you can trust is as accurate as a human expert.
As a result, many successful deep learning applications combine data operations teams working collaboratively with the ML model. To explain this, let’s think about what you do when you bring on a new coworker. When a new human coworker starts on a data operations team, you give them some basic instructions on how to do their job, let them do their job on some easy scenarios, and then check their work and give them feedback on how to do their job better. As they get better at their job, they can be trusted to work on harder tasks or given more responsibility over new projects and eventually training new coworkers.
Working with an ML model is like bringing on an “AI coworker.” When your new AI coworker starts on your team, you assemble a training dataset to give the model (basic instructions), deploy an initial model in a constrained scope (starting off on some easy scenarios), then analyze the model’s failure cases and add more data to its training set in order to improve its performance (corrective feedback leading to improvement).
The advantage of this collaborative flow is that the ML model can quickly handle easy cases much faster than a human operations team while surfacing difficult cases to human experts for review. The human expert can double check the model’s output to gain trust in its performance and work on correcting the model’s failure cases. As the model gets better at certain tasks, the human expert can reduce their involvement in those tasks and work on expanding the model’s functionality to handle more complex situations.
If teams make this setup work, they now have an AI coworker that works much faster than human operations teams, doesn’t quit or get tired, and most importantly, can accumulate more knowledge (through massive and diverse training datasets) than any single human operations team member could ever hold in their head. All with significantly less effort required from the data operations team!
The Tooling Gap
To communicate with human coworkers, data operations teams can use training materials like documents, slideshows, or direct conversation through a shared human language. In contrast, it’s much harder to communicate with an ML model. Humans can’t talk to a model in natural language! As a result, data operations teams often don’t have a lot of visibility into what the model is doing, where it’s doing well / badly, and how to improve its performance. Imagine if you didn’t have Slack and you were trying to communicate to a coworker using smoke signals!
Much of what we know as “MLOps” is around tooling for setting up data pipelines, versioning models, and automatically orchestrating model retrains. These work very well for moving, crunching, and tracking data programmatically. They also work well for training models quickly or for serving model inferences efficiently at scale.
But the “MLOps” category doesn’t cater to data operations teams. It caters to data engineers! Data operations teams have neither the time nor the engineering ability to write custom code to get their job done. Without good tooling, data operations teams must either rely on engineers to accomplish basic data curation tasks or they have to use general purpose tooling that are ill-suited to their use case and painfully inefficient. I’ve seen teams curate their datasets using spreadsheets and Mac Preview!
Data operations teams need no-code tools that allow them to visualize and manipulate ML datasets. This introspection tooling would help them find patterns of model failures, elucidate why the model is making those mistakes, and then help them edit their datasets in order to fix those mistakes. In this way, they can operate efficiently without needing daily help from the engineering team, allowing them to iterate more quickly and ship better models faster.
Data Operations In The Organization
Although data operations teams are incredibly important and have incredible leverage on system performance, they are often treated as second class citizens within the ML team. They are treated as grunt labor that are a necessary evil to get enough data to train models on, a workforce that will be eventually replaced when the models work well enough. They’re given quotas on how much data they have to produce without much visibility into why the data is needed or what impact that data has on system performance. They’re accountable for the performance of the models when they talk to customers, yet they don’t have a lot of leverage in being able to improve the model when they’re reliant on other teams for basic infrastructure.
This situation arises because the first members of an ML team are usually engineers hacking together models on open-source datasets. They then bring on data operations teammates to “help out” by reducing the need for engineers to label data or manage labeling teams. This casts data operations in a supporting role from the beginning.
It’s often more effective to start the ML team around the data operations team and have engineers support their workflows rather than the other way around! Data operations teams are often most aware of the problem domain, customer desires, and business considerations. The engineering team’s job should be to automate the drudgery in the data operations workflow without sacrificing quality for the end customer. Thus, machine learning is not as much a replacement for the data operations team as much as it is a tool to make their lives easier!
The mass deployment of AI will not lead to the complete removal of human intervention. Instead, it will allow humans and machines to work together to produce results that would not have been possible with only one or the other.
Data operations teams are ultimately the ones who are doing day-to-day work with the models and the data. They are the bridge between customer / user needs and the machine learning system, and play a pivotal role in the daily improvement of models through the constant curation of training data.
However, data operations teams are often not set up for success. Due to poor tooling, they are reliant on engineers to accomplish basic tasks in their workflow. Because they are not ML experts, they are treated as second class citizens of the ML team.
But when armed with the right workflow, the right tools, and the right organizational setup, data operations teams have incredible power to improve model performance. They are the make-or-break factor in whether an ML deployment is successful or not. I think it’s fair to say that data operations teams are the unsung heroes of ML!