Scaling An ML Team (0

So you’re doing an ML project! Maybe you want to build an object detection system for a robotics application or you want to add a recommender system to your webapp.

You’ll need a team to build and improve this ML system. In the beginning, this can be a single (very stressed) engineer hacking together an MVP, but it can evolve into an entire department with highly specialized teams and hundreds of people. At each stage of developing a model pipeline, you will encounter different problems that require different team structures to overcome.

Leaders (Engineering Managers, Tech Leads, Product Managers, etc.) have to think about how to create the right team structure to make ML projects successful. Here’s some lessons drawn from our own experiences of building and scaling ML teams!

Building The MVP (0–2 People)

A cabal of managers has decided that this “machine learning” thing is quite interesting and merits some investigation. They are now investing time and human resources into starting an ML project. The goal is to get an MVP up and running and then evaluate whether to invest more resources into this domain.

In the beginning, one or two people will be assigned to get everything up and running, and they will have a few days to a few weeks to do it. Sometimes these people are seasoned ML engineers with a PhD or years of industry experience. More likely they’re fresh grads from a master’s program or a normal software engineers with hobbyist-level ML knowledge.

Keep It Simple!

Some engineers will want to build their own models from scratch at this stage, because that’s what machine learning is all about, right? No. This is the time to do the simplest thing possible, see how it does, and then adjust from there.

It’s almost always better to leverage what has already been built. Pull a pretrained model from tensorflow/models and run it on your data. If you have more time, fine-tune an off-the-shelf model on open source data similar to your domain. Even better, use a service like Google AutoML that handles the model setup and training for you.

Don’t try and roll something from scratch unless you absolutely need to! The collective field of ML researchers, open source developers, and tooling companies likely have built better products than anything you could throw together under time pressure.

Do basic debugging checks so you don’t waste time on glaring bugs. It always helps to visualize your data, labels, and model inferences. Plotting your model’s loss and inspecting accuracy metrics is useful as well!

Building The Pipeline (2–10 People)

You don’t always need an ML engineer to do ML work. Sometimes you just need more data.

The MVP has shown some promise! (a cool demo video, some metrics showing it’s better than the status quo, etc.) The bosses have decided to invest more resources into this ML project to bring it to production.

Now it’s a race to build out missing infrastructure, integrate it into upstream and downstream systems, and show some tangible impact to the product and the business. There’s a lot of stuff that’s required to make this work!

Labeling tooling for dispatching data to operations teams to visualize, label, and QA (if your task requires data labeling).
Dataset management systems for data and label storage, visualization, and version control.
Data pipelines to preprocess data, extract features, etc. across large datasets.
Training infrastructure to train models quickly on large datasets.
Testing and validation infrastructure to measure model performance.
Deployment optimization to run model inferences quickly in production.
Monitoring to make sure the model is functioning correctly in production.

The headcount at this stage is still small enough to satisfy the Amazon “two pizza rule.” Usually the entire team can be managed by a single manager/lead without too much trouble. However, there’s still some best practices for building the right team composition.

Scale The Team

At this stage, there’s more work to do than people around to do it. This is where having additional people on the team can be really useful.

The constraining resource in building out the pipeline is invariably engineering bandwidth. This is the time where bringing on non-technical team members can add a lot of value in shipping better models faster.

Additional generalist engineers build out different parts of the pipeline in parallel. At this team size, engineers have a lot of exposure to different parts of the stack, fixing bugs with the deployment environment one day and making tweaks to the model another, depending on the needs of the team.

In domains that require labeling, ML engineers need to start off directly supervising labelers in order to build datasets that they can train their models on. However, as the size and complexity of labeling operations grows, this quickly becomes a full-time job that can be offloaded to non-engineering personnel. An ML Operations Manager takes over the day-to-day work of managing and improving ML datasets. Some of their duties involve building QA processes for labeler quality, documenting edge cases for labeling training, managing labeling contractors / providers, etc. so your ML engineers can focus on setting up other parts of the ML pipeline.

As a model is deployed into production, an ML team must figure out what are the best things to work on and communicate with stakeholders. An ML Product Manager offers a lot of value in making sure that the team is always working on solving the right problems. They spend their time analyzing failure cases of the model, collecting feedback from customers, and prioritizing what are the next best things to work on to provide impact to the business.

ML is also pretty confusing, and Product Managers play a crucial role in communicating with other stakeholders. They work with teammates on sales, marketing, customer success to explain why the model behaves the way it does, what the team is working on improving, and translating requirements from other teams into work items for engineering.

Keep Deploying New Models

Most teams tend to get into the mindset of, “let’s build something that works alright and then move on to the next biggest problem.” This means a team will train a model, deploy it, and leave it. 6 months later, a fire comes up in production that requires the model to be retrained, only for the team to discover that the training pipeline code is now broken, requiring additional engineering effort to bring it back to life. This is known as “model rot.”

Ideally an ML team can redeploy improved models every week — or if you’re really good, every day! The “proper” way to achieve this cadence is to try and automate as much of the training process as possible so that your model is constantly retraining on new data and redeploying without needing human interaction. However, this often takes a lot of engineering time to set up.

In the meantime, it’s extremely valuable to utilize ML Product and Operations managers as much as possible. They can focus on constantly curating the dataset by analyzing model error cases and organizing targeted data labeling campaigns. Meanwhile, ML engineers can simply retrain their existing model code on new data every week to produce a new model that is better than the previous one. This should be as easy as running a script, eyeball-validating the resulting trained model, and redeploying.

This workflow is described really well in Andrej Karpathy’s talks about the Operation Vacation concept, and it’s extremely valuable because it allows a team to scale model improvement with operations and machine time rather than engineering time!

Use Great Tooling

In theory, engineers can always build ad-hoc tooling / infrastructure to unblock themselves, though it won’t be pretty. It will take a lot of engineering time that is in short supply. The final product will likely have a lot of bugs and issues. And more importantly, it will be hard to use, which makes it hard for ML Product Managers and Operations Managers to use that tooling.

Again, always try to utilize great open source or paid tooling rather than building something hacky yourself. When we built the ML stack at Cruise in 2015, deep learning was still relatively new, so we had to build almost all of our tooling ourself. Nowadays, there’s an incredible ecosystem of paid and open source tooling that makes it significantly faster and easier to get an ML project off the ground. While there are many end-to-end platforms for machine learning these days, they often are mediocre at everything. In our opinion, it’s better to use tools that are really good at what they focus on and then integrate them together.

A few examples:

Labeling companies like Scale, Labelbox, Dataloop, Hive, and Cloudfactory allow you to buy great labeling infrastructure with a managed operations team.
Frameworks like Determined AI allow you to do distributed model training without having to roll your own framework or deal with low-level Tensorflow / Pytorch APIs.
Open source offerings like Apache Beam and Apache Spark (with hosted runtimes on Google Dataflow and Databricks) make it easy to scale data processing across a cluster of machines. Offerings like Roboflow help with more domain-specific ETL operations (like in computer vision).

Using Aquarium to query for interesting subsets of your datasets without needing to write any code!

‍

Our company, Aquarium, makes tools that help technical and non-technical ML team members understand and improve their datasets.

In the beginning of a project, Aquarium helps ML engineers visualize their data, labels, and model inferences. This helps catch basic bugs like data quality issues or bugs in their preprocessing code.

As a team gets bigger, ML Product Managers and Operations Managers need great tooling to be effective. Because they can’t code to unblock themselves, they often use general purpose tools like Excel spreadsheets to dig through datasets and analyze model performance, which can lead to exceptionally painful workflows.

Aquarium’s tooling provides a smooth, no-code way to explore datasets and analyze model performance. It makes it easy to analyze failure cases, track them as issues, and resolve them with just a few clicks. Users can also share model results with other team members and collaborate in our web UI!

Thanks for sticking with us so far! Part 2 examines how to go from 10 people to 100 people and scaling a single ML team into an ML org of many teams.