Peter Gao
December 10, 2020
8 min

Scaling An ML Team (10–100+ People)

This is the second part of our series on scaling ML teams! Read on if you’re a leader in an ML team that is starting to grow beyond 10 people and scale from one team to a collection of teams working on the ML pipeline. If you’re a smaller team or you haven’t seen our first post on scaling from 0–10 people, you may want to read that first.

Building An Org

If your ML teams are misaligned, your org can start to feel a lot like Westeros.

It’s been a frantic few years. The ML pipelines have become more mature, more models have been deployed in new areas, and total ML-related headcount has grown quite a lot. What used to be a single team has now split into multiple specialized teams. Now you have to think about how to structure multiple teams into an effective ML org to achieve the right outcomes for your business.

The number of models running in production has increased, necessitating additional team members to help maintain and improve these new pipelines. In super successful ML orgs, there is additional demand to apply ML to new domains or problems, necessitating R&D work to get new pipelines running.

For mature pipelines, all the low hanging fruit in model improvement has been picked — the days of simple changes yielding 20% improvements are long gone. At this stage, an ML model needs to be generating significant business value, since it takes a lot of effort to achieve smaller gains in performance. Think about pipelines like Facebook News Feed ranking model, where 1% improvements in model performance can lead to tens of millions of dollars in additional revenue. In these situations, one can justify hiring teams of engineers, scientists, and researchers to squeeze out diminishing improvements to model performance.

To satisfy to this increase in demand for ML products and services, you must figure out how to effectively organize and scale your growing org.

Enter The Matrix (Org Structure)

An example of what a matrix ML organization can look like.

Having more models in production means that there tends to be common infrastructure that is shared across pipelines, while certain parts of the pipeline require specialized skillsets whose skills transfer across different models. As you split individual generalist teams into more specialized teams, it’s natural to organize specialized teams around these two factors, though due to Conway’s law, it’s sometimes hard to tell which way the causality goes.

Team organization should follow a “matrix” structure as a team scales. This allows specialized teams start to focus on deeper improvements to the systems under their ownership, but allows Product Managers to have ownership and accountability over individual projects / model pipelines.

Teams form the “vertical” columns of the matrix organization. Team setups can include tens to hundreds of people depending on the maturity of the business, and tend to include the following:

  • Data Operations teams to manage the labeling and QA process for ML pipelines that require labeling. This team also monitors and reviews model performance in production to provide a good customer experience in scenarios where the model has poor performance. This team consists of operations managers who oversee contractor operations teams, both in-country and offshore. They spend time working on training materials and coordinating with modeling teams on providing instructions to labelers on how to handle edge cases.
  • Tooling teams to build full-stack webapp tools for dataset visualization, model analysis, and data labeling. This team contains full-stack engineers and work closely with Data Operations managers who manage the labeling workforces that use their tooling and with the ML modelers and Product Managers who introspect into their models’ performance.
  • Infrastructure teams to manage ML backend infrastructure. This can include ETL pipelines, feature stores, automated and distributed model training infrastructure to train models quickly, and testing / validation steps to make sure nobody ships regressed models. Comprised of data engineers with distributed computing experience.
  • Modeling teams to train models for new tasks, investigate changes to improve model performance with deep changes to model code, and sometimes publish their findings to conferences. This team usually hires PhDs or other so-called “model whisperers” who are experts in machine learning theory and techniques.
  • Deployment teams focused on running models in production. They make sure models run quickly and efficiently in production environments and they build monitoring to make sure the model is functioning correctly. Models deployed on edge hardware require careful embedded optimization to run efficiently, so teams can contain performance engineers with low-level C++ expertise, CUDA / GPU knowledge, and experience working with real time systems. In web deployment domains, this team may hire engineers familiar with A/B testing infrastructure and production webapp scaling.

In my previous experience, the Data Operations and Modeling teams are the first teams to emerge since they own the data and the model code, the two essential ingredients for making an ML model. Over time, as the amount of data becomes too large to process on a single machine, the Infrastructure groups emerge to optimize the speed and scale of existing processes. The Tooling and Deployment teams emerge at different times depending on the organization and the task. For example, modelers can build ersatz workflows with open source tools that can delay the necessity of building a tooling team, but certain datatypes (like 3D pointclouds) may not be supported out-of-the-box with open source tooling. Deployment optimization may be more critical in embedded environments, but less of an issue in web environments until cost cutting becomes more of a priority.

As teams focus on individual parts of the modeling pipeline, an individual team may not actually have ownership / accountability for specific models, nor may an individual team know what the most impactful work items are for improving a model. The number of components it takes for a model to work is too high, and individual team members tend to focus on the maintenance of their team’s components rather than the bigger picture. While this can be remediated with proper communication and training, there’s just too many moving pieces in an ML pipeline for everyone to be perfectly in sync.

Tech Leads emerge as engineering teams get larger and require more directed technical leadership. A Principal Engineer can report directly to a Director or VP and provide a long term technical roadmap for an entire org, while Staff Engineers report to team managers and guide more junior engineers in executing more immediate projects. Tech Leads coordinate with other team members to come up with the right metrics and recommend the best techniques to solve the problems at hand.

ML Product Managers rise to prominence by being responsible for the “horizontal” rows in the matrix structure. They can take over the work of analyzing failure cases, thinking about product + business strategy, and pulling resources from individual teams as needed to improve specific models. They can communicate needs and explain impact to critical stakeholders outside of the ML org.

While each ML team builds generalized infrastructure, ML Product Managers work across teams to get things done, taking on the role of systems integrators in setting up and improving models that lead to business impact. With proper tooling, PMs can even manage model pipelines without needing engineering support, which allows the PM to ship better models faster and with fewer resources.

Teams and PMs can own either single models or multiple models at this stage, depending on the relative value of each model to the business and the difficulty in improving each model. In larger orgs, teams can split into even more specialized subteams that focus on certain pieces of infrastructure or certain model types.

Alignment Through Proper Incentives

An ML org must have all of its critical dependencies under the same management to be successful.

While this sounds like an obvious idea, it is extremely common for orgs to fall into this trap by grouping teams based on their skillsets rather than the common ML product that they work on. Due to their vastly different skillsets, it’s a common pattern to put ML Infrastructure teams into a broader Infrastructure org that also serves non-ML teams, and then put ML modelers under a separate Research org.

However, ML infrastructure has very different requirements than non-ML infrastructure, and projects that serve the needs of the ML team will often be deprioritized. This is a recipe for failure!

Why does this happen? As specialized teams are built around parts of the ML pipeline, individual teams may not be incentivized to actually improve the quality of the ML models running in production. For example, labeling teams may be incentivized to produce large quantities of labels that aren’t super useful, since that team’s KPI is measured on the quantity of labeled data produced rather than on the impact of that labeled data on the final model performance. ML research teams may pursue intellectually interesting or publication-worthy work even if it’s not necessarily the most impactful, since it’s easier to write for a scientist to write a performance review packet about a new model that they’ve developed rather than on doing un-sexy but impactful optimizations on an existing pipeline.

In these situations, it’s important to hold teams accountable to performance measures that are correlated with success for the entire organization, not just the interests of the component teams. When different component teams are placed under separate leadership, misaligned KPIs / incentives are difficult to resolve quickly and quickly bubble into political disputes among VPs when models fail to show significant improvement. It helps tremendously to have all of the ML-focused teams in the same org under the same leadership.

Your Problem Probably Isn’t That Special

As your team’s headcount grows, there is an increasing temptation to build everything from scratch because your requirements are special. Resist this temptation. Before starting a project to research cutting-edge models or build specialized ETL infrastructure, first evaluate options that you can pull from open source or buy from external suppliers! It’s unlikely that your team of 3 modelers can produce a better model than Facebook AI Research, nor is it likely that your requirements are so special that you can’t use Apache Spark.

The cost of employee time tends to be higher than the cost of buying or modifying external options, and orgs should only resort to building their own infrastructure when nothing exists that satisfies their unique requirements, or when alternatives are prohibitively expensive.

It turns out that many in-house projects that may appear easy at first go over-schedule and over-budget to produce a product that is inferior than an external alternative that’s produced by a specialized company / open source group focused on improving that module. A “not invented here” mindset wastes time and money and causes your customer to suffer from a poor user experience as a result!

Aquarium’s tooling makes it really easy to find problems in your dataset and fix these problems so the next time you retrain your model, it just gets better. If you’d like to try Aquarium out for your dataset, let us know! We also host an open Slack community of ML practitioners here!

Get in touch

Schedule time to get started with Aquarium