Introducing Collection Campaigns

Continual data collection is a core part of a production machine learning system. After a while, randomly adding more data stops helping much, and you begin focusing on specific edge case scenarios that your model struggles with.

Some cases are straightforward. If your ML model struggles with snowy scenes, you can go and collect data from places where it snows, or search for collected data where the timestamp matches up with snowfall. Awesome.

Some cases are… trickier. Say you’re working on dog breed classification, and your model sometimes thinks that floofy, white Shibas are actually Samoyeds. Floofiness isn’t exactly a field you can query in your image database, and you don’t even have fur color tagged. You could sift through images manually. Then you try it, and after several soul-sucking hours in Mac Preview, you find a whopping two examples of sufficiently floofy Shibas. Not awesome.

‍

Left: A floofy, white Shiba Inu. Right: A Samoyed. Source

‍

Aquarium’s Collection Campaigns

These needle-in-a-haystack problems are common struggles when building a machine learning product. We’re excited to announce our Collection Campaign feature, which helps you quickly collect the data you need — without someone manually reviewing everything.

‍

Create, Track, and Resolve

The process begins by identifying and tracking the different edge cases of data that your model struggles with. Aquarium offers many ways to sort through and visualize your data, and all of them allow you to organize instances together into a single issue.

An issue tracking white Shiba Inus that the model struggles with.

‍

Once you’ve created an issue with examples of the data you want to collect more of, you can kick off a Collection Campaign for it. If Aquarium has been fully configured, then you’re all set! Wait a bit and you’ll start seeing similar samples show up next to your issue. Once you have enough data, you can turn off the Collection Campaign and export the collected samples. Label them, retrain, and get a better model.

‍

Say What You Want, Not How to Find It

The earlier example of unkempt dogs was tricky because it’s hard to describe. You get it, but there isn’t a table with dog cleanliness ratings for each photo you have access to. Aquarium handles these hard-to-describe searches by treating it as a similarity-search problem. Instead of constructing a structured query, we treat it more like reverse image search or song recommendations. You gave us some good examples in the issue, so now we just need to find similar things in the wild.

‍

Similarity search for another scenario: beach scenes.

‍

The core techniques are based around neural network embedding clustering and search, which we cover in more detail here. Because you can provide domain-specific embeddings from your trained ML models, these searches can identify those nuanced distinctions that might only be apparent to domain experts. You can even think of it like training a mini-model for a really specific class. This same approach is used on teams like Waymo’s self-driving perception team, except you can use it even if you aren’t part of Google.

‍

Set Up Once, Repeat Forever

The world changes (and your dataset likely isn’t perfect today), so you’ll continue to find scenarios you struggle with. Each new edge case should be treated as a part of regular operations, not as a distraction to everyone in the org.

Our Collection Campaigns are designed to be set up once by the engineering team, then driven by anyone from our web application. On the engineering side, we provide a client library that you can run in your environment, against already stored (but unlabeled) data or even live on an edge device. It handles synchronizing state with what users requested in the web app, scoring whether a data entry is relevant to an existing collection campaign, and utilities to send those entries back for review in the application.

‍

The Data Engine

Growing your dataset is often the best way to improve your model, but it’s not always enough to just add random data. It’s important to add the right data. Collection Campaigns make it easy to continually collect and label the most valuable data for your model.

If you’d like to try out Collection Campaigns for yourself, let us know and we’ll get you set up!