AMP Robotics, a pioneer in artificial intelligence and robotics, specializes in developing and deploying advanced technology for the recycling industry. Founded in 2014, the company's mission is to enable a world without waste.
AMP combines robotics with cutting-edge computer vision and ML to recognize and sort recycling materials. This technology achieves rates of speed and precision previously known to the industry with the aim of “decoupling the world’s potential from environmental harm.”
AMP’s Machine Learning Data team has access to huge amounts of image data collected directly from their systems deployed in production. One challenge of managing datasets at such a large scale is prioritizing which data to add to their model training and evaluation systems to create the most impact for the business.
The ideal data sampling strategy should allow the team to maximize model performance gains, minimize labeling costs, and stay ahead of emerging demands on the model as the business expands into new domains.
However, finding rare or specific objects in their unlabeled datasets was incredibly challenging and time consuming. Large amounts of time were dedicated to manually searching through existing data for examples, sourcing material to record, training a seed net to search through larger portions of data, and then quality checking the results. This took quite a bit of effort, and in the end, they sometimes still lacked the volume of objects they needed to reach their performance goals.
In practice, AMP needed a system that could:
- Translate model evaluation metrics and customer feedback on deployed models into representative examples in its datasets
- Automatically curate unlabeled data, prioritizing clusters of related imagery to solve targeted model performance problems
- Seamlessly integrate into its other MLOps workflows and handle its massive data scale
AMP chose to partner with Aquarium to solve these data curation challenges, starting with the goal of improving subclass differentiation for a high-impact set of material types in the company’s single-stream recycling datasets. Success here would allow AMP to improve recovery rates for its customers and thereby reduce waste.
Aquarium offers advanced data curation capabilities based on neural net embeddings. These capabilities include:
- Fine-tuned embedding models on any data domain
- Fully-managed infrastructure for high-scale similarity search and indexing
- Web-app based curation workflows for data operations teams—no code required
To get started, AMP ingested both its training and unlabeled data into Aquarium. Aquarium fine-tuned an embedding model for AMP’s domain and produced image and object level embeddings. Aquarium makes these domain-specific embeddings available both in the app for visual exploration and as an index for high-scale similarity search.
Aquarium’s Collection Campaigns feature allowed AMP to, for any given image or object in its dataset, curate a set of similar examples from its unlabeled dataset. One of the first tasks AMP addressed using Collection Campaigns was differentiating clear and lightly colored plastic for certain material types—something that can be difficult for even human labelers.
AMP used Aquarium to search tens of millions of datapoints for examples of each target material, collect the relevant examples, and push that data into its labeling and training pipelines.
After re-training its models with the newly collected data, AMP achieved significant performance gains across the target classes, with less than 20% the time invested compared to other available curation methods.
Collection Campaigns have allowed us to much more efficiently search through our data to find the examples we need. One project was 7x faster than previous iterations because using Aquarium saved so much time.
Claire Parchem, Data Operations Manager
AMP’s Data team collects millions of new images every month, and now with Aquarium’s data curation capability, it has the ability to index and search the entire set. Because Aquarium is integrated into AMP’s MLOps process, curating data with Aquarium allows the AMP Data and ML teams to efficiently and repeatedly generate targeted model performance improvements.
Going forward, AMP plans to expand its use of Aquarium’s fine-tuned embeddings and curation tools to include both single-stream recycling and construction and demolition verticals. This will involve a number of experimental and production initiatives, including subclass discovery and targeted data sampling across both domains.
Aquarium has allowed us to quickly and effectively ramp up projects that would have otherwise been incredibly difficult and time consuming. Given the success of the feature on the past few projects, we’ve made the tool a part of our regular processes for new labels, along with rare or specific projects. We are excited to have it in our toolbox and look forward to continuing to use it to extract value from our unlabeled datasets.
Joe Castagneri, Head of AI