Introducing Fine-Tuned Embedding Generation In Aquarium

TLDR

Aquarium now offers functionality to generate high quality embeddings for customers. Upload a labeled dataset and we will handle the infrastructure for generating, indexing, and querying embeddings that are fine-tuned for your specific data domain.

We are rolling out this functionality to our customers and are onboarding beta testers. If you want to try it, let us know!

‍

The Problem

The best way to improve a machine learning model is to improve the data it trains on. There’s a number of common workflows that ML teams can run through to improve their data — for example, figuring out what labels are erroneous and correcting them, or collecting and labeling data that leads to the most model performance improvement.

Neural network embeddings are an incredibly useful tool for improving the quality of your machine learning datasets. They’re quite useful for things like finding patterns of similarity / dissimilarity in your dataset and for finding rare examples in large pools of unlabeled data. When embeddings work well, you can get way more model performance improvement with much less effort from a human. For a longer form explanation, we’ve written extensively on embeddings in this article.

However, we’ve seen that it’s challenging for many teams to generate these embeddings themselves. Generating embeddings requires running a model on large sets of data to extract the embeddings, then indexing these embeddings for visualization and querying. When done incorrectly, this can be very expensive.

‍

Teams like Pinterest spend a lot of time trying to optimize the quality of their embeddings. For each query image in the leftmost column, the first row visualizes their old, inferior embeddings while the second row visualizes their new, higher quality embeddings with improved search relevance.

‍

It also turns out that it’s quite hard to generate high quality embeddings that capture the type of similarity that an ML practitioner cares about. For example, using embeddings generated from a model trained on Imagenet will not work well on medical x-ray images, which look very different. While the Imagenet embeddings may capture aesthetic similarities (these x-rays have a similarly colored background), it will not capture similarity that a practitioner cares about (these x-rays are all of broken fibulas).

While some of our customers have been able to overcome some of these issues, a lot of teams struggle to build the infrastructure to generate and index high quality embeddings at a cost that is lower than employing an operations team, which makes it difficult to get the fullest value out of Aquarium.

‍

The Solution

After hearing from multiple customers about their woes with getting embedding infrastructure set up, we decided to handle this for our customers to make it easier for them to get started. However, the reason our customers want to buy this functionality is because they’d have to build a lot of stuff to build an end-to-end embedding workflow! So we had to build a lot of stuff.

‍

Embedding Model Training

The key to any functional embedding workflow is starting with high quality embeddings that capture the types of similarity that are important to an ML team. With poor embeddings, our similarity search results and embedding visualizations are fairly unhelpful.

While foundation models / pretrained models tend to have good performance on data that looks like internet imagery, they have poor performance on specialized domains. As a result, we need to fine tune these pretrained models on customer data to generate high quality embeddings for their domain.

‍

CLIP uses contrastive learning to train models that generate embeddings on text and imagery.

‍

How do we do this fine tuning well? It turns out that there is a lot of literature to draw on. For example, CLIP (which is a key part of DALLE) utilizes contrastive pre-training on a large corpus of text and image data collected to train embedding generation models (referred to in the diagram above as encoders). There are also techniques that train exclusively on text and imagery that do not require labels. However, our customer data also tends to come prelabeled, so our fine tuning can actually improve the quality of the final embeddings by utilizing these labels during the training process.

‍

Embedding UI Workflows

‍

Aquarium’s similarity search interface allows users to provide feedback on what results are relevant or not relevant, improving the quality of subsequent searches. After a few iterations of this, almost all search results returned are relevant!

‍

Equally important to having good embeddings is having good workflows to utilize them. Our customers are primarily interested in using embeddings to improve their supervised learning models — by using embedding visualizations to find patterns in their model errors and by using embedding similarity search to search for examples of rare data that they can then label. However, the key to using these workflows is having an intuitive UI workflow that can be used by a user who is not necessarily an ML expert.

We can start with our similarity search functionality. Aquarium provides functionality that makes it easy for a user to create buckets of query data they would like to search with. Most importantly, once a user does a similarity search with those queries, not all of the similarity search results may be what the user wants (for example, a user searches with a set of small set of query images containing white feathered birds with blue feet, but gets back results of similar white feathered birds with red feet).

Aquarium offers functionality for users to mark search results as “relevant” or “not relevant” to the data that they are trying to collect. Aquarium then trains simple binary classifiers on this feedback to drastically improve the quality of search results — we’ve seen instances where search results go from 50% relevant to 90+% relevant after only a few minutes of feedback.

‍

Large Scale Embedding Processing

‍

‍

It’s not enough to only be logically correct with the embedding generation and UI workflow functionality. It’s also important to scale these to extremely large pools of data in a fast and cost-effective manner. Running a good embedding search on a small set of data may not produce enough examples of rare scenarios to train a model on. If embedding inference, processing, and querying is too expensive, then it can be cheaper to use operations teams to do searches “manually.”

Aquarium scales to tens of millions of images for embedding generation and processing. While some of this relies on prudent usage of Google Cloud Platform (GCP) managed services, we also use some notable technologies that are less well known.

We use GCP’s Vertex AI to efficiently scale inference workloads across large datasets. This is a significant advantage of using GCP vs other cloud providers, especially since GCP offers TPU (tensor processing unit) inference that is significantly more cost effective than running inference on CPU or GPU.

For similarity searches, it’s important to quickly run approximate nearest neighbor (ANN) searches on embedding vectors on large datasets as users interact with the Aquarium UI. While there exist a lot of open source ANN libraries, they all involve engineering work to deploy and scale on cloud infrastructure. As a result, we use Pinecone to manage vector indexing and queries for us.

‍

Why This Matters

‍

Waymo uses embedding similarity search to find rare examples to add to their training datasets. For example: cactuses!

‍

Without embedding technology, ML teams need to resort to ineffective or expensive means to find the right data to add to their datasets. Most often teams will simply hire a pool of offshore labelers / operations personnel to manually scroll through images one-by-one to find rare examples. While this can be effective, it is often expensive and time consuming. On the other hand, a team can assign an engineer to manually write queries on metadata in SQL or in a Jupyter notebook. When it works, this can be very efficient, but it’s often difficult or impossible to query for certain traits (how do you write a SQL query to find something “star-shaped?”) and takes up engineering time that is already in limited supply.

Embedding technology provides a fast and efficient way to find patterns in model failures and to search through large datasets to find specific datapoints that will most improve model performance. This technology has been used for a while in consumer search / recommender systems, while ML teams at companies like Waymo already leverage this technology for collecting data to label.

However, there haven’t been many openly usable tools that handle the end-to-end lifecycle of embedding generation, processing, and querying for machine learning use cases. Yet this technology can offer massive speedups to the machine learning development cycle. In particular, one of our test customers used to spend hundreds of hours of operational time to search for rare data that they wanted to improve their model performance on. We’ve seen that this flow allows them to sift through their unlabeled data to find rare edge cases in ~30 minutes and get them sent to labeling. Previously this was taking them multiple days across tens of offshore operations personnel, so Aquarium offered a 1000x+ improvement to productivity!

‍

Try It For Yourself!

So, excited to give it a shot? We’ve worked closely with a select group of alpha customers to refine this workflow and we’re currently onboarding beta users to provide further feedback on the quality of our embedding infrastructure. We will be spending a lot of time and attention on these beta users to implement their feedback at no additional cost.

For these beta users, we are looking for customers who:

Have a set of labeled data that they can integrate with Aquarium. We have a Python client API with accompanying docs and a solutions engineer on hand who can help make this process as smooth as possible!
Are interested in doing similarity searches for data curation use cases. Searching through unlabeled datasets for rare examples of data to label, discovering what data their models predominantly fail on.
Are comfortable with giving Aquarium read access to run training and inference with embedding models on their datasets. We will never use models trained on one customer’s data to benefit another customer, and we are SOC 2 Type 2 certified to ensure that all uploaded data stays secure. In the future, we will offer an on-prem version of this flow: contact us if you are interested in talking about this!

With all of that said: please reach out to us here and we can talk about getting you better embeddings with way less effort!