Vector Search Optimization via KMeans, Voronoi Cells and Inverted File Index (aka “Cell-Probing”)

Davide Mauri

April 18th, 20240 0

In a previous article I already mentioned how vectors can be easily stored in Azure SQL already and how to calculate dot product and cosine distance using just T-SQL. In this article I will show how to improve the performance in vector search by using a technique that has many name but is really based on something very well know already: KMeans Clustering.

As KMeans clustering is a compute intensive operation, this project uses SciKit-Learn library to perform the clustering and then stores the results in a SQL DB table. The results are then used to perform ANN search on the vector column in pure T-SQL.

Vector Search Optimization

Given a vector, finding the most similar vector among all those stored in a database is a common problem in many applications. The easiest approach to solve this problem is to use a brute force approach, which is to compute the distance between the query vector and all the vectors stored in the database. This is a good approach when the number of vectors is not extremely big, and dimensionality of vectors is not very high, as it guarantees perfect recall, meaning that all relevant items that should be returned are actually returned.

Unfortunately, this approach is not scalable as the number of vectors stored in the database increases, so you may want to exchange a perfect recall for much better performances. This is where approximate nearest neighbor (ANN) search comes into play. ANN search algorithms are able to return the most similar vectors to the query vector, but they do not guarantee perfect recall. In other words, they may return less vectors than all the relevant to the query vector, but they are much faster than the brute force approach.

Voronoi Cells

To speed up the search, it is possible to split the vectors into groups, making sure the create groups so that all vectors that are someone similar to each other are put in the same group. This is the idea behind Voronoi cells.

The idea is to create a set of centroids (i.e. vectors) and then assign each vector to the closest centroid. This way, all vectors that are similar to each other will be assigned to the same centroid. This is a very fast operation, as it is just a matter of computing the distance between the vector and all the centroids and then assign the vector to the closest centroid.

Inverted File Index & Cell-Probing

Once all vectors are assigned to a centroid, it is possible to create an inverted file index that maps each centroid to the list of vectors assigned to it. This way, when a query vector is given, it is possible to find the closest centroid and then return all the vectors assigned to it. This is much faster than computing the distance between the query vector and all the vectors stored in the database.

This project uses KMeans clustering to create the centroids and then create the inverted file index. KMeans clustering is a very popular clustering algorithm that is able to create a given number of clusters (i.e. centroids) by iteratively moving the centroids to the center of the vectors assigned to them. The number of clusters is a parameter that can be tuned to trade off recall and performances. The more clusters are created, the better the recall, but the slower the search. The less clusters are created, the worse the recall, but the faster the search.

Architecture

The architecture of the project is very simple as it is composed of a single container that exposes a REST API to build and rebuild the index and to search for similar vectors. The container is deployed to Azure Container Apps and uses Azure SQL DB to store the vectors and the clusters.

The idea is that compute intensive operations, like calculating KMeans, can be offloaded to dedicated container that is easy to deploy, quick to start and offers serverless scaling for the best performance/cost ratio.

Trained data is loaded from Azure SQL

Once the container is running it is completely independent from the database and can do its work without affecting database performances at all. Even better, if more scalability is needed, data can be partitioned across multiple container instances to achieve parallelism

Once data is processed the Clusters are saved back to Azure SQL

Once the model has been trained, the identified clusters and centroids – and thus the IVF index – are saved back to the SQL DB so that they can be used to perform ANN search on the vector column without the need for the container to remain active. If fact, the container can be stopped completely as SQL DB is completely autonomous now.

Cell probing can be done completely via T-SQL

The data stored back into SQL DB and a set of wrapper functions are also created so that you can then just use the find_similar function to perform KMeans search specifying also how many cells to probe in order to find the best performance/recall ratio for your needs.

The complete example is available as usual on GitHub

https://github.com/Azure-Samples/azure-sql-db-vectors-kmeans

The repo uses Dev Containers so that you don’t need to worry about anything. As long as you have Docker installed on your machine, it will take care of everything, and it will run in a completely isolated environment.

Equally easy is to deploy the project on Azure thanks to AZD integration.

Nice, but why?

Using an external container to do the heavy processing is something I truly believe is a great choice in many scenarios. No matter if you have a database that supports vectors or not (and stay tuned for more on this soon! 😁) there will be cases in which you don’t want to have your database CPUs all at 100% for minutes or even hours to build a vector index. In those cases, having a dedicated compute node is the best option as is greatly balances resource usage, costs, performance and result quality.

Performances

As visible in the following gif, the performance improvement is quite substantial. The gif shows the execution of the `find_similar` function with different number of probed clusters. The query on the left is scanning the whole table (and therefore doing an exact search) while the one on the right is probing only 4 clusters, returning approximate results (with a good recall), in 1/6th of the original time. Of course results may vary and are strongly related to number of cells and the content of the cells themselves. Using a different algorithm to calculate the KMeans clusters (there are many in the Scikit-Learn package) can product better result for you, depending on your data.