{"id":3441,"date":"2024-08-01T07:00:40","date_gmt":"2024-08-01T14:00:40","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/azure-sql\/?p=3441"},"modified":"2024-07-31T18:32:47","modified_gmt":"2024-08-01T01:32:47","slug":"faiss-and-azure-sql","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/azure-sql\/faiss-and-azure-sql\/","title":{"rendered":"Similarity Search with FAISS and Azure SQL"},"content":{"rendered":"<p>In today\u2019s data-driven world, finding similar items within large datasets is a common challenge. Whether it\u2019s recommending products, identifying similar documents, or clustering data points, efficient similarity search is crucial. This blog post will explore how to leverage FAISS (Facebook AI Similarity Search) and Azure SQL to perform similarity searches on Wikipedia movie plots data. The blog will also cover sample code to help you get started. Check out <a href=\"https:\/\/youtu.be\/FrR3jZE9z8Y\">Data exposed video<\/a> for quick overview and follow along with the code sample.<\/p>\n<h2 data-ccp-props=\"{&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559683&quot;:0,&quot;335559685&quot;:0,&quot;335559731&quot;:0,&quot;335559737&quot;:0,&quot;335562764&quot;:2,&quot;335562765&quot;:0.9,&quot;335562766&quot;:4,&quot;335562767&quot;:10,&quot;335562768&quot;:4,&quot;335562769&quot;:0}\"><span class=\"TextRun SCXP209608355 BCX8\" lang=\"EN-US\" xml:lang=\"EN-US\" data-usefontface=\"false\" data-contrast=\"none\"><span class=\"NormalTextRun SCXP209608355 BCX8\">What is Faiss<\/span><\/span><span class=\"EOP SCXP209608355 BCX8\">\u200b?<\/span><\/h2>\n<p><span data-scheme-color=\"@344854,,\" data-usefontface=\"true\" data-contrast=\"none\"><a href=\"https:\/\/github.com\/facebookresearch\/faiss\">FAISS<\/a> (Facebook AI Similarity Search) is a library that allows <\/span><span data-scheme-color=\"@344854,,\" data-usefontface=\"true\" data-contrast=\"none\">developers to quickly search for embeddings of multimedia <\/span><span data-scheme-color=\"@344854,,\" data-usefontface=\"true\" data-contrast=\"none\">documents that are similar to each other. It solves limitations of <\/span><span data-scheme-color=\"@344854,,\" data-usefontface=\"true\" data-contrast=\"none\">traditional query search engines that are optimized for hash-based <\/span><span data-scheme-color=\"@344854,,\" data-usefontface=\"true\" data-contrast=\"none\">searches and provides more scalable similarity search functions.<\/span>\u200b<\/p>\n<p><span data-scheme-color=\"@344854,,\" data-usefontface=\"true\" data-contrast=\"none\">FAISS was developed by Facebook AI Research (FAIR) and was first <\/span><span data-scheme-color=\"@344854,,\" data-usefontface=\"true\" data-contrast=\"none\">released as an open-source project in February 2017<\/span>\u200b<\/p>\n<p data-ccp-props=\"{&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559683&quot;:0,&quot;335559685&quot;:0,&quot;335559731&quot;:0,&quot;335559737&quot;:0,&quot;335562764&quot;:2,&quot;335562765&quot;:0.9,&quot;335562766&quot;:4,&quot;335562767&quot;:10,&quot;335562768&quot;:4,&quot;335562769&quot;:0}\">\u200bLearn More:<\/p>\n<ul>\n<li data-ccp-props=\"{&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559683&quot;:0,&quot;335559685&quot;:0,&quot;335559731&quot;:0,&quot;335559737&quot;:0,&quot;335562764&quot;:2,&quot;335562765&quot;:0.9,&quot;335562766&quot;:4,&quot;335562767&quot;:10,&quot;335562768&quot;:4,&quot;335562769&quot;:0}\"><a href=\"https:\/\/github.com\/facebookresearch\/faiss\"><span data-usefontface=\"false\" data-contrast=\"none\">https:\/\/github.com\/facebookresearch\/faiss<\/span><\/a><\/li>\n<li data-ccp-props=\"{&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559683&quot;:0,&quot;335559685&quot;:0,&quot;335559731&quot;:0,&quot;335559737&quot;:0,&quot;335562764&quot;:2,&quot;335562765&quot;:0.9,&quot;335562766&quot;:4,&quot;335562767&quot;:10,&quot;335562768&quot;:4,&quot;335562769&quot;:0}\"><a href=\"https:\/\/github.com\/facebookresearch\/faiss\/wiki\"><span data-scheme-color=\"@344854,,\" data-usefontface=\"true\" data-contrast=\"none\">https:\/\/github.com\/facebookresearch\/faiss\/wiki<\/span><\/a><\/li>\n<\/ul>\n<p><span data-usefontface=\"true\" data-contrast=\"none\">Faiss<\/span><span data-usefontface=\"true\" data-contrast=\"none\"> is a library for efficient similarity search and <\/span><span data-usefontface=\"true\" data-contrast=\"none\">clustering of dense vectors. <\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\">It supports various algorithms for searching in sets of <\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\">vectors. <\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\">Faiss<\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\"> can handle data sizes that do not fit in RAM. <\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\">It provides complete Python\/<\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\">numpy<\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\"> wrappers and GPU implementations. The<\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\">\u00a0library is written in C++ with a focus on performance <\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\" data-usefontface=\"true\" data-contrast=\"none\">and scalability.<\/span><\/p>\n<h2><span class=\"TextRun BCX8 SCXP60635428\" lang=\"EN-US\" xml:lang=\"EN-US\" data-usefontface=\"false\" data-contrast=\"none\"><span class=\"NormalTextRun BCX8 SCXP60635428\">Supported Indexes<\/span><\/span><span class=\"LineBreakBlob BlobObject BCX8 SCXP60635428\"><span class=\"BCX8 SCXP60635428\">\u200b<\/span><br class=\"BCX8 SCXP60635428\" \/><\/span><\/h2>\n<p data-ccp-props=\"{&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559683&quot;:0,&quot;335559685&quot;:0,&quot;335559731&quot;:0,&quot;335559737&quot;:0,&quot;335562764&quot;:2,&quot;335562765&quot;:0.9,&quot;335562766&quot;:4,&quot;335562767&quot;:10,&quot;335562768&quot;:4,&quot;335562769&quot;:0}\"><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">FAISS<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\"> supports various indexes: <a href=\"https:\/\/github.com\/facebookresearch\/faiss\/wiki\/Faiss-indexes\">https:\/\/github.com\/facebookresearch\/faiss\/wiki\/Faiss-indexes<\/a> for efficient similarity search and clustering of dense vectors. Let\u2019s explore some of the key index types:<\/span>\u200b<\/p>\n<p><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Flat Indexes<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">:<\/span>\u200b<\/p>\n<p><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">These indexes encode vectors into fixed-size codes and store them in an array.<\/span>\u200b<\/p>\n<p><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">They are simple baselines and do not store vector IDs.<\/span>\u200b<\/p>\n<p><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Examples:<\/span>\u200b<\/p>\n<ul>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexFlatL2 (\u201cFlat\u201d)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Performs exact search based on Euclidean distance (L2 norm).<\/span>\u200b<\/li>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexFlatIP<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\"> (\u201cFlat\u201d)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Compares vectors using the inner product (dot product) similarity.<\/span>\u200b<\/li>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexScalarQuantizer<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\"> (\u201cSQ8\u201d)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Quantizes vectors to 8-bit integers.<\/span>\u200b<\/li>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexPQ<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\"> (\u201c<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">PQx<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">\u201d)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Splits vectors into sub-vectors and quantizes them (usually to 8 bits).<\/span>\u200b<\/li>\n<\/ul>\n<p><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Hierarchical Navigable Small World (HNSW)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">:<\/span>\u200b<\/p>\n<ul>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexHNSWFlat<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\"> (\u201c<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">HNSW,Flat<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">\u201d)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Uses a graph exploration approach for fast search.<\/span>\u200b<\/li>\n<li><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Supports scalar quantization and product quantization.<\/span>\u200b<\/li>\n<\/ul>\n<p><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Inverted File Indexes<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">:<\/span>\u200b<\/p>\n<p><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">These indexes assign vectors to inverted lists and perform search efficiently.<\/span>\u200b<\/p>\n<p><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Examples:<\/span>\u200b<\/p>\n<ul>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexIVFFlat<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\"> (\u201c<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IVFx,Flat<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">\u201d)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Uses another index to assign vectors to inverted lists.<\/span>\u200b<\/li>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexIVFScalarQuantizer<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\"> (\u201cIVFx,SQ4\u201d or \u201cIVFx,SQ8\u201d)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Combines scalar quantization with inverted file structure.<\/span>\u200b<\/li>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexIVFPQ<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\"> (\u201c<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IVFx,PQy<\/span><\/b><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">\u201d)<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Uses product quantization on residuals.<\/span>\u200b<\/li>\n<\/ul>\n<p><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">GPU Acceleration<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">:<\/span>\u200b<\/p>\n<p><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Some indexes are implemented on the GPU for faster execution.<\/span>\u200b<\/p>\n<p><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Examples:<\/span>\u200b<\/p>\n<ul>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">GpuIndexIVFFlat<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: GPU version of <\/span><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexIVFFlat<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">.<\/span>\u200b<\/li>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">GpuIndexIVFPQ<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: GPU version of <\/span><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexIVFPQ<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">.<\/span>\u200b<\/li>\n<\/ul>\n<p><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">Other Indexes<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">:<\/span>\u200b<\/p>\n<ul>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexLSH<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Locality-Sensitive Hashing (binary flat index).<\/span>\u200b<\/li>\n<li><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexIVFPQR<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">: Similar to <\/span><b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">IndexIVFPQ<\/span><\/b><span data-scheme-color=\"@111111,,\" data-usefontface=\"true\" data-contrast=\"none\">, with re-ranking based on codes.<\/span><\/li>\n<\/ul>\n<p>Choosing the right index depends on your specific requirements, such as the size of your dataset and the desired trade-off between speed and accuracy.\u200b<\/p>\n<h2>\u200bFAISS with SQL Database<\/h2>\n<h4>Using Movie Plots in Azure SQL Database<\/h4>\n<p>For this example, we\u2019ll use <a href=\"https:\/\/www.kaggle.com\/datasets\/jrobischon\/wikipedia-movie-plots\" target=\"_blank\" rel=\"noopener\">Wikipedia movie plots<\/a> dataset containing Wikipedia movie plots data stored in Azure SQL. We\u2019ll encode these movie plots into dense vectors using a pre-trained model and then create a FAISS index to perform similarity searches.<\/p>\n<div>\n<div>\n<div><code>SELECT *<span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\"> FROM <\/span>[dbo].[movie_plots];<\/code><\/div>\n<\/div>\n<div><\/div>\n<div>Here is the what the data looks like:<\/div>\n<\/div>\n<div><\/div>\n<div><\/div>\n<div>\n<div>\n<div><a href=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-5.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-3490\" src=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-5.png\" alt=\"Image Data 5\" width=\"1632\" height=\"448\" srcset=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-5.png 1632w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-5-300x82.png 300w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-5-1024x281.png 1024w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-5-768x211.png 768w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-5-1536x422.png 1536w\" sizes=\"(max-width: 1632px) 100vw, 1632px\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<div><\/div>\n<div><\/div>\n<h4>Sample Code<\/h4>\n<p>I have split the sample notebooks into two separate parts. The first notebook, which focuses on creating the FAISS Index, can be found <span class=\"ui-provider a b c d e f g h i j k l m n o p q r s t u v w x y z ab ac ae af ag ah ai aj ak\" dir=\"ltr\"> <a href=\"https:\/\/github.com\/Azure-Samples\/SQL-AI-samples\/blob\/main\/AzureSQLFaiss\/Azure%20SQL_and_Faiss_Index_Creation.ipynb\">here<\/a><\/span>. The second notebook, which covers Similarity Search on the created index, is available <span class=\"ui-provider a b c d e f g h i j k l m n o p q r s t u v w x y z ab ac ae af ag ah ai aj ak\" dir=\"ltr\"><a href=\"https:\/\/github.com\/Azure-Samples\/SQL-AI-samples\/blob\/main\/AzureSQLFaiss\/Faiss_inference.ipynb\">here<\/a>.<\/span> This separation is due to the varying frequency of index creation, which depends on how often your data is updated. You can also consider the trade-off between the total cost of maintaining indexes and the accuracy of the results to decide how often to update the index.<\/p>\n<p><span class=\"ui-provider a b c d e f g h i j k l m n o p q r s t u v w x y z ab ac ae af ag ah ai aj ak\" dir=\"ltr\">[Note: These notebooks are designed to run in Microsoft Fabric with PySpark. You can adjust the Python code to suit other notebook environments]. <\/span><\/p>\n<h4>Create FAISS Index (<a href=\"https:\/\/github.com\/Azure-Samples\/SQL-AI-samples\/blob\/main\/AzureSQLFaiss\/Azure%20SQL_and_Faiss_Index_Creation.ipynb\">Sample Code<\/a>)<\/h4>\n<p>Here\u2019s a step-by-step guide to creating a FAISS index and performing similarity searches running on Microsoft Fabric (PySpark).<\/p>\n<p>First, we will have to load data and choose model.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-8.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-3494\" src=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-8.png\" alt=\"Image Data 8\" width=\"885\" height=\"294\" srcset=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-8.png 885w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-8-300x100.png 300w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-8-768x255.png 768w\" sizes=\"(max-width: 885px) 100vw, 885px\" \/><\/a><\/p>\n<p>The <strong>all-MiniLM-L6-v2 <\/strong>model from SentenceTransformers is a compact and efficient transformer model designed for generating high-quality sentence embeddings. It balances performance and speed, making it ideal for tasks like semantic search and clustering. This model is particularly useful for applications where computational resources are limited but high-quality embeddings are still required.<\/p>\n<p>Next, we will create FAISS Index. See sample code here.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-9.png\"><img decoding=\"async\" class=\"alignnone wp-image-3495 size-full\" src=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-9-e1722466526386.png\" alt=\"Image Data 9\" width=\"754\" height=\"356\" srcset=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-9-e1722466526386.png 754w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-9-e1722466526386-300x142.png 300w\" sizes=\"(max-width: 754px) 100vw, 754px\" \/><\/a><\/p>\n<p>The <strong>IndexFlatIP <\/strong>in FAISS (Facebook AI Similarity Search) is a simple and efficient index for performing inner product (dot product) similarity searches. It stores all vectors in a flat array and computes the inner product between the query vector and all stored vectors to find the most similar ones.<\/p>\n<h4>When to use IndexFlatIP?<\/h4>\n<ul>\n<li><strong>High Accuracy: <\/strong>It provides exact nearest neighbor search results, making it suitable for applications where accuracy is critical.<\/li>\n<li><strong>Small to Medium Datasets: <\/strong>Best used for datasets that fit into memory, as it performs a linear scan over all vectors.<\/li>\n<li><strong>Similarity Search: <\/strong>Ideal for tasks like semantic search, recommendation systems, and clustering where inner product similarity is relevant.<\/li>\n<\/ul>\n<h4>Performing Similarity Search (<a href=\"https:\/\/github.com\/Azure-Samples\/SQL-AI-samples\/blob\/main\/AzureSQLFaiss\/Faiss_inference.ipynb\">Sample Code<\/a>)<\/h4>\n<p>To perform a similarity search, we\u2019ll define a function that encodes the input query, searches the FAISS index, and retrieves the top results. See sample code:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-7.png\"><img decoding=\"async\" class=\"alignnone wp-image-3493 size-full\" src=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-7-e1722465845704.png\" alt=\"Image Data 7\" width=\"688\" height=\"409\" srcset=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-7-e1722465845704.png 688w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-7-e1722465845704-300x178.png 300w\" sizes=\"(max-width: 688px) 100vw, 688px\" \/><\/a><\/p>\n<h4>Query and Result<\/h4>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-6.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-3492\" src=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-6.png\" alt=\"Image Data 6\" width=\"828\" height=\"307\" srcset=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-6.png 828w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-6-300x111.png 300w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2024\/07\/Data-6-768x285.png 768w\" sizes=\"(max-width: 828px) 100vw, 828px\" \/><\/a><\/p>\n<p>And there you have it, the top 5 heist movies that are sure to keep you on the edge of your seat!\ud83c\udf7f<\/p>\n<p>You can also explore other types of searches, such as finding similar movies based on genre, director, or even specific actors. May be try &#8216;<span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\">A mysterious murder in a small town&#8217; or &#8216;D<\/span><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\">ocumentary about groundbreaking scientific discoveries&#8217;. <\/span>The possibilities are endless, so have fun experimenting with different criteria to discover new favorites!<\/p>\n<h2>Conclusion and Next Steps<\/h2>\n<p>FAISS is a powerful tool for performing similarity searches on large datasets. By integrating FAISS with Azure SQL and Microsoft Fabric, you can efficiently search for similar items within your data. The sample notebook provided here should help you get started with implementing similarity search in your own projects.<\/p>\n<p>Feel free to experiment with different FAISS indexes and encoding models to optimize the performance and accuracy of your searches. Apply the techniques learned here to your data, and try various searches, including hybrid searches with filters on other columns like genre, cast, etc.<\/p>\n<p>Happy searching!<\/p>\n<h2>References and Resources<\/h2>\n<ul>\n<li>DataSet: <a href=\"https:\/\/www.kaggle.com\/datasets\/jrobischon\/wikipedia-movie-plots\" target=\"_blank\" rel=\"noopener\">Wikipedia Movie Plots<\/a><\/li>\n<li>Check the SQL AI samples at<span class=\"ui-provider a b c d e f g h i j k l m n o p q r s t u v w x y z ab ac ae af ag ah ai aj ak\" dir=\"ltr\"> <a href=\"http:\/\/aka.ms\/sqlaisamples\" target=\"_blank\" rel=\"noopener\">aka.ms\/sqlaisamples<\/a>. The sample notebooks can be found <a href=\"https:\/\/github.com\/Azure-Samples\/SQL-AI-samples\/blob\/main\/AzureSQLFaiss\/Azure%20SQL_and_Faiss_Index_Creation.ipynb\">here<\/a> and <a href=\"https:\/\/github.com\/Azure-Samples\/SQL-AI-samples\/blob\/main\/AzureSQLFaiss\/Faiss_inference.ipynb\">here<\/a>.<\/span><\/li>\n<li>YouTube: <a href=\"https:\/\/youtu.be\/FrR3jZE9z8Y\">https:\/\/youtu.be\/FrR3jZE9z8Y<\/a><\/li>\n<li>Microsoft Learn: <a href=\"https:\/\/learn.microsoft.com\/en-us\/shows\/data-exposed\/similarity-search-with-faiss-and-azure-sql-data-exposed\">https:\/\/learn.microsoft.com\/en-us\/shows\/data-exposed\/similarity-search-with-faiss-and-azure-sql-data-exposed<\/a><\/li>\n<li>Check SQL AI documentation <span class=\"ui-provider a b c d e f g h i j k l m n o p q r s t u v w x y z ab ac ae af ag ah ai aj ak\" dir=\"ltr\"><a href=\"http:\/\/aka.ms\/sqlai\" target=\"_blank\" rel=\"noopener\">aka.ms\/sqlai<\/a><\/span><\/li>\n<li>Microsoft Fabric Notebooks: <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/how-to-use-notebook\">How to use notebooks &#8211; Microsoft Fabric | Microsoft Learn<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In today\u2019s data-driven world, finding similar items within large datasets is a common challenge. Whether it\u2019s recommending products, identifying similar documents, or clustering data points, efficient similarity search is crucial. This blog post will explore how to leverage FAISS (Facebook AI Similarity Search) and Azure SQL to perform similarity searches on Wikipedia movie plots data. [&hellip;]<\/p>\n","protected":false},"author":99201,"featured_media":3477,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[601,1,570,411,619],"tags":[590,588,465,407,626],"class_list":["post-3441","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-azure-sql","category-microsoft-fabric","category-python","category-t-sql","tag-ai","tag-azure-sql-db","tag-azuresql","tag-python","tag-similarity-search"],"acf":[],"blog_post_summary":"<p>In today\u2019s data-driven world, finding similar items within large datasets is a common challenge. Whether it\u2019s recommending products, identifying similar documents, or clustering data points, efficient similarity search is crucial. This blog post will explore how to leverage FAISS (Facebook AI Similarity Search) and Azure SQL to perform similarity searches on Wikipedia movie plots data. [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts\/3441","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/users\/99201"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/comments?post=3441"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts\/3441\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/media\/3477"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/media?parent=3441"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/categories?post=3441"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/tags?post=3441"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}