{"id":16359,"date":"2025-09-03T00:00:00","date_gmt":"2025-09-03T07:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16359"},"modified":"2025-09-03T06:41:11","modified_gmt":"2025-09-03T13:41:11","slug":"msfabric-and-openai-for-dataset","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/msfabric-and-openai-for-dataset\/","title":{"rendered":"Unlocking Vector Search with OneLake Indexer and OpenAI Integration in Microsoft Fabric"},"content":{"rendered":"<p>Our team works with customer solutions that integrate AI into data solutions. Some time ago, this post titled <a href=\"https:\/\/blog.fabric.microsoft.com\/en-gb\/blog\/fabric-change-the-game-unleashing-the-power-of-microsoft-fabric-and-openai-for-dataset-search\">Fabric Change the Game: Unleashing the Power of Microsoft Fabric and OpenAI for Dataset Search | Microsoft Fabric Blog | Microsoft Fabric<\/a> was shared in our Fabric Blog community. It explored how to integrate OpenAI capabilities into Microsoft Fabric and was based on insights from our team\u2019s experience.<\/p>\n<p>At that time, Microsoft Fabric didn\u2019t offer a built-in OneLake Files indexer. Now that it does, we revisited the topic to evaluate how these new capabilities streamline dataset discovery and enhance AI integration, drawing from our continued work with customers.<\/p>\n<p>In this step-by-step guide, we&#8217;ll dive into vector search using the OneLake Files indexer. You&#8217;ll learn how to extract searchable content and metadata from your files to power advanced search experiences. We&#8217;ll also explore how Microsoft Fabric and OpenAI work together, making it easier to build AI-driven solutions within your existing data ecosystem.<\/p>\n<p>This approach is especially important today, as many organizations are working to unify their data and AI strategies.<\/p>\n<p>To explore this option in Fabric, you will need:<\/p>\n<ul>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/rest\/api\/searchservice\/data-sources\/create-or-update?view=rest-searchservice-2024-05-01-preview&amp;tabs=HTTP&amp;preserve-view=true\">2024-05-01-preview REST API<\/a> or a newer preview REST API.<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-get-started-portal-import-vectors\">Import and vectorize data<\/a> wizard in the Azure portal.<\/li>\n<li>A Fabric workspace.<\/li>\n<li>A Lakehouse in a Fabric workspace.<\/li>\n<li>Textual data.<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/onelake\/create-lakehouse-onelake#load-data-into-a-lakehouse\">Upload into a lakehouse directly<\/a><\/li>\n<\/ul>\n<h3><span style=\"text-decoration: underline;\">Step-by Step:<\/span><\/h3>\n<h4>Search Services<\/h4>\n<p>This is where things get interesting. The OneLake indexer allows us to create an index on top of the data in OneLake. If you already have the data you want to index, you can proceed. If not, you can use sample data from this repo: <a href=\"https:\/\/github.com\/Azure-Samples\/Azure-OpenAI-Docs-Samples\/blob\/main\/Samples\/Tutorials\/Embeddings\/embedding_billsum.ipynb\">Azure-OpenAI-Docs-Samples\/Samples\/Tutorials\/Embeddings\/embedding_billsum.ipynb<\/a>.<\/p>\n<p><strong>But what is an indexer?<\/strong><\/p>\n<p>Basically, an <em>indexer<\/em> in Azure AI Search is a tool that crawls cloud data sources, extracts textual data, and populates a search index by mapping fields between the source data and the index. <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-indexer-overview\">Indexer overview &#8211; Azure AI Search | Microsoft Learn<\/a>.<\/p>\n<p>An indexer targets a supported data source, with its configuration specifying both the data source (such as OneLake) and the search index (destination). It is recommended to create a separate indexer for each combination of target index and data source. Multiple indexers can write to the same index, and a single data source can be reused across multiple indexers, as shown in Fig 1 &#8211; Flow.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig1Flow.png\" alt=\"Flow\" \/><\/p>\n<p>Fig 1 &#8211; Flow<\/p>\n<h4>Search Services configuration:<\/h4>\n<ul>\n<li>If you haven&#8217;t already, upload your data to OneLake inside the Lakehouse (<a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/load-data-lakehouse#local-file-upload\">Options to get data into the Lakehouse &#8211; Microsoft Fabric | Microsoft Learn<\/a>).<\/li>\n<li>Deploy the Search Services (<a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-create-service-portal\">Create a search service in the Azure portal &#8211; Azure AI Search | Microsoft Learn<\/a>)<\/li>\n<li>Enable identity in the Search Services as Fig 2 &#8211; Identity<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig2Identity.png\" alt=\"Identity\" \/><\/p>\n<p>Fig 2 &#8211; Identity<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig3Datasource.png\" alt=\"DataSource\" \/><\/p>\n<p>Fig 3 &#8211; DataSource<\/p>\n<ul>\n<li>Once you create the data source, you can open the JSON editor (as shown in Fig 4 &#8211; Json). You will see that the ResourceID contains the Fabric Workspace GUID, and the Container points to the Lakehouse GUID.<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig4json.png\" alt=\"Json\" \/><\/p>\n<p>Fig 4 &#8211; Json<\/p>\n<ul>\n<li>The documentation mentions that it is possible to detect when a document is flagged for deletion. To configure this, connect to Azure Storage Explorer as shown in Fig 5 &#8211; AzureStorage. <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/onelake\/onelake-azure-storage-explorer\">Integrate OneLake with Azure Storage Explorer &#8211; Microsoft Fabric | Microsoft Learn<\/a>.<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig5AzureStorage.png\" alt=\"AzureStorage\" \/><\/p>\n<p>Fig 5 &#8211; AzureStorage<\/p>\n<ul>\n<li>Connect to the subscription and choose ADLS<\/li>\n<li>Add the metadata configuration as shown in Figure 6 \u2013 Properties in Azure Storage Explorer.<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig6properties.png\" alt=\"Properties\" \/><\/p>\n<p>Fig 6 &#8211; Properties<\/p>\n<ul>\n<li>In Azure AI Search, edit the data source Json definition to include a &#8220;dataDeletionDetectionPolicy&#8221; property.<\/li>\n<\/ul>\n<pre><code class=\"language-json\">{\r\n\r\n \"@odata.context\": \"https:\/\/onelake.search.windows.net\/$metadata#datasources\/$entity\",\r\n\r\n \"@odata.etag\": \"\\\"0xXX\\\"\",\r\n\r\n \"name\": \"define a name\",\r\n\r\n \"description\": null,\r\n\r\n \"type\": \"onelake\",\r\n\r\n \"subtype\": null,\r\n\r\n \"credentials\": {\r\n\r\n  \"connectionString\": \"ResourceId=1111\"\r\n\r\n },\r\n\r\n \"container\": {\r\n\r\n  \"name\": \"0000\",\r\n\r\n  \"query\": \"data\"\r\n\r\n },\r\n\r\n  \"dataDeletionDetectionPolicy\": {\r\n\r\n  \"@odata.type\": \"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy\",\r\n\r\n  \"softDeleteColumnName\": \"isdeleted\",\r\n\r\n  \"softDeleteMarkerValue\": \"true\"\r\n\r\n },\r\n\r\n \"dataChangeDetectionPolicy\": null,\r\n\r\n \"encryptionKey\": null,\r\n\r\n \"identity\": null\r\n\r\n}<\/code><\/pre>\n<p>If the following Error happens:<\/p>\n<p>&#8220;Failed to update indexer &#8220;name of the indexer&#8221;, error: &#8220;Error with data source: Unable to list items within the Lakehouse using the specified identity as access to the workspace was denied. Please adjust your data source definition in order to proceed.&#8221;<\/p>\n<p>Add the relevant permissions to the Search Service in Fabric, as shown in Fig 7 &#8211; Fabric. For example, &#8216;Fabricrag&#8217; is the name of the Search Service where the System Managed identity was enabled.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig7Fabric.png\" alt=\"Fabric\" \/><\/p>\n<p>Fig 7 &#8211; Fabric<\/p>\n<ul>\n<li>Create the index and indexer that are relevant for the filter you are creating. Here is the documentation example:<a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-how-to-index-onelake-files#add-search-fields-to-an-index\">OneLake indexer (preview) &#8211; Azure AI Search | Microsoft Learn<\/a><\/li>\n<\/ul>\n<pre><code class=\"language-json\">{\r\n    \"name\" : \"Define A name\",\r\n    \"fields\": [\r\n        { \"name\": \"ID\", \"type\": \"Edm.String\", \"key\": true, \"searchable\": false },\r\n        { \"name\": \"content\", \"type\": \"Edm.String\", \"searchable\": true, \"filterable\": false },\r\n        { \"name\": \"metadata_storage_name\", \"type\": \"Edm.String\", \"searchable\": false, \"filterable\": true, \"sortable\": true  },\r\n        { \"name\": \"metadata_storage_size\", \"type\": \"Edm.Int64\", \"searchable\": false, \"filterable\": true, \"sortable\": true  },\r\n        { \"name\": \"metadata_storage_content_type\", \"type\": \"Edm.String\", \"searchable\": false, \"filterable\": true, \"sortable\": true }     \r\n    ]\r\n}<\/code><\/pre>\n<p><strong>Indexer:<\/strong><\/p>\n<pre><code class=\"language-json\">{\r\n \u00a0  \"name\": \"searchragtxt\",\r\n  \"dataSourceName\": \"name of your datasource\",\r\n  \"targetIndexName\":  \"name of your index\",\r\n\u200b\r\n  \"fieldMappings\": [\r\n \u00a0  {\r\n \u00a0 \u00a0  \"sourceFieldName\": \"content\",\r\n \u00a0 \u00a0  \"targetFieldName\": \"content\"\r\n \u00a0  },\r\n \u00a0  {\r\n \u00a0 \u00a0  \"sourceFieldName\": \"metadata_storage_name\",\r\n \u00a0 \u00a0  \"targetFieldName\": \"metadata_storage_name\"\r\n \u00a0  },\r\n \u00a0  {\r\n \u00a0 \u00a0  \"sourceFieldName\": \"metadata_storage_size\",\r\n \u00a0 \u00a0  \"targetFieldName\": \"metadata_storage_size\"\r\n \u00a0  },\r\n \u00a0  {\r\n \u00a0 \u00a0  \"sourceFieldName\": \"metadata_storage_content_type\",\r\n \u00a0 \u00a0  \"targetFieldName\": \"metadata_storage_content_type\"\r\n \u00a0  }\r\n  ],\r\n  \"parameters\": {\r\n \u00a0  \"configuration\": {\r\n \u00a0 \u00a0  \"dataToExtract\": \"contentAndMetadata\",\r\n \u00a0 \u00a0  \"delimitedTextHeaders\": \"true\",\r\n \u00a0 \u00a0  \"delimitedTextDelimiter\": \",\"\r\n \u00a0  }\r\n  }\r\n}<\/code><\/pre>\n<ul>\n<li>Run the indexer, as shown in Fig 8 &#8211; Run:<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig8Run.png\" alt=\"Run-Indexer\" \/><\/p>\n<p>Fig 8 &#8211; Run<\/p>\n<p>Using the index with the Python script is straightforward. In Fabric, you can open a notebook, connect to the search service, and perform a simple search.<\/p>\n<p>Keep in mind, this is just a proof of concept and is not intended for production use.<\/p>\n<p>Note: the Search Service Key can be obtained from Search Service -&gt; Settings -&gt;Key.<\/p>\n<p>Simple Example:<\/p>\n<pre><code class=\"language-python\">from azure.search.documents import SearchClient\r\nfrom azure.search.documents.indexes.models import (\r\n  SearchIndex, SimpleField, SearchFieldDataType, SearchableField\r\n)\r\nfrom azure.search.documents.indexes import SearchIndexClient\r\nfrom azure.core.credentials import AzureKeyCredential\r\n\r\nservice_endpoint = \"https:\/\/NAMEOFTHESEARCHSERVICE.search.windows.net\"\r\n\r\napi_key = \"APIKEY\"\r\n\r\nindex_name = \"NAME OF THE INDEX\"\r\n\r\nsearch_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=AzureKeyCredential(api_key))\r\n\r\nquery = \"Any relevant question for the content uploaded?\"\r\n\r\nresults = search_client.search(search_text=query, top=3)\r\n\r\nretrieved_docs = []\r\n\r\nfor result in results:\r\n  retrieved_docs.append(result['content'])\r\nprint(\"Retrieved content:\")\r\n\r\nfor doc in retrieved_docs:\r\n  print(doc)<\/code><\/pre>\n<h4>Vectorize<\/h4>\n<p>Another option is to use the wizard, which eliminates the need for coding to build an index. This enables you to craft engaging queries in just minutes.<\/p>\n<p>It automatically generates multiple objects on your search service: a searchable index, an indexer, and a data source connection for seamless data retrieval. More information <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/vector-search-how-to-configure-vectorizer#define-a-vectorizer-and-vector-profile\">Configure a vectorizer &#8211; Azure AI Search | Microsoft Learn<\/a><\/p>\n<p>You can do this by following the steps in the documentation: <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-get-started-portal-import-vectors?tabs=sample-data-storage%2Cmodel-aoai%2Cconnect-data-storage#set-up-embedding-models\">Quickstart: Vector Search in the Azure Portal &#8211; Azure AI Search | Microsoft Learn<\/a>. From there, connect the data source to OneLake in just a few clicks.<\/p>\n<p>Note: Deploying an OpenAI service is required, as mentioned in the documentation. The user performing the deployment must have the <strong>Cognitive Services Contributor<\/strong> permission. You&#8217;ll also need to define a deployment model, as shown in Fig 9 &#8211; Vectorize. An option is <code>text-embedding-ada-002<\/code>, which is an OpenAI model designed to convert text into numerical vector representations, making it useful for tasks like semantic search, recommendation systems, and clustering.<\/p>\n<p>For example, consider these scenarios:<\/p>\n<ul>\n<li><strong>Semantic Search<\/strong>: Instead of relying on keyword matching, searches return results based on meaning.<\/li>\n<li><strong>Recommendation Systems<\/strong>: Similarity between text embeddings helps suggest relevant content.<\/li>\n<li><strong>Clustering &amp; Categorization<\/strong>: Text with similar meanings gets grouped together.<\/li>\n<\/ul>\n<p>Also, ensure that the OpenAI service has <strong>system identity enabled<\/strong> and that the necessary permissions are granted between OpenAI and the Search Service.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/09\/Fig9Vectorize.png\" alt=\"Vectorize\" \/><\/p>\n<p>Fig 9 &#8211; Vectorize<\/p>\n<p>More information about OpenAI:<\/p>\n<p><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/overview\">What is Azure OpenAI Service? \u2013 Azure AI services | Microsoft Learn<\/a><\/p>\n<p><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/tutorials\/embeddings?tabs=command-line\">Azure OpenAI Service embeddings tutorial \u2013 Azure OpenAI | Microsoft Learn<\/a><\/p>\n<p>From Fabric, you can open a notebook, connect to the search service, and perform the search.<\/p>\n<pre><code class=\"language-python\">from azure.search.documents import SearchClient\r\nfrom azure.search.documents.models import VectorizableTextQuery\r\nfrom azure.identity import DefaultAzureCredential\r\nfrom azure.core.credentials import AzureKeyCredential\r\n\r\n# Pure Vector Search\r\nquery = \"can you fill with a relevant question ?\"\r\ncredential = AzureKeyCredential(\"Search Services KEY\")\r\n\r\nendpoint = \"https:\/\/nameoftheSearchservicecreated.search.windows.net\"\r\nindex_name = \"name of the index created\"\r\n\r\nsearch_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)\r\nvector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields=\"text_vector\", exhaustive=True)\r\n \u00a0\r\nresults = search_client.search( \u00a0\r\n \u00a0  search_text=query, \u00a0\r\n \u00a0  vector_queries= [vector_query],\r\n \u00a0  top=1\r\n) \u00a0\r\n \u00a0 \r\nfor result in results: \u00a0\r\n \u00a0  print(f\"title: {result['title']}\")\r\n \u00a0  print(f\"Score: {result['@search.score']}\") \u00a0 \u00a0\r\n \u00a0  print(f\"chunk: {result['chunk']}\")<\/code><\/pre>\n<h3>OpenAI and Microsoft Fabric Seamless integration<\/h3>\n<p>Another option is to use Fabric and OpenAI. Fabric seamlessly integrates with Azure AI services, enabling you to enhance your data with prebuilt AI models without any prior setup.<\/p>\n<p><a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-science\/ai-services\/ai-services-overview#prebuilt-ai-models-in-fabric-preview\">Use Azure AI services in Fabric &#8211; Microsoft Fabric | Microsoft Learn<\/a><\/p>\n<p>This is the easiest way to use OpenAI and Microsoft Fabric: simply import the library, choose a supported model, and start using it.<\/p>\n<p><a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-science\/ai-services\/how-to-use-openai-sdk-synapse?tabs=python0#chat\">Use Azure OpenAI with Python SDK &#8211; Microsoft Fabric | Microsoft Learn<\/a><\/p>\n<h3>Summary<\/h3>\n<p>In this post, we explored how to leverage OneLake&#8217;s indexer for vector search to extract searchable data and metadata. We also discussed the seamless integration of Microsoft Fabric with OpenAI, showing how these tools work together for efficient data processing and AI-driven insights.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Exploring how Microsoft Fabric OneLake indexer integrates with OpenAI<\/p>\n","protected":false},"author":192662,"featured_media":16360,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,3451],"tags":[33,41,3592,3613,3612,3496],"class_list":["post-16359","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-ise","tag-ai","tag-analytics","tag-fabric","tag-indexer","tag-onelake","tag-openai"],"acf":[],"blog_post_summary":"<p>Exploring how Microsoft Fabric OneLake indexer integrates with OpenAI<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16359","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/192662"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16359"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16359\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16360"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16359"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16359"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16359"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}