{"id":16048,"date":"2025-02-06T00:00:00","date_gmt":"2025-02-06T08:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16048"},"modified":"2025-02-06T07:05:27","modified_gmt":"2025-02-06T15:05:27","slug":"unlock-ai-search-potential-the-case-for-azure-functions-in-data-ingestion","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/unlock-ai-search-potential-the-case-for-azure-functions-in-data-ingestion\/","title":{"rendered":"Azure Functions vs. Indexers: AI Data Ingestion"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>Every successful AI search begins with one core requirement &#8211; <strong>searchable data<\/strong>. For most businesses, especially when implementing AI search on a customer\u2019s knowledge base, the challenge starts with <strong>data migration<\/strong> or <strong>data ingestion<\/strong> into the search service.<\/p>\n<p>In this article we will show efficient alternatives for ingesting data into Azure AI Search. Specifically, we&#8217;ll compare two popular approaches:<\/p>\n<ol>\n<li><strong>Using Azure Functions<\/strong> a more independent, flexible option.<\/li>\n<li><strong>Leveraging Pre-built Indexers<\/strong> as part of AI Search.<\/li>\n<\/ol>\n<p>We will walk through the first approach in details. By the end of this guide, you&#8217;ll have a clear understanding of when to use Azure Functions and how to implement a scalable ingestion pipeline.<\/p>\n<h2>Key Steps in the Data Ingestion Process<\/h2>\n<p>Before diving into the comparison, let\u2019s outline the essential steps involved in setting up a successful ingestion pipeline:<\/p>\n<ol>\n<li><strong>Set Up Azure Blob Storage:<\/strong> Configure a storage account and container to hold the data.<\/li>\n<li><strong>Prepare the Data:<\/strong> Ensure your data is clean, consistent and ready for ingestion.<\/li>\n<li><strong>Describe the File Structure:<\/strong> Common formats include JSON, CSV, PDF, or plain text.<\/li>\n<li><strong>Set Up Azure AI Search:<\/strong> Create and configure your Azure AI (Cognitive) Search service.<\/li>\n<li><strong>Create an Ingestion Mechanism:<\/strong> You can implement this using Azure Functions or a pre-built Indexer.<\/li>\n<\/ol>\n<h2>Pros and Cons: Azure Functions vs. Pre-Built Indexers<\/h2>\n<p>Let&#8217;s compare the two options based on flexibility, ease of use, scalability, and maintenance:<\/p>\n<table>\n<thead>\n<tr>\n<th><strong>Criteria<\/strong><\/th>\n<th><strong>Azure Functions<\/strong><\/th>\n<th><strong>Pre-Built Indexers (Azure AI Search)<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Flexibility<\/strong><\/td>\n<td>Highly flexible, allowing custom logic and advanced workflows.<\/td>\n<td>Limited flexibility with predefined workflows and configurations.<\/td>\n<\/tr>\n<tr>\n<td><strong>Control<\/strong><\/td>\n<td>Full control over data transformation, validation, and ingestion processes.<\/td>\n<td>Less control as it handles data ingestion out-of-the-box with minimal customization.<\/td>\n<\/tr>\n<tr>\n<td><strong>Scalability<\/strong><\/td>\n<td>Easily scalable through Azure Function configurations.<\/td>\n<td>Scalable, but customization is limited for large, complex data sets.<\/td>\n<\/tr>\n<tr>\n<td><strong>Ease of Use<\/strong><\/td>\n<td>Requires setup and custom coding (e.g., Python, C#).<\/td>\n<td>Easier to set up; less coding required.<\/td>\n<\/tr>\n<tr>\n<td><strong>Monitoring<\/strong><\/td>\n<td>Custom monitoring setup required.<\/td>\n<td>Built-in monitoring options available.<\/td>\n<\/tr>\n<tr>\n<td><strong>Cost<\/strong><\/td>\n<td>Pay-as-you-go model, with charges based on Function executions.<\/td>\n<td>Typically more cost-efficient with built-in capabilities for basic scenarios.<\/td>\n<\/tr>\n<tr>\n<td><strong>Maintenance<\/strong><\/td>\n<td>Requires ongoing maintenance of custom code.<\/td>\n<td>Less maintenance due to managed services.<\/td>\n<\/tr>\n<tr>\n<td><strong>Search Engine Agnostic<\/strong><\/td>\n<td>Functions allow you to keep flexibility if you&#8217;re unsure whether to use Azure AI Search alone or in combination with other search engines.<\/td>\n<td>Indexers are tightly coupled with Azure AI Search, limiting cross-platform usage.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For those looking to get started with Indexers, you can find a comprehensive guide here: <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-indexer-overview\">https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-indexer-overview<\/a><\/p>\n<h2>Deep Dive with Azure Function Approach<\/h2>\n<h3>Overview<\/h3>\n<p><strong>Azure Functions<\/strong> provide a more powerful and customizable alternative in comparison to pre-built Indexers.\nWith Azure Functions, you can control how data is processed, apply custom logic, and ensure that the ingestion meets your specific needs.\nIt is easier to integrate any specific log\/data analytics tools in comparison to pre-build indexers.<\/p>\n<h3>Architecture<\/h3>\n<ol>\n<li><strong>Azure Blob Storage:<\/strong> Stores the data (e.g., PDFs).<\/li>\n<li><strong>Azure Function:<\/strong> Acts as a trigger(add new\/ or change file, metadata change) that processes the data and sends it to Azure AI Search.<\/li>\n<li><strong>Azure AI Search:<\/strong> Indexes the processed data, making it searchable.<\/li>\n<\/ol>\n<p>This architecture offers flexibility and scalability, especially when dealing with complex data structures or custom ingestion logic.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/02\/function-ingestion.png\" alt=\"Architecture for file ingestion leveraging Azure Functions\" \/><\/p>\n<h3>Data Ingestion Pipeline<\/h3>\n<p>Let\u2019s build a <strong>data ingestion pipeline<\/strong> using Azure Functions. For this example, we\u2019ll demonstrate ingesting <strong>PDF documents<\/strong> from Azure Blob Storage into Azure AI Search using <strong>Python<\/strong>.<\/p>\n<h3>Prepare folder structure<\/h3>\n<p>Here\u2019s a breakdown of the folder structure for your Azure Function setup:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/02\/common-function-for-ingestion.png\" alt=\"Common folder structure\" \/><\/p>\n<ul>\n<li><strong><code>components\/<\/code><\/strong>: This folder contains all ingestion-related components, such as modules or classes, that your Azure Function will use.<\/li>\n<li><strong><code>function_app.py<\/code><\/strong>: This is the main entry point for the Azure Function. It contains the logic that triggers the ingestion process and orchestrates data flow from Blob Storage to Azure AI Search.<\/li>\n<li><strong><code>host.json<\/code><\/strong>: This file defines the configuration settings for the Azure Function. You can customize it based on your business requirements.\n<ul>\n<li><strong><code>watchDirectories<\/code><\/strong>: Specifies which directories should be monitored for changes (in this case, the <code>components<\/code> folder).<\/li>\n<li><strong><code>logLevel<\/code><\/strong>: Defines the logging level, so you can control the verbosity of logs. Here, all logs are set to <code>Information<\/code> level, except for <code>Azure.Core<\/code>, which is set to <code>Error<\/code>.<\/li>\n<li><strong><code>applicationInsights<\/code><\/strong>: Enables logging for Application Insights, with sampling settings to exclude certain log types.<\/li>\n<li><strong><code>extensionBundle<\/code><\/strong>: Configures the extension bundle for the Azure Function runtime.<\/li>\n<\/ul>\n<p>Here\u2019s a sample configuration:<\/p>\n<pre><code class=\"language-json\">{\r\n\"version\": \"2.0\",\r\n\"watchDirectories\": [\r\n    \"components\"\r\n],\r\n\"logging\": {\r\n    \"logLevel\": {\r\n        \"default\": \"Information\",\r\n        \"Azure.Core\": \"Error\"\r\n    },\r\n    \"applicationInsights\": {\r\n        \"samplingSettings\": {\r\n            \"isEnabled\": true,\r\n            \"excludedTypes\": \"Request\"\r\n        }\r\n    }\r\n},\r\n\"extensionBundle\": {\r\n    \"id\": \"Microsoft.Azure.Functions.ExtensionBundle\",\r\n    \"version\": \"[4.*, 5.0.0)\"\r\n}\r\n}<\/code><\/pre>\n<\/li>\n<li><strong><code>tests\/<\/code><\/strong>: This folder contains automated tests for your function, ensuring that your ingestion logic works as expected.<\/li>\n<li><strong><code>requirements.txt<\/code><\/strong>: Lists all the dependencies (e.g., libraries and packages) required by the Azure Function. The function runtime uses this file to install the necessary dependencies during deployment.<\/li>\n<\/ul>\n<h3>Required Environment Variables Setup<\/h3>\n<p>To ensure your ingestion pipeline functions properly, you need to configure several environment variables both locally (in a <code>.env<\/code> file) and in the cloud (for the Azure Function). These variables provide access to essential services like Azure Blob Storage and Azure AI Search.<\/p>\n<p>The following environment variables are required:<\/p>\n<ul>\n<li><strong><code>APPLICATION_INSIGHTS_CONNECTION_STRING<\/code><\/strong>: Connection string for sending telemetry data to Application Insights.<\/li>\n<li><strong><code>AZURE_STORAGE_CONNECTION_STRING<\/code><\/strong>: Connection string to access Azure Blob Storage.<\/li>\n<li><strong><code>AZURE_SEARCH_SERVICE_NAME<\/code><\/strong>: The name of the Azure AI Search service.<\/li>\n<li><strong><code>AZURE_SEARCH_ADMIN_KEY<\/code><\/strong>: Admin key for managing the Azure Search service.<\/li>\n<li><strong><code>AZURE_SEARCH_SERVICE_ENDPOINT<\/code><\/strong>: The endpoint URL of the Azure AI Search service.<\/li>\n<li><strong><code>AZURE_SEARCH_INDEX_NAME<\/code><\/strong>: The name of the search index where the data will be ingested.<\/li>\n<li><strong><code>BLOB_STORAGE_DATA_CONNECTION_STRING<\/code><\/strong>: Connection string to the Azure Blob Storage that contains the data to be ingested.<\/li>\n<li><strong><code>BLOB_STORAGE_DATA_CONTAINER_NAME<\/code><\/strong>: The name of the Blob Storage container where the documents (e.g., PDFs) are stored.<\/li>\n<li><strong><code>AZURE_OPENAI_API_VERSION<\/code><\/strong>: The version of the Azure OpenAI API being used.<\/li>\n<li><strong><code>AZURE_OPENAI_ENDPOINT<\/code><\/strong>: The endpoint URL for the Azure OpenAI service.<\/li>\n<li><strong><code>AZURE_OPENAI_KEY<\/code><\/strong>: The API key for accessing the Azure OpenAI service.<\/li>\n<\/ul>\n<p>These environment variables will be accessed within the Azure Function to automate the connection and interaction with the respective services.<\/p>\n<h2>Step-by-Step Guide to Azure Function Implementation<\/h2>\n<h3>Step 1: Setting Up the Azure Function in Azure Cloud<\/h3>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/02\/create-function-app-in-azure.png\" alt=\"Create an Azure function in the cloud UI\" \/><\/p>\n<h3>Step 2: Define the Context for Your Function App<\/h3>\n<p>At this stage, we created an independent <code>PDFIngestionsService<\/code>, which serves as a dedicated service for handling PDF ingestion. This modular approach allows for easy future extension to support other file types, such as text, Excel, HTML, and more.<\/p>\n<p>By structuring the service in this way, you can easily add additional ingestion logic for other formats, ensuring flexibility and scalability in your data ingestion pipeline.<\/p>\n<pre><code class=\"language-python\">    import logging\r\n    import os\r\n    import pathlib\r\n\r\n    import azure.functions as func\r\n\r\n    DOCUMENT_CONTAINER_NAME = os.environ[\"BLOB_STORAGE_CONTAINER_NAME\"]\r\n    app = func.FunctionApp()\r\n\r\n    @app.blob_trigger(arg_name=\"blob\", path=f\"{DOCUMENT_CONTAINER_NAME}\", connection=\"BLOB_STORAGE_DOCS_CONNECTION_STRING\")\r\n    async def handle_blob_change_function(blob: func.InputStream):\r\n        # Load our code during execution to prevent host startup failures\r\n        from components import logging\r\n        from components.ingest_service import PDFIngestionsService\r\n\r\n        configure_logging()\r\n        logger = logging.getLogger(__name__)\r\n        data_file_name = blob.name.lower()\r\n        ingestion_service = None\r\n\r\n        try:\r\n            # Determine the file type and select the appropriate ingestion service\r\n            file_extension = pathlib.Path(data_file_name).suffix[1:].lower()\r\n            if file_extension in [\"pdf\"]:\r\n                ingestion_service = PDFIngestionsService()\r\n            else:\r\n                raise ValueError(\"Unsupported file type\")\r\n\r\n            # Ingest the blob using the selected service\r\n            if ingestion_service:\r\n                await ingestion_service.ingest(blob)\r\n\r\n        except Exception as e:\r\n            logger.exception(f\"An error occurred while processing data {data_file_name}: {e}\")\r\n            raise<\/code><\/pre>\n<h3>Step 3: Create a Helper for Setting Up Application Insights Logging<\/h3>\n<p>To ensure proper logging and monitoring within your Azure Function app, we created a helper script that configures logging and integrates with <strong>Azure Application Insights<\/strong>. This allows you to capture logs, errors, and telemetry data in Application Insights for better observability and troubleshooting.<\/p>\n<p><strong>Suggested path:<\/strong> <code>.\/components\/logging.py<\/code><\/p>\n<pre><code class=\"language-python\">    import logging\r\n    import os\r\n    from opencensus.ext.azure.log_exporter import AzureLogHandler\r\n\r\n    def configure_logging():\r\n        # Configure the logger\r\n        LOG_LEVEL = os.getenv(\"LOG_LEVEL\", \"INFO\").upper()\r\n        LOG_FORMAT = os.getenv(\"LOG_FORMAT\", \"%(asctime)s - %(name)s - %(levelname)s - %(message)s\")\r\n        logging.basicConfig(level=LOG_LEVEL, format=LOG_FORMAT)\r\n\r\n        # Silence noisy loggers\r\n        logging.getLogger(\"azure.core.pipeline.policies.http_logging_policy\").setLevel(logging.WARNING)\r\n        logging.getLogger(\"httpx\").setLevel(logging.WARNING)\r\n\r\n        # Add Azure Application Insights handler to root logger\r\n        root_logger = logging.getLogger()\r\n        appinsights_connection_string = os.getenv(\"APPLICATION_INSIGHTS_CONNECTION_STRING\")\r\n        if appinsights_connection_string and not any(isinstance(handler, AzureLogHandler) for handler in root_logger.handlers):\r\n            azure_handler = AzureLogHandler(connection_string=appinsights_connection_string)\r\n            root_logger.addHandler(azure_handler)<\/code><\/pre>\n<h3>Step 4: Configure Your File Ingestion Service Interface and Its Implementation<\/h3>\n<p>Now is the ideal time to set up the interface for your file ingestion service and implement it. This will allow you to streamline how different file types are processed, starting with PDFs, but leaving room for future extensions (e.g., text, Excel, HTML).<\/p>\n<p><strong>Suggested folder structure for components and PDF ingestion service:<\/strong><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/02\/azure-function-components.png\" alt=\"Common folder structure for components\" \/><\/p>\n<p><strong>Ingestion service interface (<code>ingestion_service_interface.py<\/code>):<\/strong><\/p>\n<p>In this interface, we define the base structure for any ingestion service, enabling flexibility for handling different file types. The <code>FileProcessor<\/code> class is also defined to process file streams and handle temporary file storage.<\/p>\n<pre><code class=\"language-python\">import io\r\nimport logging\r\nimport os\r\nimport tempfile\r\nfrom abc import ABC, abstractmethod\r\n\r\nimport azure.functions as func\r\n\r\nclass IngestionServiceInterface(ABC):\r\n    @abstractmethod\r\n    async def ingest(blob: func.InputStream):\r\n        pass\r\n\r\nclass FileProcessor(ABC):\r\n    async def process_stream(self, blob: func.InputStream) -&gt; str:\r\n        \"\"\"\r\n        Process a file stream by saving it to a temporary file and returning the file path.\r\n        \"\"\"\r\n        logger = logging.getLogger(__name__)\r\n\r\n        try:\r\n            # Extract the original filename from the blob metadata\r\n            original_filename = os.path.basename(blob.name)\r\n            input_stream = io.BytesIO(blob.read())\r\n\r\n            # Read data from the stream\r\n            file_data = input_stream.read()\r\n\r\n            # Create a temporary file with the original filename and write the data to it\r\n            temp_dir = tempfile.gettempdir()\r\n            temporary_file_path = os.path.join(temp_dir, original_filename)\r\n\r\n            with open(temporary_file_path, \"wb\") as temporary_file:\r\n                temporary_file.write(file_data)\r\n\r\n            # Return the path to the temporary file\r\n            return temporary_file_path\r\n        except Exception as error:\r\n            logger.error(f\"An error occurred in process_stream: {str(error)}\", exc_info=True)\r\n            raise\r\n\r\n    @abstractmethod\r\n    async def process_file(self, *, file_path: str, index_name: str):\r\n        \"\"\"\r\n        Process a file path by passing it to process_input.\r\n        \"\"\"\r\n        pass<\/code><\/pre>\n<p><strong>IngestionServiceInterface<\/strong>: Defines an abstract interface that any ingestion service (e.g., PDF, Excel) must implement. This provides flexibility for adding new file types in the future.<\/p>\n<p><strong>FileProcessor<\/strong>: Provides methods for processing file streams and saving them to a temporary directory. It also abstracts away the logic for file path handling, which will be reused in different implementations.<\/p>\n<p><strong>PDF ingestion service (<code>.\/components\/PDFs\/ingest_service.py<\/code>):<\/strong><\/p>\n<p>This is a concrete implementation of the <code>IngestionServiceInterface<\/code> for processing PDF files. It uses the <code>PDFFileProcessor<\/code> to handle the file processing and ingestion.<\/p>\n<h3>Step 5: Next Steps: Logging and Local Deployment<\/h3>\n<pre><code class=\"language-python\">import logging\r\n\r\nimport azure.functions as func\r\nfrom components.ingestion_service_interface import IngestionServiceInterface\r\nfrom components.PDF.build_pdf_index import PDFFileProcessor\r\n\r\nlogger = logging.getLogger(__name__)\r\n\r\nclass PDFIngestionsService(IngestionServiceInterface):\r\n    async def ingest(self, myblob: func.InputStream):\r\n        blob_name = myblob.name.lower()\r\n\r\n        try:\r\n            # Process the blob stream\r\n            await self.process_blob(myblob)\r\n        except Exception as e:\r\n            logger.error(f\"Error ingesting PDF file '{blob_name}': {e}\", exc_info=True)\r\n            raise\r\n\r\n    async def process_blob(self, blob: func.InputStream):\r\n        fileProcessor = PDFFileProcessor()\r\n\r\n        try:\r\n            file_path = await fileProcessor.process_stream(blob)\r\n            await fileProcessor.process_file(file_path=file_path, index_name=\"ada002-sample-index\")\r\n        except Exception as e:\r\n            logger.error(f\"Error during data file processing: {e}\", exc_info=True)\r\n            raise<\/code><\/pre>\n<p><strong>PDFIngestionsService<\/strong>: This class implements the <code>IngestionServiceInterface<\/code> for PDF files. It handles blob ingestion and error logging.<\/p>\n<p><strong>process_blob<\/strong>: This method calls the <code>PDFFileProcessor<\/code> to process the file stream and then processes the file using a specified index in Azure AI Search.<\/p>\n<p>At this point, the main focus is setting up <strong>logging<\/strong> capabilities within the <code>process_blob()<\/code> method. This ensures that errors are properly logged during ingestion, making it easier to debug issues during file processing.<\/p>\n<p><strong>To deploy<\/strong>, you can use the <strong>VSCode Azure Functions plugin<\/strong> to deploy the function from your local machine. This plugin provides a simple interface for deploying directly to Azure without needing additional scripts. Follow these steps:<\/p>\n<ol>\n<li>Install the <strong>Azure Functions Extension<\/strong> for VSCode.<\/li>\n<li>Open the project in VSCode and click on the Azure Functions icon in the sidebar.<\/li>\n<li>Log in to your Azure account if needed.<\/li>\n<li>Right-click on your Azure Function in the Functions Explorer and select <strong>Deploy to Function App<\/strong>.<\/li>\n<li>Choose your subscription and an existing or new function app for deployment.<\/li>\n<\/ol>\n<p>Once deployed, you can monitor the logs via Azure Portal or within VSCode&#8217;s Output window to ensure the function behaves as expected.<\/p>\n<p>An <strong>automated deployment script<\/strong> will be provided later in this guide to streamline the deployment process for CI\/CD pipelines.<\/p>\n<h3>Step 6: Deployment to Cloud<\/h3>\n<p>There are many ways to automate the deployment of your Azure Function app, and in this article, I\u2019ll demonstrate a sample deployment process using <strong>GitHub Actions<\/strong>.<\/p>\n<p>GitHub Actions allows you to automate the deployment pipeline, ensuring your Azure Function app is continuously delivered to the cloud whenever changes are made.<\/p>\n<p>Below is a sample GitHub Actions workflow for deploying your function app:<\/p>\n<pre><code class=\"language-yml\"># Documentation for the Azure Web Apps Deploy action: https:\/\/github.com\/azure\/functions-action\r\n# Continuous delivery using GitHub Actions: https:\/\/learn.microsoft.com\/en-us\/azure\/azure-functions\/functions-how-to-github-actions?tabs=linux%2Cpython&amp;pivots=method-manual\r\n\r\nname: Deploy document blob trigger\r\n\r\non:\r\n  workflow_dispatch: # Allows triggering the workflow manually\r\n\r\nenv:\r\n  PYTHON_VERSION: \"3.11\" # Set the Python version for the function app\r\n  FUNCTION_APP_DIR: \"data-ingestion-function\" # Directory where the function app is located\r\n  FUNCTION_ZIP_NAME: \"data-ingestion-function.zip\" # Name of the zip package for deployment\r\n\r\njobs:\r\n  deploy:\r\n    runs-on: ubuntu-latest\r\n\r\n    steps:\r\n      - name: Checkout repository\r\n        uses: actions\/checkout@v4\r\n\r\n      - name: Setup Python ${{ env.PYTHON_VERSION }} environment\r\n        uses: actions\/setup-python@v5\r\n        with:\r\n          python-version: ${{ env.PYTHON_VERSION }}\r\n\r\n      - name: Install project dependencies\r\n        working-directory: .\/${{ env.FUNCTION_APP_DIR }}\r\n        run: |\r\n          python -m pip install --upgrade pip\r\n          pip install -r requirements.txt --target=\".python_packages\/lib\/site-packages\"\r\n\r\n      - name: Az CLI login\r\n        uses: azure\/login@v2\r\n        with:\r\n          creds: ${{ secrets.AZURE_CREDENTIALS }}\r\n\r\n      # Make environment variables dynamically or statically (Skipped in this example)\r\n      # INTENTIONALLY SKIPPED STEP\r\n\r\n      - name: AzFunctions deployment\r\n        uses: Azure\/functions-action@v1\r\n        with:\r\n          app-name: ${{ env.FUNCTION_APP_DIR }} # Azure Function App name\r\n          package: ${{ env.FUNCTION_APP_DIR }} # Directory containing the function code\r\n          scm-do-build-during-deployment: true # Enable build on deployment\r\n          enable-oryx-build: true # Enable Oryx build engine for deployment\r\n\r\n      - name: Set function configuration\r\n        run: |\r\n          az functionapp config appsettings set \\\r\n            --name \"${{ env.FUNCTION_APP_DIR }}\" \\\r\n            --resource-group \"SOME_RESOURCE_GROUP\" \\\r\n            --settings \\\r\n              \"APPLICATION_INSIGHTS_CONNECTION_STRING=SOME\" \\\r\n              \"AZURE_OPENAI_API_VERSION=SOME\" \\\r\n              \"AZURE_OPENAI_ENDPOINT=SOME\" \\\r\n              \"AZURE_OPENAI_KEY=SOME\" \\\r\n              \"AZURE_SEARCH_ADMIN_KEY=SOME\" \\\r\n              \"AZURE_SEARCH_SERVICE_ENDPOINT=SOME\" \\\r\n              \"BLOB_STORAGE_DOCS_CONNECTION_STRING=SOME\" \\\r\n              \"BLOB_STORAGE_DATA_CONTAINER_NAME=SOME\"<\/code><\/pre>\n<ul>\n<li><strong>Checkout the repository<\/strong>: The workflow starts by checking out the repository to access the code.<\/li>\n<li><strong>Set up Python<\/strong>: The specified Python version (<code>3.11<\/code> in this case) is set up using <code>setup-python<\/code>.<\/li>\n<li><strong>Install project dependencies<\/strong>: The workflow installs the required dependencies listed in the <code>requirements.txt<\/code> file into a <code>.python_packages<\/code> directory to be packaged with the function.<\/li>\n<li><strong>Login to Azure CLI<\/strong>: The <code>azure\/login@v2<\/code> action logs in to Azure using credentials stored in GitHub secrets (<code>AZURE_CREDENTIALS<\/code>).<\/li>\n<li><strong>Azure Functions deployment<\/strong>: The <code>Azure\/functions-action@v1<\/code> action deploys the function app to Azure. The workflow enables build during deployment by setting <code>scm-do-build-during-deployment<\/code> to <code>true<\/code>.<\/li>\n<li><strong>Set environment variables<\/strong>: The workflow configures environment variables for the function app using <code>az functionapp config appsettings set<\/code>.<\/li>\n<\/ul>\n<hr \/>\n<p>You can manage the environment variables either dynamically or statically, so this part was skipped:<\/p>\n<ul>\n<li><strong>Dynamic setup<\/strong>: Use Azure Key Vault to securely pull secrets at runtime.<\/li>\n<li><strong>Static setup<\/strong>: Set the environment variables manually within the workflow.<\/li>\n<\/ul>\n<p>For simplicity, this example skips the environment setup step. You can choose the method that best fits your workflow.<\/p>\n<hr \/>\n<p>An actual deployment for this workflow will:<\/p>\n<ul>\n<li>Set up the Python environment.<\/li>\n<li>Install dependencies.<\/li>\n<li>Deploy the function app.<\/li>\n<li>Configure the function app\u2019s environment settings.<\/li>\n<\/ul>\n<p>With this setup, your Azure Function will be automatically deployed whenever you manually trigger the workflow or add it to trigger on code pushes to the repository.<\/p>\n<h3>Step 7: Verify Your Deployed Azure Function<\/h3>\n<p>Now that your Azure Function is deployed, it&#8217;s time to test it to ensure everything is working correctly.<\/p>\n<ol>\n<li><strong>Upload a Test PDF<\/strong>: Upload a PDF file into the configured Azure Blob Storage container. This should trigger the Azure Function if everything is set up correctly.<\/li>\n<li><strong>Check Function Logs<\/strong>\n<ul>\n<li>Navigate to the <strong>Azure Portal<\/strong>.<\/li>\n<li>Open your <strong>Function App<\/strong> and go to the <strong>Logs<\/strong> tab under the <strong>Monitoring<\/strong> section.<\/li>\n<li>Monitor the logs to verify that the function has been triggered and processed the PDF successfully.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Troubleshooting<\/strong>:\n<ul>\n<li>If you don\u2019t see any logs or if the function isn\u2019t triggered, it might indicate an issue with the deployment or configuration.<\/li>\n<li>Double-check your deployed files, environment variables, and ensure that the correct blob storage connection string is set.<\/li>\n<li>You can also use <strong>Application Insights<\/strong> (if configured) to further analyze any errors or performance issues.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>Testing your function with a real PDF upload ensures that the entire pipeline\u2014from blob storage ingestion to logging\u2014functions as expected.<\/p>\n<h3>Step 8: Building the Index<\/h3>\n<p>To enable <strong>full-text search<\/strong> on the ingested PDF documents, you\u2019ll need to build an index in <strong>Azure AI Search<\/strong>. The index defines the structure and schema of the data, making it searchable. This is a critical step in enabling efficient searches through the content of your documents.<\/p>\n<p>We recommend using <strong>Azure Document Intelligence<\/strong> (formerly known as Form Recognizer) to extract and structure the content of your PDF documents, such as the title, author, and text content. Document Intelligence can parse complex document layouts, allowing you to extract key-value pairs, tables, and paragraphs for indexing.<\/p>\n<hr \/>\n<p>Here\u2019s a sample index structure for PDF documents, which can be customized based on the specific fields you want to extract:<\/p>\n<ul>\n<li><strong>Id<\/strong>: A unique identifier for each document (used as a primary key).<\/li>\n<li><strong>Title<\/strong>: <code>Edm.String<\/code> \u2013 The title of the document.<\/li>\n<li><strong>Author<\/strong>: <code>Edm.String<\/code> \u2013 The author of the document.<\/li>\n<li><strong>Content<\/strong>: <code>Edm.String<\/code> \u2013 The main searchable text content of the document.<\/li>\n<li><strong>File Size<\/strong>: <code>Edm.Int64<\/code> \u2013 The size of the file, useful for filtering or sorting large documents.<\/li>\n<\/ul>\n<hr \/>\n<p>Here\u2019s a code sample for the <code>.\/components\/PDFs\/build_index.py<\/code> file, which defines the index structure for your Azure AI Search service:<\/p>\n<pre><code class=\"language-python\">import asyncio\r\nimport base64\r\nimport logging\r\nimport os\r\nfrom dataclasses import asdict, dataclass\r\nfrom typing import List\r\n\r\nfrom azure.core.credentials import AzureKeyCredential\r\nfrom azure.search.documents.aio import SearchClient as AsyncSearchClient\r\nfrom azure.search.documents.indexes import SearchIndexClient\r\nfrom azure.search.documents.indexes.models import (\r\n    SearchIndex,\r\n    SearchableField,\r\n    VectorSearchProfile,\r\n    SearchFieldDataType,\r\n    SimpleField\r\n)\r\n\r\nimport pymupdf\r\nfrom components.logging import configure_logging\r\nfrom components.ingestion_service_interface import FileProcessor\r\n\r\n# Configure logging\r\nlogger = logging.getLogger(__name__)\r\n\r\n# Define constants for sleep times\r\nCHUNK_SLEEP_TIME = 2  # in seconds\r\nBATCH_SLEEP_TIME = 7  # in seconds\r\n\r\nclass AISearchHelper:\r\n    @classmethod\r\n    def create_from_env(cls, ai_search_index_name: str) -&gt; \"AISearchHelper\":\r\n        search_endpoint = os.environ[\"AZURE_SEARCH_SERVICE_ENDPOINT\"]\r\n        search_api_key = os.environ[\"AZURE_SEARCH_ADMIN_KEY\"]\r\n        ai_search_credential = AzureKeyCredential(search_api_key)\r\n\r\n        return cls(\r\n            ai_search_index_name=ai_search_index_name,\r\n            ai_search_endpoint=search_endpoint,\r\n            ai_search_credential=ai_search_credential,\r\n        )\r\n\r\n    def __init__(self, ai_search_index_name: str, ai_search_endpoint: str, ai_search_credential: AzureKeyCredential) -&gt; None:\r\n        self.ai_search_index_name = ai_search_index_name\r\n        self.ai_search_endpoint = ai_search_endpoint\r\n        self.ai_search_credential = ai_search_credential\r\n\r\n    def create_or_update_index(self, index_fields: List[any]):\r\n        index_client = SearchIndexClient(endpoint=self.ai_search_endpoint, credential=self.ai_search_credential)\r\n\r\n        # Define the fields for the search index\r\n        index_fields = [\r\n            SearchableField(name=\"id\", type=SearchFieldDataType.String, key=True, searchable=True, analyzer_name=\"keyword\"),\r\n            SimpleField(name=\"content\", type=SearchFieldDataType.String, searchable=True)\r\n        ]\r\n\r\n        # Create a simple search index with vector search\r\n        search_index = SearchIndex(\r\n            name=self.ai_search_index_name,\r\n            fields=index_fields,\r\n            vector_search=VectorSearchProfile(name=\"simple_vector\")\r\n        )\r\n\r\n        index_client.create_or_update_index(search_index)\r\n        index_client.close()\r\n\r\n    async def upload_content_to_index_async(self, content: List[dict]):\r\n        search_client = AsyncSearchClient(endpoint=self.ai_search_endpoint, index_name=self.ai_search_index_name, credential=self.ai_search_credential)\r\n        result = await search_client.upload_documents(documents=content)\r\n        await search_client.close()\r\n        return result\r\n\r\n@dataclass\r\nclass IndexData:\r\n    id: str\r\n    content: str\r\n\r\n# Subclass for PDF processing\r\nclass PDFFileProcessor(FileProcessor):\r\n\r\n        async def process_file(self, file_path: str, index_name: str):\r\n            configure_logging()\r\n\r\n            # Create search index\r\n            ai_search = AISearchHelper.create_from_env(ai_search_index_name=index_name)\r\n            ai_search.create_or_update_index()\r\n\r\n            # Load the PDF and process pages\r\n            doc = pymupdf.open(file_path)\r\n            tasks = []\r\n            for page in doc.pages():\r\n                page_content = page.get_text(sort=True)\r\n                page_id = base64.b64encode(f\"{file_path}{page.number}\".encode()).decode()\r\n                index_data = asdict(IndexData(id=page_id, content=page_content))\r\n\r\n                tasks.append(asyncio.create_task(ai_search.upload_content_to_index_async(content=[index_data])))\r\n\r\n                # Chunk-level control: Sleep for a defined period between uploads\r\n                await asyncio.sleep(CHUNK_SLEEP_TIME)\r\n\r\n                # Run tasks in batches to avoid memory overload\r\n                if len(tasks) &gt;= 10:  # BATCH_SIZE can be configured\r\n                    await self._execute_tasks(tasks)\r\n                    tasks = []  # Clear the task list after executing the batch\r\n\r\n                    # Batch-level control: Sleep between batch runs\r\n                    await asyncio.sleep(BATCH_SLEEP_TIME)\r\n\r\n            # Process any remaining tasks\r\n            if tasks:\r\n                await self._execute_tasks(tasks)\r\n\r\n        async def _execute_tasks(self, tasks: List[asyncio.Task]):\r\n            \"\"\"\r\n            Helper method to execute a list of asyncio tasks and handle errors.\r\n            \"\"\"\r\n            try:\r\n                await asyncio.gather(*tasks)\r\n            except Exception as e:\r\n                logger.error(f\"An error occurred during task execution: {e}\", exc_info=True)\r\n\r\n# Usage example in the Azure function\r\nasync def main():\r\n    processor = PDFFileProcessor()\r\n    await processor.process_file(file_path=\"example.pdf\", index_name=\"my-index\")<\/code><\/pre>\n<p>Here is key elements of the code:<\/p>\n<ul>\n<li><strong>SearchableField and SimpleField<\/strong>:\n<ul>\n<li><code>SearchableField<\/code> is used for fields that will be <strong>searched<\/strong> (e.g., <code>content<\/code>, <code>title<\/code>).<\/li>\n<li><code>SimpleField<\/code> is used for fields that are <strong>filterable<\/strong> or <strong>sortable<\/strong> but not necessarily searchable (e.g., <code>file_size<\/code>).<\/li>\n<\/ul>\n<\/li>\n<li><strong>Id Field<\/strong>:\n<ul>\n<li>The <code>id<\/code> field is marked as a <strong>key<\/strong> and is essential for uniquely identifying each document. It is searchable, filterable, and key-based.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Content Field<\/strong>:\n<ul>\n<li>The <code>content<\/code> field is the main body of the document and is marked as searchable with a <strong>Lucene analyzer<\/strong> (<code>standard.lucene<\/code>) to support full-text search capabilities.<\/li>\n<\/ul>\n<\/li>\n<li><strong>File Size Field<\/strong>:\n<ul>\n<li>The <code>file_size<\/code> field is useful for filtering or sorting documents based on their size.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<hr \/>\n<p>Once you\u2019ve defined and created the index, you can:<\/p>\n<ol>\n<li>Use <strong>Azure Document Intelligence<\/strong> to extract text and metadata from your PDF documents.<\/li>\n<li>Ingest the extracted content into <strong>Azure AI Search<\/strong> using the defined index structure.<\/li>\n<li>Run searches across the ingested documents, leveraging full-text search, filters, and sorting.<\/li>\n<\/ol>\n<p>This index will enable efficient document retrieval, allowing users to search for specific content or metadata (such as title or author) in a large collection of documents.<\/p>\n<h2>Continuous Document Consistency and Monitoring<\/h2>\n<p>To ensure data consistency and continuously check whether all documents have been ingested properly (including modified or newly uploaded ones), you can create a <strong>Python script<\/strong>. This script will:<\/p>\n<ol>\n<li><strong>Check for missing documents.<\/strong><\/li>\n<li><strong>Trigger re-ingestion<\/strong> [Optionally] if any updates are detected.<\/li>\n<li><strong>Automate the process<\/strong> to run at regular intervals using Azure Functions or Azure Logic Apps, or any crontab tools.<\/li>\n<\/ol>\n<p>Example of a Python script for checking document consistency:<\/p>\n<pre><code class=\"language-python\">import logging\r\nimport os\r\nimport requests\r\nfrom azure.storage.blob import BlobServiceClient\r\nfrom components.logging import configure_logging\r\n\r\nlogger = logging.getLogger(__name__)\r\n\r\nPDF_FOLDER = \"PDF\"\r\nEXPECTED_DOCUMENTS_AMOUNT = 50000\r\n\r\n# Folder to index and search field mapping\r\nfolder_to_index = {\r\n    PDF_FOLDER: {\"index_name\": \"Some\", \"search_field\": \"id\"},\r\n}\r\n\r\n# Get blob names in a specific folder\r\ndef get_blob_names(folder_path, blob_connection_string, container_name):\r\n    blob_service_client = BlobServiceClient.from_connection_string(blob_connection_string)\r\n    container_client = blob_service_client.get_container_client(container_name)\r\n    blob_list = container_client.list_blobs(name_starts_with=folder_path)\r\n    return {blob.name[len(folder_path) + 1:] for blob in blob_list}\r\n\r\n# Get document names from Azure AI Search\r\ndef get_document_names(index_name, search_field, service_endpoint, api_key):\r\n    search_url = f\"{service_endpoint}\/indexes\/{index_name}\/docs\/search?api-version=2021-04-30-Preview\"\r\n    headers = {\"Content-Type\": \"application\/json\", \"api-key\": api_key}\r\n    query = {\"searchFields\": search_field, \"select\": search_field, \"top\": EXPECTED_DOCUMENTS_AMOUNT}\r\n\r\n    response = requests.post(search_url, headers=headers, json=query)\r\n    response.raise_for_status()\r\n    results = response.json()\r\n    return {doc[search_field] for doc in results.get(\"value\", [])}\r\n\r\ndef run_check():\r\n    # Load environment variables\r\n    service_endpoint = os.getenv(\"AZURE_SEARCH_SERVICE_ENDPOINT\")\r\n    api_key = os.getenv(\"AZURE_SEARCH_ADMIN_KEY\")\r\n    blob_connection_string = os.getenv(\"BLOB_STORAGE_CONNECTION_STRING\")\r\n    container_name = os.getenv(\"BLOB_STORAGE_DOCS_CONTAINER_NAME\")\r\n\r\n    if not all([service_endpoint, api_key, blob_connection_string, container_name]):\r\n        raise ValueError(\"Please ensure all required environment variables are set.\")\r\n\r\n    for folder, settings in folder_to_index.items():\r\n        # Get blobs and document names\r\n        blob_names = get_blob_names(folder, blob_connection_string, container_name)\r\n        document_names = get_document_names(settings[\"index_name\"], settings[\"search_field\"], service_endpoint, api_key)\r\n\r\n        # Compare counts and find missing files\r\n        if len(blob_names) != len(document_names):\r\n            logger.warning(f\"Mismatch in {folder}: Blobs ({len(blob_names)}) vs. Documents ({len(document_names)})\")\r\n\r\n        missing_files = blob_names - document_names\r\n        if missing_files:\r\n            logger.warning(f\"Missing in AI Search for {folder}: {missing_files}\")\r\n        else:\r\n            logger.info(f\"All files in {folder} are indexed correctly.\")\r\n\r\nif __name__ == \"__main__\":\r\n    configure_logging()\r\n    run_check()<\/code><\/pre>\n<h2>Local and Automated Document Ingestion<\/h2>\n<p>You can ingest documents both manually and automatically:<\/p>\n<ul>\n<li><strong>Manual Ingestion:<\/strong> Use a script or function trigger to push specific documents into the AI Search service.<\/li>\n<li><strong>Automated Ingestion:<\/strong> Set up an automated pipeline that checks for new\/modified documents and triggers the ingestion process.<\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>In this article, we&#8217;ve explored the crucial role that data ingestion plays in unlocking the full potential of AI-powered search, specifically in the context of Azure AI Search. We\u2019ve compared two approaches: the flexibility and independence of using <strong>Azure Functions<\/strong> for custom ingestion pipelines, versus the simplicity of <strong>pre-built Indexers<\/strong>. While Indexers offer a quick, out-of-the-box solution, Azure Functions provide greater control, scalability, and the ability to customize your pipeline to handle complex, real-time scenarios.<\/p>\n<p>For businesses dealing with diverse data sources, Azure Functions empower teams to integrate different types of content (PDFs, Excel files, HTML documents, etc.) into a searchable format, offering a tailored solution for advanced AI search. Coupled with tools like GitHub Actions for deployment automation and the consistency checks provided by the ingestion pipeline, this approach ensures that your search service is not only comprehensive but also robust and scalable.<\/p>\n<p>Whether you\u2019re looking for simplicity or need a more advanced and customizable ingestion framework, Azure AI Search, combined with the right ingestion strategy, opens up new possibilities for transforming your data into actionable insights. Now it&#8217;s up to you to choose the best fit for your organization\u2019s needs and take full advantage of AI search capabilities.<\/p>\n<h2>Interesting Links<\/h2>\n<p>Here are a few useful resources to further explore:<\/p>\n<ul>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\">Azure AI Search Documentation<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-indexer-overview\">AI Search Indexers<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/azure-functions\">Azure Functions Documentation<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/storage\/blobs\">Azure Blob Storage Documentation<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This article compares Azure Functions with pre-built indexers for data ingestion in Azure AI Search, with a focus on using Azure Functions for a flexible, scalable approach. It explores key steps like data migration, index creation, and deployment automation.<\/p>\n","protected":false},"author":120373,"featured_media":16058,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,3451],"tags":[3579,3577,3548,77,3334,3582,3477,3581,3372,3580,3578],"class_list":["post-16048","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-ise","tag-ai-search","tag-azure-ai-indexes","tag-azure-ai-search","tag-azure-functions","tag-cicd","tag-cloud-automation","tag-data-ingestion","tag-data-migration","tag-github-actions","tag-indexing","tag-pre-built-indexers"],"acf":[],"blog_post_summary":"<p>This article compares Azure Functions with pre-built indexers for data ingestion in Azure AI Search, with a focus on using Azure Functions for a flexible, scalable approach. It explores key steps like data migration, index creation, and deployment automation.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16048","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/120373"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16048"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16048\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16058"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16048"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16048"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16048"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}