Multimodal RAG with Vision: From Experimentation to Implementation

Introduction

In the rapidly-evolving field of generative AI, retrieval-augmented generation (RAG) has emerged as a common pattern to enable large language models to answer domain-specific user queries grounded in data retrieved from a document store. For many enterprise use cases, these documents contain both textual and image content, such as photographs, diagrams, or screenshots of web pages: this introduces the concept of multimodal RAG, which incorporates multiple input modalities in the context used by the LLM to answer questions, and can be accomplished via a variety of design patterns.

As part of a recent project, our team addressed this scenario by following a pattern of multimodal RAG that utilizes a multimodal LLM such as GPT-4V or GPT-4o to effectively transform image content to a text format by generating detailed descriptions of each image. This conversion enables the text content and textual image descriptions to be stored in the same vector database space, which can then be retrieved and used as context for the LLM via the standard RAG pipeline flow. Our goal was to improve search precision and relevance while ensuring meaningful LLM-generated responses to user queries through detailed experimentation; this blog post shares our methods, insights, and practical findings from that experimentation process.

We recognize that our conclusions are based on a setup tailored to a specific customer scenario. Therefore, while we’re excited about our results and learnings, we encourage readers to conduct their experiments to validate the applicability of our insights to their unique contexts. Join us as we delve into the intricacies of Azure AI Search, Azure AI Services, and Azure OpenAI Service and explore how multimodal RAG pipelines can be fine-tuned for optimal performance, especially in handling vision-enhanced queries.

We’ll start by walking through the end-to-end workflow and our experimentation methodology.

Experimentation Background

Our workflow consists of the following components: the ingestion flow, the enrichment flow (a sub-process of ingestion), and the inference flow. We discuss these in further detail below, along with additional information on our experimentation process, dataset, and metrics used for evaluation.

Ingestion Flow

To enhance the handling of user queries related to images, we decided to ingest image descriptions along with the text in source documents. Initially, we considered using multi-modal embeddings like CLIP, but these were not used due to their limitations in capturing detailed visual information. Additionally, most CLIP models do not allow more than 70-77 words. These embeddings typically offer only generic insights into objects and shapes without providing detailed explanations.

Instead, we opted for a multimodal large language model (MLLM) to interpret both textual and visual data. We focused on ensuring that all information in a document, including images, gets transformed into text by generating detailed image descriptions with the MLLM, enabling a more comprehensive ingestion process. This approach allows for a richer and more accurate understanding of images, leading to more effective responses to image-related queries than if the LLM had to rely solely on text data from the same document set. Note, however, that this offline pre-processing relies solely on the instructions included in the prompt to the multimodal LLM for description generation; it can’t capture every detail that a user may ask about, and, in some cases, the additional guiding context from a user query may be required to extract the relevant details from an image and formulate a proper response. This tradeoff between offline context-independent image processing and online context-aware image processing, as well as how these techniques can work in conjunction, is explored in the section below.

The ingestion process extracts both text and image data from source documents using a custom loader; image data is processed through the enrichment service developed by our team. The extracted data is then ingested into Azure AI Search, making it available for user queries. Below is a high-level overview of this flow.

Ingestion

Enrichment Flow

Image enrichment is a configurable subcomponent of the ingestion flow. As part of the enrichment process, images in the source document are replaced with MLLM-generated textual descriptions if they are deemed to be potentially relevant to the document subject matter. The image tagging functionality in Azure Computer Vision Image Analysis is used for this purpose – we don’t generate descriptions for images that are classified as a logo (based on a configurable confidence score threshold) or that don’t contain any text. This logic was developed based on the nature of our source documents and the types of images typically used to answer questions, with the purpose of reducing cost and latency during the ingestion process – the classifier implementation can and should be customized for your particular use case. The below diagram is a visual representation of this workflow; it also includes details about the caching mechanism we used based on similar cost/latency considerations.

Related experiments:

Including available document metadata
Custom loader configuration: separate vs. inline chunks for image annotations
Generating accurate image descriptions: prompt engineering, GPT-4o vs GPT-4V, including surrounding text context along with the image
Varying the classifier threshold during image enrichment

Inference Flow

End users submit queries to receive meaningful responses from the AI system. The following flowchart illustrates the structured approach used to evaluate user queries. Upon receiving a query, the system utilizes Azure AI Search to retrieve initial chunks of relevant data. These data chunks can optionally undergo a reranking process to enhance search result precision. The final chunks are then passed as context to an Azure OpenAI LLM to generate an answer to the user query grounded in the relevant source information, resulting in the final LLM-generated response to the user’s query and related citations.

For those interested in running RAG experiments for a similar vision-based use case, this Azure-Samples repository provides starter code and evaluation flows.

RAG experimentation

Related experiments:

Prompt engineering to tailor the LLM response to this use case
Inference with a multi-modal LLM (variation on the above pattern)
LLM choice at inference time (GPT-3.5 vs GPT-4 vs GPT-4o)

Experimentation Methodology

We conduct experiments by systematically testing different approaches, adjusting one configuration setting at a time and evaluating its impact against a predefined baseline. Performance is assessed using specific retrieval and generative metrics outlined below. A detailed analysis of these metrics informs our decision on whether to update the baseline with the new configuration or retain the existing one.

Experimentation Process

Q&A evaluation dataset

For accurate evaluation during experimentation, it’s crucial to curate a diverse set of question and answer pairs. These pairs should cover a range of articles, encompassing various data formats, lengths, and subject matters. This diversity ensures comprehensive testing and assessment, contributing to the reliability of the results and insights obtained. Here is a sample of the Q&A dataset that can be considered.

ID	Document Title	Question	Text Answer	Image link	Type	source
1	Ingestion Process	What are the steps of the ingestion flow in RAG	Chunk documents, enrich documents, embed chunks, and persist data	https://test.com/images/ingestion_steps.png	Vision	https://test.com/RAG/info
2	KNN	What types of datasets should exhaustive KNN be used for?	KNN may be used for small to medium datasets but not for large datasets.	https://test.com/images/datasets.png	Vision+Text	https://test.com/RAG/datasets

We ensured our Q&A dataset had a balanced mix of questions derived from text alone, both image and text, and some solely from images. Also made sure the questions were distributed across various source documents.

Our evaluation set was relatively small, but we ensured its diversity by including a wide array of edge cases. Our analysis began with a thorough Exploratory Data Analysis (EDA), where we extracted features such as article length, table lengths, and table counts for text, along with image type, resolution, and counts for images. We then carefully distributed the evaluation set across these features to achieve comprehensive representation and robust coverage of the feature space. Additionally, the system supports alternative sources and images for the same question.

Evaluation Metrics

In our experimental setup, we focused on two primary evaluation type metrics as outlined below to assess the performance of Azure AI Search (retrieval metrics) and the LLM’s response to a user query (generative metrics):

Retrieval Metrics

Metric	Explanation	Row level metric value
sourcerecall@k% **	Percentage of Q&A pairs for which at least one of the documents marked as a good source was found among the top ‘k’ chunks	1 If first source chunk index < k else 0
all_imgrecall@k% **	Percentage of Q&A pairs for which all expected images were successfully retrieved	1 if all ground-truth images retrieved else 0
img_recall@k_mean	Mean image recall	Between 0 and 1 (recall of retrieved urls in k chunks over expected urls from ground truth)
img_recall@k_median	Median image recall	Between 0 and 1 (recall of retrieved urls in k chunks over expected urls from ground truth)
img_precision@k_mean	Mean image precision	Between 0 and 1 (precision of retrieved urls in k chunks over expected urls from ground truth)
img_precision@k_median	Median image precision	Between 0 and 1 (precision of retrieved urls in k chunks over expected urls from ground truth)
similarity_search_time_in_sec_mean	Mean AI Search chunks retrieval time in secs	Time in secs
similarity_search_time_in_sec_median	Median AI Search chunks retrieval time in secs	Time in secs
#_source_chunks_sum	The count of all ground-truth retrieved chunks for all Q&A pairs	Between 0 and k
#_img_chunks_sum	The count of all ground-truth retrieved img chunks for all Q&A pairs	Between 0 and k

** k = 10,5,3

Generative Metrics

Metric	Explanation	Row level metric value
all_cited_imgrecall%	% of ground truth Q&A for which all the expected images were cited by LLM	1 if all ground-truth images cited by LLM else 0
cited_img_recall_mean	Mean cited image recall	Between 0 and 1 (recall of urls in generated answer over expected urls in ground truth)
cited_img_recall_median	Median cited image recall	Between 0 and 1 (recall of urls in generated answer over expected urls in ground truth)
cited_img_precision_mean	Mean cited image precision	Between 0 and 1 (precision of urls in generated answer over expected urls in ground truth)
cited_img_precision_median	Median cited image precision	Between 0 and 1 (precision of urls in generated answer over expected urls in ground truth)
cited_img_f1_mean	Mean cited image f1	F1 = 2 cited_img_recall cited_img_precision / (cited_img_recall + cited_img_precision)
cited_img_f1_median	Median cited image f1	F1 = 2 cited_img_recall cited_img_precision / (cited_img_recall + cited_img_precision)
chat_query_time_in_sec_mean	Mean E2E response time in secs	Time in secs
chat_query_time_in_sec_median	Median E2E response time in secs	Time in secs
inference_prompt_tokens_sum	Input tokens to LLM	inference_prompt_tokens
inference_completion_tokens_sum	Output tokens used by LLM to answer	inference_completion_tokens
vision_prompt_tokens_sum	Input tokens to Enrichment service	vision_prompt_tokens
vision_completion_tokens_sum	Output tokens to Enrichment service	vision_completion_tokens
gpt_correctnessscore>3%	% of ground truth Q&A for which the correctness metric score is above 3
gpt_correctness_score_mean	Mean correctness score	gpt_correctness_score (1-5)
gpt_correctness_score_median	Median correctness score	gpt_correctness_score (1-5)

The previously described metrics provide valuable insights into the effectiveness of each part of the system. It is essential to measure the search capacity of the system as well as the generative part separately, so you can understand the impact of your experiments on each component.

Learnings

Our goal throughout our customer engagement was to boost image-based query performance while maintaining the efficiency of text-based queries. We set out to find features that could achieve this balance and tested our ideas through rigorous experimentation.

Let’s dive into the key insights from our experiments and see how they’ve guided our approach to optimizing both image and text query performance.

Prompt Engineering

Our journey began with the creation of two specialized prompts: one for ingestion and another for inference. This approach was designed to accurately extract image descriptions and enhance query responses.

Prompt for Image Enrichment

For the image enrichment prompt, we conducted a thorough analysis of the source images to understand their content and the typical categories of images seen in our document store. Based on this categorization, we tailored the prompt to ensure the LLM focused on the relevant information that each image type contained. This method ensured that our system could handle a diverse range of images and deliver the precise responses required.

Below is a sample of the prompt we used for image description extraction in the ingestion pipeline. While this prompt was specifically tailored to our use case, it offers an idea of how to address different image types. Depending on the images in the source data, you may need to adjust the prompt to ensure its effectiveness.

You are an assistant whose job is to provide the explanation of images which is going to be used to retrieve the images. Follow the below instructions:

If the image contains any bubble tip then explain ONLY the bubble tip.

If the image is an equipment diagram then explain all of the equipment and their connections in detail.

If the image contains a table, try to extract the information in a structured way.

If the image device or product, try to describe that device with all the details related to shape and text on the device.

If the image contains screenshot, try to explain the highlighted steps. Ignore the exact text in the image if it is just an example for the steps and focus on the steps.

Otherwise, explain comprehensively the most important items with all the details on the image.

Prompt for Inference

We also developed a specific prompt for generating responses to user queries, tailored to meet customer requirements and ensure relevant and accurate answers. This prompt instructs the language model to return cited images, allowing us to gather citation metrics and assess the quality of the responses.

Below is a sample of the prompt we used for generating responses to user queries. As with the ingestion prompt, this sample may need to be adjusted based on the expected responses to ensure its effectiveness.

You are a helpful AI assistant whose primary goal is to help technicians who maintain and improve the company’s communication infrastructure. According to:\n Context: {context}, \n what is the answer to the \n Question:{question}. Provide step by step instructions if it is a procedural question. Do not attempt to answer if the Context provided is empty. Ask them to elaborate the question instead. The output MUST be in a json format where one attribute will be the answer and the other one will be the image_citation which are the image urls that might be useful as a reference to the user. In the context, all image urls will be in this format: (image url). Example of the output format: ‘answer’: ‘Yes, you can replace the cable’, ‘image_citation’: image url

One challenge we faced was that the model did not always return parsable JSON format. Sometimes it was breaking, but we were able to extract the URLs to use for citation metrics. We did not invest a lot of effort in improving the JSON format because it was not a requirement from the customer, but it can be solved by methods like the Structured Outputs from the OpenAI API.

Impact of Metadata

As discussed above, one of the goals of our work was to optimize retrieval performance: we want to identify the most relevant content from the document store that will help the LLM accurately answer user queries. In addition to the raw document content, we can also leverage the available document-level metadata – which often includes fields such as the document’s title, author, date created, summary, keywords, and intended audience – to improve the relevance of search results using Azure AI Search. The stringified metadata content can be included in the search index to be queried, and specific fields can be added to the configuration of Azure AI Search’s semantic ranking feature to further improve search performance by promoting the most semantically relevant results.

For these reasons, we conducted an experiment to quantify the benefits of including document-level metadata on retrieval in our use case, with the following results:

metadata

From this table, we can see that integrating structured metadata alongside the unstructured content during the ingestion process resulted in a statistically significant improvement in source recall performance, so we chose to utilize this approach going forward.

Separate Image chunks vs Inline chunks

When it comes to ingesting image descriptions alongside text in source documents, the method of how to include those image annotations plays a crucial role in how effectively the information is utilized during searches. We annotated image data using below format and observed that it worked well for our use case:

![image description](image URL)

We then faced a decision: should we store the description as separate chunks, and the main text chunk only contains the URL reference or integrate image description with the main text and chunk it? To find out, we did both, aiming to discover the best method in terms of retrieval metrics and generated metrics.

Let’s first look at how inline vs. separate image chunk data will look, using a sample image extracted from the Azure documentation here

image chunk sample

The image description extracted from enrichment service for this sample image will look like the example below:

The image displays a set of three circular progress charts, each representing a different type of data metric related to storage and indexing.\n\n1. The first chart on the left is labeled \”Storage\” and shows a progress of 1.8%. Below the chart, there are two rows of information: the current usage is listed as \”459.63 MB,\” and the quota is \”25 GB.\” There is a link or button labeled \”View scale\” beneath these rows.\n\n2. The middle chart is labeled \”Vector index size\” and indicates a progress of 3%. The data below this chart shows the current usage as \”93 MB\” and the quota as \”3 GB.\” Similar to the first chart, there is a link or button labeled \”View indexes.\”\n\n3. The third chart on the right is labeled \”Indexes\” and shows a much larger progress of 48%. The current count is \”24,\” and the quota is \”50.\” This chart also has a \”View indexes\” link or button below the data.\n\nEach chart has a colored segment indicating the used percentage, with the remaining chart area in grey representing the unused portion of the quota. The colors of the segments are not visible in the OCR text but can be seen in the image: blue for \”Storage,\” light blue for \”Vector index size,\” and dark blue for \”Indexes.\”

Here’s a brief inline chunk example for a chunk of text with 2100 characters, considering a chunk size of 510 tokens for our samples:

*The following screenshot shows an S1 service configured with one partition and one replica. This particular service has 24 small indexes, with one vector field on average, each field consisting of 1536 embeddings. The second tile shows the quota and usage for vector indexes. A vector index is an internal data structure created for each vector field. As such, storage for vector indexes is always a fraction of the storage used by the index overall. Other nonvector fields and data structures consume the rest.![The image displays a set of three circular progress charts, each representing a different type of data metric related to storage and indexing.\n\n1. The first chart on the left is labeled \”Storage\” and shows a progress of 1.8%. Below the chart, there are two rows of information: the current usage is listed as \”459.63 MB,\” and the quota is \”25 GB.\” There is a link or button labeled \”View scale\” beneath these rows.\n\n2. The middle chart is labeled \”Vector index size\” and indicates a progress of 3%. The data below this chart shows the current usage as \”93 MB\” and the quota as \”3 GB.\” Similar to the first chart, there is a link or button labeled \”View indexes.\”\n\n3. The third chart on the right is labeled \”Indexes\” and shows a much larger progress of 48%. The current count is \”24,\” and the quota is \”50.\” This chart also has a \”View indexes\” link or button below the data.\n\nEach chart has a colored segment indicating the used percentage, with the remaining chart area in grey representing the unused portion of the quota. The colors of the segments are not visible in the OCR text but can be seen in the image: blue for \”Storage,\” light blue for \”Vector index size,\” and dark blue for \”Indexes.\”](https://learn.microsoft.com/en-us/azure/search/vector-store/***.png) Vector index limits and estimations are covered in another article, but two points to emphasize up front is that maximum storage varies by service tier, and also by when the search service was created. Newer same-tier services have significantly more capacity for vector indexes. For these reasons, take the following actions:*

A separate chunk sample will follow the annotation defined and the actual text chunk will look like below, where the image URL alone is included:

*The following screenshot shows an S1 service configured with one partition and one replica. This particular service has 24 small indexes, with one vector field on average, each field consisting of 1536 embeddings. The second tile shows the quota and usage for vector indexes. A vector index is an internal data structure created for each vector field. As such, storage for vector indexes is always a fraction of the storage used by the index overall. Other nonvector fields and data structures consume the rest.(https://learn.microsoft.com/en-us/azure/search/vector-store/***.png) Vector index limits and estimations are covered in another article, but two points to emphasize up front is that maximum storage varies by service tier, and also by when the search service was created. Newer same-tier services have significantly more capacity for vector indexes. For these reasons, take the following actions:*

Now let’s look at the experiment runs. Below is a high-level observation of the experiment runs:

separate vs inline

Based on our experimental runs, we observed that using separate chunks for image annotations resulted in notable statistical improvements in both source document and image retrieval metrics. There is no change in terms of search latency. In particular, for vision-related queries, separate chunks proved to be more effective at accurately retrieving the correct source document as well as the correct source images; in addition to improving the context passed to the LLM, this would also benefit citation accuracy if the application UI chose to include those in the response to users.

Curious to explore further with the configuration of this loader and chunk structure, we tested a few other hypotheses: the first was whether including these image annotations as metadata on the source document chunks would improve retrieval quality. However, we didn’t observe a statistically significant improvement in retrieval accuracy with this configuration. Another experiment involved seeing whether we could enhance the quality of generated answers by enhancing the context passed to the LLM at inference chat time with any missing image information from the chunks of documents retrieved from AI Search. However, this approach didn’t result in improved image citation quality and increased overall token usage, so we chose not to proceed with this implementation.

Impact of Classifier

Firstly, let’s understand the classifier component of the enrichment service. We introduced a classifier layer to differentiate between images, ensuring that only relevant images are processed by the enrichment services. This was implemented using Azure AI’s Vision tag endpoint. By configuring the tags and confidence levels based on the customer’s specific interests and the types of images included int the document store for the use case, we were able to effectively categorize the images and streamline the processing workflow. This approach was designed to enhance the efficiency of data ingestion.

In our customer’s use case, the source documents contained a variety of images, including application logos and images with minimal text. To optimize data processing, these less informative images were excluded from detailed extraction. Instead, only the image URL references were added to the chunk data, ensuring a streamlined and efficient handling of the documents.

We set out to evaluate the impact of excluding less informative images, such as logos and abstract visuals, to reduce noise and improve overall system performance. Specifically, we aimed to ensure that this exclusion did not adversely affect recall while simultaneously improving ingestion latency.

Below, we present a high-level overview of the experimental results that explored the hypothesis.

classifier

The introduction of a classifier significantly optimized our data ingestion process. We achieved statistically similar results for both source and image recall while substantially reducing ingestion time. By filtering out less informative images, we streamlined the ingestion pipeline, enhancing system efficiency without compromising retrieval accuracy.

For our customer use case, our analysis strongly supported enabling the classifier with a threshold of 0.8, even though the latency was slightly higher than at 0.9. This configuration effectively filters out non-essential images, reducing potential noise and focusing on generating descriptions for the most relevant images.

Impact of including surrounding text in image enrichment

End users reviewing standalone GPT-4V-generated descriptions noted a recurring issue: key details like app and device names—often found in titles or captions preceding images—were frequently missing. This observation prompted a crucial question:

Can including surrounding text in a document help GPT-4V provide better descriptions of images, thus enhancing search results and chat responses?

To enhance image descriptions, we can dynamically extract the text before and after an image in a document using a custom loader and include it as context to the multimodal LLM model during the enrichment process, with the aim of generating better image descriptions. For this experiment, we updated the ingestion prompt to include the following additional text included at the end:

The image is taken from a document and we are going to provide part of the previous and next text around the image. Use that context to provide more information about the image if it is useful. Previous text:\n (N characters back) Next text:\n (M characters next)

Let’s return back to the sample image we used previously, but now let’s consider it in the context of the text that surrounds it:

surrounding text sample

Per the parameters of our experiment, we updated the image enrichment prompt to include the following additional surrounding text to the MLLM (N=600, M=300):

The image is taken from a document and we are going to provide part of the previous and next text around the image. Use that context to capture the screenshot app or the device name. Previous text: <The paragraph of text before the image, from the screen shot above> Next text: <The paragraph of text after the image, from the screen shot above>

We observed that after making this change, the resulting image description was largely similar but included an additional sentence at the beginning that specifically called out the name of the application the screenshot was from:

The image appears to be a screenshot from the Azure portal, highlighting the configuration and usage metrics of an S1 search service.

The initial results were promising, so we proceeded with a full experiment run. Below, we present a high-level overview of the experimental results that explored whether surrounding text helps with retrieval metrics:

surrounding text

Our investigation into enhancing image descriptions by including surrounding text yielded some valuable insights. While the quality of image descriptions improved with the added context, the impact on retrieval metrics was not as significant as anticipated.

On further analysis, we found that the surrounding text in the source files often repeated information already present in the images, making the images more of a reference rather than a unique data source. Moreover, URLs within the documents frequently contained essential details, such as device or tool names, which were crucial for effective information retrieval. The Ground Truth dataset also had a limited number of vision-only queries that required extra context, further limiting the effect of the improved descriptions.

In summary, while surrounding text did enhance the quality of image descriptions, its impact on retrieval metrics was limited due to content redundancy and constraints within the dataset. However, this feature could prove valuable in scenarios where images contain standalone information and the surrounding text provides additional context to the image description. To fully justify this approach, further experiments with different source data would be necessary to validate its effectiveness.

Inference with Multimodal LLM

We are comparing two approaches for handling images in our system. The first approach involves enrichment at ingestion, where we extract image descriptions using a multimodal LLM and store these descriptions in a vector database. This allows us to retrieve and utilize the descriptions during a chat or search. The second approach involves enrichment at inference, where we send the top 10 image bytes along with the retrieved chunks to a multimodal LLM at query time. The LLM processes the images and text together to generate a response.

Our goal was to compare two different approaches:

Enrichment during ingestion and inference by sending only the retrieved text.
Skipping initial enrichment and instead performing inference via multimodal LLM, which involves sending both text and relevant images.

Multimodal LLM

Below, we present a high-level overview of the experimental results:

multimodal_llm_without_enrichment

As expected, we observed a decline in source and image retrieval metrics, which also affected the cited image metrics. Understanding the images is essential for retrieval especially for image only questions. Enabling enrichment during inference with Multimodal LLM (sending text + image bytes), as a stopgap solution can be advantageous when managing a large corpus of images, particularly until the ingestion process is fully completed.

Our next step was to explore the benefits of combining enrichment at ingestion with Multimodal LLM during inference, and evaluate how much we gain by doing both.

multimodal_llm_with_enrichment

We observed an improvement in correctness scores, with answers becoming more precise in vision-only and some vision-text queries. However, the additional 6-second latency introduced by enabling Multimodal LLM may not justify its use, especially when valid images are already being retrieved.

GPT 3.5 vs GPT 4 vs GPT 4o for inference

To determine the optimal GPT model for our customer, we explored various aspects such as generative metrics, cost, and latency. The main focus was to understand the impact of changing the GPT model from multiple perspectives, including cited image precision, recall, F1 scores, and GPT correctness scores.

We began by comparing GPT-3.5 and GPT-4 for inference, specifically to see if GPT-4-32k could better handle responses from retrieved image enrichment chunks.

3.5vs4

The results showed that GPT-4, with its context 32K context size, does better in citing images compared to GPT-3.5, which has a 16K context size. The correctness score also improved. Despite GPT-4 being slower and more costly, it provided much better performance regarding generated image citations and vision-based questions, indicating more accurate and complete answers.

We switched to GPT 4-32K for the improved performance. Later when GPT 4o was introduced and we saw that the cost was cheaper, we wanted to evaluate if GPT-4o would be as performant for our use case.

4ovs4

GPT-4o, with its context 128K, showed significant improvements in image citation recall and correctness scores but slightly reduced image precision. The differences were due to GPT-4o providing inline references to URLs, highlighting keywords, and offering more detailed responses, sometimes with more URL references than necessary. Despite the slight drop in precision, these detailed answers were not harmful and could be useful, especially since top search image links, not LLM citations, are displayed in the customer portals.

While GPT-4o used more completion tokens, the cost to run GPT-4-32k was over six times more expensive. As expected, GPT-4o also performed better in terms of latency.

Based on these results, GPT-4o appears to be the right choice for non-vision inference, offering improvements in quality, speed, and cost.

GPT 4 vs GPT 4o for Enrichment

Our experiments aimed to determine if the newer GPT-4o could replace the original GPT-4v for document ingestion.

4ovs4v

The results indicated that GPT-4o was not suitable for the customer, as it performed worse than GPT-4v in several search metrics, leading to worse results in cited image recall. Despite GPT-4o’s lower cost, the performance degradation was not worth the cost benefit.

GPT-4v (vision-preview) performs better at generating image summaries, improving recall metrics. However, GPT-4o does better in answering questions. Thus, we recommend using GPT-4v for enrichment and GPT-4o for inference to leverage their respective strengths for our customer use case.

Final Conclusion

Our journey with the customer involved a deep dive into optimizing their RAG application, with a particular focus on answering questions by incorporating both text and image information. The insights we’ve shared are just a glimpse of the extensive work we undertook, aimed at fine-tuning image processing and retrieval.

It all started with understanding the data, the type of questions that will be asked by the actual end users, and the answers that will help the users do their job efficiently. Before we began experimenting, we ensured a solid foundation by developing a robust ground truth dataset and built the features that allowed us to experiment, including a configurable experimentation framework that allowed us to run experiments and persist the results for shared reference and analysis. This preparation allowed us to methodically test and analyze various strategies, ultimately leading to these learnings. It is important to remember that like any product, the solution will need to be tweaked in an ongoing manner learning from the feedback provided by the real users in production as well as with evolving technologies/models.

We hope this blog post highlights the importance of a solid experimental foundation and provides useful insights for anyone working on RAG applications focused on images. Our experiences underscore the significance of tailored solutions and thorough testing to achieve the best outcomes for specific use cases.

Check out the RAG with Vision repository, which provides an application framework for a Python-based retrieval-augmented generation (RAG) pipeline that can utilize both textual and image content from MHTML documents to answer user queries, a sample evaluation framework and a detailed guide.

Resources

Acknowledgements

We’d like to give thanks to the other authors who helped craft this post: Soubhi Hadri, Hemavathy Alaganandam, Paul Butler, Mahdi Setayesh, and Mona Soliman Habib.

The feature image was generated using Bing Image Creator. Terms can be found here.

Multimodal RAG with Vision: From Experimentation to Implementation

Introduction