Enhancing LLM Capabilities: The Power of Multimodal LLMs and RAG
Author(s): Sunil Rao
Originally published on Towards AI.

Building upon foundational concepts from my previous articles on LLMs, Retrieval-Augmented Generation (RAG), and advanced RAG techniques, this article ventures into the next frontier: Multimodal LLMs.
We’ll begin by demystifying their core principles and exploring prominent models in the field, before delving into the powerful combination of Multimodal LLM RAG and crucial evaluation metrics.
Traditional LLM RAG was groundbreaking but faced fundamental limitations because it was almost exclusively text-centric.
- LLMs with RAG could read any text, but lacked perception of the physical world. They couldn’t directly interpret images, audio, or video.
- Many real-world problems and queries involve implicit visual or auditory information. If a user asks “What’s wrong here?” while providing a picture of a broken machine, a text-only RAG would only be able to process the words, not the visual evidence.
- A huge amount of valuable information exists outside of text (e.g., medical scans, engineering diagrams, surveillance footage, animal sounds). Text-only RAG couldn’t directly access, understand, or retrieve from these rich, non-textual data sources.
- To use non-textual data with text-only RAG, you’d have to first convert it into text (e.g., describe an image, transcribe audio). This process is often loses crucial nuances and details that are inherent to the original modality.
Multimodal LLMs solved these issues by giving the AI “senses.”
- They integrate specialized encoders (e.g., for images, audio, video) that transform non-textual data into numerical representations (embeddings) that the LLM can understand and process.
- A fusion layer then aligns and combines these multimodal embeddings with text embeddings, creating a unified understanding of the entire context.
What is a Multimodal LLM?
A Multimodal Large Language Model (MLLM) is an advanced AI model that can process and generate information across multiple data modalities. Unlike traditional LLMs that primarily work with text, MLLMs can understand and reason about various types of data, such as:
- Text: Written language.
- Images: Visual data.
- Audio: Sound and speech.
- Video: Sequences of images and audio.
- And potentially other modalities like sensor data, thermal images, etc.
The key difference from unimodal LLMs is the ability to jointly understand and reason across these different forms of input and sometimes produce output in multiple modalities as well.
For example, an MLLM could take an image and a question in text as input and provide a textual answer about the image. Some advanced MLLMs can even generate images from text or provide textual descriptions of videos.
A user becomes dramatically more effective with a Multimodal LLM because they can communicate naturally, just as they would with a human expert who possesses multiple senses. The barrier between different information types is removed.
Ex: This happened yesterday that my kid’s bike chain keeps falling off and I’m not a bike mechanic.
Scenario 1: Using a Traditional Text-Only LLM:
- You: “My kid’s bike chain keeps falling off. How do I fix it?”
- LLM: Provides generic text instructions for various reasons a chain might fall off (e.g., derailleur adjustment, bent chainring, worn chain). It might ask clarifying questions like, “Is it a single-speed or geared bike? Is the chain loose?” that you might struggle to answer precisely with just words.
- You’re struggling to translate what you see into words the LLM understands, and the LLM can’t see what you see. You might follow wrong instructions or get stuck.
Scenario 2: Using a Multimodal LLM:
- You: (Takes a photo of the bike’s rear derailleur and chain, highlights the area where the chain is falling) “The chain keeps falling off my kid’s bike. It’s usually when they shift gears.”
- Multimodal LLM:
- Image Understanding: Processes the image, identifies the bike as a geared bike, notes the type of derailleur (e.g., Shimano Tourney), and visually inspects the chain tension and alignment.
- Text Understanding: Processes your text query, understanding “chain keeps falling off” and “when they shift gears.”
- Fusion & Reasoning: Combines visual and textual information to understand the specific problem: a shifting issue likely related to derailleur adjustment or cable tension.
- Response: “Based on the image of your kid’s geared bike with a Shimano Tourney derailleur, and the chain falling off during shifting, the most common culprit is a misadjusted derailleur. You’ll likely need to adjust the H (High) and L (Low) limit screws, and possibly the barrel adjuster for cable tension.
- Locate the two small screws labeled ‘H’ and ‘L’ on the back of the derailleur.
- For the chain falling off the outside (largest cog), you might need to slightly loosen the ‘H’ screw.
- For the chain falling off towards the spokes (smallest cog), you might need to slightly loosen the ‘L’ screw.
- Also check the cable tension using the barrel adjuster where the cable enters the derailleur. I can also provide a short video demonstrating this specific adjustment if you’d like more visual guidance.”
You communicate naturally (show and tell). The LLM immediately understands the specific type of component and provides precise, actionable steps tailored to your exact situation, even anticipating further needs (video). This is vastly superior to generic instructions.
Foundational Models of Multimodal Large Language Models (MLLMs)
Foundational models are large-scale, general-purpose neural networks trained on massive, diverse datasets and designed to serve as the base for a wide range of downstream applications. In the context of MLLMs, these models are specifically built to process and generate content across multiple data modalities — such as text, images, audio, and sometimes video — enabling more natural and versatile AI interactions
Brief summary of few prominent multimodal LLM models:

- OpenAI (e.g., GPT-4o, GPT-4V) : Real-time, voice-native, lightning fast; excels in conversational and interactive applications; widely regarded as the fastest and most advanced for multimodal tasks.
- GPT-4o (“omni”): OpenAI’s latest flagship, designed for native multimodality across text, audio, and vision, with impressive real-time performance. It can understand and generate text, images, and audio seamlessly.
- GPT-4V (Vision): The predecessor to GPT-4o’s vision capabilities, allowing GPT-4 to interpret and reason about images provided as input.
2. Google DeepMind (e.g., Gemini): Designed from the ground up for multimodality. Gemini Pro boasts a massive context window and strong reasoning across modalities. Supports “long context understanding” for video and audio.
- Gemini: Google’s most capable and natively multimodal model, designed from the ground up to understand and operate across text, code, audio, image, and video. It comes in various sizes (Ultra, Pro, Nano).
3. Meta (e.g., Llama family — Llama 3, Llama 2): Multilingual, open-source, optimized for diverse language tasks and reasoning; ideal for customization and research
- Llama 3.2’s multimodal models (11B and 90B) can process both text and images. These models can perform tasks like image captioning, visual reasoning, and answering questions about images. It also includes lightweight text-only models for edge devices.
- Llama 3-V is a separate, open-source multimodal model that achieves comparable performance to larger models.
4. Alibaba Cloud (e.g., Qwen series): A powerful series of open-source and proprietary models from Alibaba. Low-latency, high performance, excels in code generation and real-time tasks; open-source flexibility
- Qwen-VL: The foundational vision-language model.
- Qwen-VL-Chat: A fine-tuned version for conversational capabilities, making it more suitable for interactive multimodal applications.
5. xAI (Grok): Integrated with X (Twitter), real-time information processing, unique “Think” and “Deep Search” modes; excels in up-to-date world knowledge
- Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs.
6. Deepseek (e.g., Deepseek-V2, Deepseek-Coder): pen-source, excels in reasoning and long-form content, highly efficient and cost-effective; strong performance in benchmarks and RAG tasks
- Deepseek-VLM combines a powerful vision encoder with a Large Language Model backbone.
Basics of Multimodal LLM Architecture

Think of a Multimodal LLM as having specialized “translators” for each sense, all feeding into a central “brain” (the LLM’s core architecture).
- When you show it an image, a Visual Encoder (like a vision transformer) “translates” the pixels into a language the LLM can understand — a rich numerical representation (embedding).
- When you play it audio, an Audio Encoder (like Wav2Vec 2.0 or Whisper) “translates” the sound waves into a numerical representation or direct text.
- All these translated “sensory inputs” are then brought together and aligned in a Fusion Layer (e.g., using cross-attention mechanisms, like what C-Former helps enable). This layer helps the LLM understand how the different pieces of information relate to each other (e.g., “this sound of a dog barking is related to this image of a dog”).
- Finally, the LLM’s powerful Generative Core processes this unified multimodal understanding, allowing it to generate text, images, or even new audio based on the combined input.
A typical Multimodal LLM (MLLM) architecture can be abstracted into the following key components:
- Modality Encoders: These are specialized neural networks responsible for processing raw data from different modalities (like images, audio, video) and converting them into embeddings. The goal is to extract features relevant to the content of each modality.
Ex:
- Vision Encoder: For image inputs, models like CLIP’s Vision Transformer (ViT) or OpenCLIP are commonly used. These models are pre-trained to align visual features with textual descriptions. They take an image as input and output a fixed-size vector embedding representing the image’s content.
- Audio Encoder: For audio inputs, models like HuBERT or Whisper can be employed. These models process raw audio waveforms and extract features related to the sound, including speech content, speaker identity, and acoustic characteristics.
- Video Encoder: Video encoders often involve a combination of visual and temporal processing. Models might use 3D CNNs or incorporate attention mechanisms across video frames to capture both spatial and temporal information. Examples include extensions of vision transformers for video.
2. Pre-trained LLM: This is the core of the MLLM, a powerful Transformer-based language model that has been pre-trained on vast amounts of text data. It possesses strong capabilities in understanding and generating natural language, reasoning, and in-context learning.
Ex:
- Models like Llama 3, GPT-3/4, Mistral, Gemini or Claude 3 serve as the LLM backbone. These models have learned complex language patterns and world knowledge during their pre-training phase. The multimodal aspects are integrated by feeding the representations from the modality encoders into this pre-trained LLM.
3. Modality Interface (Connector): This is a crucial component that bridges the gap between the representations from the modality encoders (which are modality-specific) and the input format expected by the pre-trained LLM (which is typically token embeddings). The interface aligns the different modalities into a shared representation that the LLM can understand and reason over.
Ex:
- Projection Layer: A simple yet effective interface can be one or more linear layers (Multi-Layer Perceptrons — MLPs) that project the output embeddings from the vision, audio, or video encoders into the same dimensional space as the word embeddings of the LLM.
For instance, LLaVA uses a linear projection to map visual features to the LLM’s embedding space. - Q-Former (Querying Transformer): As used in BLIP-2, this involves a set of learnable query tokens that interact with the visual features through cross-attention. The output of the Q-Former is a fixed-length sequence of embeddings that can be fed into the LLM.
4. Generator: Some MLLMs are designed to generate outputs in modalities other than text (e.g., generating images from text). In such cases, an optional modality-specific generator is attached to the LLM backbone.
Ex:
- Image Generator: Models like the decoder part of Stable Diffusion or DALL-E can be used as an image generator. The LLM’s output (in the form of latent vectors or image tokens) can be fed into this generator to produce an image.
- Audio Generator: Similarly, models like VALL-E or other text-to-speech models can be used to generate audio outputs based on the LLM’s text generation.
Three Types of Connectors
The modality interface, or “connector,” plays a vital role in how information from different modalities is integrated with the LLM. Broadly, there are three main types of connectors:
1. Projection-Based Connectors (Token-Level Fusion): These connectors use relatively simple transformations, often linear layers (MLPs), to project the feature embeddings from non-text modalities (e.g., image embeddings) into the same embedding space as the text tokens of the LLM. The projected features are then treated as pseudo-tokens and concatenated with the actual text tokens before being fed into the LLM. This achieves fusion at the token level because the LLM’s Transformer layers process these projected features just like regular word embeddings.
- The modality encoder produces a feature vector.
- This vector is passed through one or more projection layers (MLPs).
- The output of the projection is treated as a sequence of “visual tokens” (or “audio tokens,” etc.).
- These pseudo-tokens are concatenated with the embedded text tokens.
- The combined sequence of tokens is fed into the LLM.
Ex: LLaVA utilizes one or two linear layers to project visual features from a pre-trained vision encoder (CLIP’s ViT) to align their dimensionality with the word embeddings of the Vicuna or Llama LLM. These projected visual features are then prepended to the text input.
2. Query-Based Connectors (Token-Level Fusion): These connectors employ a set of learnable query vectors (often called Q-Former) to interact with the features extracted by the non-text modality encoders using cross-attention mechanisms. The query vectors “query” the visual, audio, or video features to extract relevant information. The output of this interaction is a fixed-length sequence of embedding vectors, which are then treated as pseudo-tokens and fed into the LLM along with the text tokens. This is also a form of token-level fusion as the LLM processes these resulting embeddings as part of its input sequence.
- The modality encoder produces a set of feature vectors.
- A small set of learnable query vectors is initialized.
- These query vectors attend to the modality features using cross-attention layers.
- The output of the cross-attention layers (the updated query vectors) represents a compressed and informative representation of the non-text modality.
- These query output vectors are treated as a sequence of “modality tokens.”
- These pseudo-tokens are concatenated with the embedded text tokens and fed into the LLM.
Ex: BLIP-2 uses a Q-Former that takes the output of a frozen image encoder and a set of learnable query embeddings as input. Through several Transformer blocks with self-attention on the queries and cross-attention between the queries and image features, the Q-Former extracts a fixed-length representation of the image. These embeddings are then fed into a frozen LLM.
3. Fusion-Based Connectors (Feature-Level Fusion): Unlike the previous two types that convert non-text modality features into tokens to be processed sequentially with text, fusion-based connectors enable a deeper interaction and fusion of features within the layers of the LLM itself. This is typically achieved by inserting new layers or modifying existing attention mechanisms in the LLM’s Transformer architecture to allow direct interaction between the textual features and the features from other modalities.
- The modality encoders produce feature vectors.
- The text input is processed through the initial layers of the LLM to obtain text feature embeddings.
- Specialized fusion layers (e.g., cross-attention layers or modified self-attention) are introduced within the LLM’s Transformer blocks. These layers allow the text features to attend to the non-text modality features (and vice versa) at different stages of processing.
- The fused features are then processed by the subsequent layers of the LLM for reasoning and generation.
Vision Transformer
A Vision Transformer (ViT) is a neural network architecture that applies the self-attention mechanism of the Transformer model directly to images. Unlike traditional convolutional neural networks (CNNs) that use convolutional layers to process local regions of an image, ViTs treat an image as a sequence of image patches.
Here’s a breakdown of how it works:
- The input image is divided into a grid of fixed-size, non-overlapping square patches.
- Each 2D image patch is flattened into a 1D vector. These flattened patches are then projected into a lower-dimensional space using a linear transformation to create patch embeddings. This step is analogous to converting words into token embeddings in NLP.
- To preserve the spatial information lost by flattening the patches and treating them as a sequence, learnable positional embeddings are added to the patch embeddings. These encodings provide the model with information about the original position of each patch in the image.
- The sequence of patch embeddings, now with positional information, is passed through a standard Transformer encoder.
The encoder consists of multiple layers, each containing Multi-Head Self-Attention (MSA) and Feed-Forward Networks (FFN).
The MSA mechanism allows the model to attend to different patches and capture relationships between them across the entire image. The FFN further processes these attended features. - For tasks like image classification, a special learnable classification token is often prepended to the sequence. The output of the Transformer encoder corresponding to this token is then fed into a classification head (typically a simple MLP) to predict the output. For other tasks like object detection or segmentation, different heads are used to process the encoder’s output.
Ex: Imagine you want to classify an image of a cat. A ViT would divide the image into patches (e.g., 16×16 pixels). Each patch would be flattened and embedded. Positional information would be added to these embeddings. The sequence of embedded patches would then go through the Transformer encoder, where the self-attention mechanism would allow the model to learn how different parts of the cat (e.g., ears, whiskers, tail) relate to each other and the overall image. Finally, the classification head would use the learned representation to determine that the image contains a cat.
ViTs can be applied to various computer vision tasks, including:
- Image Classification: Categorizing images into predefined classes (e.g., dog, cat, car).
- Object Detection: Identifying and locating multiple objects within an image.
- Image Segmentation: Dividing an image into segments based on the objects or regions they represent.
- Action Recognition: Identifying human actions in videos.
- Image Captioning: Generating textual descriptions for images.
The fundamental difference between Vision Transformers and Convolutional Neural Networks lies in their approach to capturing spatial information and dependencies:
- Use convolutional filters that slide over the image to capture local features. They build a hierarchical representation by progressively combining local features into more complex ones in deeper layers. CNNs have an inductive bias towards locality and translation equivariance.
- ViTs (Vision Transformers): Treat the image as a sequence of patches and use self-attention to capture global dependencies between any pair of patches, regardless of their distance. They have a weaker inductive bias and rely more heavily on large datasets to learn spatial hierarchies and relationships.
The integration of Vision Transformers into multimodal foundational LLMs offers several key advantages:
- Transformers provide a common architectural backbone that can process sequential data, whether it’s text tokens or image patch embeddings.
- The self-attention mechanism in ViTs allows the model to capture long-range dependencies and global context within an image. This is crucial for multimodal tasks where understanding the overall scene and relationships between objects is important for generating relevant text or answering questions.
- ViTs, like their NLP counterparts, benefit significantly from pre-training on massive image datasets. The learned visual representations can then be transferred and fine-tuned for various downstream multimodal tasks, improving performance and reducing the need for task-specific training data.
- Transformers are known for their scalability with increasing data and model size, which is essential for building large foundational multimodal models capable of handling the complexity of real-world multimodal data.
- By embedding images and text into a common representation space (often facilitated by the Transformer architecture), multimodal models can more effectively learn the relationships and alignments between visual and textual information, enabling tasks like image captioning,
- visual question answering, and text-to-image generation.
Encoders
For multimodal large language models (LLMs), different specialized encoders are used to process various data types (modalities), such as images, audio, video, and code.
Here’s a breakdown of commonly used models and techniques for each modality:
Contrastive Language-Image Pre-training [CLIP]
CLIP is a multimodal model developed by OpenAI that learns visual concepts by connecting them to natural language.
Unlike traditional image classification models trained to predict a fixed set of labels, CLIP learns a shared embedding space for images and text. This allows it to understand the semantic relationship between visual content and textual descriptions.

Let’s explore how CLIP works:
1. Dual Encoder Architecture: CLIP employs two separate encoder networks:
- Image Encoder: This network takes an image as input and transforms it into a high-dimensional vector representing its visual features. CLIP experimented with both ResNet and Vision Transformer (ViT) architectures for the image encoder.
- Text Encoder: This network takes a text description (caption or label) as input and encodes it into a high-dimensional vector representing its semantic meaning. The text encoder in CLIP is a Transformer model, similar to those used in language models like GPT.
2. Shared Embedding Space: The key idea of CLIP is to project the image and text embeddings into the same multi-dimensional vector space. This shared space allows the model to directly compare the representations of images and their corresponding text.
3. Contrastive Learning: CLIP is trained using a contrastive learning objective on a massive dataset of 400 million (image, text) pairs collected from the internet. The training process works as follows:
- Positive Pairs: For each image in a batch, there is at least one correct text description associated with it. The model aims to maximize the cosine similarity between the embedding of the image and the embedding of its correct text description.
- Negative Pairs: For each image, all other text descriptions in the batch are considered incorrect. The model aims to minimize the cosine similarity between the embedding of the image and the embeddings of these incorrect text descriptions.
Similarly, this process is also applied from the perspective of the text encoder. The model tries to match each text description with its corresponding image and differentiate it from other images in the batch.
4. Zero-Shot Transfer: The contrastive learning approach on a vast and diverse dataset enables CLIP to learn a rich understanding of visual concepts and their relationship with language. This results in a remarkable zero-shot transfer capability.
- Image Classification: To perform zero-shot image classification on a new dataset with predefined categories, you can create text prompts for each category (e.g., “a photo of a cat”, “a photo of a dog”). Then, you encode the input image and all the text prompts. CLIP predicts the image’s class as the text prompt whose embedding has the highest cosine similarity with the image embedding. This is done without any fine-tuning on the new dataset.
In essence, CLIP learns to associate images and text by understanding which descriptions are likely to belong to which images. This is achieved by pushing the embeddings of matching (image, text) pairs closer together in the shared embedding space and pushing the embeddings of non-matching pairs further apart.
Flamingo
Flamingo, developed by DeepMind, is a foundational MLLM that gained significant attention for its few-shot learning capabilities across a wide range of vision and language tasks. Unlike models that require extensive fine-tuning for each specific task, Flamingo aimed to rapidly adapt to new tasks by leveraging in-context learning, similar to large language models like GPT-3.
Let’s explore key aspects of the Flamingo MLLM and how it works:
1. Vision Encoder: Flamingo utilizes a pre-trained vision encoder, often a model trained with a contrastive text-image approach similar to CLIP (though they used NFNet models in their initial work).
The role of the vision encoder is to extract rich semantic and spatial features from the input images or video frames. It takes the raw visual data and outputs a set of feature maps.
2. Perceiver Resampler: The output of the vision encoder can have a variable number of feature vectors depending on the input resolution and the architecture. Language models, however, typically prefer a fixed-size input.
Perceiver Resampler acts as a bridge. It takes the variable number of visual features from the vision encoder and uses a set of learnable query vectors and cross-attention mechanisms to output a fixed number of visual tokens. This reduces the dimensionality and provides a consistent input size for the language model.
For video, the Perceiver Resampler processes features from multiple frames, potentially incorporating temporal information.
3. Language Model: Flamingo employs a powerful, pre-trained and frozen LLM as its backbone. The initial work used the Chinchilla family of LLMs developed by DeepMind.
This frozen LLM provides strong generative language abilities and access to a vast amount of knowledge learned during its pre-training on text.
4. GATED XATTN-DENSE Layers (Gated Cross-Attention Dense Layers): The crucial innovation that connects the visual tokens to the frozen LLM is the introduction of novel GATED XATTN-DENSE layers. These layers are interleaved (inserted) between the layers of the pre-trained LLM.
- Cross-Attention: These layers allow the language embeddings to attend to the visual tokens produced by the Perceiver Resampler. This enables the LLM to incorporate visual information when predicting the next text token.
- Gating Mechanism: The “gated” aspect involves a learnable gate that controls how much the visual information influences the language processing. This helps the model selectively integrate visual cues.
- Interleaving: By inserting these cross-attention layers throughout the LLM, the visual information can influence the language generation at multiple stages.

Processing Flow:
- Input: Flamingo receives a sequence of text tokens interleaved with images or videos. Special tags might be added to the text to indicate the presence of visual inputs.
- Vision Encoding: Images/videos are processed by the frozen vision encoder to extract feature maps.
- Perceiver Resampling: The variable visual features are transformed into a fixed number of visual tokens by the Perceiver Resampler.
- Language Encoding: The text tokens are embedded using the frozen LLM’s embedding layer.
- Interleaved Processing: The sequence of text embeddings and visual tokens is fed into the modified LLM. At each GATED XATTN-DENSE layer:
- The language embeddings attend to the visual tokens.
- The attended visual information is used to influence the next token prediction by the LLM.
6. Text Generation: The model autoregressively generates text conditioned on both the textual and visual context.
Large Language and Vision Assistant [LLaVA]
LLaVA (Large Language and Vision Assistant) is a end-to-end trained MLLM that combines a pre-trained vision encoder with a LLM to achieve general-purpose visual and language understanding, along with impressive chat capabilities. Its architecture is relatively simple yet surprisingly powerful and data-efficient.
Let’s explore key aspects of the LLaVA MLLM:
- Vision Encoder: LLaVA utilizes a pre-trained CLIP (Contrastive Language-Image Pre-training) visual encoder, typically the ViT-L/14 model.
The CLIP vision encoder is responsible for processing input images and extracting high-level visual features that are already aligned with textual concepts due to CLIP’s training. It outputs a fixed-size vector representation of the image.
2. Projection Matrix (Modality Connector): The core of LLaVA’s architecture for connecting vision and language is a simple, trainable projection matrix (a fully connected layer or a small Multi-Layer Perceptron — MLP).
This projection matrix takes the visual feature vector from the CLIP encoder and maps it into the embedding space of the pre-trained LLM. The goal is to transform the visual features into a format that the language model can understand and integrate with textual input.
3. Large Language Model (LLM) Backbone: LLaVA uses a pre-trained large language model as its text processing and generation engine. The original LLaVA work primarily used Vicuna, an open-source chatbot fine-tuned from LLaMA.
The LLM processes the embedded text instructions and the projected visual features to generate a textual response that is relevant to both the text and the image.

Processing Flow:
- Input: LLaVA receives a multimodal input consisting of an image and a text instruction or question.
- Vision Encoding: The input image is processed by the pre-trained CLIP vision encoder to extract a visual feature vector.
- Projection: The visual feature vector is passed through the trainable projection matrix to align its dimensionality and semantic space with the LLM’s word embeddings. These projected visual features are effectively treated as “visual tokens” by the LLM.
- Language Encoding: The text instruction is tokenized and embedded using the LLM’s embedding layer.
- Multimodal Input to LLM: The embedded text tokens and the projected visual features are concatenated (or otherwise combined) and fed into the LLM.
- Text Generation: The LLM processes this combined input and autoregressively generates a textual response that answers the question or follows the instruction based on the visual and textual context.
Training Process:
LLaVA’s training involves two main stages:
- Feature Alignment Pre-training: In this initial stage, the weights of the pre-trained CLIP vision encoder and the LLM are frozen.
Only the projection matrix is trained. The model is trained on a dataset of image-text pairs (often a subset of CC3M) to align the visual features with the language embeddings. The objective is to predict the text associated with an image based on the projected visual features.
2. Visual Instruction Tuning: In the second and crucial stage, the entire model (including the projection matrix and often the LLM) is fine-tuned on a large-scale visual instruction tuning dataset.
This dataset, often generated using GPT-4, consists of instructions paired with images and corresponding responses. The instructions cover a wide range of tasks, including:
- Conversations: Natural dialogues about the image.
- Detailed Descriptions: Generating comprehensive descriptions of the visual content.
- Complex Reasoning: Answering questions that require understanding relationships and performing reasoning based on the image.
During this stage, the model learns to follow instructions that involve understanding and reasoning about visual information.
C-Former
C-Former is a specialized model architecture used within multimodal large language models (MLLMs) to process and encode audio data, enabling integration with other modalities like text and images. Its design leverages transformer-based techniques tailored to capture the unique properties of audio signals, facilitating effective fusion with language models for downstream tasks such as transcription, audio understanding, or multimodal reasoning
Let’s explore how C-Former works:
C-Former typically employs a Transformer-based architecture, leveraging the self-attention mechanism’s ability to capture long-range dependencies within the audio sequence and across different modalities. Here’s a breakdown of its key aspects and processing flow:
- Audio Feature Extraction:
- The raw audio signal is first processed by a dedicated audio encoder (e.g., VGGish, Wav2Vec 2.0, Whisper’s audio encoder).
- This encoder extracts a sequence of lower-dimensional, semantically richer audio features. These features could be frame-level embeddings representing acoustic properties over short time windows.
2. Audio Feature Embedding:
- The sequence of audio features is then passed through an embedding layer. This projects each audio feature vector into a higher-dimensional embedding space that is compatible with the LLM’s embedding space.
- Positional embeddings are often added to the audio feature embeddings to encode the temporal order of the audio frames, similar to how they are used in Transformers for text.
3. Cross-Modal Attention (Crucial Component):
- This is where C-Former truly shines. It utilizes cross-attention mechanisms to enable interaction between the audio embeddings and the embeddings of other modalities (e.g., visual features, text tokens).
- Query, Key, Value: In a typical cross-attention layer:
- The queries might come from the LLM’s text or visual embeddings.
- The keys and values come from the audio embeddings (or vice-versa, depending on the specific architecture).
- Attention Weights: The attention mechanism calculates weights based on the similarity between the queries and the keys. These weights determine how much influence each audio feature has on the representation of other modalities (and vice-versa).
- Contextualized Audio Representation: Through cross-attention, the audio embeddings become contextualized by the information from other modalities. For example, if the text mentions “a dog barking,” the audio embeddings corresponding to barking sounds will likely receive higher attention weights when processing the text. Similarly, the text representation might be enriched by attending to relevant audio cues.
4. Fusion and Integration:
- The output of the cross-attention layers (the contextualized audio representations) is then fused with the embeddings of other modalities. This fusion can happen through concatenation, element-wise operations (addition, multiplication), or further Transformer layers.
- The goal is to create a unified multimodal representation that the LLM can then use for downstream tasks like multimodal understanding, generation, and reasoning.
5. Interaction within Audio (Self-Attention):
- C-Former often also includes self-attention layers that operate solely on the audio embeddings before or alongside the cross-attention. This allows the model to capture temporal dependencies and relationships within the audio stream itself, independent of other modalities.
Ex: Imagine an MLLM is shown an image of a dog barking and hears the sound of barking. The user asks: “What is the animal in the image doing?”
- Audio Processing: The barking sound goes through the audio encoder, producing a sequence of audio features. C-Former embeds these features and applies self-attention to understand the temporal structure of the bark.
- Visual Processing: The image of the dog goes through a visual encoder, producing visual features.
- Cross-Modal Attention (C-Former):
- The visual features (as queries) attend to the audio embeddings (as keys and values). This helps the model understand that the sound is likely related to the visual content.
- Simultaneously (or in a different layer), the text tokens of the question (“What is the animal…”) attend to both the visual and audio embeddings, allowing the model to ground the question in the multimodal input. The audio embeddings corresponding to the barking sound will likely receive high attention when the model focuses on the action.
4. Fusion: The contextualized audio and visual representations are fused with the text embeddings of the question.
5. LLM Reasoning: The LLM processes this unified multimodal representation and reasons about the scene. It understands that the image contains an animal (dog) and the audio indicates the action of barking.
6. Answer Generation: The LLM generates the answer: “The dog in the image is barking.”
Wav2Vec 2.0
Wav2Vec 2.0 is a self-supervised model designed to learn powerful speech representations from raw audio waveforms. It is especially effective for speech recognition but can be adapted as an audio encoder within multimodal large language models (LLMs) to provide rich audio embeddings that integrate with text or other modalities.

Let’s explore how Wav2Vec 2.0 works:
1. Raw Audio Input
- The model takes raw audio waveform sampled typically at 16 kHz as input, without requiring traditional preprocessing like Fourier transforms or handcrafted features.
2. Feature Encoder (Convolutional Neural Network)
- The raw waveform is passed through a 7-layer 1D convolutional neural network (CNN).
- This CNN extracts latent speech features at a frame rate of about 20 ms per vector, reducing the raw audio to a sequence of lower-dimensional latent representations Z=(z0,z1,…,zT)Z=(z0,z1,…,zT).
- The CNN has 512 channels per layer and a receptive field of about 25 ms of audio, capturing local acoustic patterns.
3. Quantization Module (During Pre-training)
- The latent features are discretized into a finite set of quantized speech units.
- This quantization acts as a target for the model to predict during self-supervised contrastive learning, helping the model learn meaningful speech representations without labels.
4. Context Network (Transformer Encoder)
- The latent features ZZ are fed into a Transformer encoder (12 layers for base, 24 for large models).
The Transformer produces contextualized representations
C=(c0,c1,…,cT)C=(c0,c1,…,cT)
that capture long-range dependencies and speech context beyond local acoustic features.
- A feature projection layer adjusts the CNN output dimension (512) to match the Transformer input dimension (768 or 1024).
5. Self-Supervised Pre-training
- The model is trained by masking portions of the latent audio features and learning to predict the correct quantized representations for these masked parts using contrastive loss.
- Contrastive learning forces the model to distinguish the true quantized feature from distractors, improving representation quality.
6. Fine-tuning for Downstream Tasks
- After pre-training, the model is fine-tuned on labeled data (e.g., transcriptions) for tasks like Automatic Speech Recognition (ASR).
- The Transformer output CC is passed through a linear projection to predict phonemes, words, or other targets.
Processing Flow:
- Audio Input: Raw waveform sampled at 16 kHz.
- Feature Extraction: CNN feature encoder converts waveform to latent features ZZ.
- Contextual Encoding: Transformer encoder produces contextualized embeddings CC.
- Projection: The contextual embeddings are projected into a shared embedding space compatible with the LLM.
- Multimodal Fusion: Audio embeddings are concatenated or integrated with other modalities (e.g., text embeddings).
- LLM Processing: The multimodal input is processed by the LLM’s transformer layers for tasks like speech-to-text, audio captioning, or multimodal understanding.
- Output Generation: The LLM generates the final output, such as transcriptions, answers, or descriptions.
Ex: Input: A raw WAV audio file containing spoken language.
- Step 1: The audio waveform is loaded and resampled to 16 kHz.
- Step 2: The waveform is passed through the Wav2Vec 2.0 CNN encoder to get latent features.
- Step 3: Latent features are fed into the Transformer to obtain contextualized embeddings.
- Step 4: The embeddings are projected and fed into a multimodal LLM alongside any text or other modality inputs.
- Step 5: The LLM processes the combined input and generates a transcription text output, e.g., “Hello, how can I help you today?”89.
Whisper for Multimodal LLMs with Audio
Whisper is an open-source speech-to-text (STT) model developed by OpenAI. Unlike Wav2Vec 2.0, which primarily focuses on learning general-purpose audio representations, Whisper is explicitly trained for speech transcription and translation. This makes it a powerful tool for directly converting audio into text that a multimodal LLM can readily understand and process alongside other modalities.
Key Characteristics of Whisper:
- Multilingual and Multitask: Whisper is trained on a massive dataset of multilingual speech and paired text, enabling it to transcribe audio in multiple languages and translate speech from one language to another.
- Robustness to Noise and Accents: The large and diverse training data makes Whisper relatively robust to various accents, background noise, and different audio qualities.
- Direct Text Output: Whisper’s primary output is text, which aligns naturally with the text-based processing of LLMs.
Let’s explore how Whisper works:
Whisper utilizes a Transformer-based encoder-decoder architecture. Here’s a breakdown of its processing flow:
- Audio Input: The raw audio waveform is fed into the Whisper model.
- Audio Encoder (Transformer):
- The audio is first processed into a sequence of log-Mel spectrogram features. This is a standard representation of audio frequency content over time.
- These spectrogram features are then fed into a Transformer encoder. The encoder uses self-attention mechanisms to learn contextualized representations of the audio over time. It captures long-range dependencies and acoustic patterns within the speech.
3. Decoder (Transformer):
- The Transformer decoder takes the encoded audio representations as input and generates the corresponding text.
- During decoding, the model predicts the next token in the text sequence autoregressively, conditioned on the encoded audio and the previously generated tokens.
- The decoder is trained to perform both transcription (predicting the text in the same language as the audio) and translation (predicting the text in a different target language). The specific task is often indicated by a special token at the beginning of the decoding process.
Processing Flow:
- Raw Audio Input: The audio associated with the multimodal input is fed into the Whisper model.
- Whisper Processing: Whisper’s encoder processes the audio into spectrogram features and then into encoded representations. The decoder then generates the corresponding text transcription (or translation, if specified).
- Text Embedding: The transcribed text output from Whisper is then passed through the text embedding layer of the multimodal LLM. This converts the textual representation of the audio into a vector embedding that exists in the LLM’s semantic space.
- Cross-Modal Fusion: The audio’s text embedding is then fused with the embeddings of other modalities (like visual features and potentially other textual inputs) using mechanisms like cross-attention. This allows the LLM to relate the spoken content to the visual information and any other relevant text.
- Unified Multimodal Representation: The fusion process creates a unified representation that integrates information from all processed modalities, including the textual representation of the audio.
- LLM Processing: The LLM processes this unified representation to perform tasks like understanding the content of a video (audio and visuals), answering questions about a scene with spoken descriptions, or generating descriptions that incorporate both visual and auditory elements.
Ex: Imagine an MLLM is shown a video of a person saying “The cat is on the mat” and the audio of them speaking is also provided. The user asks: “What did the person say is on the mat?”
- Audio Processing: The audio of the person speaking is fed into Whisper. Whisper transcribes the audio into the text: “The cat is on the mat.”
- Visual Processing: The video frames are processed by a visual encoder, extracting features of a cat and a mat (if present).
- Text Embedding: The transcribed text “The cat is on the mat” is embedded into a vector representation by the MLLM’s text embedding layer.
- Cross-Modal Fusion: The text embedding of the audio is fused with the visual features of the cat and the mat, as well as the text embedding of the question (“What did the person say is on the mat?”). The attention mechanisms learn to associate the spoken words with the visual objects.
- LLM Reasoning: The LLM processes the unified representation and understands the relationship between the spoken words and the visual elements.
- Answer Generation: The LLM generates the answer: “The person said the cat is on the mat.”
Multimodal LLM RAG
Multimodal LLM RAG is an advanced approach that extends the traditional RAG framework to enable LLMs to understand, retrieve, and generate information by leveraging multiple data modalities beyond just text. This includes data types like images, audio, video, and potentially structured data.

Components Needed in Multimodal LLM RAG:
- Multimodal Data Sources: The knowledge base needs to store and index data in various formats (text documents, images, audio files, video clips, etc.)
- Multimodal Data Loaders and Preprocessors: These components are responsible for ingesting and preparing the multimodal data for indexing. This might involve:
- Extracting text from documents (including text within images using OCR).
- Extracting keyframes or features from images and videos.
- Extracting transcripts or features from audio.
3. Multimodal Embedding Models: To perform effective retrieval across different modalities, the query and the data in the knowledge base need to be represented in a shared embedding space. This can be achieved through:
- Joint Embedding Models (e.g., CLIP, ImageBind): These models can embed text and images (and sometimes other modalities) into a common vector space where semantically similar items are close together, regardless of their original modality.
- Separate Encoders with Alignment: Using modality-specific encoders (e.g., a text encoder and an image encoder) and then learning a mapping or projection to align their embeddings.
- Textual Descriptions: Generating text descriptions for non-textual data (e.g., image captioning) and then using standard text embeddings for retrieval.
4. Multimodal Vector Database: A vector database is used to store and efficiently search the embeddings of the multimodal data. It should support indexing and querying vectors from different modalities, ideally within the same space for unified retrieval.
- Retrieval Mechanism: This component takes a user query (which can also be multimodal, e.g., text and an image) and retrieves the most relevant information from the multimodal vector database based on similarity in the embedding space. The retrieval strategy might involve:
- Unified Multimodal Retrieval: If all modalities are embedded in the same space, a single vector search can retrieve relevant content across all types.
- Separate Retrieval and Fusion: Retrieving relevant content for each modality independently and then fusing the results.
5. Multimodal Large Language Model (MLLM): The core of the generation process is an MLLM that can process information from multiple modalities. This model takes the user query and the retrieved multimodal context as input and generates a coherent and relevant response that can also be in multiple modalities (e.g., text, or sometimes images). Examples of MLLMs include GPT-4o, Gemini, LLaVA, and others.
6. Prompt Engineering for Multimodality: Crafting effective prompts for MLLMs in a multimodal RAG setting is crucial. The prompt needs to clearly instruct the model on how to use the retrieved multimodal context to answer the user’s query.
7. Evaluation Metrics for Multimodal RAG: Evaluating the performance of a multimodal RAG system requires appropriate metrics that can assess the relevance and accuracy of the retrieved context and the generated responses across different modalities. These metrics are still an active area of research but can include extensions of traditional RAG metrics (like relevance, faithfulness, and answer quality) adapted for multimodal outputs.

Let’s explore step-by-step procedure for each part of multimodal RAG system.
Step 1: Document Loading
The goal here is to get all the raw content (text, image paths, audio paths, table data) from your complex document.
For documents with mixed modalities (PDFs with text, images, tables, embedded audio/video), you’ll often need a custom parsing approach, or leverage specialized libraries.
Libraries:
PyMuPDF(Fitz) for PDFs (extracting text, images, and detecting tables).python-docxfor Word documents.openpyxlfor Excel.BeautifulSoupfor HTML.- Custom logic for extracting audio/video paths if they are embedded or referenced.
import fitz # PyMuPDF
import re
import os
import camelot
from PIL import Image
def extract_pdf_multimodal_content(pdf_path, output_dir="extracted_content"):
doc = fitz.open(pdf_path)
all_content = []
os.makedirs(output_dir, exist_ok=True)
for page_num in range(doc.page_count):
page = doc[page_num]
page_text = page.get_text()
# 1. Extract Text
all_content.append({"type": "text", "content": page_text, "page": page_num + 1})
# 2. Extract Images
images = page.get_images(full=True)
for img_index, img_info in enumerate(images):
xref = img_info[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
image_filename = os.path.join(output_dir, f"page{page_num+1}_img{img_index}.{image_ext}")
with open(image_filename, "wb") as img_file:
img_file.write(image_bytes)
all_content.append({"type": "image", "path": image_filename, "page": page_num + 1})
# 3. Extract Tables
tables = camelot.read_pdf(pdf_path, pages=str(page_num + 1))
for table in tables:
all_content.append({"type": "table", "content": table.df.to_csv(), "page": page_num + 1})
# 4. Extract Audio Clips
# This is dependent on how audio clips are referenced in the PDF.
audio_links = re.findall(r'https?://\S+\.(mp3|wav|ogg|flac)', page_text)
for link in audio_links:
all_content.append({"type": "audio_link", "url": link, "page": page_num + 1})
doc.close()
return all_content
document_content = extract_pdf_multimodal_content("your_multimodal_document.pdf")
print(f"Extracted {len(document_content)} multimodal chunks.")
print(document_content[:5])
Step 2: Multimodal Chunking & Pre-processing
This is crucial. You need to break down your extracted content into meaningful units suitable for embedding and retrieval. “Meaningful” depends on your use case.
- Keep each distinct piece of content (a paragraph, an image, an audio clip’s transcription) as an atomic chunk.
- Crucially, attach rich metadata to each chunk: original page number, proximity to other modalities, original filename, type (text, image, audio, table).
- Use LLM-based chunking to ensure that text chunks represent a complete semantic idea, rather than arbitrary token counts.
- For tables, convert them into a descriptive text format (e.g., CSV, markdown, or a natural language summary) that an LLM can understand.
- For audio clips, transcribe them into text using an STT model (like Whisper). The transcribed text is what gets embedded for search. You might also store the original audio path for playback.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import ChatGoogleGenerativeAI
from pydub import AudioSegment
from pydub.playback import play # For local testing of audio
import io
import base64
import whisper
import google.generativeai as genai
from google.cloud import speech_v1p1beta1 as speech # For GCP Speech-to-Text
# Configure Gemini for text summarization (for tables)
genai.configure(api_key="YOUR_GEMINI_API_KEY") # Ensure your API key is set up
llm_for_summarization = ChatGoogleGenerativeAI(model="gemini-pro")
# Configure GCP Speech-to-Text client
# You'll need to set up GCP authentication (e.g., GOOGLE_APPLICATION_CREDENTIALS)
speech_client = speech.SpeechClient()
def process_multimodal_chunks(raw_content_list):
processed_chunks = []
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
add_start_index=True,
)
for item in raw_content_list:
if item["type"] == "text":
# Semantic chunking for text
chunks = text_splitter.create_documents([item["content"]])
for i, chunk in enumerate(chunks):
processed_chunks.append({
"type": "text",
"content": chunk.page_content,
"metadata": {
"page": item["page"],
"original_source": "document_text",
"chunk_index": i
}
})
elif item["type"] == "image":
# For images, we pass the image path or base64 representation directly
processed_chunks.append({
"type": "image",
"path": item["path"],
"content_description": f"Image from page {item['page']}: {item['path']}", # A placeholder description
"metadata": {
"page": item["page"],
"original_source": "document_image",
}
})
elif item["type"] == "table":
# Convert table to text.
table_text = f"Table from page {item['page']}:\n{item['content']}"
table_summary = llm_for_summarization.invoke(f"Summarize this table content:\n{item['content']}")
processed_chunks.append({"type": "table_text", "content": table_summary, ...})
processed_chunks.append({
"type": "table_text",
"content": table_text,
"metadata": {
"page": item["page"],
"original_source": "document_table",
}
})
elif item["type"] == "audio_link":
# --- Using Google Cloud Speech-to-Text (recommended for production) ---
audio_file_path = download_audio_from_url(item["url"])
with io.open(audio_file_path, "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
)
response = speech_client.recognize(config=config, audio=audio)
transcribed_text = " ".join([result.alternatives[0].transcript for result in response.results])
transcribed_text = f"Transcription of audio clip from {item['url']}: [Actual transcription content would go here]"
model = whisper.load_model("base")
result = model.transcribe(audio_file_path)
transcribed_text = result["text"]
processed_chunks.append({
"type": "audio_text",
"content": transcribed_text,
"audio_url": item["url"],
"metadata": {
"page": item["page"],
"original_source": "document_audio",
}
})
return processed_chunks
processed_multimodal_chunks = process_multimodal_chunks(document_content)
print(f"Processed {len(processed_multimodal_chunks)} chunks.")
print(processed_multimodal_chunks[0])
print(processed_multimodal_chunks[1])
Step 3: Multimodal Embedding
This is where the magic happens for multimodal retrieval. You need a model that can embed text, and ideally images and audio, into a shared vector space. Gemini excels here.
- Gemini’s
GenerativeModelcan take mixed inputs (text and images) and produce a single embedding. For audio, you'll generally embed the transcribed text. - Use a text embedding model (e.g.,
text-embedding-004) for text/table/audio transcriptions and a vision encoder (e.g., from CLIP or custom) for images, ensuring their embedding spaces are compatible or aligned. This is more complex to align accurately.
import google.generativeai as genai
from PIL import Image
# Configure Gemini
genai.configure(api_key="YOUR_GEMINI_API_KEY")
async def get_gemini_multimodal_embedding(content_type, content, model_name="models/embedding-001"):
"""
Generates a multimodal embedding using Gemini.
For images, 'content' should be a PIL Image object.
For text, 'content' should be a string.
"""
model = genai.GenerativeModel(model_name)
if content_type == "text":
embedding = await model.embed_content(content=content)
elif content_type == "image":
# Ensure 'content' is a PIL Image object
embedding = await model.embed_content(content=content, task_type="RETRIEVAL_DOCUMENT")
else:
raise ValueError(f"Unsupported content type for embedding: {content_type}")
return embedding['embedding']
# Function to load image for embedding
def load_image_for_embedding(image_path):
try:
return Image.open(image_path)
except Exception as e:
print(f"Error loading image {image_path}: {e}")
return None
async def embed_multimodal_chunks(processed_chunks):
embedded_chunks = []
for i, chunk in enumerate(processed_chunks):
embedding_vector = None
if chunk["type"] in ["text", "table_text", "audio_text"]:
embedding_vector = await get_gemini_multimodal_embedding("text", chunk["content"])
elif chunk["type"] == "image":
image_pil = load_image_for_embedding(chunk["path"])
if image_pil:
embedding_vector = await get_gemini_multimodal_embedding("image", image_pil)
else:
print(f"Skipping embedding for corrupted image: {chunk['path']}")
continue # Skip if image couldn't be loaded
if embedding_vector:
embedded_chunks.append({
"chunk_id": f"chunk_{i}", # Unique ID for retrieval
"embedding": embedding_vector,
"original_content": chunk["content"], # Store original text/description
"path": chunk.get("path"), # Store image path
"audio_url": chunk.get("audio_url"), # Store audio URL
"type": chunk["type"],
"metadata": chunk["metadata"]
})
return embedded_chunks
import asyncio
async def main_embedding():
embedded_data = await embed_multimodal_chunks(processed_multimodal_chunks)
print(f"Embedded {len(embedded_data)} chunks.")
print(embedded_data[0])
asyncio.run(main_embedding())
Step 4: Vector Database Storage
Store your chunk_id, embedding, and metadata (including original content, path, URL) in a vector database.
- Dedicated Vector Database for Production
Options: Pinecone, Weaviate, Chroma, Qdrant, Milvus.
These are optimized for vector search at scale. - They allow you to store vectors along with associated metadata, which is crucial for RAG.
- Local Vector Store for development/testing
Langchain’sFAISSorChroma(in-memory/local persistent) are good for quick prototypes.
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import numpy as np
# Configure Gemini for embedding
genai.configure(api_key="YOUR_GEMINI_API_KEY") # Ensure API key is set
embeddings_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
def store_in_vector_db(embedded_chunks, db_path="./multimodal_chroma_db"):
documents_for_chroma = []
metadatas_for_chroma = []
ids_for_chroma = []
embeddings_for_chroma = []
for chunk in embedded_chunks:
text_content = ""
if chunk["type"] in ["text", "table_text", "audio_text"]:
text_content = chunk["original_content"]
elif chunk["type"] == "image":
text_content = chunk["content_description"]
documents_for_chroma.append(text_content)
ids_for_chroma.append(chunk["chunk_id"])
embeddings_for_chroma.append(chunk["embedding"])
metadata = {
"type": chunk["type"],
"page": chunk["metadata"].get("page"),
"original_source": chunk["metadata"].get("original_source"),
}
if chunk.get("path"): # For images
metadata["image_path"] = chunk["path"]
if chunk.get("audio_url"): # For audio
metadata["audio_url"] = chunk["audio_url"]
metadatas_for_chroma.append(metadata)
# Initialize an empty Chroma collection first
db = Chroma(
embedding_function=GoogleGenerativeAIEmbeddings(model="models/embedding-001"),
persist_directory=db_path
)
# Add documents, embeddings, and metadata
# Ensure all lists have the same length
db.add_texts(
texts=documents_for_chroma,
metadatas=metadatas_for_chroma,
ids=ids_for_chroma,
embeddings=embeddings_for_chroma
)
db.persist()
print(f"Stored {len(embedded_chunks)} chunks in Chroma DB at {db_path}")
return db
db = store_in_vector_db(embedded_data)
Step 5: Multimodal Retrieval
When a user asks a question, you’ll embed their query and use it to search your vector database.
- Embed the user’s natural language query using the same Gemini text embedding model.
- Perform a similarity search in the vector database to retrieve the most relevant chunks (which could be text, image descriptions, or audio transcriptions).
- If the user’s query itself includes an image or audio, you can embed the entire query (text + image/audio) using Gemini’s multimodal embedding. This allows for truly multimodal search.
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
# Ensure Gemini is configured
genai.configure(api_key="YOUR_GEMINI_API_KEY")
# Load the embeddings model for querying
query_embeddings_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
def retrieve_multimodal_chunks(user_query, db, top_k=5):
retrieved_docs = db.similarity_search(user_query, k=top_k)
retrieved_multimodal_content = []
for doc in retrieved_docs:
content_item = {
"type": doc.metadata.get("type"),
"content": doc.page_content,
"metadata": doc.metadata
}
if doc.metadata.get("image_path"):
content_item["path"] = doc.metadata["image_path"]
if doc.metadata.get("audio_url"):
content_item["audio_url"] = doc.metadata["audio_url"]
retrieved_multimodal_content.append(content_item)
return retrieved_multimodal_content
db = Chroma(embedding_function=query_embeddings_model, persist_directory="./multimodal_chroma_db")
user_query = "Explain the neural network architecture shown in the diagram and its audio properties."
retrieved_info = retrieve_multimodal_chunks(user_query, db)
print(f"Retrieved {len(retrieved_info)} relevant chunks.")
for item in retrieved_info:
print(f"Type: {item['type']}, Content (excerpt): {item['content'][:100]}..., Path/URL: {item.get('path') or item.get('audio_url')}")
Step 6: Multimodal LLM Integration (with Gemini)
Finally, pass the retrieved multimodal chunks to Gemini for reasoning and generation.
- Gemini can directly accept a list of text strings and PIL Image objects. For audio, you’ll provide the transcribed text from the relevant audio chunks.
- Construct a prompt that incorporates the user’s query and the retrieved content.
- If the LLM doesn’t directly support multimodal input, you’d convert all retrieved modalities to text (e.g., image captions, audio transcriptions, table summaries) and then pass only text to the LLM. You’d then need to explicitly instruct the LLM to refer to the original assets (by path/URL) in its response. Gemini supports direct multimodal input, making this less necessary.
import google.generativeai as genai
from PIL import Image
# Configure Gemini
genai.configure(api_key="YOUR_GEMINI_API_KEY")
async def generate_response_with_gemini_multimodal(user_query, retrieved_multimodal_content):
model = genai.GenerativeModel('gemini-pro-vision')
prompt_parts = [
f"User's question: {user_query}\n\n",
"Here is some relevant information from my knowledge base:\n\n"
]
for item in retrieved_multimodal_content:
if item["type"] in ["text", "table_text", "audio_text"]:
prompt_parts.append(f"<{item['type']}_chunk>\n{item['content']}\n</{item['type']}_chunk>\n\n")
elif item["type"] == "image" and item.get("path"):
try:
image_pil = Image.open(item["path"])
prompt_parts.append(image_pil)
prompt_parts.append(f"\nDescription of image from knowledge base: {item['content']}\n\n")
except Exception as e:
print(f"Warning: Could not load image {item['path']} for prompt: {e}")
prompt_parts.append(f"[Image content at {item['path']} could not be loaded. Description: {item['content']}]\n\n")
if item.get("audio_url"):
prompt_parts.append(f"(Associated audio clip: {item['audio_url']})\n\n")
prompt_parts.append("Please provide a comprehensive answer to the user's question based on the provided information, referencing details from the text, images, and audio if relevant.")
try:
response = await model.generate_content(prompt_parts)
return response.text
except Exception as e:
print(f"Error generating content with Gemini: {e}")
return "Sorry, I could not generate a response at this time."
import asyncio
async def main_generation():
final_answer = await generate_response_with_gemini_multimodal(user_query, retrieved_info)
print("\n--- Final Answer from LLM ---")
print(final_answer)
asyncio.run(main_generation())
Evaluation Metrics
Evaluating a Multimodal LLM RAG application is a complex task because you need to assess performance across multiple dimensions: the quality of retrieval, the quality of generation, and critically, how well the system integrates and leverages all modalities.
Here’s a breakdown of evaluation metrics, categorized for clarity:
I. Retrieval Metrics
These metrics assess how well your system retrieves relevant multimodal documents or chunks based on a query. The challenge here is defining “relevance” when the query might be multimodal and the retrieved items are also multimodal.
A. Core Retrieval Metrics:
These are standard IR metrics, but adapted to consider the type and relevance of the retrieved multimodal chunk.
- Precision@k: Out of the top
kretrieved chunks, how many are truly relevant to the query?
A retrieved chunk might be relevant if its text, image, or audio content contributes to answering the query. You’d need human annotators to label relevance across modalities.
Ex: For “Show me dogs playing in water,” if the top 3 retrieved items are an image of a dog swimming, a text passage about water safety for pets, and an audio clip of splashing, you’d assess each for relevance.
2. Recall@k: Out of all truly relevant chunks in the knowledge base, how many were retrieved in the top k?
Requires a comprehensive ground truth of all relevant multimodal chunks for a given query. Challenging to create.
3. F1-Score: The harmonic mean of precision and recall.
4. Mean Average Precision (MAP): A popular metric that considers the order of relevant documents. If a highly relevant document is retrieved early, it scores higher.
5. Normalized Discounted Cumulative Gain (NDCG@k): More sophisticated than MAP. Assigns graded relevance scores (e.g., 0=irrelevant, 1=somewhat relevant, 2=highly relevant) and discounts relevance based on position.
Allows you to give partial credit if, for instance, an image is somewhat relevant, but an accompanying text description is highly relevant.
B. Modality-Specific Retrieval Relevance:
- Text Generation Evaluation
These metrics assess the quality of the natural language response generated by the LLM, often by comparing it to a reference (ground truth) answer.
- Exact Match (EM): Binary metric indicating if the generated answer is an exact match to the reference answer.
Simple but strict. Useful for factual questions with very precise answers. In RAG, it indicates if the LLM correctly extracted and formatted the exact information.
Limitation: Fails to capture semantic similarity if the wording is different but meaning is the same. - BLEU (Bilingual Evaluation Understudy): Compares n-grams (sequences of words) between the generated text and reference text. Higher scores indicate more overlap and similarity.
Good for assessing fluency and how well the generated answer captures key phrases from the relevant retrieved content.
Limitation: Primarily focuses on lexical overlap, less on semantic meaning. - ROUGE (Recall-Oriented Gisting Evaluation): A set of metrics (ROUGE-N, ROUGE-L, ROUGE-S) that measure overlap of n-grams, longest common subsequences, or skip-bigrams, typically focusing on recall (how much of the reference is covered by the generated text).
Particularly useful for summarization tasks or when the generated answer is expected to cover a broad range of facts from retrieved documents.
MultiRAGen uses Multilingual ROUGE: This is crucial for multilingual RAG systems. It extends ROUGE to compare texts in different languages or to assess cross-lingual summarization, ensuring that semantic content is preserved regardless of the language of generation. - METEOR (Metric for Evaluation of Translation with Explicit Ordering): Beyond n-gram overlap, METEOR incorporates synonyms, stemming, and paraphrasing, and considers word order.
Provides a more robust assessment of semantic similarity and fluency than BLEU or ROUGE.
2. Multimodal Output Evaluation (Specific to Modality-Specific Outputs)
While your primary LLM output might be text, if the multimodal LLM generates images or audio (e.g., image captioning, sound generation from text), these metrics become relevant.
For Image Captioning (e.g., when the LLM generates a caption from an image):
- CIDEr (Consensus-Based Image Description Evaluation): Measures the similarity between a generated caption and a set of human reference captions. It uses TF-IDF (Term Frequency-Inverse Document Frequency) weighting for n-grams, giving more weight to rare but important words.
If your MLLM is part of a system that generates descriptions of retrieved images, CIDEr helps assess the quality of those descriptions. - SPICE (Semantic Propositional Image Caption Evaluation): Focuses on the semantic content of the caption. It parses captions into semantic “scene graphs” (objects, attributes, relationships) and compares the graphs.
Critical for ensuring the LLM’s captions correctly identify and relate visual elements, which is vital for faithful multimodal understanding. - SPIDEr (Combines both metrics): Often a preferred metric for image captioning tasks as it balances lexical overlap (CIDEr) with semantic accuracy (SPICE).
3. Semantic Alignment / Cross-Modal Consistency
These metrics assess how well the different modalities (e.g., text and image, text and audio) are understood in relation to each other, both during retrieval and reasoning.
- BERTScore: Compares contextualized embeddings (from BERT or similar models) of the generated text and the reference text/retrieved content. It essentially measures semantic similarity.
Excellent for evaluating the fluency and semantic quality of the generated response. Also, it can be used to assess the semantic overlap between a text query and the textual components (transcriptions, descriptions) of retrieved multimodal chunks.
It matches words in candidate and reference sentences, computes cosine similarity between their BERT embeddings, and then calculates precision, recall, and F1 based on these similarities. - CLIP Score: Measures the semantic similarity between an image and a text description using the embeddings from the CLIP (Contrastive Language–Image Pre-training) model.
Retrieval: Assess how well retrieved images semantically match a text query, or vice-versa.
Generation: If your MLLM generates image descriptions, CLIP Score evaluates how semantically aligned those descriptions are with the image itself. It’s a strong proxy for image-text correspondence.
4. Image Quality (if the LLM generates or processes images directly)
These metrics focus on the perceptual quality and diversity of generated or processed images.
- FID (Fréchet Inception Distance): Quantifies the similarity between the feature distributions of generated images and real images using features extracted from a pre-trained Inception-v3 network. Lower FID scores indicate higher quality and similarity to real images.
If your MLLM has generative image capabilities (e.g., generating images based on retrieved text descriptions), FID assesses the realism of those generated images. - KID (Kernel Inception Distance): Provides an unbiased and more robust alternative to FID, using a polynomial kernel on Inception features. Less prone to mode collapse issues.
Similar to FID, for evaluating the perceptual quality of generated images. - Inception Score (IS): Evaluates image diversity and quality by measuring the KL divergence between the conditional class distribution of generated images and the marginal class distribution. It uses a pre-trained Inception model to classify images. Higher IS implies higher quality (classifiable images) and diversity (images belonging to different classes).
To ensure generated images are both realistic and varied.
5. Audio Evaluation (if the LLM processes or generates audio directly)
These metrics assess the quality and relevance of audio.
- Human Annotators for Sound Quality (OVL — Overall Quality Level) and Text Relevance (REL — Relevance to Text): Subjective assessment by humans. OVL rates the perceptual quality of the sound (clarity, absence of noise, naturalness). REL rates how well the audio content corresponds to a given text (e.g., a query or a generated transcription).
Essential for truly understanding the user experience for audio, as automated metrics can miss nuances. In RAG, you might use OVL to ensure the audio retrieved is clear enough for the LLM to process, and REL to ensure the audio content is semantically aligned with the text context. - Fréchet Audio Distance (FAD): An audio-specific variant of FID. It quantifies the similarity between the feature distributions of generated audio samples and real audio samples using features extracted from a pre-trained audio classification model (e.g., VGGish). Lower FAD indicates higher quality and similarity to real audio.
If your MLLM generates audio, FAD assesses the realism and quality of that generated audio.
II. Generation Metrics
The metrics focus on the quality of the generated answer and its relationship to the retrieved context, specifically considering the multimodal nature of the information.
- Correctness: This is the most straightforward and critical metric. It assesses whether the LLM’s generated answer is factually accurate and provides the correct information in response to the user’s query.
In a multimodal RAG setting, “correctness” means the answer accurately reflects information present in any of the modalities of the retrieved chunks (text, images, audio transcription).
For example, if a user asks “What is the color of the car?”, and a retrieved image clearly shows a red car, the correct answer is “red.”
If an audio clip confirms a date, the answer should reflect that date. - Relevancy: This assesses whether the generated answer directly addresses the user’s question and provides useful, pertinent information, without including superfluous or off-topic details.
The answer’s relevance is judged in the context of the user’s multimodal intent. If the user provides an image and asks about a specific object within it, the answer must be relevant to that object, leveraging both the visual information and the textual query. - Text Faithfulness: This assesses whether the generated answer is solely based on and consistent with the information found within the textual parts of the retrieved chunks. It’s about preventing “hallucinations” or making up information not present in the textual sources.
In a multimodal RAG system, this metric specifically isolates the faithfulness to text. It helps determine if the LLM is correctly extracting and summarizing information from text passages (e.g., from transcribed audio, document text, table summaries). If the text says “The cat sat on the mat,” but the LLM says “The cat sat on the bed,” it fails text faithfulness. - Image Faithfulness: This assesses whether the generated answer is factually supported by and consistent with the information presented in the retrieved images. It ensures the LLM doesn’t misinterpret or hallucinate details from the visual content.
This is crucial for multimodal RAG. If a retrieved image shows “a blue bird,” but the LLM states “a green bird,” it fails image faithfulness. It directly checks the LLM’s ability to accurately perceive and use visual information. - Text Context Relevancy: This assesses how relevant the retrieved text chunks (including transcribed audio and table summaries) are to the user’s query. It’s a metric for the retrieval component’s accuracy for textual sources.
While the user’s query might be multimodal (e.g., text + image), this metric specifically checks if the system pulled the right text documents or chunks that are useful for answering. If the text chunk describes “water pumps” but the query is about “faucets,” it’s low relevancy. - Image Context Relevancy: This assesses how relevant the retrieved image chunks are to the user’s query. It evaluates the retrieval component’s ability to fetch appropriate visual evidence.
If the user asks about “bike gears” and the system retrieves an image of a bike tire, the image context relevancy would be low. If it retrieves a detailed diagram of a derailleur, it’s high. This is where your multimodal embedding for images (e.g., via Gemini or CLIP) is crucial.
III. System-Level / End-to-End Metrics
These assess the overall user experience and performance of the entire RAG system.
- Response Time (Latency): How quickly does the system provide an answer? (Multimodal processing can be more intensive).
- Robustness: How well does the system handle noisy inputs (blurry images, muffled audio), ambiguous queries, or missing modalities?
- User Satisfaction: The ultimate metric, often gathered via surveys or implicit feedback (e.g., re-query rate).
- Cost-Effectiveness: The computational cost of running the multimodal encoders, embeddings, and LLM inference.
IV. Evaluation Methodologies
- Human Evaluation (Gold Standard): For nuanced aspects like multimodal coherence, faithfulness, and integration quality, human annotators are often indispensable. They can assess relevance, correctness, and overall answer quality.
- LLM-as-a-Judge: Using a powerful LLM to evaluate the responses of your RAG system based on predefined criteria and the retrieved context. This can automate some aspects of human evaluation but requires careful prompt engineering.
- Quantitative Benchmarks: For retrieval, creating datasets with multimodal queries and rigorously labeled relevant multimodal documents is key. For generation, creating gold-standard answers for comparison.
- A/B Testing: For practical applications, comparing different RAG strategies or models by deploying them to different user groups and measuring engagement, task completion rates, or user satisfaction.
I appreciate you taking the time to read this piece. If you found it insightful, a share with your friends, a clap, or a comment would be greatly appreciated and help me improve future content
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.