🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

Last Updated on May 6, 2025 by Editorial Team

Author(s): sridhar sampath

Originally published on Towards AI.

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

TL;DR

Traditional RAG fails on visual data. This project uses Cohere’s multimodal embeddings + Gemini 2.5 Flash to build a RAG system that understands both text and images — enabling accurate answers from charts, tables, and visuals inside PDFs.

📉The Problem: Traditional RAG’s Visual Blindspot

Traditional Retrieval-Augmented Generation (RAG) systems rely on text embeddings to retrieve information from documents. But what if your most valuable insights are hidden in charts, tables, and images?

Whether you’re analyzing financial PDFs, investment research reports, or market slides, much of the relevant information lives in visuals:

Numerical breakdowns in pie/bar charts (e.g., portfolio allocations)
Trend visualizations in line graphs (e.g., market performance)
Structured data in complex tables (e.g., comparison matrices)
Process flows in diagrams (e.g., system architectures)
Spatial relationships in maps or layouts

A purely text-based approach fails to capture this crucial layer of information.

💡The Solution: Multimodal RAG

Multimodal RAG augments traditional RAG by combining text and image understanding. This approach enables:

🔍 Image + Text search from the same document
🧠 Unified vector index with mixed modality support
🤖 Context-aware answers via Gemini using either matched text or matched image

🔧Key Technologies

Cohere’s Embed v4.0: Embeds both text and images in the same vector space
Gemini 2.5 Flash: Processes queries with context (text or image) to generate factual, human-like responses
FAISS: Efficiently indexes and searches vectors from both modalities.FAISS supports efficient approximate nearest neighbor search.

🧭 End-to-End Multimodal RAG Workflow

Below is the high-level system flow for the Multimodal RAG pipeline:

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

📌 From PDF upload to image/text embedding, vector search, and Gemini-powered answer generation — everything is stitched together using Streamlit, Cohere, FAISS, and Gemini 2.5 Flash.

🎥 Multimodal RAG — Video Demo

Here’s a 9-minute visual walkthrough of the system in action:

See it live: Charts being analyzed, tables being interpreted, and complex visuals being understood all in real-time!

Architecture Comparison

🖼️ Multimodal RAG Architecture

In this pipeline, both the text and each page image are embedded using Cohere, stored in FAISS, and served as context to Gemini 2.5 Flash. This allows questions grounded in visuals to be answered — something traditional RAG setups can’t handle.

📝 Text-Only RAG Architecture

This approach extracts text from the PDF, embeds it, and uses it for retrieval — but completely misses information embedded inside charts or graphics.

Results: Side-by-Side Comparison

We tested both Text-Only and Multimodal RAG apps on the same ETF PDF document:

The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.

The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.

Code Walkthrough : Multimodal Processing

💻 Full Source Code: GitHub Repository

1. PDF to Image Conversion

images = pdf2image.convert_from_path(pdf_path, dpi=200)

This gives us a list of page-wise PIL images, which are embedded next.

2. Embedding with Cohere

if content_type == "text":
 response = cohere.embed(input_type="search_document", texts=[text])
else:
 base64_img = convert_image_to_base64(image)
 response = cohere.embed(
 input_type="search_document",
 inputs=[{"content": [{"type": "image", "image": base64_img}]}]
 )

The output is added to FAISS as a float32 vector.

3. Gemini Answering Logic

if isinstance(content, Image.Image):
 response = gemini.generate_content([query, content])
else:
 response = gemini.generate_content(f"Question: {query}\n\nContext: {text}")

Gemini 2.5 Flash intelligently parses charts, titles, and layouts.

🚀Getting Started — Minimal Example

Here’s a compact script to get you up and running with multimodal RAG using Cohere + Gemini:

⚠️ Note: This is a minimal gist to demonstrate the core flow. The full working code with UI, modular structure, and search logic is available in the [GitHub repository](github.com/SridharSampath/multimodal-rag-demo).*

import cohere
from google.generativeai import GenerativeModel
import faiss
import numpy as np
from pdf2image import convert_from_path
from PIL import Image
# Initialize APIs
co = cohere.Client("your-cohere-key")
gemini = GenerativeModel("gemini-2.5-flash")

# Convert PDF page to image
def pdf_to_images(pdf_path):
 return convert_from_path(pdf_path, dpi=200)
# Create embeddings
def get_embedding(content, content_type="text"):
 if content_type == "text":
 response = co.embed(input_type="search_document", texts=[content])
 else:
 base64_img = Image.open(content).resize((512, 512)).tobytes().hex()
 response = co.embed(
 input_type="search_document",
 inputs=[{"content": [{"type": "image", "image": base64_img}]}]
 )
 return response.embeddings[0]
# Index and query
dimension = 1024
index = faiss.IndexFlatL2(dimension)
images = pdf_to_images("your.pdf")
for img in images:
 index.add(np.array([get_embedding(img, "image")], dtype=np.float32))
def answer_query(query):
 query_emb = get_embedding(query)
 D, I = index.search(np.array([query_emb], dtype=np.float32), k=1)
 result = images[I[0][0]]
 return gemini.generate_content([query, result]).text

⚙️Project Setup

What You’ll Need

🔑 API Keys:

Cohere embed-v4.0 → Create Cohere Account
Gemini 2.5 Flash → Try Gemini on Google AI Studio

💻 System Requirements:

Python 3.8+
Poppler (for PDF image conversion)

# Clone repository
git clone https://github.com/SridharSampath/multimodal-rag-demo
cd multimodal-rag-app

# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.py

⚠️System Dependency: Poppler

This project uses pdf2image to convert PDF pages into images, which requires Poppler:

Windows:

Download from GitHub — Poppler Windows Releases
Extract to a folder like C:\poppler
Add C:\poppler\Library\bin to your system’s PATH

🧪 Demo Screenshots — Multimodal vs. Text-Only RAG

Visual comparison of the same queries across two apps:

1. ❓ Query: “What is AUM of Invesco?”

Multimodal App: Found in Bar chart
Text-Only App: Missed (text doesn’t mention it)

2. ❓ Query: “How much did BlackRock earn through Technology services?”

Multimodal App: Pulled value from image- Blackrock Income Statement
Text-Only App: Missed (text doesn’t mention it)

3. 🍎 Query: “How much Percentage is Apple in S&P?”

Multimodal App: Found in pie chart
Text-Only App: Gave approximate data

4. 🦠 Query: “During Covid pandemic what was the top 10 weight in S&P 500?”

Multimodal App: Parsed timeline chart
Text-Only App: Missed specific figure

5. 💰 Query: “How to track Bitcoin in ETFs?”

Multimodal App: Found in Table Image
Text-Only App: Missed specific figure

⚠️ Limitations and Considerations

While multimodal RAG offers significant advantages, be aware of:

Computational overhead — Processing and embedding images requires more resources
API costs — Multimodal embedding APIs typically cost more than text-only equivalents
OCR dependency — Chart text recognition still relies on OCR quality
Image resolution impact — Low-resolution images may reduce embedding quality
Complex visualization challenges — Very complex visualizations might still be misinterpreted

Resources & Reference Links

🙌Closing Thoughts

If you’re building LLM apps for financial document QA, research assistant bots, or compliance analytics, you need to look beyond just text. Multimodal RAG delivers context-aware, image-inclusive, and LLM-optimized retrieval that can extract insights from your entire document ecosystem, not just the textual components.

Try it out and let me know your thoughts!

🚀 Let’s Connect!

If you found this useful, feel free to connect with me:
🔗 LinkedIn — Sridhar Sampath
🔗 Medium Blog
🔗 GitHub Repository

✨ End

Originally published at https://sridhartech.hashnode.dev on May 3, 2025.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

Author(s): sridhar sampath

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

TL;DR

📉The Problem: Traditional RAG’s Visual Blindspot

💡The Solution: Multimodal RAG

🔧Key Technologies

🧭 End-to-End Multimodal RAG Workflow

🎥 Multimodal RAG — Video Demo

Architecture Comparison

🖼️ Multimodal RAG Architecture

📝 Text-Only RAG Architecture

Results: Side-by-Side Comparison

Code Walkthrough : Multimodal Processing

1. PDF to Image Conversion

2. Embedding with Cohere

3. Gemini Answering Logic

🚀Getting Started — Minimal Example

⚙️Project Setup

What You’ll Need

⚠️System Dependency: Poppler

🧪 Demo Screenshots — Multimodal vs. Text-Only RAG

1. ❓ Query: “What is AUM of Invesco?”

2. ❓ Query: “How much did BlackRock earn through Technology services?”

3. 🍎 Query: “How much Percentage is Apple in S&P?”

4. 🦠 Query: “During Covid pandemic what was the top 10 weight in S&P 500?”

5. 💰 Query: “How to track Bitcoin in ETFs?”

⚠️ Limitations and Considerations

Resources & Reference Links

🙌Closing Thoughts

🚀 Let’s Connect!

✨ End

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement