🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini
Last Updated on May 6, 2025 by Editorial Team
Author(s): sridhar sampath
Originally published on Towards AI.
🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini
TL;DR
Traditional RAG fails on visual data. This project uses Cohere’s multimodal embeddings + Gemini 2.5 Flash to build a RAG system that understands both text and images — enabling accurate answers from charts, tables, and visuals inside PDFs.
📉The Problem: Traditional RAG’s Visual Blindspot
Traditional Retrieval-Augmented Generation (RAG) systems rely on text embeddings to retrieve information from documents. But what if your most valuable insights are hidden in charts, tables, and images?
Whether you’re analyzing financial PDFs, investment research reports, or market slides, much of the relevant information lives in visuals:
- Numerical breakdowns in pie/bar charts (e.g., portfolio allocations)
- Trend visualizations in line graphs (e.g., market performance)
- Structured data in complex tables (e.g., comparison matrices)
- Process flows in diagrams (e.g., system architectures)
- Spatial relationships in maps or layouts
A purely text-based approach fails to capture this crucial layer of information.
💡The Solution: Multimodal RAG
Multimodal RAG augments traditional RAG by combining text and image understanding. This approach enables:
🔍 Image + Text search from the same document
🧠 Unified vector index with mixed modality support
🤖 Context-aware answers via Gemini using either matched text or matched image
🔧Key Technologies
- Cohere’s Embed v4.0: Embeds both text and images in the same vector space
- Gemini 2.5 Flash: Processes queries with context (text or image) to generate factual, human-like responses
- FAISS: Efficiently indexes and searches vectors from both modalities.FAISS supports efficient approximate nearest neighbor search.
🧭 End-to-End Multimodal RAG Workflow
Below is the high-level system flow for the Multimodal RAG pipeline:
📌 From PDF upload to image/text embedding, vector search, and Gemini-powered answer generation — everything is stitched together using Streamlit, Cohere, FAISS, and Gemini 2.5 Flash.
🎥 Multimodal RAG — Video Demo
Here’s a 9-minute visual walkthrough of the system in action:
See it live: Charts being analyzed, tables being interpreted, and complex visuals being understood all in real-time!
Architecture Comparison
🖼️ Multimodal RAG Architecture
In this pipeline, both the text and each page image are embedded using Cohere, stored in FAISS, and served as context to Gemini 2.5 Flash. This allows questions grounded in visuals to be answered — something traditional RAG setups can’t handle.
📝 Text-Only RAG Architecture
This approach extracts text from the PDF, embeds it, and uses it for retrieval — but completely misses information embedded inside charts or graphics.
Results: Side-by-Side Comparison
We tested both Text-Only and Multimodal RAG apps on the same ETF PDF document:
The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.

The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.
Code Walkthrough : Multimodal Processing
💻 Full Source Code: GitHub Repository
1. PDF to Image Conversion
images = pdf2image.convert_from_path(pdf_path, dpi=200)
This gives us a list of page-wise PIL images, which are embedded next.
2. Embedding with Cohere
if content_type == "text":
response = cohere.embed(input_type="search_document", texts=[text])
else:
base64_img = convert_image_to_base64(image)
response = cohere.embed(
input_type="search_document",
inputs=[{"content": [{"type": "image", "image": base64_img}]}]
)
The output is added to FAISS as a float32 vector.
3. Gemini Answering Logic
if isinstance(content, Image.Image):
response = gemini.generate_content([query, content])
else:
response = gemini.generate_content(f"Question: {query}\n\nContext: {text}")
Gemini 2.5 Flash intelligently parses charts, titles, and layouts.
🚀Getting Started — Minimal Example
Here’s a compact script to get you up and running with multimodal RAG using Cohere + Gemini:
⚠️ Note: This is a minimal gist to demonstrate the core flow. The full working code with UI, modular structure, and search logic is available in the [GitHub repository](github.com/SridharSampath/multimodal-rag-demo).*
import cohere
from google.generativeai import GenerativeModel
import faiss
import numpy as np
from pdf2image import convert_from_path
from PIL import Image
# Initialize APIs
co = cohere.Client("your-cohere-key")
gemini = GenerativeModel("gemini-2.5-flash")
# Convert PDF page to image
def pdf_to_images(pdf_path):
return convert_from_path(pdf_path, dpi=200)
# Create embeddings
def get_embedding(content, content_type="text"):
if content_type == "text":
response = co.embed(input_type="search_document", texts=[content])
else:
base64_img = Image.open(content).resize((512, 512)).tobytes().hex()
response = co.embed(
input_type="search_document",
inputs=[{"content": [{"type": "image", "image": base64_img}]}]
)
return response.embeddings[0]
# Index and query
dimension = 1024
index = faiss.IndexFlatL2(dimension)
images = pdf_to_images("your.pdf")
for img in images:
index.add(np.array([get_embedding(img, "image")], dtype=np.float32))
def answer_query(query):
query_emb = get_embedding(query)
D, I = index.search(np.array([query_emb], dtype=np.float32), k=1)
result = images[I[0][0]]
return gemini.generate_content([query, result]).text
⚙️Project Setup
What You’ll Need
🔑 API Keys:
- Cohere embed-v4.0 → Create Cohere Account
- Gemini 2.5 Flash → Try Gemini on Google AI Studio
💻 System Requirements:
- Python 3.8+
- Poppler (for PDF image conversion)

# Clone repository
git clone https://github.com/SridharSampath/multimodal-rag-demo
cd multimodal-rag-app
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.py
⚠️System Dependency: Poppler
This project uses pdf2image to convert PDF pages into images, which requires Poppler:
Windows:
- Download from GitHub — Poppler Windows Releases
- Extract to a folder like C:\poppler
- Add C:\poppler\Library\bin to your system’s PATH
🧪 Demo Screenshots — Multimodal vs. Text-Only RAG
Visual comparison of the same queries across two apps:
1. ❓ Query: “What is AUM of Invesco?”
Multimodal App: Found in Bar chart
Text-Only App: Missed (text doesn’t mention it)
2. ❓ Query: “How much did BlackRock earn through Technology services?”
Multimodal App: Pulled value from image- Blackrock Income Statement
Text-Only App: Missed (text doesn’t mention it)
3. 🍎 Query: “How much Percentage is Apple in S&P?”
Multimodal App: Found in pie chart
Text-Only App: Gave approximate data
4. 🦠 Query: “During Covid pandemic what was the top 10 weight in S&P 500?”
Multimodal App: Parsed timeline chart
Text-Only App: Missed specific figure
5. 💰 Query: “How to track Bitcoin in ETFs?”
Multimodal App: Found in Table Image
Text-Only App: Missed specific figure
⚠️ Limitations and Considerations
While multimodal RAG offers significant advantages, be aware of:
- Computational overhead — Processing and embedding images requires more resources
- API costs — Multimodal embedding APIs typically cost more than text-only equivalents
- OCR dependency — Chart text recognition still relies on OCR quality
- Image resolution impact — Low-resolution images may reduce embedding quality
- Complex visualization challenges — Very complex visualizations might still be misinterpreted
Resources & Reference Links
- 🔗 GitHub Repository
- 🎥 Demo Video
- 📚 Cohere Documentation
- 📘 Gemini Documentation
- Multimodal (RAG) with Gemini
- Cohere Multimodal embeddings
🙌Closing Thoughts
If you’re building LLM apps for financial document QA, research assistant bots, or compliance analytics, you need to look beyond just text. Multimodal RAG delivers context-aware, image-inclusive, and LLM-optimized retrieval that can extract insights from your entire document ecosystem, not just the textual components.
Try it out and let me know your thoughts!
🚀 Let’s Connect!
If you found this useful, feel free to connect with me:
🔗 LinkedIn — Sridhar Sampath
🔗 Medium Blog
🔗 GitHub Repository
✨ End
Originally published at https://sridhartech.hashnode.dev on May 3, 2025.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.