Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini
Latest   Machine Learning

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

Last Updated on May 6, 2025 by Editorial Team

Author(s): sridhar sampath

Originally published on Towards AI.

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

TL;DR

Traditional RAG fails on visual data. This project uses Cohere’s multimodal embeddings + Gemini 2.5 Flash to build a RAG system that understands both text and images — enabling accurate answers from charts, tables, and visuals inside PDFs.

📉The Problem: Traditional RAG’s Visual Blindspot

Traditional Retrieval-Augmented Generation (RAG) systems rely on text embeddings to retrieve information from documents. But what if your most valuable insights are hidden in charts, tables, and images?

Whether you’re analyzing financial PDFs, investment research reports, or market slides, much of the relevant information lives in visuals:

  • Numerical breakdowns in pie/bar charts (e.g., portfolio allocations)
  • Trend visualizations in line graphs (e.g., market performance)
  • Structured data in complex tables (e.g., comparison matrices)
  • Process flows in diagrams (e.g., system architectures)
  • Spatial relationships in maps or layouts

A purely text-based approach fails to capture this crucial layer of information.

💡The Solution: Multimodal RAG

Multimodal RAG augments traditional RAG by combining text and image understanding. This approach enables:

🔍 Image + Text search from the same document
🧠 Unified vector index with mixed modality support
🤖 Context-aware answers via Gemini using either matched text or matched image

🔧Key Technologies

  • Cohere’s Embed v4.0: Embeds both text and images in the same vector space
  • Gemini 2.5 Flash: Processes queries with context (text or image) to generate factual, human-like responses
  • FAISS: Efficiently indexes and searches vectors from both modalities.FAISS supports efficient approximate nearest neighbor search.

🧭 End-to-End Multimodal RAG Workflow

Below is the high-level system flow for the Multimodal RAG pipeline:

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

📌 From PDF upload to image/text embedding, vector search, and Gemini-powered answer generation — everything is stitched together using Streamlit, Cohere, FAISS, and Gemini 2.5 Flash.

🎥 Multimodal RAG — Video Demo

Here’s a 9-minute visual walkthrough of the system in action:

See it live: Charts being analyzed, tables being interpreted, and complex visuals being understood all in real-time!

Architecture Comparison

🖼️ Multimodal RAG Architecture

In this pipeline, both the text and each page image are embedded using Cohere, stored in FAISS, and served as context to Gemini 2.5 Flash. This allows questions grounded in visuals to be answered — something traditional RAG setups can’t handle.

📝 Text-Only RAG Architecture

This approach extracts text from the PDF, embeds it, and uses it for retrieval — but completely misses information embedded inside charts or graphics.

Results: Side-by-Side Comparison

We tested both Text-Only and Multimodal RAG apps on the same ETF PDF document:

The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.

The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.

Code Walkthrough : Multimodal Processing

💻 Full Source Code: GitHub Repository

1. PDF to Image Conversion

images = pdf2image.convert_from_path(pdf_path, dpi=200)

This gives us a list of page-wise PIL images, which are embedded next.

2. Embedding with Cohere

if content_type == "text":
response = cohere.embed(input_type="search_document", texts=[text])
else:
base64_img = convert_image_to_base64(image)
response = cohere.embed(
input_type="search_document",
inputs=[{"content": [{"type": "image", "image": base64_img}]}]
)

The output is added to FAISS as a float32 vector.

3. Gemini Answering Logic

if isinstance(content, Image.Image):
response = gemini.generate_content([query, content])
else:
response = gemini.generate_content(f"Question: {query}\n\nContext: {text}")

Gemini 2.5 Flash intelligently parses charts, titles, and layouts.

🚀Getting Started — Minimal Example

Here’s a compact script to get you up and running with multimodal RAG using Cohere + Gemini:

⚠️ Note: This is a minimal gist to demonstrate the core flow. The full working code with UI, modular structure, and search logic is available in the [GitHub repository](github.com/SridharSampath/multimodal-rag-demo).*

import cohere
from google.generativeai import GenerativeModel
import faiss
import numpy as np
from pdf2image import convert_from_path
from PIL import Image
# Initialize APIs
co = cohere.Client("your-cohere-key")
gemini = GenerativeModel("gemini-2.5-flash")

# Convert PDF page to image
def pdf_to_images(pdf_path):
return convert_from_path(pdf_path, dpi=200)
# Create embeddings
def get_embedding(content, content_type="text"):
if content_type == "text":
response = co.embed(input_type="search_document", texts=[content])
else:
base64_img = Image.open(content).resize((512, 512)).tobytes().hex()
response = co.embed(
input_type="search_document",
inputs=[{"content": [{"type": "image", "image": base64_img}]}]
)
return response.embeddings[0]
# Index and query
dimension = 1024
index = faiss.IndexFlatL2(dimension)
images = pdf_to_images("your.pdf")
for img in images:
index.add(np.array([get_embedding(img, "image")], dtype=np.float32))
def answer_query(query):
query_emb = get_embedding(query)
D, I = index.search(np.array([query_emb], dtype=np.float32), k=1)
result = images[I[0][0]]
return gemini.generate_content([query, result]).text

⚙️Project Setup

What You’ll Need

🔑 API Keys:

💻 System Requirements:

  • Python 3.8+
  • Poppler (for PDF image conversion)
# Clone repository
git clone https://github.com/SridharSampath/multimodal-rag-demo
cd multimodal-rag-app

# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.py

⚠️System Dependency: Poppler

This project uses pdf2image to convert PDF pages into images, which requires Poppler:

Windows:

🧪 Demo Screenshots — Multimodal vs. Text-Only RAG

Visual comparison of the same queries across two apps:

1. ❓ Query: “What is AUM of Invesco?”

Multimodal App: Found in Bar chart
Text-Only App: Missed (text doesn’t mention it)

2. ❓ Query: “How much did BlackRock earn through Technology services?”

Multimodal App: Pulled value from image- Blackrock Income Statement
Text-Only App: Missed (text doesn’t mention it)

3. 🍎 Query: “How much Percentage is Apple in S&P?”

Multimodal App: Found in pie chart
Text-Only App: Gave approximate data

4. 🦠 Query: “During Covid pandemic what was the top 10 weight in S&P 500?”

Multimodal App: Parsed timeline chart
Text-Only App: Missed specific figure

5. 💰 Query: “How to track Bitcoin in ETFs?”

Multimodal App: Found in Table Image
Text-Only App: Missed specific figure

⚠️ Limitations and Considerations

While multimodal RAG offers significant advantages, be aware of:

  1. Computational overhead — Processing and embedding images requires more resources
  2. API costs — Multimodal embedding APIs typically cost more than text-only equivalents
  3. OCR dependency — Chart text recognition still relies on OCR quality
  4. Image resolution impact — Low-resolution images may reduce embedding quality
  5. Complex visualization challenges — Very complex visualizations might still be misinterpreted

Resources & Reference Links

🙌Closing Thoughts

If you’re building LLM apps for financial document QA, research assistant bots, or compliance analytics, you need to look beyond just text. Multimodal RAG delivers context-aware, image-inclusive, and LLM-optimized retrieval that can extract insights from your entire document ecosystem, not just the textual components.

Try it out and let me know your thoughts!

🚀 Let’s Connect!

If you found this useful, feel free to connect with me:
🔗 LinkedIn — Sridhar Sampath
🔗 Medium Blog
🔗 GitHub Repository

✨ End

Originally published at https://sridhartech.hashnode.dev on May 3, 2025.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.