Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

I Made DocVision: My Messy File Savior That Reads Docs and Pics
Latest   Machine Learning

I Made DocVision: My Messy File Savior That Reads Docs and Pics

Last Updated on April 15, 2025 by Editorial Team

Author(s): Tanisque Bagal

Originally published on Towards AI.

Okay, picture this: I’m staring at a stack of random files — PDFs I forgot I downloaded, a Word doc from last month’s meeting, and a blurry photo of a grocery list I scribbled on a napkin. I just wanted someone to tell me what’s in there without me having to dig through it all. So, I built DocVision AI Assistant, this little chatbot that’s become my new best friend. It reads documents, decodes images, and chats with me like it gets me. I threw it together with Streamlit and some AI tricks from Ollama, and honestly, I’m kind of obsessed. Want the scoop? Let’s dive in.

How It All Started

I’m not gonna lie — my digital life is a mess. I’ve got folders stuffed with notes, reports, and pics I swear I’ll sort out someday. But who has time to read a 20-page PDF just to find one line? I wanted something that could do it for me — something smart, fast, and maybe even a little fun. That’s when I thought, “What if I made a bot that could handle everything — docs, pics, all of it?” A couple of late nights and too much coffee later, DocVision was born.

It’s not fancy, but it’s mine. I used Streamlit to make it look decent (I’m no design pro) and Ollama to give it some serious brainpower. Now, it’s like having a buddy who’s way better at reading than I am.

I Made DocVision: My Messy File Savior That Reads Docs and Pics
Photo by Ant Rozetsky on Unsplash

What It Does (And Why I Love It)

DocVision is my chaos-tamer. Here’s what it’s got going on:

  • Doc Reader: I toss in a PDF, Word file, or even a messy CSV, and it figures out what’s inside. I’ll ask, “What’s this about?” and it’s like, “Here’s the gist.”
  • Pic Decoder: I snap a photo — like a sign or a note — and it tells me what it says or what’s in it. So cool.
  • Chat Pal: No files? It still talks to me. I’ll say, “Hey, what’s up?” and it’s got something to say back.
  • Memory: It remembers the last five things we talked about, so it’s not like, “Wait, who are you again?”

The whole thing runs in a browser with a sidebar for uploading and a chat box for talking. It even does this cute “Thinking…” thing while it works. I grin every time.

The Code: How I Made It Tick

I’m not a pro coder — I just hack stuff together until it works. Here’s the breakdown of the key pieces, with more details on why I did it this way and what’s going on.

1. Setting Up the Basics

I start with a bunch of imports — Streamlit for the app, Ollama for the AI, and helpers like PyPDF2 and pandas for file stuff. Then, I set up a little memory bank with session_state to keep track of things:

DEFAULT_STATE = {
'docs': [], # List of uploaded docs
'chat_history': [], # What we’ve said
'vectorizer': None, # For turning text into numbers
'vectors': None, # The number version of text
'chunks': [], # Bits of text I chopped up
'images': [], # Uploaded pics
'current_image': None,# The pic I’m looking at now
'image_names': set() # No duplicates, please
}

for key, value in DEFAULT_STATE.items():
if key not in st.session_state:
st.session_state[key] = value

Why? Streamlit forgets everything when it refreshes, so this keeps my files and chats alive between clicks. It’s like a sticky note on my screen.

2. Handling Files

I needed it to eat up all kinds of files — docs and pics. Here’s how:

  • Images:
def process_image(image_file):
if image_file.name in st.session_state.image_names:
return True # Skip if I've seen it
try:
image = Image.open(image_file).convert('RGB') # Load it up
st.session_state.current_image = image # Set as current
st.session_state.images.append({'name': image_file.name, 'image': image})
st.session_state.image_names.add(image_file.name) # Track it
return True
except Exception as e:
st.error(f"Failed to process image: {str(e)}")
return False

This grabs a photo (like a PNG or JPG), makes sure it’s in color, and stashes it. If it messes up — like a corrupted file — it yells at me nicely.

  • Documents:
def extract_text_from_file(uploaded_file):
file_handlers = {
'pdf': lambda path: '\n'.join(p.extract_text() for p in PdfReader(path).pages if p.extract_text()),
'docx': lambda path: '\n'.join(p.text for p in Document(path).paragraphs),
'txt': lambda path: open(path, 'r', encoding='utf-8').read(),
'csv': lambda path: pd.read_csv(path).to_string(index=False),
'xlsx': lambda path: pd.read_excel(path).to_string(index=False)
}
file_type = uploaded_file.name.split('.')[-1].lower()
if file_type not in file_handlers:
return "Unsupported file type" # Nope, can’t do it
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=f'.{file_type}') as temp_file:
temp_file.write(uploaded_file.getvalue()) # Save it quick
return file_handlers[file_type](temp_file.name) # Pull out the text
except Exception as e:
return f"Error processing {uploaded_file.name}: {str(e)}"
finally:
if 'temp_file' in locals():
os.unlink(temp_file.name) # Clean up

This is my file-chewing machine. It checks the file type, uses the right tool to grab the text (like PyPDF2 for PDFs), and spits it out. I use a temp file because Streamlit hands me bytes, not paths. If it’s something weird, it shrugs and moves on.

3. Chopping Text

Big docs are a pain, so I break them into bite-sized pieces:

def chunk_text(text, chunk_size=500, overlap=50):
if not text:
return [] # Nothing to chop
words = text.split() # Split into words
return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size - overlap)]def chunk_text(text, chunk_size=500, overlap=50):
if not text:
return [] # Nothing to chop
words = text.split() # Split into words
return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size - overlap)]

This takes all the text, splits it into 500-word chunks, and overlaps by 50 words so I don’t lose context — like if a sentence gets cut off. It’s simple but keeps things manageable.

4. Processing Docs for the AI

Once I’ve got chunks, I prep them:

def process_documents():
if not st.session_state.docs:
return # No docs, no work
text = '\n\n'.join(doc['text'] for doc in st.session_state.docs) # Smash it all together
st.session_state.chunks = chunk_text(text) # Chop it
if st.session_state.chunks:
if not st.session_state.vectorizer:
st.session_state.vectorizer = TfidfVectorizer(lowercase=True) # Make a word-number converter
st.session_state.vectors = st.session_state.vectorizer.fit_transform(st.session_state.chunks)

This glues all my doc text together, chops it, and turns it into numbers with TfidfVectorizer. Why numbers? The AI can’t read words, but it loves math. This step’s like translating my mess into AI-speak.

5. Finding the Good Stuff

When I ask a question, it hunts for the best chunks:

def get_relevant_chunks(query, top_k=3):
if st.session_state.vectors is None or not st.session_state.chunks:
return [] # Nothing to search
query_vector = st.session_state.vectorizer.transform([query]) # Turn my question into numbers
similarities = cosine_similarity(query_vector, st.session_state.vectors).flatten() # Compare it
top_indices = np.argsort(similarities)[-top_k:][::-1] # Pick the top 3 matches
return [st.session_state.chunks[i] for i in top_indices]

This is the smart part. It takes my question, makes it a number too, and checks how close it is to each chunk. Then it grabs the three most similar ones — like picking the best pages from a book.

6. Talking Back

Here’s where the AI shines:

def generate_document_response(query):
chunks = get_relevant_chunks(query)
context = "\n\n".join(chunks) if chunks else "No relevant info found."
prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"
try:
response = ollama.chat(model="gemma3:12b", messages=[{"role": "user", "content": prompt}])
return response.get('message', {}).get('content', "Failed to get response")
except Exception as e:
return f"Error: {str(e)}"

def analyze_image(query):
if not st.session_state.current_image:
return "Please upload and select an image to analyze."
try:
img_buffer = io.BytesIO()
st.session_state.current_image.save(img_buffer, format='PNG')
img_data = base64.b64encode(img_buffer.getvalue()).decode('utf-8')
prompt = (
"Analyze this image and:\n"
"1. Extract any visible text\n"
"2. Describe key visual elements\n"
"3. Answer this question: {query}")
response = ollama.chat(
model="llama3.2-vision:latest",
messages=[{
"role": "user",
"content": prompt.format(query=query),
"images": [img_data]
}])
return response.get('message', {}).get('content', "Failed to analyze the image. Please try again.")
except Exception as e:
return f"Image analysis failed: {str(e)}"

This feeds the chunks and my question to Gemma3, a big language model from Ollama. It’s like saying, “Here’s the info, now tell me something smart.” If it works, I get an answer.

For pics, it’s similar but with LLaMA 3.2-vision — it takes the image, reads it, and answers my question.

7. The Chat Part

The UI’s dead simple:

if query := st.chat_input("Ask a question or chat with the AI"):
st.session_state.chat_history.append({"role": "user", "content": query}) # Save my question
with st.chat_message("user"):
st.markdown(query) # Show it
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
response = generate_response(query) # Get the answer
st.markdown(response) # Show it
st.session_state.chat_history.append({"role": "assistant", "content": response}) # Save it

This is the chat box. I type, it saves what I said, shows it, thinks, and talks back. The spinner’s just for fun — it makes it feel alive.

Where It Falls Short

Building DocVision wasn’t all sunshine and rainbows. Here are the hiccups I ran into, polished up a bit:

  • Performance Lag: Large files can bog it down, especially when processing hefty documents or multiple uploads. I capped the chat history at five messages to keep it snappy, but there’s still room to optimize — maybe caching or chunking smarter.
  • AI Precision: The models are brilliant, but if my question’s vague or the context’s thin, the answers can wander. Tweaking the prompt or trying a different model might sharpen things up.
  • File Quirks: Some PDFs with odd formatting — like scanned pages or funky fonts — throw it for a loop. It’s not a dealbreaker, but it’s a reminder that not every file plays nice with my little bot.

These aren’t failures, just growing pains — stuff I can tweak as I go.

Why It’s Special to Me — and Where It’s Headed

DocVision’s more than just a tool — it’s a personal win. It’s pulled me out of the weeds at work when I’m drowning in documents, and it’s brought a smile to my face when it decodes my messy scribbles. Sure, it’s got its rough edges, but that’s what makes it real. It’s a project born from my own chaos, and seeing it come to life has been incredibly rewarding.

If you’re curious to give it a spin, the code’s all yours:

https://github.com/Tanx-123/DocVision-AI-Assistant.git

Install the dependencies with pip install streamlit ollama PyPDF2 python-docx pandas scikit-learn pillow numpy, set up the Ollama models (gemma3:12b and llama3.2-vision:latest), and launch it with streamlit run docvision.py. I’d love to hear how it works for you — or if you tweak it into something even cooler.

Looking forward, I’ve got ideas brewing: integrating web search capabilities, supporting more languages, or ironing out those performance kinks. For now, though, I’m content with what it is — a trusty companion that lightens my load and sparks a bit of joy along the way. It’s not about building the ultimate AI; it’s about creating something that fits my life and, hopefully, inspires others to tinker too.

Have thoughts or questions? Maybe you’ve built something similar? I’d genuinely love to connect — drop a comment and let’s chat!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.