I Made DocVision: My Messy File Savior That Reads Docs and Pics

Last Updated on April 15, 2025 by Editorial Team

Author(s): Tanisque Bagal

Originally published on Towards AI.

Okay, picture this: I’m staring at a stack of random files — PDFs I forgot I downloaded, a Word doc from last month’s meeting, and a blurry photo of a grocery list I scribbled on a napkin. I just wanted someone to tell me what’s in there without me having to dig through it all. So, I built DocVision AI Assistant, this little chatbot that’s become my new best friend. It reads documents, decodes images, and chats with me like it gets me. I threw it together with Streamlit and some AI tricks from Ollama, and honestly, I’m kind of obsessed. Want the scoop? Let’s dive in.

How It All Started

I’m not gonna lie — my digital life is a mess. I’ve got folders stuffed with notes, reports, and pics I swear I’ll sort out someday. But who has time to read a 20-page PDF just to find one line? I wanted something that could do it for me — something smart, fast, and maybe even a little fun. That’s when I thought, “What if I made a bot that could handle everything — docs, pics, all of it?” A couple of late nights and too much coffee later, DocVision was born.

It’s not fancy, but it’s mine. I used Streamlit to make it look decent (I’m no design pro) and Ollama to give it some serious brainpower. Now, it’s like having a buddy who’s way better at reading than I am.

I Made DocVision: My Messy File Savior That Reads Docs and Pics — Photo by Ant Rozetsky on Unsplash

What It Does (And Why I Love It)

DocVision is my chaos-tamer. Here’s what it’s got going on:

Doc Reader: I toss in a PDF, Word file, or even a messy CSV, and it figures out what’s inside. I’ll ask, “What’s this about?” and it’s like, “Here’s the gist.”
Pic Decoder: I snap a photo — like a sign or a note — and it tells me what it says or what’s in it. So cool.
Chat Pal: No files? It still talks to me. I’ll say, “Hey, what’s up?” and it’s got something to say back.
Memory: It remembers the last five things we talked about, so it’s not like, “Wait, who are you again?”

The whole thing runs in a browser with a sidebar for uploading and a chat box for talking. It even does this cute “Thinking…” thing while it works. I grin every time.

The Code: How I Made It Tick

I’m not a pro coder — I just hack stuff together until it works. Here’s the breakdown of the key pieces, with more details on why I did it this way and what’s going on.

1. Setting Up the Basics

I start with a bunch of imports — Streamlit for the app, Ollama for the AI, and helpers like PyPDF2 and pandas for file stuff. Then, I set up a little memory bank with session_state to keep track of things:

DEFAULT_STATE = {
 'docs': [], # List of uploaded docs
 'chat_history': [], # What we’ve said
 'vectorizer': None, # For turning text into numbers
 'vectors': None, # The number version of text
 'chunks': [], # Bits of text I chopped up
 'images': [], # Uploaded pics
 'current_image': None,# The pic I’m looking at now
 'image_names': set() # No duplicates, please
}

for key, value in DEFAULT_STATE.items():
 if key not in st.session_state:
 st.session_state[key] = value

Why? Streamlit forgets everything when it refreshes, so this keeps my files and chats alive between clicks. It’s like a sticky note on my screen.

2. Handling Files

I needed it to eat up all kinds of files — docs and pics. Here’s how:

Images:

def process_image(image_file):
 if image_file.name in st.session_state.image_names:
 return True # Skip if I've seen it
 try:
 image = Image.open(image_file).convert('RGB') # Load it up
 st.session_state.current_image = image # Set as current
 st.session_state.images.append({'name': image_file.name, 'image': image})
 st.session_state.image_names.add(image_file.name) # Track it
 return True
 except Exception as e:
 st.error(f"Failed to process image: {str(e)}")
 return False

This grabs a photo (like a PNG or JPG), makes sure it’s in color, and stashes it. If it messes up — like a corrupted file — it yells at me nicely.

Documents:

def extract_text_from_file(uploaded_file):
 file_handlers = {
 'pdf': lambda path: '\n'.join(p.extract_text() for p in PdfReader(path).pages if p.extract_text()),
 'docx': lambda path: '\n'.join(p.text for p in Document(path).paragraphs),
 'txt': lambda path: open(path, 'r', encoding='utf-8').read(),
 'csv': lambda path: pd.read_csv(path).to_string(index=False),
 'xlsx': lambda path: pd.read_excel(path).to_string(index=False)
 }
 file_type = uploaded_file.name.split('.')[-1].lower()
 if file_type not in file_handlers:
 return "Unsupported file type" # Nope, can’t do it
 try:
 with tempfile.NamedTemporaryFile(delete=False, suffix=f'.{file_type}') as temp_file:
 temp_file.write(uploaded_file.getvalue()) # Save it quick
 return file_handlers[file_type](temp_file.name) # Pull out the text
 except Exception as e:
 return f"Error processing {uploaded_file.name}: {str(e)}"
 finally:
 if 'temp_file' in locals():
 os.unlink(temp_file.name) # Clean up

This is my file-chewing machine. It checks the file type, uses the right tool to grab the text (like PyPDF2 for PDFs), and spits it out. I use a temp file because Streamlit hands me bytes, not paths. If it’s something weird, it shrugs and moves on.

3. Chopping Text

Big docs are a pain, so I break them into bite-sized pieces:

def chunk_text(text, chunk_size=500, overlap=50):
 if not text:
 return [] # Nothing to chop
 words = text.split() # Split into words
 return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size - overlap)]def chunk_text(text, chunk_size=500, overlap=50):
 if not text:
 return [] # Nothing to chop
 words = text.split() # Split into words
 return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size - overlap)]

This takes all the text, splits it into 500-word chunks, and overlaps by 50 words so I don’t lose context — like if a sentence gets cut off. It’s simple but keeps things manageable.

4. Processing Docs for the AI

Once I’ve got chunks, I prep them:

def process_documents():
 if not st.session_state.docs:
 return # No docs, no work
 text = '\n\n'.join(doc['text'] for doc in st.session_state.docs) # Smash it all together
 st.session_state.chunks = chunk_text(text) # Chop it
 if st.session_state.chunks:
 if not st.session_state.vectorizer:
 st.session_state.vectorizer = TfidfVectorizer(lowercase=True) # Make a word-number converter
 st.session_state.vectors = st.session_state.vectorizer.fit_transform(st.session_state.chunks)

This glues all my doc text together, chops it, and turns it into numbers with TfidfVectorizer. Why numbers? The AI can’t read words, but it loves math. This step’s like translating my mess into AI-speak.

5. Finding the Good Stuff

When I ask a question, it hunts for the best chunks:

def get_relevant_chunks(query, top_k=3):
 if st.session_state.vectors is None or not st.session_state.chunks:
 return [] # Nothing to search
 query_vector = st.session_state.vectorizer.transform([query]) # Turn my question into numbers
 similarities = cosine_similarity(query_vector, st.session_state.vectors).flatten() # Compare it
 top_indices = np.argsort(similarities)[-top_k:][::-1] # Pick the top 3 matches
 return [st.session_state.chunks[i] for i in top_indices]

This is the smart part. It takes my question, makes it a number too, and checks how close it is to each chunk. Then it grabs the three most similar ones — like picking the best pages from a book.

6. Talking Back

Here’s where the AI shines:

def generate_document_response(query):
 chunks = get_relevant_chunks(query)
 context = "\n\n".join(chunks) if chunks else "No relevant info found."
 prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"
 try:
 response = ollama.chat(model="gemma3:12b", messages=[{"role": "user", "content": prompt}])
 return response.get('message', {}).get('content', "Failed to get response")
 except Exception as e:
 return f"Error: {str(e)}"

def analyze_image(query):
 if not st.session_state.current_image:
 return "Please upload and select an image to analyze."
 try:
 img_buffer = io.BytesIO()
 st.session_state.current_image.save(img_buffer, format='PNG')
 img_data = base64.b64encode(img_buffer.getvalue()).decode('utf-8')
 prompt = (
 "Analyze this image and:\n"
 "1. Extract any visible text\n"
 "2. Describe key visual elements\n"
 "3. Answer this question: {query}")
 response = ollama.chat(
 model="llama3.2-vision:latest",
 messages=[{
 "role": "user",
 "content": prompt.format(query=query),
 "images": [img_data]
 }])
 return response.get('message', {}).get('content', "Failed to analyze the image. Please try again.")
 except Exception as e:
 return f"Image analysis failed: {str(e)}"

This feeds the chunks and my question to Gemma3, a big language model from Ollama. It’s like saying, “Here’s the info, now tell me something smart.” If it works, I get an answer.

For pics, it’s similar but with LLaMA 3.2-vision — it takes the image, reads it, and answers my question.

7. The Chat Part

The UI’s dead simple:

if query := st.chat_input("Ask a question or chat with the AI"):
 st.session_state.chat_history.append({"role": "user", "content": query}) # Save my question
 with st.chat_message("user"):
 st.markdown(query) # Show it
 with st.chat_message("assistant"):
 with st.spinner("Thinking..."):
 response = generate_response(query) # Get the answer
 st.markdown(response) # Show it
 st.session_state.chat_history.append({"role": "assistant", "content": response}) # Save it

This is the chat box. I type, it saves what I said, shows it, thinks, and talks back. The spinner’s just for fun — it makes it feel alive.

Where It Falls Short

Building DocVision wasn’t all sunshine and rainbows. Here are the hiccups I ran into, polished up a bit:

Performance Lag: Large files can bog it down, especially when processing hefty documents or multiple uploads. I capped the chat history at five messages to keep it snappy, but there’s still room to optimize — maybe caching or chunking smarter.
AI Precision: The models are brilliant, but if my question’s vague or the context’s thin, the answers can wander. Tweaking the prompt or trying a different model might sharpen things up.
File Quirks: Some PDFs with odd formatting — like scanned pages or funky fonts — throw it for a loop. It’s not a dealbreaker, but it’s a reminder that not every file plays nice with my little bot.

These aren’t failures, just growing pains — stuff I can tweak as I go.

Why It’s Special to Me — and Where It’s Headed

DocVision’s more than just a tool — it’s a personal win. It’s pulled me out of the weeds at work when I’m drowning in documents, and it’s brought a smile to my face when it decodes my messy scribbles. Sure, it’s got its rough edges, but that’s what makes it real. It’s a project born from my own chaos, and seeing it come to life has been incredibly rewarding.

If you’re curious to give it a spin, the code’s all yours:

https://github.com/Tanx-123/DocVision-AI-Assistant.git

Install the dependencies with pip install streamlit ollama PyPDF2 python-docx pandas scikit-learn pillow numpy, set up the Ollama models (gemma3:12b and llama3.2-vision:latest), and launch it with streamlit run docvision.py. I’d love to hear how it works for you — or if you tweak it into something even cooler.

Looking forward, I’ve got ideas brewing: integrating web search capabilities, supporting more languages, or ironing out those performance kinks. For now, though, I’m content with what it is — a trusty companion that lightens my load and sparks a bit of joy along the way. It’s not about building the ultimate AI; it’s about creating something that fits my life and, hopefully, inspires others to tinker too.

Have thoughts or questions? Maybe you’ve built something similar? I’d genuinely love to connect — drop a comment and let’s chat!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

I Made DocVision: My Messy File Savior That Reads Docs and Pics

Author(s): Tanisque Bagal

How It All Started

What It Does (And Why I Love It)

The Code: How I Made It Tick

1. Setting Up the Basics

2. Handling Files

3. Chopping Text

4. Processing Docs for the AI

5. Finding the Good Stuff

6. Talking Back

7. The Chat Part

Where It Falls Short

Why It’s Special to Me — and Where It’s Headed

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

I Made DocVision: My Messy File Savior That Reads Docs and Pics

Author(s): Tanisque Bagal

How It All Started

What It Does (And Why I Love It)

The Code: How I Made It Tick

1. Setting Up the Basics

2. Handling Files

3. Chopping Text

4. Processing Docs for the AI

5. Finding the Good Stuff

6. Talking Back

7. The Chat Part

Where It Falls Short

Why It’s Special to Me — and Where It’s Headed

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement