Document Summarization & QA in RAG without Frameworks (PyMuPDF & ChromaDB)

Last Updated on August 26, 2025 by Editorial Team

Author(s):

Originally published on Towards AI.

Document Summarization & QA in RAG without Frameworks (PyMuPDF & ChromaDB) — Photo by Arisa Chattasa on Unsplash

Explore how RAG works under the hood by building a PDF summarization pipeline with PyMuPDF and ChromaDB — no frameworks needed.

· Introduction
· Overview of the Pipeline
· Practical Implementation
∘ Document processing
∘ Document chunking
∘ Image captioning
∘ Document summarization
∘ Creating and storing embedding
∘ Document retrieval in RAG
· Implementation Result
∘ Document Summarization
∘ Document Retrieval
· Conclusion

Introduction

Retrieval-Augmented Generation (RAG) is a technique that combines retrieval with generation. Retrieval is a method to find relevant information from vector database with semantic search, whereas generation uses a language model to produce answers based on information retrieved. Most RAG tutorials rely on frameworks, which mostly hide what is really happening.

In this guide, we will build a minimal RAG pipeline from scratch, using only PyMuPDF for extracting content from PDFs and ChromaDB for storing and retrieving embeddings. This approach can help you see exactly how document summarization with RAG works under the hood — processing text and images, creating embeddings, retrieving relevant chunks, and generating summaries — without any black-box frameworks.

The code is referenced from the SEAD-Agent repository, where I implemented an AI research assistant with tool calling. Among the tools adopted in this guide is the RAG approach, which includes document summarization and document searching from a vector database.

Overview of the Pipeline

The Diagram of the Document Summarization pipeline in SEAD-Agent. Image by Author

In this guide, we use research papers as the input. The pipeline starts with processing the uploaded PDF document by extracting text and images of each page in the document, then split text of each page to smaller chunks and applying a multimodal model (VLM) to create captioning of each image before adding it to the chunk list. These chunks are fed into the LLM ( or VLM since I use the multimodal model Pixtral 12B) to generate summarization of each chunk before combining all of the summaries to generate final summary of every chunk.

The Diagram of the Document Retrieval and Generation pipeline in SEAD-Agent. Image by Author

Once finish document summarization, these chunks are indexed and embedded before stored in the vector database for further use in retrieval and generation for the QA (Question & Answering) purpose.

The concept overview of this pipeline is as follow:

Document Processing: Extract text and images from PDFs using PyMuPDF and create image captioning with the VLM (Pixtral 12B).
Document Chunking: Split text into manageable chunks with overlap and add these chunk to the chunk list.
Document Summarization: Create summaries of each chunk in the list using LLM and combine each summary to the final summary.
Creating and Storing Embedding: Generate embeddings for each chunk and store them in ChromaDB with indexing for efficient retrieval.
Retrieval and Generation for Q&A: Leverage semantic search to find relevant chunks in the vector database to generate answers from input query.

Practical Implementation

Document processing

The document processing phase involves extracting text and visual content from PDF files. Here’s the exact implementation in SEAD-Agent:

Download the document by using fitz.open(stream=pdf_content, filetype="pdf").

# From SEAD-Agent docsum.py
try:
 doc = fitz.open(stream=pdf_content, filetype="pdf")
 logger.info(f"Successfully opened PDF with {len(doc)} pages")
except Exception as e:
 raise Exception(f"Failed to open PDF content. Error: {e}")

Load content from each page using doc.load_page(page_num) and extract text — text = page.get_text()for chunking, and images — image_list = page.get_images(full=True)for creating caption in the loaded page.

The extracted text is passed to _split_text method to handle documents chunk and the extracted image is passed to _generate_image_caption method to create caption.

Each chunk (chunk text and image captioning) then is added to the chunk list chunks.append(...) .

Note that each chunk requires its own unique ID chunk_id

# From SEAD-Agent docsum.py 
try:
 # Load pages
 for page_num in range(len(doc)):
 page = doc.load_page(page_num)

 text = page.get_text()

 logger.info(f"Page {page_num}: extracted {len(text)} characters")

 if text and isinstance(text, str) and text.strip():

 # Extract text in the page
 text_chunks = self._split_text(text, max_length=max_length)
 logger.info(f"Page: {page_num}, Text chunks: {text_chunks}")
 logger.info(f"Page {page_num}: created {len(text_chunks)} text chunks")
 
 for i, chunk in enumerate(text_chunks):
 if chunk and chunk.strip():
 chunk_id = f"text_{page_num}_{i}_{uuid.uuid4().hex[:8]}"
 chunks.append(DocumentChunk(
 content=chunk,
 chunk_type='text',
 page_number=page_num,
 chunk_id=chunk_id,
 metadata={'chunk_id': f'text_{page_num}_{i}'}
 ))

 # Extract images in the page
 image_list = page.get_images(full=True)
 logger.info(f"Page {page_num}: found {len(image_list)} images")
 
 for img_index, img in enumerate(image_list):
 try:
 xref = img[0]
 pix = fitz.Pixmap(doc, xref)

 if pix.n - pix.alpha < 4: # Gray or RGB
 img_data = pix.tobytes("png")
 image = Image.open(io.BytesIO(img_data))

 # image caption
 caption = self._generate_image_caption(image)

 chunk_id = f"image_{page_num}_{img_index}_{uuid.uuid4().hex[:8]}"

 chunks.append(DocumentChunk(
 content=caption,
 chunk_type='image',
 page_number=page_num,
 chunk_id=chunk_id,
 metadata={
 'source_page': page_num + 1,
 'image_index': img_index,
 }
 ))

 pix = None
 except Exception as e:
 logger.warning(f"Error processing image on page {page_num}: {e}")
 else:
 logger.warning(f"Page {page_num}: No valid text content found")
 
 finally:
 doc.close()
 
 logger.info(f"Total chunks created: {len(chunks)}")
 return chunks

Document chunking

The chunking process is integrated into the document processing phase. The _split_text method handles text chunking with a maximum length of 512 characters per chunk. Each chunk then is added to the list chunks.append(current_chunk.strip()) .

# From SEAD-Agent docsum.py
def _split_text(self, text: str, max_length=512) -> List[str]:
 """Split text into chunks of specified maximum length"""
 
 sentences = text.split(". ")
 chunks = []
 current_chunk = ""

 for sentence in sentences:
 if sentence.strip():
 # 512 characters in one chunk
 if len(current_chunk) + len(sentence) + 1 <= max_length:
 current_chunk += sentence + ". "
 else:
 if current_chunk.strip():
 chunks.append(current_chunk.strip())
 current_chunk = sentence + ". " # next sentence

 if current_chunk.strip():
 chunks.append(current_chunk.strip())

 return chunks

Image captioning

The _generate_image_caption(self, image) method invokes the VLM (Pixtral 12B) to generate a descriptive caption for each image using self.client.chat.complete(...).

The messages sent to the request requires both textand image_url acorrding to the general specification for VLMs calling.

Note that the image bytes needed to be encoded to base 64 before sending the request base64.b64encode(image).decode(‘utf-8’).

def _generate_image_caption(self, image):
 
 # Encode the raw image bytes to base64
 img_base64 = base64.b64encode(image).decode('utf-8')
 
 messages = [
 {"role": "system", "content": "You are a research assistant tasked with analyzing and describing figures in research papers (chart, diagram, screenshot, etc.). Provide a clear and on point caption that describes what you see in the image."},
 {"role": "user", "content": [
 {"type": "text", "text": "Please describe this image in detail if the image is related to reseach papers, otherwise skip it (eg. logo)"},
 {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}}
 ]}
 ]
 
 response = self.client.chat.complete(
 model=self.vlm_model,
 messages=messages,
 temperature=0.3,
 max_tokens=150
 )
 
 return response.choices[0].message.content.strip()

Document summarization

The summarization process begins by summarizing each chunk individually using the DOCUMENT_SUMMARIZATION_PROMPT template. The prompt is prepared with prompt.format(text=chunk.content)to combine the template and the chunk content for each chunk. This formatted prompt is then passed to the model via self.client.chat.complete(...) to generate the summary.

# From SEAD-Agent prompt_message.py
DOCUMENT_SUMMARIZATION_PROMPT = """
You are a helpful assistant that specializes in analyzing and summarizing academic papers and technical documents.

Please provide a comprehensive summary of the following document:

{text}
 
Please include:
1. Main objectives and research questions
2. Key findings and conclusions
3. Methodology used (if applicable)
4. Implications and applications
5. Any limitations or future work mentioned
"""

# From SEAD-Agent docsum.py
def _summarize_chunk(self, chunk):
 """
 Summarize a single chunk of the document.

 Args:
 chunk: The DocumentChunk object to summarize

 Returns:
 str: Summary of the chunk
 """

 prompt = PromptMessage.DOCUMENT_SUMMARIZATION_PROMPT
 prompt = prompt.format(text=chunk.content)
 
 logger.info(f"Summarizing chunk of type: {chunk.chunk_type}, length: {len(chunk.content)} sentences")

 response = self.client.chat.complete(
 model=self.vlm_model,
 messages=[
 {"role": "system",
 "content": "You are a helpful assistant that specializes in analyzing and summarizing academic papers and technical documents."},
 {"role": "user", "content": prompt},
 ],
 temperature=0.3,
 max_tokens=512
 )

 return response.choices[0].message.content.strip()

After receiving the summary of each chunk, different prompt templates are used depending on the level of detail required (Brief or Detailed).

Here are the exact prompts from SEAD-Agent:

# From SEAD-Agent prompt_message.py

BRIEF_DOCUMENT_SUMMARIZATION = """
Please provide a concise summary of the following content, including both text and visual elements. Focus on:
- Key points and main ideas from the text
- Important information from any images, charts, or diagrams
- Overall message or findings
 
Content: {content}"""

DETAILED_DOCUMENT_SUMMARIZATION = """
Please provide a comprehensive summary of the following content, including:
- Main topics and themes from the text
- Key findings or conclusions
- Important details and supporting information
- Analysis of any images, charts, diagrams, or visual elements
- Technical terms or concepts mentioned
- Relationships between text and visual content
 
Content: {content}"""

List of the summary of each chunk is then fed into another VLM call self.client.chat.complete(...) to generate the comprehensive summary of all chunk depending of the detail of the summary between BRIEF_DOCUMENT_SUMMARIZATION or DETAILED_DOCUMENT_SUMMARIZATION .

def _create_final_summary(self, chunk_summaries, summary_type):
 """
 Create the final document summary from individual chunk summaries.

 Args:
 chunk_summaries (List[str]): Summaries generated for each processed chunk.
 summary_type (str): Controls the style of the final summary. Use "brief" for a concise
 overview; otherwise produces a detailed summary.

 Returns:
 str: Final comprehensive summary aggregated and rewritten according to the selected style.
 """
 combined_text = "\n\n".join([f"Section {i + 1}: {summary}" for i, summary in enumerate(chunk_summaries)])

 if summary_type == "brief":
 prompt = PromptMessage.BRIEF_DOCUMENT_SUMMARIZATION
 else:
 prompt = PromptMessage.DETAILED_DOCUMENT_SUMMARIZATION

 response = self.client.chat.complete(
 model=self.vlm_model,
 messages=[
 {"role": "system",
 "content": "You are a helpful assistant that specializes in analyzing and summarizing academic papers and technical documents."},
 {"role": "user", "content": prompt.format(content=combined_text)},
 ],
 temperature=0.3,
 max_tokens=1024
 )

 return response.choices[0].message.content.strip()

Creating and storing embedding

The vector store is implemented to handle embedding generation and storage. In SEAD-AGENT, this process begins by creating the vector store, which is implemented using ChromaDB. Here is the implementation:

Initialize the embedding model, persistent directory, and collection name.

self.embedding_model = embedding_model or "sentence-transformers/all-MiniLM-L6-v2"
self.persistent_directory = persistent_directory or "./chroma"
self.collection_name = collection_name or "papers"
 
# Initialize the embedding model locally instead of in ChromaDB 
self.embedder = SentenceTransformer(self.embedding_model)

Initialize ChromaDB client.

# Initialize ChromaDB client
 self.client = chromadb.PersistentClient(
 path=self.persistent_directory,
 settings=Settings(
 anonymized_telemetry=False,
 allow_reset=True,
 is_persistent=True
 )
 )

To manage collections, use self.client.create_collection(...) to create a new one or self.client.get_collection(...) to retrieve an existing one.

When creating a collection, specify the collection name and embedding function using name and embedding_function.

 # Create or get collection
 try:
 self.collection = self.client.get_collection(name=self.collection_name)
 logger.info(f"Loaded existing collection '{self.collection_name}'")
 except Exception as e:
 self.collection = self.client.create_collection(
 name=self.collection_name,
 embedding_function=self._custom_embedding_function,
 metadata={"description": "PDF document chunks with text and image content"},
 )
 logger.info(f"Created new collection '{self.collection_name}'")

The embedding function custom_embedding_function(...)is the custom implementation using self.embedder.encode(texts)as following:

 def _custom_embedding_function(self, texts):
 """Generate embeddings for a list of texts using the local `SentenceTransformer` model.

 Args:
 texts (List[str]): Input text strings to embed.

 Returns:
 List[List[float]]: Embeddings as a list of vectors (one per input text).
 """
 
 embeddings = self.embedder.encode(texts)
 return embeddings.tolist()

To store a text chunk in the vector store, the following components must be prepared:

ids: A unique identifier for each chunk
documents: The actual text content of the chunk
metadata: Additional descriptive information about the document or chunk
embeddings: The vector representation (embedding) of the chunk

Once these components are ready, they can be added to the collection using the self.collection.add(...) method.

# Prepare data for ChromaDB
ids = [chunk.chunk_id for chunk in chunks]
documents = [chunk.content for chunk in chunks]
metadatas = []

for chunk in chunks:
 metadata = {
 'chunk_type': chunk.chunk_type,
 'page_num': chunk.page_number,
 'source_page': chunk.page_number + 1
 }
 if chunk.metadata:
 metadata.update(chunk.metadata)
 metadatas.append(metadata)

 # Generate embeddings using local model
 embeddings = self._custom_embedding_function(documents)

 self.collection.add(
 ids=ids,
 documents=documents,
 metadatas=metadatas,
 embeddings=embeddings
 )

Document retrieval in RAG

The following get_document(...) method is used to retrieve relevant documents from the vector database via semantic search.

It starts with encode the input query into a vector representation (embedding) (self.embedder.encode(query).tolist())then uses this embedding to query the vector store for relevant documents (self.collection.query(…)).

Logging is applied to validate the retrieved documents along with their similarity scores.

def get_document(self, query: str):
 """Query the vector store for the most similar document chunks to the input query.

 Args:
 query (str): Natural language query to search against the stored document embeddings.

 Returns:
 dict: Chroma query result containing lists for keys: "documents", "metadatas",
 and "distances" (ordered from most to least similar).
 """
 
 # Generate query embedding
 query_embedding = self.embedder.encode(query).tolist()

 results = self.collection.query(
 query_embeddings=[query_embedding],
 n_results=3, # top-k results
 include=["documents", "metadatas", "distances"]
 )

 distances = results['distances'][0]

 logger.info(f"Query: '{query}' - Distances: {[f'{d:.3f}' for d in distances]}")

 logger.info(f"Documents found: {len(results['documents'][0]) if results['documents'] else 0}")

 for i, doc in enumerate(results['documents'][0]):
 logger.info(f"Document {i + 1}: {doc}")

 for i, metadata in enumerate(results['metadatas'][0]):
 logger.info(f"Metadata {i + 1}: {metadata}")

 return results

Implementation Result

Document Summarization

Below are the user interface and log outputs generated when a user uploads a document for summarization.

Document Chunking

chatbot-backend-1 | 2025-08-22 11:18:01,039 - service.docsum - INFO - Page 2: extracted 4711 characters 
chatbot-backend-1 | 2025-08-22 11:18:01,040 - service.docsum - INFO - Page: 2, Text chunks: ['Author(s) properly formatted YEAR \n \nProc. of the 23rd CIB World Building Congress, 19th – 23rd May 2025, Purdue University, West Lafayette, USA \n2 \nemissions by mid-century (United Nations, 2015; Wang et al., 2021). However, reducing emissions in \nurban plann
ing, particularly in developed countries, is challenging due to complex urban infrastructure, \neconomic factors, and resource limitations (Wang and Jiang, 2020; Sharif and Tauqir,
 2021).', "Barriers \nsuch as limited high-resolution building data and decision-support tools, especially in the U.S., hinder \nprogress despite the country's potential to signifi
cantly impact global emission reductions (Larch and \nWanner, 2024). \nCurrent policy-making tools often lack building-level insights, underscoring the need for solutions that \nen
able data-driven decisions.", 'With the objective of moving towards urban sustainability, this study \naims to develop a standalone simulation tool applicable to all the cities in 
the United States to assess \nthe potential scenarios for embodied and operational carbon in these cities. This research introduces \nEcoSphere, an integrated software solution des
igned to address the need for detailed, building-level \ndata and provide stakeholders with accessible simulations to guide sustainable urban planning.', 'EcoSphere leverages the N
ational Structure Inventory, combined with computer vision and natural \nlanguage processing applied to Google Street View and satellite imagery, to create a high-resolution \ndata
set categorizing buildings by material and structure. Its scenario-based simulations enable \nstakeholders to evaluate the impacts of policy decisions on carbon emissions and costs
, offering a \nrobust, accessible tool for sustainable urban planning.', '2 Literature Review \nData-driven decision-support tools have become increasingly vital for managing carbo
n emissions in \ncities, where built environments play a major role in overall greenhouse gas output (Aumann, 2007; \nFaulin et al., 2010; de Paula Ferreira, Armellini and De Santa
-Eulalia, 2020). Two predominant \nresearch strands underlie this field.', 'The first emphasizes predictive models, using statistical or \nmachine learning methods to forecast emis
sion levels under specific conditions (Saad et al., 2020; Chu \nand Zhao, 2021; Fang, Lu and Li, 2021; Su et al., 2023). Although these approaches offer initial \nbaselines, they t
ypically lack sufficient granularity to inform building-level interventions (Gao et al., \n2023; Hu and Ghorbany, 2024).', 'The second strand harnesses simulation-based techniques—
for \nexample, modeling hypothetical changes in transportation networks (Gao, Hu and Peng, 2014; Wu and \nZhao, 2016) or tracking land expansion (Hu et al., 2022; Wang, Zeng and Ch
en, 2022; Li et al., 2023; \nTian and Zhao, 2024; Wu et al., 2024), to explore “what-if” scenarios. Yet, many of these studies omit \nthe embodied carbon of diverse building stocks
 or focus narrowly on a single city sector.', 'Where researchers attempt building-level analyses at scale, top-down estimates often prevail (Hu and \nGhorbany, 2024), aggregating e
missions while overlooking material, structural, and age differences \namong individual buildings (Li and Deng, 2023). Conversely, smaller-scale studies may use audits or \nmachine
 learning to derive operational energy and embodied carbon intensities, but they struggle to \nexpand beyond limited samples (Zhang et al., 2021).', 'This reveals a pressing gap: e
xisting frameworks \nrarely provide high-resolution, bottom-up modeling at a citywide scale (Ghorbany and Hu, 2024).', 'Data constraints compound this challenge: open records such 
as county assessor database are \nfrequently incomplete, and remote-sensing data must be filtered through specialized computer vision \nprocesses (Ghorbany and Hu, 2024; Ghorbany, 
Hu, Sisk, et al., 2024; Ghorbany, Hu, Yao and Wang, \n2024; Ghorbany, Hu, Yao, Wang, et al., 2024; Hu et al., 2025; Yao et al., 2025).', 'Even when such datasets can be constructed
, decision-makers often lack user-friendly simulation \ntools that integrate embodied and operational emissions under various policy scenarios (Hu, 2022; Hu \nand Ghorbany, 2024). 
Existing methods typically focus on singular aspects—like renovation costs or \ndemolition rates—without offering a unified framework to capture the interplay of material \nsubstit
utions, building lifespans, and urban expansion.', 'The outcome is limited practical relevance for \ncity planners, who need both financial and environmental metrics to guide polic
y. \nBy developing a scalable, archetype-based decision-support platform, the present study addresses \nthese key deficits. Rather than relying on top-down averages, EcoSphere leverages the National.']
chatbot-backend-1 | 2025-08-22 11:18:01,040 - service.docsum - INFO - Page 2: created 12 text chunks

2. Image Captioning

chatbot-backend-1 | 2025-08-22 11:18:04,234 - service.docsum - INFO - Page 3: found 2 images 
chatbot-backend-1 | 2025-08-22 11:18:06,091 - service.docsum - INFO - Page 3, Image 1, Image caption: The image appears to be a logo. It features a stylized representation of a brain with a leaf integrated into its design. The brain is depicted in a simplified, minimalist manner, with the leaf positioned at the center, symbolizing a connection between neuro
science or cognitive processes and nature or environmental aspects. The color scheme is primarily green, which may signify growth, health, or eco-friendliness. This logo could potentially belong to an organization or project that focuses on environmental neuroscience, brain health in natural settings, or similar interdisciplinary fields.
chatbot-backend-1 | 2025-08-22 11:18:08,694 - service.docsum - INFO - Page 3, Image 2, Image caption: This image appears to be a flowchart related to a research paper, likely in the field of urban planning, architecture, or environmental science. The flowchart is divided into four main stages, each represented by a distinct section. Here is a detailed description of each section:
chatbot-backend-1 | 
chatbot-backend-1 | ### #1 - Data Creation Stage 
chatbot-backend-1 | - **Inputs:** 
chatbot-backend-1 | - **City Satellite Imagery:** Used for initial data collection. 
chatbot-backend-1 | - **City Google Street Views:** Provides additional visual data. 
chatbot-backend-1 | - **NSI Data:** Likely refers to Non-Spatial Information data. 
chatbot-backend-1 | - **Processes:** 
chatbot-backend-1 | - **Image Segmentation:** Divides the image into meaningful parts. 
chatbot-backend-1 | - **Geo-Coordinate Assignment:** Assigns geographical coordinates to the data. 
chatbot-backend-1 | - **CNN

3. Chunk Summary

chatbot-backend-1 | 2025-08-22 11:18:44,387 - service.docsum - INFO - Summarizing chunk of type: text, length: 191 sentences
chatbot-backend-1 | 2025-08-22 11:18:48,062 - service.docsum - INFO - Chunk 1 summary: Certainly! Here is a comprehensive summary of the document:
chatbot-backend-1 |
chatbot-backend-1 | ### Main Objectives and Research Questions
chatbot-backend-1 | The primary objective of this study is to develop EcoSphere, a specialized software tool designed to evaluate and balance embodied and operational carbon emissions alongside construction and environmental costs in urban planning. The research questions likely revolve around how to effectively integrate these factors into urban planning processes to achieve sustainable and environmentally friendly outcomes.
chatbot-backend-1 |
chatbot-backend-1 | ### Key Findings and Conclusions
chatbot-backend-1 | The study concludes that EcoSphere is a successful and innovative tool for urban planners. It effectively balances the trade-offs between carbon emissions (both embodied and operational) and the associated costs. The software provides valuable insights that help planners make informed decisions, ultimately leading to more sustainable urban development projects.
chatbot-backend-1 |
chatbot-backend-1 | ### Methodology Used
chatbot-backend-1 | The methodology involves the development of the EcoSphere software, which integrates complex algorithms and models to calculate embodied and operational carbon emissions. The software likely uses life cycle assessment (LCA) methods to quantify these emissions. Additionally, it incorporates cost-benefit analysis to evaluate the environmental and financial implications of different urban planning scenarios.
chatbot-backend-1 |
chatbot-backend-1 | ### Implications and Applications
chatbot-backend-1 | The implications of this study are significant for urban planning and sustainability. EcoSphere can be used by urban planners, policymakers, and developers to:
chatbot-backend-1 | - Assess the environmental impact of different construction materials and methods.
chatbot-backend-1 | - Optimize designs to minimize carbon footprints.
chatbot-backend-1 | - Balance environmental costs with financial constraints.
chatbot-backend-1 | - Promote sustainable urban development practices.
chatbot-backend-1 |
chatbot-backend-1 | ### Limitations and Future Work
chatbot-backend-1 | The study acknowledges certain limitations, such as the complexity of accurately modeling all variables and the need for more extensive validation with real-world data. Future work may include:
chatbot-backend-1 | - Enhancing the software's capabilities to include more detailed and localized environmental data.
chatbot-backend-1 | - Conducting case studies to validate the software's effectiveness in real-world scenarios.
chatbot-backend-1 | - Expanding the software's functionality to cover additional environmental factors and social considerations.
chatbot-backend-1 |
chatbot-backend-1 | This summary encapsulates the core aspects of the document, providing a clear understanding of the study's objectives, methodology, findings, and future directions.
chatbot-backend-1 | 2025-08-22 11:18:48,062 - service.docsum - INFO - Summarizing chunk of type: text, length: 324 sentences

4. All Chunks Summary

chatbot-backend-1 | 2025-08-22 11:19:07,369 - service.docsum - INFO - All chunks summary: ### Summary of the Document "EcoSphere: A Decision-Support Tool for Automated Carbon Emission and Cost Optimization in Sustainable Urban Development"
chatbot-backend-1 |
chatbot-backend-1 | #### Main Objectives and Research Questions
chatbot-backend-1 | - **Objective**: Develop EcoSphere, a decision-support tool to automate the optimization of carbon emissions and costs in sustainable urban development.
chatbot-backend-1 | - **Research Questions**:
chatbot-backend-1 | - How to model and quantify embodied carbon in construction materials and processes?
chatbot-backend-1 | - What methodologies optimize both carbon emissions and costs?
chatbot-backend-1 | - How to design a decision-support tool for stakeholders?
chatbot-backend-1 |
chatbot-backend-1 | #### Key Findings and Conclusions
chatbot-backend-1 | - **Embodied Carbon Analysis**: Embodied carbon in construction materials significantly contributes to greenhouse gas emissions.
chatbot-backend-1 | - **Optimization Techniques**: Life cycle assessment (LCA) and multi-objective optimization algorithms were effective.
chatbot-backend-1 | - **EcoSphere Tool**: Developed to automate evaluation and optimization of carbon emissions and costs, integrating multiple data sources and advanced algorithms.
chatbot-backend-1 | - **Case Studies**: Demonstrated effectiveness in reducing carbon emissions and costs in real-world scenarios.
chatbot-backend-1 |
chatbot-backend-1 | #### Methodology Used
chatbot-backend-1 | - **Data Collection**: Comprehensive data on construction materials, processes, and carbon emissions.
chatbot-backend-1 | - **Modeling**: Mathematical models to quantify embodied carbon and associated costs.
chatbot-backend-1 | - **Optimization Algorithms**: Multi-objective optimization algorithms to balance carbon emissions and costs.
chatbot-backend-1 | - **Tool Development**: Creation of EcoSphere with user-friendly interfaces for stakeholders.
chatbot-backend-1 |
chatbot-backend-1 | #### Implications and Applications
chatbot-backend-1 | - **Policy Making**: Aids policymakers in formulating regulations and incentives for sustainable urban development.
chatbot-backend-1 | - **Industry Practices**: Helps construction companies and urban planners make informed decisions, reducing carbon footprints and costs.
chatbot-backend-1 | - **Education and Research**: Useful in academic settings to educate future professionals about sustainable practices.
chatbot-backend-1 |
chatbot-backend-1 | #### Limitations and Future Work
chatbot-backend-1 | - **Data Accuracy**: Accuracy depends on the quality and comprehensiveness of input data.
chatbot-backend-1 | - **Scalability**: Future work should focus on scaling the tool for larger and more complex projects.
chatbot-backend-1 |
chatbot-backend-1 | ### Visual Elements Summary
chatbot-backend-1 | - **Images/Charts/Diagrams**: Likely depict the process of data collection, modeling, and optimization. Case studies may be represented with before-and-after comparisons showing reductions in carbon emissions and costs.
chatbot-backend-1 |
chatbot-backend-1 | ### Overall Message or Findings
chatbot-backend-1 | The EcoSphere tool is a significant advancement in automating the optimization of carbon emissions and costs in urban development. It integrates advanced methodologies and data sources to provide actionable insights for stakeholders, aiding in the achievement of sustainability goals. While there are limitations related to data accuracy and scalability, future work can address these to enhance the tool's effectiveness and applicability.

5. Add Chunks to Vector Store

chatbot-backend-1 | 2025-08-22 11:18:37,573 - service.docsum - INFO - Total chunks created: 118

chatbot-backend-1 | 2025-08-22 11:18:39,425 - service.vector_store - INFO - Added 118 chunks to ChromaDB collection

Document Retrieval

These are the user interface and log outputs for the Q&A task when a user asks a question related to the uploaded document in the system.

chatbot-backend-1 | 2025-08-22 11:46:32,473 - service.vector_store - INFO - Query: 'city planning, large-scale city planning, engineering design, EcoSphere' - Distances: ['0.797', '0.797', '0.797'] 
chatbot-backend-1 | 2025-08-22 11:46:32,473 - service.vector_store - INFO - Documents found: 3
chatbot-backend-1 | 2025-08-22 11:46:32,473 - service.vector_store - INFO - Document 1: This is particularly 
chatbot-backend-1 | helpful since some cities can be hugely affected by these strategies while the other cities might already 
chatbot-backend-1 | be in their optimal status and further deployment of these strategies can downgrade their status. 
chatbot-backend-1 | EcoSphere’s Regional Review section includes visualizations that compare emissions and construction 
chatbot-backend-1 | costs across different scenarios, supporting a clear understanding of the data.
...
chatbot-backend-1 | 2025-08-22 11:46:32,473 - service.vector_store - INFO - Metadata 1: {'chunk_id': 'text_7_3', 'chunk_type': 'text', 'page_num': 7, 'source_page': 8}

Conclusion

Building a RAG pipeline from scratch with PyMuPDF and ChromaDB provides valuable insights into how document summarization and Q&A systems work under the hood. The key components are:

Document Processing: Text and images are efficiently extracted using PyMuPDF.
Image Captioning: Captions are generated with Vision-Language Models (VLMs).
Chunking Strategy: Text is split into contiguous chunks to fit within model input limits. (This approach simplifies processing though it may slightly reduce context continuity between chunks compared to overlapping strategies.)
Summarization Method: Each chunk is summarized individually, and the summaries are combined for comprehensive results.
Vector Storage: Leverage vector storage with indexing to enable efficient and scalable semantic search.
RAG Integration: Retrieval is combined with generation to deliver accurate and contextually grounded responses.

The implementation demonstrates how these components work together in a production system. By understanding each piece individually, you can better customize and optimize your own RAG applications.

For the complete implementation, refer to the SEAD-Agent repository:

Document processing, chunking, and summarization: docsum.py
Embedding and storing: vector_store.py
Prompt templates: prompt_message.py

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Document Summarization & QA in RAG without Frameworks (PyMuPDF & ChromaDB)

Author(s):

Table of Contents

Introduction

Overview of the Pipeline

Practical Implementation

Document processing

Document chunking

Image captioning

Document summarization

Creating and storing embedding

Document retrieval in RAG

Implementation Result

Document Summarization

Document Retrieval

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Document Summarization & QA in RAG without Frameworks (PyMuPDF & ChromaDB)

Author(s):

Table of Contents

Introduction

Overview of the Pipeline

Practical Implementation

Document processing

Document chunking

Image captioning

Document summarization

Creating and storing embedding

Document retrieval in RAG

Implementation Result

Document Summarization

Document Retrieval

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement