Unlocking the Power of Chatting with Code: Revolutionizing Codebase Understanding and Developer Productivity👨🏽💻🚀
Last Updated on November 3, 2024 by Editorial Team
Author(s): Rohit Sharma
Originally published on Towards AI.
Unlocking the Power of Chatting with Code: Revolutionizing Codebase Understanding and Developer Productivity👨🏽💻🚀
Imagine diving into a new codebase without having to sift through endless lines of code, documentation, or relying on colleagues for explanations.
What if you could simply ask questions about the code — like, “What are the main classes in this module?” or “How does this function interact with the database?” — and instantly receive clear, context-rich answers? This is the power of chatting with code, and it’s changing the way developers, both new and experienced, engage with large, complex repositories.🤩
For newcomers ramping up on a project, chat-driven exploration is a game-changer. Instead of navigating the daunting maze of unfamiliar code, they can get personalized insights and instantly clarify how different parts fit together. Meanwhile, even seasoned experts can benefit — whether it’s for understanding how a particular legacy module works or identifying dependencies between components. No more guessing or hours lost to manual code navigation.
Chatting with code accelerates onboarding, enhances team productivity, and ensures that even the most intricate or undocumented parts of a codebase can be understood intuitively. Whether you’re diving into an open-source project, onboarding new developers, or working on a fast-moving codebase — this AI-driven approach is the future of seamless code comprehension.
🛠️ Implementation
In this post, I will walk you through the creation of a conversational AI system that allows you to interact with your own Python code repository and extract meaningful information from it. By leveraging tools like LangChain, Chroma, and OpenAI embeddings, you’ll create a scalable and efficient query system capable of processing documents, retrieving relevant code information, and answering questions intelligently.
🎓What You’ll Learn:
– How to load documents from a file system using LangChain’s `GenericLoader`
– How to split large code files into manageable chunks with `RecursiveCharacterTextSplitter`
– How to build a persistent Chroma vector store for embeddings
– How to create a retrieval-based QA system with memory-aware conversation
📜Prerequisites:
– Basic knowledge of Python
– Understanding of embeddings and vector databases
– Familiarity with LangChain, Chroma, and OpenAI embeddings
⚙️1. Setting up the Environment
First, we need to load environment variables using `dotenv`. This step ensures that API keys and configurations are safely loaded into the environment. Next, we suppress any warnings to keep our logs clean.
import os
from dotenv import load_dotenv
load_dotenv()
import warnings
warnings.filterwarnings("ignore")
📋 2. Loading Documents from the Filesystem
For the scope of this tutorial — I cloned langchain-core’s code from their github and performed RAG over it.
LangChain’s `GenericLoader` allows you to easily load documents from a given repository or directory. In this case, we are pointing the loader to a local directory (`repo_path`) that contains the Python source files we want to analyze.
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_text_splitters import Language
repo_path = "./langchain_core"
loader = GenericLoader.from_filesystem(
repo_path,
glob="**/*",
suffixes=[".py"],
exclude=["**/non-utf8-encoding.py"],
parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),
)
docs = loader.load()
– `repo_path`: The path to your code repository.
– `suffixes`: Filters to only include `.py` files (Python scripts).
– `exclude`: Skips files with non-UTF-8 encoding.
– `LanguageParser`: Parses files specifically in Python.
We now have a collection of documents loaded from the repository. You can inspect metadata to see the origins of the loaded files:
for document in docs:
pprint(document.metadata)
To verify the content of the loaded files, print their contents:
print("\n\n--8<--\n\n".join([document.page_content for document in docs]))
✂️ 3. Splitting the Documents into Chunks
For large files, it is efficient to break them into smaller chunks for better performance in both storage and retrieval. We use the `RecursiveCharacterTextSplitter` for this purpose.
from langchain_text_splitters import RecursiveCharacterTextSplitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=2000, chunk_overlap=200
)
texts = python_splitter.split_documents(docs)
This process breaks the documents into smaller parts, each with a size of 2,000 characters, ensuring a 200-character overlap between the chunks for better continuity.
💼 4. Storing Embeddings Using Chroma
To make the content of the repository searchable, we will create a vector store using Chroma. Each document chunk is converted into an embedding (vector representation) using `OpenAIEmbeddings`. Chroma is a vector database that will store these embeddings persistently.
💾Creating a Persistent Chroma Vector Store:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
persist_directory = "./chroma_langchain_RecursiveCharacterTextSplitter_with_Python_language"
db = Chroma.from_documents(
texts,
OpenAIEmbeddings(disallowed_special=()),
persist_directory=persist_directory
)
persist_directory = "./chroma_langchain_RecursiveCharacterTextSplitter_with_Python_language"
db = Chroma.from_documents(
texts,
OpenAIEmbeddings(disallowed_special=()),
persist_directory=persist_directory
)
Loading an Existing Chroma Vector Store:
If you have already created the vector store and saved it, you can load it using the following code:
db = Chroma(
persist_directory=persist_directory,
embedding_function=OpenAIEmbeddings(disallowed_special=())
)
💬 5. Building a Conversational Retriever
With the documents loaded and embedded, we can now create a retriever that searches based on maximum marginal relevance (`mmr`) or similarity. This allows for better query results by prioritizing diverse and relevant answers.
retriever = db.as_retriever(
search_type="mmr",
search_kwargs={"k": 8},
)
– `k=8`: Specifies the number of results to retrieve.
🤖6. Setting Up the Language Model (LLM)
To interact with the repository, we use a large language model (LLM) for natural language queries. In this case, the `ChatGroq` model is used, but you can substitute it with other models like OpenAI’s GPT.
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
llm = ChatGroq(model_name="llama3-70b-8192")
⛓️7. Creating the Query Chains
Two types of prompts are used: one for retrieving relevant documents and another for answering the user’s question based on the retrieved documents.
🔗️Retrieval Chain:
This chain generates a search query from the user input and retrieves the most relevant documents from the Chroma vector store.
prompt1 = ChatPromptTemplate.from_messages(
[
("placeholder", "{chat_history}"),
("user", "{input}"),
("user", "Generate a search query relevant to the conversation"),
]
)
retriever_chain = create_history_aware_retriever(llm, retriever, prompt1)
🔗️Question-Answer Chain:
This chain answers the user’s questions based on the context provided by the retriever.
prompt2 = ChatPromptTemplate.from_messages(
[
("system", "Answer the user's questions based on the context:\n\n{context}"),
("placeholder", "{chat_history}"),
("user", "{input}"),
]
)
question_answer_chain = create_stuff_documents_chain(llm, prompt2)
qa = create_retrieval_chain(retriever_chain, question_answer_chain)
∞ 8. The Query Loop
Now, we set up an interactive loop where users can continuously input questions, and the system will retrieve relevant code snippets and provide answers.
while True:
question = input("Enter your question (or type 'exit' to quit): ")
if question.lower() == 'exit':
print("Exiting the loop. Goodbye!")
break
result = qa.invoke({"input": question})
print("Answer:", result["answer"])
if "context" in result:
print("\nRetrieved Documents:")
for doc in result["context"]:
print(f"- {doc.page_content}\n")
This loop runs until the user types “exit.” The system retrieves the answer and prints the relevant documents for further inspection.
🙋9. Example Queries
To illustrate the power of the system, you can ask it questions about the codebase:
👀Sample OutPut-1:
tell_me("What are the major blocks in LangChain core?")
Answer: The LangChain Core library is organized into several major blocks, which are:
Models: This block includes the core abstractions for Language Models (LLMs) and Chat Models. LLMs are traditional language models that take a string as input and return a string. Chat Models, on the other hand, use a sequence of messages as inputs and return chat messages as outputs.
Interactions:LLMs can be composed together using the | operator to create a chain of models.
Chat Models can also be composed together, and they support the assignment of distinct roles to conversation messages.
Runnables: This block defines the Runnable interface, which represents a unit of computation that can be executed with input data. Runnables can be composed together to create complex workflows.
Interactions:Runnables can be composed together using the | operator to create a chain of runnables.
Runnables can be configured with retry policies, lifecycle listeners, and other features.
Vector Stores: This block includes the abstractions for vector stores, which are used to store and retrieve embeddings (dense vectors) associated with text data.
Interactions:Vector stores can be used to store and retrieve embeddings generated by language models.
Vector stores can be used in conjunction with example selectors to select relevant examples for a given input.
Embeddings: This block includes the abstractions for embeddings, which are dense vectors that represent text data.
Interactions:Embeddings can be generated by language models.
Embeddings can be stored in vector stores.
Embeddings can be used in conjunction with example selectors to select relevant examples for a given input.
Prompts: This block includes the abstractions for prompts, which are used to generate input data for language models.
Interactions:Prompts can be used to generate input data for language models.
Prompts can be used in conjunction with example selectors to select relevant examples for a given input.
Example Selectors: This block includes the abstractions for example selectors, which are used to select relevant examples from a dataset based on input data.
Interactions:Example selectors can be used to select relevant examples for a given input.
Example selectors can be used in conjunction with prompts and vector stores to generate input data for language models.
Messages: This block includes the abstractions for messages, which are used to represent conversational data.
Interactions:Messages can be used to represent conversational data in chat models.
Messages can be used to store and retrieve conversational history.
These blocks interact with each other to enable the creation of complex language models and conversational AI systems.
👀 Sample OutPut-2:
tell_me("What are the key methods and classes in LangChain core?")
Answer: Based on the provided context, here are some key methods, classes, and functions in LangChain Core:
Classes:RunnableLambda: A class that represents a runnable lambda function.
RunnableParallel: A class that represents a runnable parallel operation.
RunnableWithFallbacks: A class that represents a runnable with fallbacks.
BaseCallbackHandler: A base class for callback handlers.
BaseCallbackManager: A base class for callback managers.
AsyncCallbackHandler: A class that represents an asynchronous callback handler.
AsyncCallbackManager: A class that represents an asynchronous callback manager.
BaseRunManager: A base class for run managers.
RunManager: A class that represents a run manager.
CallbackManager: A class that represents a callback manager.
BaseLLM: A base class for language models.
LLM: A class that represents a language model.
Functions:surface_langchain_beta_warnings(): A function that surfaces LangChain beta warnings.
surface_langchain_deprecation_warnings(): A function that surfaces LangChain deprecation warnings.
try_load_from_hub(): A function that loads a model from a hub.
get_bolded_text(): A function that gets bolded text.
get_colored_text(): A function that gets colored text.
print_text(): A function that prints text.
batch_iterate(): A function that iterates over a batch of items.
comma_list(): A function that converts a list to a comma-separated string.
stringify_value(): A function that converts a value to a string.
guard_import(): A function that guards an import statement.
Methods:invoke(): A method that invokes a runnable with a given input.
with_retry(): A method that adds a retry policy to a runnable.
with_parallel(): A method that adds parallel execution to a runnable.
get_lc_namespace(): A method that gets the namespace of a LangChain object.
on_llm_start(): A method that is called when an LLM starts.
on_llm_stream(): A method that is called when an LLM streams output.
on_llm_end(): A method that is called when an LLM ends.
These are some of the key methods, classes, and functions in LangChain Core. Note that this is not an exhaustive list, and there may be other important components in the LangChain Core library.
🕵️The detailed Flow
💡 Conclusion
In this tutorial, you’ve built a complete query system that interacts with a Python code repository. You’ve learned how to load documents, split them into manageable chunks, embed them into a vector store, and retrieve relevant information using natural language queries. The setup is versatile and can be adapted to various projects, allowing developers to interactively query their codebases.
This system not only improves productivity but also opens new possibilities for AI-driven code analysis and exploration.
Feel free to extend this by integrating additional models or extending the document loader to handle other file types!
👨💼Follow me on [LinkedIn](https://in.linkedin.com/in/rohit0221) for more updates.
…a bit about me
Hi, I am Rohit!👋 A seasoned Technical R&D Manager with 15 years of experience, I specialize in designing…
rohit0221.github.io
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI