Stop Searching Your Codebase Manually — Build your own GenAI To Do It for You
Last Updated on April 15, 2025 by Editorial Team
Author(s): Shubham Gupta
Originally published on Towards AI.
Modern codebases are growing more complex by the day. Developers often waste hours digging through files just to understand how a function works, where it’s used, or how components connect. Whether you’re onboarding to a new project or revisiting an old one, finding answers inside your own code can feel like searching for a needle in a haystack.
What if you could just ask questions like:
“What is this callback doing?"
"Where is the design element of the page?"
"What are all the data files being used in the code?"
…and get clear, AI-generated answers based on your entire codebase?
That’s exactly what I’ve built.

In this blog, I’ll show you how I’ve created a tool that gives a Generative AI model full access to your repository and allows you to interact with your code like a conversation. No more grepping through files. Just ask — and get instant, intelligent responses.
To demonstrate how to build and use your AI tool with a specific repository, let’s walk through an example using a basic Python Dash application. This will illustrate how your tool can access the repository and provide answers to questions about the codebase.
Example Repository: Python Dash Application
To follow along with the example, check out the Python Dash application repository. Here’s how the project is structured:
demo_sample_app/
├── app.py
├── data.py
├── index.py
└── pages/
├── page1.py
└── page2.py
First, let us get all the required libraries,
import pandas as pd
from pathlib import Path
import re
from typing import List, Optional
import ast
import os
import re
from pathlib import Path
from typing import List, Dict
import pandas as pd
import numpy as np
from datetime import datetime
from openai import OpenAI
from directory_tree import DisplayTree
To interact with OpenAI’s API, you’ll need to generate your own API key and paste it into the designated section of the code below.
client = OpenAI(max_retries=5, api_key = "") # paste your api key
Below code analyzes a Python repo by extracting functions, classes, imports, and metadata. It then generates embeddings using OpenAI to represent code meaningfully. With these, you can search or ask questions about your codebase, and returns the most relevant files or snippets based on semantic similarity.
# Constants
NEWLINE = '\n'
DEF_PREFIXES = ('def ', 'async def ')
CLASS_PREFIX = 'class '
IMPORT_PATTERN = re.compile(r'^(?:from\s+\S+\s+import\s+\S+|import\s+\S+)')
COMMENT_PATTERN = re.compile(r'^\s*#')
DOCSTRING_PATTERN = re.compile(r'^\s*(\'\'\'|\"\"\")')
ASSIGNMENT_PATTERN = re.compile(r'^\s*\w+\s*=')
def get_embedding(text: str, model="text-embedding-3-small", **kwargs) -> list:
# Ensure text is a string
if not isinstance(text, str):
raise ValueError("Input text must be a string.")
# Replace newlines, which can negatively affect performance.
text = text.replace("\n", " ")
# Create the embedding
response = client.embeddings.create(input=text, model=model, **kwargs)
return response.data[0].embedding
def extract_function_name(line: str) -> str:
"""
Extract the function name from a line starting with 'def' or 'async def'.
"""
for prefix in DEF_PREFIXES:
if line.startswith(prefix):
return line[len(prefix):line.index('(')].strip()
return ""
def extract_class_name(line: str) -> str:
"""
Extract the class name from a line starting with 'class'.
"""
if line.startswith(CLASS_PREFIX):
return line[len(CLASS_PREFIX):line.index('(') if '(' in line else len(line)].strip()
return ""
def extract_code_block(lines: List[str], start_index: int) -> str:
"""
Extract a block of code (function or class) starting from the given index.
"""
block = [lines[start_index]]
indent_level = len(lines[start_index]) - len(lines[start_index].lstrip())
for line in lines[start_index + 1:]:
current_indent = len(line) - len(line.lstrip())
if current_indent > indent_level or not line.strip():
block.append(line)
else:
break
return NEWLINE.join(block)
def get_file_metadata(filepath: Path) -> Dict:
"""
Retrieve metadata for the given file.
"""
stats = filepath.stat()
return {
'file_size': stats.st_size,
'creation_time': datetime.fromtimestamp(stats.st_ctime),
'modification_time': datetime.fromtimestamp(stats.st_mtime),
'permissions': oct(stats.st_mode)[-3:]
}
def analyze_python_file(filepath: Path, code_root: Path) -> Dict:
"""
Analyze a Python file to extract its structural elements and metadata.
"""
with open(filepath, 'r', encoding='utf-8', errors='replace') as file:
content = file.read()
# Initialize containers for different code elements
functions, classes, imports, assignments, top_level_code = [], [], [], [], []
# To parse the AST of the file content
try:
tree = ast.parse(content, filename=str(filepath))
except SyntaxError:
print(f"Syntax error in file: {filepath}")
return {}
# Visitor class to traverse the AST
class CodeVisitor(ast.NodeVisitor):
def __init__(self):
self.current_class = None
def visit_Import(self, node):
imports.append(ast.unparse(node).strip())
self.generic_visit(node)
def visit_ImportFrom(self, node):
imports.append(ast.unparse(node).strip())
self.generic_visit(node)
def visit_FunctionDef(self, node):
func_info = {
'name': node.name,
'code': ast.unparse(node).strip()
}
if self.current_class:
func_info['class'] = self.current_class
functions.append(func_info)
self.generic_visit(node)
def visit_ClassDef(self, node):
class_info = {
'name': node.name,
'code': ast.unparse(node).strip()
}
classes.append(class_info)
# To traverse methods within the class
self.current_class = node.name
self.generic_visit(node)
self.current_class = None
def visit_Assign(self, node):
assignments.append(ast.unparse(node).strip())
self.generic_visit(node)
def visit_Expr(self, node):
if isinstance(node.value, ast.Str):
pass
else:
top_level_code.append(ast.unparse(node).strip())
self.generic_visit(node)
visitor = CodeVisitor()
visitor.visit(tree)
stats = filepath.stat()
file_metadata = {
'file_size': stats.st_size,
'creation_time': datetime.fromtimestamp(stats.st_ctime),
'modification_time': datetime.fromtimestamp(stats.st_mtime),
'permissions': oct(stats.st_mode)[-3:]
}
return {
'filepath': filepath.relative_to(code_root),
'content': content,
'imports': imports,
'functions': functions,
'classes': classes,
'assignments': assignments,
'top_level_code': top_level_code,
**file_metadata
}
def extract_repository_details(code_root: str) -> List[Dict]:
"""
Extract details from all Python files in the specified code repository.
"""
code_root_path = Path(code_root).resolve()
python_files = list(code_root_path.rglob('*.py'))
if not python_files:
print('No Python files found in the specified directory.')
return []
all_files_data = [analyze_python_file(file, code_root_path) for file in python_files]
return all_files_data
def process_files_data(all_files_data: List[Dict], code_root: str) -> pd.DataFrame:
"""
Process the extracted file data into a DataFrame and generate embeddings.
"""
entries = []
for file_data in all_files_data:
filepath = file_data['filepath']
content = file_data['content']
entries.append({
'filepath': filepath,
'code': content,
'type': 'file',
'name': None,
'embedding': get_embedding(content),
'imports': file_data.get('imports', []),
'comments': file_data.get('comments', []),
'assignments': file_data.get('assignments', []),
'top_level_code': file_data.get('top_level_code', [])
})
# Function-level
for func in file_data.get('functions', []):
entries.append({
'filepath': filepath,
'code': func['code'],
'type': 'function',
'name': func['name'],
'embedding': get_embedding(func['code']),
'imports': file_data.get('imports', []),
'comments': file_data.get('comments', []),
'assignments': file_data.get('assignments', []),
'top_level_code': file_data.get('top_level_code', [])
})
# Class-level
for cls in file_data.get('classes', []):
entries.append({
'filepath': filepath,
'code': cls['code'],
'type': 'class',
'name': cls['name'],
'embedding': get_embedding(cls['code']),
'imports': file_data.get('imports', []),
'comments': file_data.get('comments', []),
'assignments': file_data.get('assignments', []),
'top_level_code': file_data.get('top_level_code', [])
})
df = pd.DataFrame(entries)
return df
def cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
"""
Compute the cosine similarity between two vectors.
"""
vec1, vec2 = np.array(vec1), np.array(vec2)
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def search_functions(df, code_query, n=3):
"""
Search for Python files in the DataFrame that are most similar to the code_query.
Args:
df (pd.DataFrame): DataFrame containing code data with embeddings.
code_query (str): The code snippet to search for.
n (int): Number of top similar files to return.
Returns:
List[Dict]: List of dictionaries containing file paths and code content of the most similar Python files.
"""
# Generate embedding for the query
query_embedding = get_embedding(code_query)
# Calculate similarities
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, query_embedding))
# Sort by similarity and filter for file-level entries
top_files = df[df['type'] == 'file'].sort_values(by='similarity', ascending=False).head(n)
# Extract file paths and code content
similar_files = top_files[['filepath', 'code']].to_dict(orient='records')
return similar_files
def process_question(df, question, tree_structure, n=5):
"""
Processes a user's question by searching the DataFrame and generating answers.
Parameters:
- df: DataFrame containing the data to search.
- question: The user's question.
- tree_structure: The folder structure to consider.
- n: Number of search results to retrieve (default is 5).
Returns:
- The assistant's response containing the code update and file details.
"""
# To perform the search on the DataFrame
res = search_functions(df, question, n)
messages = [
{"role": "user", "content": f"You are a Python expert. Use code {res} and answer {question}. Also mention what file are used. Use this folder structure as well: {tree_structure}"}
]
# Generate the response
completion = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
# Return the response
return completion.choices[0].message.content
Now use the code below to load the repository you want to query and explore using AI.
# Path to your repository
path = "demo_sample_app/"
# Extract all functions from the repository
all_funcs = extract_repository_details(path)
df = process_files_data(all_funcs, path)
# Repo directory tree as a string
tree_structure = DisplayTree(dirPath=path, stringRep=True)
Now, with everything set up, you can use the code below to ask questions about your repo — and let the AI fetch the answers directly from your codebase.
result = process_question(df, "", tree_structure) #ask question about you codebase
Here are some example questions I asked based on the repository.
result = process_question(df, "What data is being used in the code?", tree_structure)
print(result)
Response of the question asked,
The code provided utilizes three different datasets, each defined in the `data.py` file within the `demo_sample_app` directory:
1. **Table Data** (`table_data`): A DataFrame consisting of a small dataset containing individuals' names, ages, and cities. The dataset is:
- Name: ["Alice", "Bob", "Charlie", "David"]
- Age: [25, 30, 35, 40]
- City: ["New York", "San Francisco", "Los Angeles", "Chicago"]
2. **Chart Data** (`chart_data`): A DataFrame designed for a bar chart with categories and their associated values:
- Category: ["A", "B", "C", "D"]
- Values: [10, 20, 15, 25]
response = process_question(df, "expain the layout of page 1", tree_structure)
print(response)
Response of the question asked,
### Layout of Page 1 (`page1.py`)
#### Purpose:
Page 1 is designed to display a data table using the Dash DataTable component. It is registered as a page with the path "/page-1".
#### Code Explanation:
1. **Imports**:
- `html` and `dash_table`: These are components from the Dash framework to create HTML elements and tables.
- `register_page`: This function is used to register the page as part of the multi-page application, assigning it the path "/page-1".
- `table_data`: This is the data imported from a module `data`. The `table_data` is assumed to be a pandas DataFrame containing the data to be displayed on this page.
2. **Page Registration**:
- `register_page(__name__, path="/page-1")`: This line registers the current module (`page1.py`) as a page with the specified URL path "/page-1".
3. **Page Layout**:
- `layout = html.Div([...])`: This defines the overall layout of the page using a Dash HTML division (`Div`).
- `html.H2("Table Page", ...)`: A heading for the page is centered and styled with some margin at the bottom.
- `dash_table.DataTable(...)`: This component displays the data table.
- `data=table_data.to_dict("records")`: Converts the DataFrame data to a dictionary format suitable for the DataTable.
- `columns=[{"name": col, "id": col} for col in table_data.columns]`: Specifies the columns of the table.
- `page_size=5`: Limits the table display to 5 records per page.
- `style_table={"width": "60%", "margin": "auto"}`: Centers the table and sets its width.
- `style_cell={"textAlign": "center"}`: Aligns the text in table cells to the center.
response = process_question(df, "to add new page named 'bubble chart' with other already existing pages and in it add bubble chart with new dummy data.", tree_structure)
print(response)
Response of the question asked,
To add a new page named 'Bubble Chart' to your existing mult-page Dash application, you will need to create a new Python file for this page (i.e., `page3.py`), update the `data.py` file with new dummy data for the bubble chart, and ensure that the new page is integrated with the existing application structure.
Here is a step-by-step guide on how to do this:
### 1. Update `data.py` to include Bubble Chart Data
Add new dummy data for the bubble chart in your `data.py` file. Here's how you can do it:
```python
import pandas as pd
# Dummy data for the table
table_data = pd.DataFrame({
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 40],
"City": ["New York", "San Francisco", "Los Angeles", "Chicago"]
})
# Dummy data for the bar chart
chart_data = pd.DataFrame({
"Category": ["A", "B", "C", "D"],
"Values": [10, 20, 15, 25]
})
# Dummy data for the bubble chart
bubble_chart_data = pd.DataFrame({
"Category": ["X", "Y", "Z"],
"Value1": [10, 40, 70],
"Value2": [20, 50, 30],
"Size": [100, 150, 200]
})
```
### 2. Create a New Page File for Bubble Chart: `pages/page3.py`
Create a new file named `page3.py` in the `pages` directory. This file will define the layout and content for the Bubble Chart page.
```python
from dash import dcc, html, register_page
import plotly.express as px
from data import bubble_chart_data # Import bubble chart data
# Register the page
register_page(__name__, path="/page-3")
# Create Bubble Chart
fig = px.scatter(bubble_chart_data, x="Value1", y="Value2", size="Size", color="Category", title="Sample Bubble Chart")
# Layout for Bubble Chart Page
layout = html.Div([
html.H2("Bubble Chart Page", style={"textAlign": "center", "margin-bottom": "20px"}),
dcc.Graph(figure=fig)
])
```
### 3. Update `index.py` to Add Navigation to the New Page
Ensure that the `index.py` file includes a link to the new Bubble Chart page in the navigation bar.
```python
from dash import dcc, html, page_container
import dash_bootstrap_components as dbc
# Navigation bar
navbar = dbc.NavbarSimple(
children=[
dbc.NavItem(dcc.Link("Table Page", href="/page-1", className="nav-link")),
dbc.NavItem(dcc.Link("Chart Page", href="/page-2", className="nav-link")),
dbc.NavItem(dcc.Link("Bubble Chart Page", href="/page-3", className="nav-link")), # New link added
],
brand="Multi-Page App",
color="primary",
dark=True,
)
# Main Layout (Includes Navbar & Page Loader)
layout = html.Div([
navbar,
dcc.Location(id="url", refresh=False), # Handles URL changes
page_container # Automatically loads the correct page layout
])
Final Thoughts
As developers, we spend a huge chunk of time trying to understand code — whether it’s our own or someone else’s. With the power of GenAI and vector embeddings, we can now transform our codebases into something searchable, conversational, and intelligent. By giving AI full visibility into our repo, we no longer need to dig through files or trace functions manually. Just ask — and let your code answer.
This is just the beginning. Whether you’re debugging faster, onboarding smoother, or simply exploring smarter, tools like this are redefining how we interact with code.
Here are some of the example Use Cases
Debugging: Paste a broken function and find similar implementations that work
Refactoring Help: Find duplicated logic across different files
Learning a New Codebase: Ask questions like “Where is the main API handler defined?” or “What’s the entry point for the app?”
If you found this helpful or have ideas to take it further, feel free to connect or drop a comment — I’d love to hear your thoughts!
All the code for repo access to Gen AI is on this link.
https://github.com/shubham7169
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Get your free Agents Cheatsheet here. Our proven framework for choosing the right AI architecture.
3 years of hands-on work with real clients into 6 pages.
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Discover Your Dream AI Career at Towards AI JobsTowards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.