Stop Searching Your Codebase Manually — Build your own GenAI To Do It for You

Last Updated on April 15, 2025 by Editorial Team

Author(s): Shubham Gupta

Originally published on Towards AI.

Modern codebases are growing more complex by the day. Developers often waste hours digging through files just to understand how a function works, where it’s used, or how components connect. Whether you’re onboarding to a new project or revisiting an old one, finding answers inside your own code can feel like searching for a needle in a haystack.

What if you could just ask questions like:

“What is this callback doing?"
"Where is the design element of the page?"
"What are all the data files being used in the code?"

…and get clear, AI-generated answers based on your entire codebase?

That’s exactly what I’ve built.

Stop Searching Your Codebase Manually — Build your own GenAI To Do It for You

In this blog, I’ll show you how I’ve created a tool that gives a Generative AI model full access to your repository and allows you to interact with your code like a conversation. No more grepping through files. Just ask — and get instant, intelligent responses.

To demonstrate how to build and use your AI tool with a specific repository, let’s walk through an example using a basic Python Dash application. This will illustrate how your tool can access the repository and provide answers to questions about the codebase.

Example Repository: Python Dash Application

To follow along with the example, check out the Python Dash application repository. Here’s how the project is structured:

demo_sample_app/
├── app.py
├── data.py
├── index.py
└── pages/
 ├── page1.py
 └── page2.py

First, let us get all the required libraries,

import pandas as pd
from pathlib import Path
import re
from typing import List, Optional
import ast
import os
import re
from pathlib import Path
from typing import List, Dict
import pandas as pd
import numpy as np
from datetime import datetime
from openai import OpenAI
from directory_tree import DisplayTree

To interact with OpenAI’s API, you’ll need to generate your own API key and paste it into the designated section of the code below.

client = OpenAI(max_retries=5, api_key = "") # paste your api key

Below code analyzes a Python repo by extracting functions, classes, imports, and metadata. It then generates embeddings using OpenAI to represent code meaningfully. With these, you can search or ask questions about your codebase, and returns the most relevant files or snippets based on semantic similarity.

# Constants
NEWLINE = '\n'
DEF_PREFIXES = ('def ', 'async def ')
CLASS_PREFIX = 'class '
IMPORT_PATTERN = re.compile(r'^(?:from\s+\S+\s+import\s+\S+|import\s+\S+)')
COMMENT_PATTERN = re.compile(r'^\s*#')
DOCSTRING_PATTERN = re.compile(r'^\s*(\'\'\'|\"\"\")')
ASSIGNMENT_PATTERN = re.compile(r'^\s*\w+\s*=')

def get_embedding(text: str, model="text-embedding-3-small", **kwargs) -> list:
 # Ensure text is a string
 if not isinstance(text, str):
 raise ValueError("Input text must be a string.")
 
 # Replace newlines, which can negatively affect performance.
 text = text.replace("\n", " ")
 
 # Create the embedding
 response = client.embeddings.create(input=text, model=model, **kwargs)
 
 return response.data[0].embedding


def extract_function_name(line: str) -> str:
 """
 Extract the function name from a line starting with 'def' or 'async def'.
 """
 for prefix in DEF_PREFIXES:
 if line.startswith(prefix):
 return line[len(prefix):line.index('(')].strip()
 return ""

def extract_class_name(line: str) -> str:
 """
 Extract the class name from a line starting with 'class'.
 """
 if line.startswith(CLASS_PREFIX):
 return line[len(CLASS_PREFIX):line.index('(') if '(' in line else len(line)].strip()
 return ""

def extract_code_block(lines: List[str], start_index: int) -> str:
 """
 Extract a block of code (function or class) starting from the given index.
 """
 block = [lines[start_index]]
 indent_level = len(lines[start_index]) - len(lines[start_index].lstrip())
 for line in lines[start_index + 1:]:
 current_indent = len(line) - len(line.lstrip())
 if current_indent > indent_level or not line.strip():
 block.append(line)
 else:
 break
 return NEWLINE.join(block)

def get_file_metadata(filepath: Path) -> Dict:
 """
 Retrieve metadata for the given file.
 """
 stats = filepath.stat()
 return {
 'file_size': stats.st_size,
 'creation_time': datetime.fromtimestamp(stats.st_ctime),
 'modification_time': datetime.fromtimestamp(stats.st_mtime),
 'permissions': oct(stats.st_mode)[-3:]
 }

def analyze_python_file(filepath: Path, code_root: Path) -> Dict:
 """
 Analyze a Python file to extract its structural elements and metadata.
 """
 with open(filepath, 'r', encoding='utf-8', errors='replace') as file:
 content = file.read()

 # Initialize containers for different code elements
 functions, classes, imports, assignments, top_level_code = [], [], [], [], []

 # To parse the AST of the file content
 try:
 tree = ast.parse(content, filename=str(filepath))
 except SyntaxError:
 print(f"Syntax error in file: {filepath}")
 return {}

 # Visitor class to traverse the AST
 class CodeVisitor(ast.NodeVisitor):
 def __init__(self):
 self.current_class = None

 def visit_Import(self, node):
 imports.append(ast.unparse(node).strip())
 self.generic_visit(node)

 def visit_ImportFrom(self, node):
 imports.append(ast.unparse(node).strip())
 self.generic_visit(node)

 def visit_FunctionDef(self, node):
 func_info = {
 'name': node.name,
 'code': ast.unparse(node).strip()
 }
 if self.current_class:
 func_info['class'] = self.current_class
 functions.append(func_info)
 self.generic_visit(node)

 def visit_ClassDef(self, node):
 class_info = {
 'name': node.name,
 'code': ast.unparse(node).strip()
 }
 classes.append(class_info)
 # To traverse methods within the class
 self.current_class = node.name
 self.generic_visit(node)
 self.current_class = None

 def visit_Assign(self, node):
 assignments.append(ast.unparse(node).strip())
 self.generic_visit(node)

 def visit_Expr(self, node):
 if isinstance(node.value, ast.Str): 
 pass
 else:
 top_level_code.append(ast.unparse(node).strip())
 self.generic_visit(node)

 visitor = CodeVisitor()
 visitor.visit(tree)
 stats = filepath.stat()
 file_metadata = {
 'file_size': stats.st_size,
 'creation_time': datetime.fromtimestamp(stats.st_ctime),
 'modification_time': datetime.fromtimestamp(stats.st_mtime),
 'permissions': oct(stats.st_mode)[-3:]
 }

 return {
 'filepath': filepath.relative_to(code_root),
 'content': content,
 'imports': imports,
 'functions': functions,
 'classes': classes,
 'assignments': assignments,
 'top_level_code': top_level_code,
 **file_metadata
 }

def extract_repository_details(code_root: str) -> List[Dict]:
 """
 Extract details from all Python files in the specified code repository.
 """
 code_root_path = Path(code_root).resolve()
 python_files = list(code_root_path.rglob('*.py'))

 if not python_files:
 print('No Python files found in the specified directory.')
 return []

 all_files_data = [analyze_python_file(file, code_root_path) for file in python_files]
 return all_files_data

def process_files_data(all_files_data: List[Dict], code_root: str) -> pd.DataFrame:
 """
 Process the extracted file data into a DataFrame and generate embeddings.
 """
 entries = []

 for file_data in all_files_data:
 filepath = file_data['filepath']
 content = file_data['content']
 entries.append({
 'filepath': filepath,
 'code': content,
 'type': 'file',
 'name': None,
 'embedding': get_embedding(content),
 'imports': file_data.get('imports', []),
 'comments': file_data.get('comments', []),
 'assignments': file_data.get('assignments', []),
 'top_level_code': file_data.get('top_level_code', [])
 })

 # Function-level
 for func in file_data.get('functions', []):
 entries.append({
 'filepath': filepath,
 'code': func['code'],
 'type': 'function',
 'name': func['name'],
 'embedding': get_embedding(func['code']),
 'imports': file_data.get('imports', []),
 'comments': file_data.get('comments', []),
 'assignments': file_data.get('assignments', []),
 'top_level_code': file_data.get('top_level_code', [])
 })

 # Class-level
 for cls in file_data.get('classes', []):
 entries.append({
 'filepath': filepath,
 'code': cls['code'],
 'type': 'class',
 'name': cls['name'],
 'embedding': get_embedding(cls['code']),
 'imports': file_data.get('imports', []),
 'comments': file_data.get('comments', []),
 'assignments': file_data.get('assignments', []),
 'top_level_code': file_data.get('top_level_code', [])
 })

 df = pd.DataFrame(entries)
 return df

def cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
 """
 Compute the cosine similarity between two vectors.
 """
 vec1, vec2 = np.array(vec1), np.array(vec2)
 return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
 

def search_functions(df, code_query, n=3):
 """
 Search for Python files in the DataFrame that are most similar to the code_query.

 Args:
 df (pd.DataFrame): DataFrame containing code data with embeddings.
 code_query (str): The code snippet to search for.
 n (int): Number of top similar files to return.

 Returns:
 List[Dict]: List of dictionaries containing file paths and code content of the most similar Python files.
 """
 # Generate embedding for the query
 query_embedding = get_embedding(code_query)

 # Calculate similarities
 df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, query_embedding))

 # Sort by similarity and filter for file-level entries
 top_files = df[df['type'] == 'file'].sort_values(by='similarity', ascending=False).head(n)

 # Extract file paths and code content
 similar_files = top_files[['filepath', 'code']].to_dict(orient='records')

 return similar_files


def process_question(df, question, tree_structure, n=5):
 """
 Processes a user's question by searching the DataFrame and generating answers.

 Parameters:
 - df: DataFrame containing the data to search.
 - question: The user's question.
 - tree_structure: The folder structure to consider.
 - n: Number of search results to retrieve (default is 5).

 Returns:
 - The assistant's response containing the code update and file details.
 """
 # To perform the search on the DataFrame
 res = search_functions(df, question, n)
 messages = [
 {"role": "user", "content": f"You are a Python expert. Use code {res} and answer {question}. Also mention what file are used. Use this folder structure as well: {tree_structure}"}
 ]

 # Generate the response
 completion = client.chat.completions.create(
 model="gpt-4o",
 messages=messages
 )

 # Return the response
 return completion.choices[0].message.content

Now use the code below to load the repository you want to query and explore using AI.

# Path to your repository
path = "demo_sample_app/"

# Extract all functions from the repository
all_funcs = extract_repository_details(path)
df = process_files_data(all_funcs, path)

# Repo directory tree as a string
tree_structure = DisplayTree(dirPath=path, stringRep=True)

Now, with everything set up, you can use the code below to ask questions about your repo — and let the AI fetch the answers directly from your codebase.

result = process_question(df, "", tree_structure) #ask question about you codebase

Here are some example questions I asked based on the repository.

result = process_question(df, "What data is being used in the code?", tree_structure)
print(result)

Response of the question asked,

The code provided utilizes three different datasets, each defined in the `data.py` file within the `demo_sample_app` directory:

1. **Table Data** (`table_data`): A DataFrame consisting of a small dataset containing individuals' names, ages, and cities. The dataset is:
 - Name: ["Alice", "Bob", "Charlie", "David"]
 - Age: [25, 30, 35, 40]
 - City: ["New York", "San Francisco", "Los Angeles", "Chicago"]

2. **Chart Data** (`chart_data`): A DataFrame designed for a bar chart with categories and their associated values:
 - Category: ["A", "B", "C", "D"]
 - Values: [10, 20, 15, 25]

response = process_question(df, "expain the layout of page 1", tree_structure)
print(response)

Response of the question asked,

### Layout of Page 1 (`page1.py`)

#### Purpose:
Page 1 is designed to display a data table using the Dash DataTable component. It is registered as a page with the path "/page-1".

#### Code Explanation:
1. **Imports**:
 - `html` and `dash_table`: These are components from the Dash framework to create HTML elements and tables.
 - `register_page`: This function is used to register the page as part of the multi-page application, assigning it the path "/page-1".
 - `table_data`: This is the data imported from a module `data`. The `table_data` is assumed to be a pandas DataFrame containing the data to be displayed on this page.

2. **Page Registration**:
 - `register_page(__name__, path="/page-1")`: This line registers the current module (`page1.py`) as a page with the specified URL path "/page-1".

3. **Page Layout**:
 - `layout = html.Div([...])`: This defines the overall layout of the page using a Dash HTML division (`Div`).
 - `html.H2("Table Page", ...)`: A heading for the page is centered and styled with some margin at the bottom.
 - `dash_table.DataTable(...)`: This component displays the data table. 
 - `data=table_data.to_dict("records")`: Converts the DataFrame data to a dictionary format suitable for the DataTable.
 - `columns=[{"name": col, "id": col} for col in table_data.columns]`: Specifies the columns of the table.
 - `page_size=5`: Limits the table display to 5 records per page.
 - `style_table={"width": "60%", "margin": "auto"}`: Centers the table and sets its width.
 - `style_cell={"textAlign": "center"}`: Aligns the text in table cells to the center.

response = process_question(df, "to add new page named 'bubble chart' with other already existing pages and in it add bubble chart with new dummy data.", tree_structure)
print(response)

Response of the question asked,

To add a new page named 'Bubble Chart' to your existing mult-page Dash application, you will need to create a new Python file for this page (i.e., `page3.py`), update the `data.py` file with new dummy data for the bubble chart, and ensure that the new page is integrated with the existing application structure.

Here is a step-by-step guide on how to do this:

### 1. Update `data.py` to include Bubble Chart Data

Add new dummy data for the bubble chart in your `data.py` file. Here's how you can do it:

```python
import pandas as pd

# Dummy data for the table
table_data = pd.DataFrame({
 "Name": ["Alice", "Bob", "Charlie", "David"],
 "Age": [25, 30, 35, 40],
 "City": ["New York", "San Francisco", "Los Angeles", "Chicago"]
})

# Dummy data for the bar chart
chart_data = pd.DataFrame({
 "Category": ["A", "B", "C", "D"],
 "Values": [10, 20, 15, 25]
})

# Dummy data for the bubble chart
bubble_chart_data = pd.DataFrame({
 "Category": ["X", "Y", "Z"],
 "Value1": [10, 40, 70],
 "Value2": [20, 50, 30],
 "Size": [100, 150, 200]
})
```

### 2. Create a New Page File for Bubble Chart: `pages/page3.py`

Create a new file named `page3.py` in the `pages` directory. This file will define the layout and content for the Bubble Chart page.

```python
from dash import dcc, html, register_page
import plotly.express as px
from data import bubble_chart_data # Import bubble chart data

# Register the page
register_page(__name__, path="/page-3")

# Create Bubble Chart
fig = px.scatter(bubble_chart_data, x="Value1", y="Value2", size="Size", color="Category", title="Sample Bubble Chart")

# Layout for Bubble Chart Page
layout = html.Div([
 html.H2("Bubble Chart Page", style={"textAlign": "center", "margin-bottom": "20px"}),
 dcc.Graph(figure=fig)
])
```

### 3. Update `index.py` to Add Navigation to the New Page

Ensure that the `index.py` file includes a link to the new Bubble Chart page in the navigation bar.

```python
from dash import dcc, html, page_container
import dash_bootstrap_components as dbc

# Navigation bar
navbar = dbc.NavbarSimple(
 children=[
 dbc.NavItem(dcc.Link("Table Page", href="/page-1", className="nav-link")),
 dbc.NavItem(dcc.Link("Chart Page", href="/page-2", className="nav-link")),
 dbc.NavItem(dcc.Link("Bubble Chart Page", href="/page-3", className="nav-link")), # New link added
 ],
 brand="Multi-Page App",
 color="primary",
 dark=True,
)

# Main Layout (Includes Navbar & Page Loader)
layout = html.Div([
 navbar,
 dcc.Location(id="url", refresh=False), # Handles URL changes
 page_container # Automatically loads the correct page layout
])

Final Thoughts

As developers, we spend a huge chunk of time trying to understand code — whether it’s our own or someone else’s. With the power of GenAI and vector embeddings, we can now transform our codebases into something searchable, conversational, and intelligent. By giving AI full visibility into our repo, we no longer need to dig through files or trace functions manually. Just ask — and let your code answer.

This is just the beginning. Whether you’re debugging faster, onboarding smoother, or simply exploring smarter, tools like this are redefining how we interact with code.

Here are some of the example Use Cases
Debugging: Paste a broken function and find similar implementations that work

Refactoring Help: Find duplicated logic across different files

Learning a New Codebase: Ask questions like “Where is the main API handler defined?” or “What’s the entry point for the app?”

If you found this helpful or have ideas to take it further, feel free to connect or drop a comment — I’d love to hear your thoughts!

All the code for repo access to Gen AI is on this link.

https://github.com/shubham7169

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Stop Searching Your Codebase Manually — Build your own GenAI To Do It for You

Author(s): Shubham Gupta

Final Thoughts

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Stop Searching Your Codebase Manually — Build your own GenAI To Do It for You

Author(s): Shubham Gupta

Final Thoughts

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement