
From First Principles: Building Function Calling by Fine-tuning NanoGPT
Last Updated on April 14, 2025 by Editorial Team
Author(s): Suyash Harlalka
Originally published on Towards AI.

Turning AI Assistants into Action-Takers
Imagine talking to a digital assistant that doesn’t just understand your request, but can actually do something about it. You say, “Book me a flight to San Francisco,” and instead of just writing a response, the AI actually starts the booking process. This isn’t science fiction — it’s function calling, and it’s changing how we interact with AI.
Most people see these intelligent systems as black boxes, magically responding to commands. But what if you could peek behind the curtain and understand exactly how they work? What if you could build one yourself?
What You’ll Discover in This Guide
In this hands-on tutorial, you’ll learn:
- How to build function calling from scratch — no black boxes, no magic, just pure PyTorch and Tiktoken
- The secret behind efficient function calling — teaching models to generate structured outputs without bloating prompts
- A streamlined fine-tuning approach that works even for smaller language models
- Practical techniques for training loop optimization including custom loss masking and strategic data preparation
Whether you’re an ML engineer looking to customize LLM capabilities, a researcher exploring model fine-tuning, or a developer wanting to add function calling to your applications, this guide provides the complete blueprint for implementation with minimal dependencies.
Our journey in this blogpost is about demystifying this technology. We’re not just going to explain function calling — we’re going to show you how to build it from scratch, line of code by line of code. No shortcuts, almost no pre-built libraries, just pure computational thinking.
What makes this approach truly distinctive is that we’re training the model to generate structured function calls without embedding function definitions in each prompt. Unlike conventional implementations that waste valuable tokens describing available functions, this fine-tuning approach bakes this knowledge directly into the model weights, resulting in more efficient inference and better performance, especially on smaller models.
In this blog post, we’ll walk through implementing function calling capabilities by fine-tuning a NanoGPT-like model using a pure, from-scratch approach. What sets this implementation apart is the deliberate choice to build everything using only PyTorch and tiktoken — completely eschewing high-level libraries like HuggingFace. This approach offers several critical advantages that go beyond typical machine learning implementations:
1. Complete Architectural Control: By building from the ground up, we gain unprecedented flexibility to modify and optimize every component of the model without being constrained by external library APIs.
2. Deep Technical Understanding: The process of implementing each component manually forces a profound, granular understanding of the model’s inner workings, revealing insights that abstracted libraries often obscure.
3. Performance and Efficiency: With full control over the implementation, we can meticulously optimize the code for our specific use case, eliminating unnecessary overhead and tailoring the architecture precisely to our requirements.
4. Minimal Dependency Footprint: This approach results in a lean, dependency-light codebase that’s easier to maintain, deploy, and understand.
The complete code for this implementation is available on GitHub: https://github.com/suyashh94/finetune-function-calling-from-scratch
We’ll dive deep into the code, explaining each component and demonstrating how a relatively simple architecture can be extended to support sophisticated structured function calling. From implementing custom attention mechanisms to developing a flexible function calling approach, this post offers an under-the-hood look at building an intelligent language model system from first principles.
Understanding Supervised Fine-Tuning and Function Calling
Let’s start with the fundamentals. Supervised Fine-Tuning (SFT) is a technique where a pre-trained language model is further trained on specific examples to guide its behaviour towards a particular style or capability. The model learns to map input prompts to desired output responses through this process.
Function calling, despite its seemingly complex nature, is fundamentally just a specialized form of supervised fine-tuning. The key insight here is that we’re teaching the model to produce outputs in a specific structured format rather than free-form text. Instead of:
User: What’s the weather like in San Francisco?
Assistant: It's currently 65°F and sunny in San Francisco.
We want:
User: What's the weather like in San Francisco?
Assistant: <functioncall> {"name": "get_weather", "arguments": "{'location': 'San Francisco'}"} </functioncall>
Here, get_weather is a predefined function in our system that accepts a location parameter and returns weather information.
The crucial difference in this function calling approach lies in avoiding embedded function descriptions within the context window. Traditional implementations often bloat prompts with extensive function metadata, consuming valuable token space and potentially overwhelming smaller language models. By removing these verbose descriptions, we force the model to learn function capabilities more dynamically, relying on context and training examples rather than explicit instructions. This approach not only reduces prompt overhead but also enhances model flexibility, allowing for more intelligent and adaptive function understanding. The result is a more computationally efficient method that scales across different model architectures, prioritizing lean, meaningful interactions over exhaustive technical documentation.
It’s crucial to clarify that “function calling” is a misnomer — the model doesn’t actually execute functions, but rather generates structured function invocation instructions using specialized tokens. The actual function execution is the responsibility of the underlying system, which parses the model’s output and routes the appropriate function call to the correct implementation. This approach leverages the model’s ability to understand context and generate precise, actionable function representations without the overhead of maintaining extensive function descriptions in the prompt. By using carefully designed tokens and training strategies, we enable the model to generate function calls that are both contextually relevant and structurally consistent, while leaving the complex logic of actual function execution to the system’s implementation layer.
In most function calling systems — like those built by OpenAI, Claude, or Mistral — the model doesn’t know which functions it can use ahead of time. So every time you prompt the model, you have to tell it what functions are available, what they’re called, what parameters they take, and how to use them.
Let’s say you have 5 functions:
get_weather(location)
set_temperature(temperature, unit)
adjust_fan_speed(speed)
play_music(song_name)
turn_on_lights(location)
🧠 Traditional (In-Context Function Calling)
You’d need to include something like this in every prompt:
{
"functions": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
},
{
"name": "set_temperature",
"description": "Set the car temperature",
"parameters": {
"type": "object",
"properties": {
"temperature": { "type": "number" },
"unit": { "type": "string" }
},
"required": ["temperature", "unit"]
}
},
{
"name": "adjust_fan_speed",
"description": "Adjust fan speed",
"parameters": {
"type": "object",
"properties": {
"speed": { "type": "string", "enum": ["low", "medium", "high"] }
},
"required": ["speed"]
}
},
{
"name": "play_music",
"description": "Play a specific song",
"parameters": {
"type": "object",
"properties": {
"song_name": { "type": "string" }
},
"required": ["song_name"]
}
},
{
"name": "turn_on_lights",
"description": "Turn on lights in a location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
],
"user": "Can you turn on the lights in the kitchen?"
}
The model now has to read all this, understand all functions, match your request, and then format a correct call.
🧠 Fine-Tuned (No In-Context Definitions)
In our case, the same prompt looks like this:
{
"user": "Can you turn on the lights in the kitchen?"
}
And the model simply replies:
<functioncall>{"name": "turn_on_lights", "arguments": "{'location': 'kitchen'}"}</functioncall>
That’s it.
There’s no function list, no schema, no metadata, no JSON bloat. By baking function definitions directly into the model’s weights, we turn function calling into a generation task, not a schema-parsing task. The model isn’t “choosing from a list” — it’s generating the function call as if it were a language translation: from natural language to structured command.
This eliminates the constant overhead of including long JSON schemas, makes inference faster and leaner, and drastically improves usability for smaller or embedded models.
Dataset Requirements for Function Calling
Before diving into the implementation, it’s worth briefly discussing dataset requirements. For function calling to work effectively, your training data needs to cover the parameter space of your functions adequately.
Consider this example from our dataset:
{
"system": "<|im_start|>system\nYou are a helpful assistant. You have to either provide a way to answer user's request or answer user's query.\n<|im_end|>\n",
"user": "<|im_start|>user\nSet the temperature to 22 degrees Celsius.<|im_end|>\n",
"assistant": "<|im_start|>assistant\n<functioncall> {\"name\": \"set_temperature\", \"arguments\": \"{'temperature': 22, 'unit': 'celsius'}\"} <|im_end|><|endoftext|>"
}
Now compare it with another example:
{
"system": "<|im_start|>system\nYou are a helpful assistant. You have to either provide a way to answer user's request or answer user's query.\n<|im_end|>\n",
"user": "<|im_start|>user\nCan you set it to 65 degrees Fahrenheit?<|im_end|>\n",
"assistant": "<|im_start|>assistant\n<functioncall> {\"name\": \"set_temperature\", \"arguments\": \"{'temperature': 65, 'unit': 'fahrenheit'}\"} <|im_end|><|endoftext|>"
}
Note how we’ve included examples with different temperature values and unit systems. For functions with categorical parameters (like “high”, “medium”, “low”), you’d want examples covering each possible value:
{
"system": "<|im_start|>system\nYou are a helpful assistant. You have to either provide a way to answer user's request or answer user's query.\n<|im_end|>\n",
"user": "<|im_start|>user\nSet the fan speed to high.<|im_end|>\n",
"assistant": "<|im_start|>assistant\n<functioncall> {\"name\": \"set_fan_speed\", \"arguments\": \"{'speed': 'high'}\"} <|im_end|><|endoftext|>"
}
While comprehensive data curation strategies are beyond this blog’s scope, remember that the model can only learn patterns present in your training data. Ensure your dataset covers different phrasings, parameter combinations, and edge cases.
The cornerstone of effective function calling lies in the richness and diversity of your training data. This approach demands exhaustive coverage of linguistic variations and parameter combinations. For each function, we meticulously generate training examples that explore every possible permutation of parameters, ensuring the model can handle a wide range of user interactions. This means creating multiple training instances for each function that showcase:
- Linguistic Diversity
Capturing the same function call through dozens of different phrasings. For the"adjust_temperature"
function, this might include variations like:
– "Make it warmer"
– "I'm cold, can you adjust the heat?"
– "Increase the temperature in the front seats" - Comprehensive Parameter Exploration
We systematically generate function call examples that cover:
— All possible parameter combinations
— Partial parameter specifications
— Different units of measurement (Celsius vs. Fahrenheit)
— Area-specific variations (front seats, driver’s side, entire car)
— Relative vs. absolute temperature changes - Interaction Complexity
Training data must include scenarios that demonstrate:
— Implicit parameter inference
— Contextual understanding
— Handling of ambiguous or incomplete requests
The goal is to create a dataset so comprehensive that the model can seamlessly translate virtually any user request into the correct function call, regardless of how it’s phrased or how incomplete the initial request might be. This approach transforms function calling from a rigid, keyword-matching exercise into a flexible, context-aware capability.
The Modified NanoGPT Architecture
This implementation builds upon NanoGPT, Andrej Karpathy’s minimalist GPT implementation. However, we’ve made several key modifications to support function calling capabilities.
The first significant change is in the tokenizer. We’ve extended the base GPT-2 tokenizer with special tokens that help demarcate different parts of the input and output:
import tiktoken
class Encoder:
def __init__(self):
gpt2_base = tiktoken.get_encoding("gpt2")
# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
# If you're changing the set of special tokens, make sure to use a different name
# It should be clear from the name what behaviour to expect.
name="gpt_instruct",
pat_str=gpt2_base._pat_str,
mergeable_ranks=gpt2_base._mergeable_ranks,
special_tokens={
**gpt2_base._special_tokens,
"<|pad_token|>": 50257,
"<|eop_token|>": 50258,
},
)
enc.pad_token = 50257
enc.eop_token = 50258
self.encoder = enc
if __name__ == "__main__":
encoder = Encoder()
instruction = "Write a summary of the given text."
print("Instruction:", instruction)
encoded_instr = encoder.encoder.encode_ordinary(instruction)
print("Encoded instruction:", encoded_instr)
decoded_instr = encoder.encoder.decode(encoded_instr)
print("Decoded instruction:", decoded_instr)
This encoder adds two special tokens beyond the base GPT-2 vocabulary:
· <|pad_token|> (token ID 50257): Used for padding sequences to a consistent length
· <|eop_token|> (token ID 50258): End-of-prompt token that marks the boundary between the prompt and the response
Note: In an ideal scenario, we would also add <functioncall> and </functioncall> as special tokens in our encoder, since they are fundamental to our function calling implementation. This would make the model recognize these tokens as distinct entities rather than multiple regular tokens, improving efficiency and precision in both training and inference. Adding them as special tokens would reduce the token count needed to represent function calls and potentially improve the model’s understanding of the function calling structure.
The base model architecture remains largely unchanged from NanoGPT, with a standard transformer architecture including self-attention blocks and an MLP.
Dataset Processing for Function Calling
The heart of this implementation lies in how we process the dataset. Let’s examine the key function:
def process_dataset(dataset, enc, input_len=1024):
data = json.loads(dataset["text"])
system_prompt = "system: " + data["system"] + "\n"
user_prompt = "user: " + data["user"] + "\n"
response = "assistant: " + data["assistant"]
prompt = system_prompt + user_prompt
prompt_ids = enc.encode_ordinary(prompt)
prompt_id_len = len(prompt_ids)
prompt_ids.append(enc.eop_token)
response_ids = enc.encode_ordinary(response)
response_ids.append(enc.eot_token)
prompt_ids = prompt_ids + response_ids
prompt_response_len = len(prompt_ids)
prompt_ids = prompt_ids + [enc.pad_token] * (input_len - len(prompt_ids) + 1)
prompt_ids = np.array(prompt_ids, dtype=np.uint16)
prompt_mask = np.array([1] * prompt_id_len + [0] * (input_len - prompt_id_len))
prompt_mask = np.array(prompt_mask, dtype=np.uint8)
pad_mask = np.array([0] * input_len)
pad_mask[prompt_response_len - 1 :] = 1
pad_mask = np.array(pad_mask, dtype=np.uint8)
out = {
"output_ids": prompt_ids,
"length": prompt_response_len,
"prompt_mask": prompt_mask,
"pad_mask": pad_mask,
}
return out
Let’s break down what’s happening here:
1. We combine the system message and user input to form the complete prompt
2. We encode this prompt into tokens and record its length
3. We append the end-of-prompt (EOP) token to mark the end of the prompt
4. We encode the assistant’s response (which contains the function call)
5. We append the end-of-text (EOT) token to mark the end of the response
6. We combine these sequences and pad them to the required length
7. We create two mask arrays:
· prompt_mask: Marks which tokens belong to the original prompt
· pad_mask: Marks which tokens are padding
This processing allows us to distinguish between different parts of the input sequence during training, which is crucial for how we calculate the loss.
Loss Calculation and Training Process
Now, let’s examine how the training process works. In typical language model training, we calculate loss on all tokens by predicting each token based on the previous ones.
In this implementation, the key components are in the forward method of the GPT model and the training loop:
def forward(self, idx, targets=None, prompt_idx=None, answer_idx=None, return_losses_separately=False):
device = idx.device
b, t = idx.size()
# Standard positioning and embedding
pos = torch.arange(0, t, dtype=torch.long, device=device)
tok_emb = self.transformer.wte(idx)
pos_emb = self.transformer.wpe(pos)
x = self.transformer.drop(tok_emb + pos_emb)
# Process through transformer blocks
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
# Calculate loss if targets provided
if targets is not None:
logits = self.lm_head(x)
if not return_losses_separately:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1),
ignore_index=-1 # Ignore tokens marked with -1
)
else:
# For separate loss tracking (prompt vs answer)
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1),
ignore_index=-1,
reduction="none",
)
prompt_idx_flatten = prompt_idx.view(-1)
answer_idx_flatten = answer_idx.view(-1)
prompt_loss = loss[prompt_idx_flatten == 1].mean()
answer_loss = loss[answer_idx_flatten == 1].mean()
loss = loss.mean()
lossDict = {
"prompt_loss": prompt_loss,
"answer_loss": answer_loss,
"loss": loss
}
else:
# Inference time optimization
logits = self.lm_head(x[:, [-1], :])
loss = None
if return_losses_separately:
return logits, loss, lossDict
return logits, loss
And in the training loop :
def train(self):
# … (training setup code)
for batch in self.train_loader:
# Unpack the batch
input_ids = batch["input_ids"]
padding_mask = batch["padding_mask"]
prompt_mask = batch["prompt_mask"]
# Prepare input and target sequences
X = input_ids[:, :-1].to(self.device)
Y = input_ids[:, 1:].to(self.device)
# Mask out tokens we don't want to calculate loss for
Y[padding_mask == 1] = -1 # Don't calculate loss for padding tokens
# Decide whether to calculate loss on prompt tokens
if hasattr(self.config, "loss_on_prompt") and self.config.loss_on_prompt:
Y[prompt_mask == 1] = 1 # Include prompt tokens in loss
else:
Y[prompt_mask == 1] = -1 # Exclude prompt tokens from loss
# … (forward pass and optimization code)
The crucial insight here is how we use the masks:
1. We set target tokens that correspond to padding (padding_mask == 1) to -1
2. We set target tokens that correspond to the prompt (prompt_mask == 1) to -1
3. The cross-entropy loss function ignores tokens with the target value of -1 (ignore_index=-1)
This means the model is only trained to predict tokens in the assistant’s response — specifically, the function call. The model learns to generate the correct function call format, name, and arguments based on the user input, without being penalized for not predicting the prompt tokens.
Let’s visualize what this looks like for an example (“phrase wise tokenization” is assumed for simplification):
Input sequence:
[“You are a helpful assistant”, “Make the fans blow harder”, “<EOP>”, “<functioncall>”, “adjust_fan_speed”, “increase”, … <PAD> <PAD> <PAD>]
Target sequence (shifted by one position):
[“Make the fans blow harder”, “<EOP>”, “<functioncall>”, “adjust_fan_speed”, “increase”, …, <PAD> <PAD> <PAD> <PAD>]
With masking applied (where -1 means “don’t calculate loss for this token”):
[-1, -1, -1, “<functioncall>”, “adjust_fan_speed”, “increase”, …, -1, -1, -1, -1]
The model is only penalized for incorrectly predicting the function call tokens, encouraging it to learn the mapping between user requests and the appropriate function call format.
Advanced Feature: Loss on Prompt
This implementation includes a unique feature: the ability to calculate loss on prompt tokens as well. This is controlled by the loss_on_prompt configuration parameter:
if hasattr(self.config, “loss_on_prompt”) and self.config.loss_on_prompt:
Y[prompt_mask == 1] = 1
else:
Y[prompt_mask == 1] = -1
When loss_on_prompt is enabled, the model is trained to predict the prompt tokens as well. While this might seem counterintuitive (why train the model to predict something it already knows?), it can help in certain scenarios:
1. It can act as a form of regularization, preventing the model from “forgetting” its pre-training
2. It can help maintain the model’s general language capabilities while it specializes in function calling
3. It provides an additional learning signal that might improve overall performance
In most function calling scenarios, we keep this disabled, focusing learning to generate the correct function calls rather than predicting the prompt.
Implementation Without HuggingFace
The core model architecture is defined in the model.py file, which is an exact replica from nanoGPT, which implements the standard GPT components:
· LayerNorm: Layer normalization for stabilizing training
· CausalSelfAttention: Multi-head self-attention mechanism with causal masking
· MLP: Multi-layer perceptron for feed-forward processing
· Block: Combines attention and MLP with residual connections
· GPT: Ties everything together into a complete language model
The model can be initialized from scratch or loaded from a pre-trained GPT-2 checkpoint from the file saved from nanoGPT pretrained model.
Putting It All Together: Training Execution
Let’s examine how the training process is actually executed. The GPTTrainer class in finetune.py handles the entire training pipeline:
def train(self):
t0 = time.time()
raw_model = self.model.module if self.ddp else self.model
running_mfu = -1.0
while True:
if self.iter_num >= self.config.max_iters:
break
# Set epoch for samplers
if self.ddp:
self.train_loader.sampler.set_epoch(self.iter_num // len(self.train_loader)) # type: ignore
self.val_loader.sampler.set_epoch(self.iter_num // len(self.train_loader)) # type: ignore
for batch in self.train_loader:
# Correctly unpack the batch
input_ids = batch["input_ids"]
padding_mask = batch["padding_mask"]
prompt_mask = batch["prompt_mask"]
X = input_ids[:, :-1].to(self.device)
Y = input_ids[:, 1:].to(self.device)
Y[padding_mask == 1] = -1
if hasattr(self.config, "loss_on_prompt") and self.config.loss_on_prompt:
Y[prompt_mask == 1] = 1
else:
Y[prompt_mask == 1] = -1
lr = (
self.get_lr(self.iter_num)
if self.config.decay_lr
else self.config.learning_rate
)
for param_group in self.optimizer.param_groups:
param_group["lr"] = lr
if self.iter_num % self.config.eval_interval == 0 and self.master_process:
losses = self.estimate_loss()
print(
f"step {self.iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}"
)
if losses["val"] < self.best_val_loss or self.config.always_save_checkpoint:
self.best_val_loss = losses["val"]
if self.iter_num > 0:
checkpoint = {
"model": raw_model.state_dict(),
"optimizer": self.optimizer.state_dict(),
"model_args": self.gptconf.__dict__,
"iter_num": self.iter_num,
"best_val_loss": self.best_val_loss,
}
print(f"saving checkpoint to {self.config.out_dir}")
torch.save(checkpoint, os.path.join(self.config.out_dir, "ckpt.pt"))
if self.config.eval_only:
return
if self.iter_num % self.config.sample_interval == 0 and self.master_process:
self.sample_inference(self.iter_num)
for micro_step in range(self.config.gradient_accumulation_steps):
if self.ddp:
self.model.require_backward_grad_sync = (
micro_step == self.config.gradient_accumulation_steps - 1
)
with self.ctx:
logits, loss = self.model(X, Y)
loss = loss / self.config.gradient_accumulation_steps
self.scaler.scale(loss).backward()
if self.config.grad_clip != 0.0:
self.scaler.unscale_(self.optimizer)
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.grad_clip)
self.scaler.step(self.optimizer)
self.scaler.update()
self.optimizer.zero_grad(set_to_none=True)
t1 = time.time()
dt = t1 - t0
t0 = t1
if self.iter_num % self.config.log_interval == 0 and self.master_process:
lossf = loss.item() * self.config.gradient_accumulation_steps
if self.iter_num % self.config.log_interval == 0:
mfu = raw_model.estimate_mfu(
self.config.batch_size * self.config.gradient_accumulation_steps, dt
)
running_mfu = mfu if running_mfu == -1.0 else 0.9 * running_mfu + 0.1 * mfu
print(
f"iter {self.iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%"
)
self.iter_num += 1
if self.iter_num >= self.config.max_iters:
break
if self.ddp:
destroy_process_group()
The training loop includes several important components:
1. Gradient accumulation: To simulate larger batch sizes on limited hardware
2. Learning rate scheduling: With warmup and cosine decay
3. Checkpointing: Saving the model when validation loss improves
4. Distributed training: Support for multi-GPU training using PyTorch’s DistributedDataParallel
5. Sample generation: Periodically generating samples to see how the model is performing
The configuration parameters allow you to control various aspects of the training process:
class Config:
def __init__(self, pretrained_dir):
self.init_from = "pretrained"
self.out_dir = "./outputs/fc_fully_finetuned"
self.pretrained_dir = pretrained_dir
self.device = "cuda"
self.eval_interval = 500
self.sample_interval = 50
self.log_interval = 1
self.eval_iters = 2
self.eval_only = False
self.always_save_checkpoint = True
self.gradient_accumulation_steps = 8
self.batch_size = 64
self.learning_rate = 5e-6
self.max_iters = 60000
self.weight_decay = 1e-1
self.beta1 = 0.9
self.beta2 = 0.95
self.grad_clip = 1.0
self.decay_lr = True
self.warmup_iters = 2000
self.lr_decay_iters = int(0.9 * 60000)
self.min_lr = 1e-6
self.backend = "nccl"
self.dtype = (
"bfloat16"
if torch.cuda.is_available() and torch.cuda.is_bf16_supported()
else "float16"
)
self.compile = True
self.block_size = None
self.finetuning_dropout = 0.2
self.loss_on_prompt = False
By adjusting these parameters, you can customize the training process for your specific needs, hardware capabilities, and dataset characteristics.
Training Progress and Model Evolution
During the training process, ourmodel gradually learns to map user requests to the appropriate function calls. Let’s examine how the model’s capabilities evolve over the course of training by looking at its outputs at different stages.
Start of Training (0 iterations)
Input: system: <|im_start|>system
You are a helpful assistant. You have to either provide a way to answer user's request or answer user's query.
<|im_end|>
user: <|im_start|>user
Can you decrease the fan speed?<|im_end|>
<|eop_token|>
----------------------------------
Actual: assistant: <|im_start|>assistant
<functioncall> {"name": "adjust_fan_speed", "arguments": "{'speed': 'decrease'}"} <|im_end|><|endoftext|><|endoftext|>
----------------------------------
Predicted: èrey: {start|}Can you decrease the fan speed? <|im_end|>When can you add an option to fan speed? <|im_end|>When do you update fan speed? <|im_end|>When can you update fan speed? <|im_end|>When will fan speed be modified? <|im_end|>When will fan speed be edited? <|im_end|>When will fan speed be edited ?<|im_end|>when will fan speed be edited ?<|im_end|>When will fan speed be changed? <|im_end|>When will fan speed be changed? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed get deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed be deleted? <|im_end|>When will fan speed get deleted? <|im_end|>When will fan speed get deleted? <|im_end|>When will fan speed get deleted? <|im_end|>When will fan speed get deleted? <|im_end|>When will fan speed get deleted? <|im_end|>When will fan speed get deleted? <|im_end|> When will fan speed get deleted? <|im_end|>When will fan speed get deleted? <|im_end|>When will fan speed get deleted? <|im_end|>What are your favorite fan presets? <|im_end|> What are your favorite fan presets? <|im_end|>What are your favorite fan presets? <|im_end|>What are your favorite fan presets? <|im_end|>What are your favorite fan presets? <|im_end|>What are your favorite fan presets? <|im_end|>What are your favorite fan presets? <|im_end|>Can you add fan presets that can be edited, only fan presets? <|im_end|>Can you make fan presets that are only fan presets? <|im_end|>Can you make fan presets that are only fan presets? <|im_end|]Can you make fan presets that can be edited, only fan presets? <|im_end|>What are the most popular fan presets? <|im_end|>What are the most popular fan presets? <|im_end|>What are the most popular fan presets? <|im_end|>What are the most popular fan presets? <|im_end|>What are the most popular fan presets? <|im_end|>Can you make fan presets that can be edited, only fan presets? <|im_end|>What are the most popular fan presets? <|im_end|>Can you make fan presets that can be edited, only fan presets? <|im_end|>Can you make fan presets that can be edited, only fan presets? <|im_end|>Can you make fan presets that can be edited, only fan presets? <|im_end|>Can you make fan presets that can
n
Note: No end-of-text (<eot>) token was predicted and prediction was truncated as maximum length was reached
Early Training (500 iterations)
Input: system: <|im_start|>system
You are a helpful assistant. You have to either provide a way to answer user's request or answer user's query.
<|im_end|>
user: <|im_start|>user
Raise the temperature please.<|im_end|>
<|eop_token|>
- - - - - - - - - - - - - - - - -
Actual: assistant: <|im_start|>assistant
<functioncall> {"name": "adjust_temperature", "arguments": "{'action': 'increase'}"} <|im_end|><|endoftext|><|endoftext|>
- - - - - - - - - - - - - - - - -
Predicted: assistant_start|>
<|im_end|>
<|im_start|>
<|im_end|><|endoftext|>
Mid Training (1,000 iterations)
Input: system: <|im_start|>system
You are a helpful assistant. You have to either provide a way to answer user's request or answer user's query.
<|im_end|>
user: <|im_start|>user
Turn the airflow up to high in the driver's seat and the back seats.<|im_end|>
<|eop_token|>
- - - - - - - - - - - - - - - - -
Actual: assistant: <|im_start|>assistant
<functioncall> {"name": "set_fan_speed", "arguments": "{'speed': 'high', 'area': ['rear-right', 'driver', 'rear-left']}"} <|im_end|><|endoftext|><|endoftext|>
- - - - - - - - - - - - - - - - -
Predicted: assistant: <|im_start|>assistant
<functioncall> {"name": "adjust_fan_speed", "arguments": "{'speed': 'high'}"} <|im_end|><|endoftext|><|endoftext|>
Late Training (20,000 iterations)
Input: system: <|im_start|>system
You are a helpful assistant. You have to either provide a way to answer user's request or answer user's query.
<|im_end|>
user: <|im_start|>user
Adjust the fans to medium for the left and right seats in the rear.<|im_end|>
<|eop_token|>
- - - - - - - - - - - - - - - - -
Actual: assistant: <|im_start|>assistant
<functioncall> {"name": "set_fan_speed", "arguments": "{'speed': 'medium', 'area': ['rear-left', 'rear-right']}"} <|im_end|><|endoftext|><|endoftext|>
- - - - - - - - - - - - - - - - -
Predicted: assistant: <|im_start|>assistant
<functioncall> {"name": "set_fan_speed", "arguments": "{'speed': 'medium', 'area': ['rear-right', 'rear-left']}"} <|im_end|><|endoftext|><|endoftext|>
Complex Examples
By the end of training, the model handles complex requests that require understanding multiple parameters and areas:
User: Reduce the fan speed for the front seats, both driver and passenger.
Model Output: <functioncall> {"name": "adjust_fan_speed", "arguments": "{'speed': 'decrease', 'area': ['driver', 'front-passenger']}"} </functioncall>
These examples demonstrate how the model gradually builds an understanding of the relationship between natural language requests and structured function calls. The fine-tuning process effectively teaches the model to extract the relevant intent and parameters from diverse phrasings and map them to the appropriate function call format.
Inference and Usage
Once the model is trained, we can use it to generate function calls in response to new user inputs. The inference process uses the generate_answer_for_question method:
@torch.no_grad()
def generate_answer_for_question(
self,
idx,
max_new_tokens,
temperature=1.0,
top_k=None,
eot_token_id=50256,
isTraining=False,
eop_token_id=50258,
sampleMax=False,
):
"""
Generates a function call response to a user query
idx: input token sequence
max_new_tokens: maximum number of tokens to generate
temperature: sampling temperature (lower = more deterministic)
top_k: if set, only sample from the top k most likely tokens
"""
print("********** Generating Answer **********")
if isTraining:
# For training, find the EOP token position
x_eop_row = torch.where(idx == eop_token_id)[0][0].item()
x_eop_col = torch.where(idx == eop_token_id)[1][0].item()
idx = idx[x_eop_row, : x_eop_col + 1]
idx = idx[None, ...]
isEndToken = False
nTokens = 0
# Auto-regressive generation loop
while (not isEndToken) and (nTokens <= max_new_tokens):
# Respect context window size
idx_cond = (
idx if idx.size(1) <= self.config.block_size
else idx[:, -self.config.block_size:]
)
# Get logits for next token
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
# Apply top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float("Inf")
# Sample or take argmax
probs = F.softmax(logits, dim=-1)
if not sampleMax:
idx_next = torch.multinomial(probs, num_samples=1)
else:
idx_next = torch.argmax(probs, dim=-1).view(1, 1)
# Append token and check for end condition
idx = torch.cat((idx, idx_next), dim=1)
if idx_next == eot_token_id:
isEndToken = True
nTokens += 1
return idx
This method:
1. Takes a user prompt encoded as token IDs
2. Generates tokens autoregressively, sampling from the model’s output distribution
3. Stops when it reaches the end-of-text token or the maximum token limit
4. Returns the complete sequence, including the original prompt and the generated function call
For real-world usage, you would:
1. Encode the user’s request using your custom encoder
2. Add the system prompt and end-of-prompt token
3. Pass this to the model for generation
4. Decode the generated tokens
5. Parse the function call from the decoded text
6. Execute the appropriate function with the specified arguments
Conclusion and Extensions
We’ve walked through a complete implementation of function calling capabilities using a NanoGPT-style model built from scratch. By fine-tuning the model on carefully structured examples, we’ve taught it to respond to user requests with appropriate function calls without embedding function definitions in the context window.
Some potential extensions and improvements to this approach include:
1. Hybrid responses: Training the model to provide either function calls or natural language responses based on the context
2. Multi-function calls: Supporting sequences of function calls for complex requests
3. Function validation: Adding validation logic to ensure the generated function calls have valid parameters
4. Tool use: Extending beyond simple function calls to more complex tool interactions
5. Feedback loops: Incorporating execution results into the model’s context for iterative refinement
This from-scratch implementation gives you a deep understanding of how function calling works at a fundamental level, allowing you to customize and extend it for your specific needs. By building on NanoGPT’s clean and efficient codebase, we’ve created a powerful yet understandable system for adding structured function calling capabilities to language models.
Remember that the quality of your fine-tuned model depends heavily on your training data — invest time in creating diverse, high-quality examples that cover the full range of functions and parameters you want to support. While this repository provides the curated dataset ready for training, the data generation process itself is perhaps the most crucial step in the entire pipeline. In this implementation, we’ve focused on the model architecture and training methodology, but data generation deserves its own spotlight. Stay tuned for an upcoming piece where we’ll dive deep into strategies for generating comprehensive training data that covers parameter variations, linguistic diversity, and edge cases. With the right data and training approach, even relatively small models can perform function calling tasks with impressive accuracy.
By understanding function calling from first principles, we unlock the ability to customize and extend this capability for specific domains and applications.
You can find the complete codebase for this implementation on GitHub: https://github.com/suyashh94/finetune-function-calling-from-scratch. Feel free to clone, fork, and adapt it for your own projects!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.