
Are You Using the OpenAI API Correctly?
Author(s): Tom or
Originally published on Towards AI.
Background
OpenAI API has become a standard in the way we communicate with Large Language Models (LLM). Many open-source projects (Ollama, llamacpp) and even enterprise SDKs (Google vertex , Anthropic) provide OpenAI-compatible API. This unification allows changing between LLM providers by simply replacing a line.

Most language frameworks (Langchain, LangGraph) build on this fact to bring a “provider-agnostic switch”. They provide simple wrappers over the OpenAI client for cleaner usage.
API Spec
The API exposes 2 main communication channels:
- Chat completion
- Completion
The Chat completion takes in a structured list of objects, specifying the role and the content. On the other hand, the “regular” completion takes in a string and returns one. So what is the difference?

Quick refresher on LLM inner workings
To better understand the differences between the endpoints, we need to understand how LLM understand and generate text.
Given a string S, the LLM split S into units called “tokens”. This process is called tokenization and is performed by a tokenizer. The LLM process the received token sequence from the tokenizer, and generate the next probable token. The new generated token is appended to the previous ones, and the process repeats until a special token indicating “end of generation” is generated.
If the LLM is just the next probable token machine — how does it know to chat with us? how does it knows what is the system prompt? how can we instruct it?
LLM are trained on multiple stages. The first stage is usually referred to as “pre-training”, which is the act of training the model on diverse, large datasets (Code repositories, Wikipedia, private datasets, ….). The result of that model is referred to as the “base model”
The second step, referred to as “post-training”, is different fine-tuning processes being done on the “base model”. Popular examples to that are “chat” fine-tuning and “instruct” fine-tuning. In this step, many practices introduce new special tokens
Example — LLaMa fine tuning
Meta released state-of-the-art open-source models until about two years ago. (Ages in AI time…) . We will focus on 2 (old) models:
- https://huggingface.co/meta-llama/Llama-2-70b-hf
- https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
The none-chat version is a pure token completion machine.If we ask, “Who are you?”, the completion is… well… weird:

On the other hand, the chat version “knows” to differentiate between requests (or instructions) and “understands” that you ask a question.
How the transformation works (base model -> chat model)?
The base model is further trained on a curated, relatively small dataset of “chat” or “instruct” data. The architecture of the model is still the same — the model still produces token by token, however, the input is not the raw string you send to it. The string, or the messages you send to it are being formatted in special template, defined by the fine-tuning process and the tokenizer.
For example, the query “Who are you?” for LLaMa-2–70-b-chat will be transformed to something like this:

[INST], <<SYS>> and additional tokens were added, and the model sees that string and its input. But who does the formatting? The chat model will not “crash” if we don’t use the template.
Chat completion vs Completion
When using chat completion, the hosting server does (or should) the formatting automatically for you. (Pro-tip: make sure the server does it properly)

When using the completion endpoint, the model receives the input as-is.

So for the second example, we will get subpar or even gibberish results, due to not following the chat template.
The caveat
The completion API came before the chat completion, as chat models are an incremental step over the base models. Due to that, many applications, tutorials (and data that LLM trained on) are based on the completion API. In the present, the releases models are usually chat or instruct based, so they have some chat template to interact with. Many (popular) libraries use the completion endpoint as default — so watch out.
So make sure that if you are using chat/instruct model, you are using the chat completion endpoint, or you are doing the chat formatting at your end, and using the completion endpoint.
Why “Completion” lingers
The completion endpoint could mess your GenAI application and you wouldn’t even know it. So why it is still here?
The main reason is back-compatibility — as many application relies on that and will break without it.
However, the completion endpoint provide flexibility that is not achieved via the chat endpoint. Let’s consider this example:
You are developing a chatbot for a French company. The chatbot should only answer in French. The go-to solution to answer only in French is to include it in the system prompt. This will probably work well for most cases, however when a user will ask something in English, the system may break and will answer in English (weaker models especially ).

To enforce a more robust solution, one can leverage the nature of LLM as probabilistic token machine. By its nature — if the first token of the generated answer will be “French” token, it is very likely that the rest of the answer will also be in French. We can “trick” the model by letting it “think” that he already generated a single token in French. This can only be done only with the completion endpoint:

Takeaways
- The chat endpoint is a high-level convenience wrapper; underneath it’s still tokens all the way down.
- Special tokens (<|/INST|>,<<SYS>>, etc.) govern role boundaries, stops, and tooling hooks.
- Older code using the completion endpoint can silently lose those tokens unless you add them yourself.
Happy prompting!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.