Are You Using the OpenAI API Correctly?

Author(s): Tom or

Originally published on Towards AI.

Background

OpenAI API has become a standard in the way we communicate with Large Language Models (LLM). Many open-source projects (Ollama, llamacpp) and even enterprise SDKs (Google vertex , Anthropic) provide OpenAI-compatible API. This unification allows changing between LLM providers by simply replacing a line.

Are You Using the OpenAI API Correctly? — Source: Image by the author. Using the chat completion endpoint in Python

Most language frameworks (Langchain, LangGraph) build on this fact to bring a “provider-agnostic switch”. They provide simple wrappers over the OpenAI client for cleaner usage.

API Spec

The API exposes 2 main communication channels:

Chat completion
Completion

The Chat completion takes in a structured list of objects, specifying the role and the content. On the other hand, the “regular” completion takes in a string and returns one. So what is the difference?

Source: Image by the author. OpenAI-compatible client with LangChain. Usage of chat completion and completion.

Quick refresher on LLM inner workings

To better understand the differences between the endpoints, we need to understand how LLM understand and generate text.

Given a string S, the LLM split S into units called “tokens”. This process is called tokenization and is performed by a tokenizer. The LLM process the received token sequence from the tokenizer, and generate the next probable token. The new generated token is appended to the previous ones, and the process repeats until a special token indicating “end of generation” is generated.

If the LLM is just the next probable token machine — how does it know to chat with us? how does it knows what is the system prompt? how can we instruct it?

LLM are trained on multiple stages. The first stage is usually referred to as “pre-training”, which is the act of training the model on diverse, large datasets (Code repositories, Wikipedia, private datasets, ….). The result of that model is referred to as the “base model”

The second step, referred to as “post-training”, is different fine-tuning processes being done on the “base model”. Popular examples to that are “chat” fine-tuning and “instruct” fine-tuning. In this step, many practices introduce new special tokens

Example — LLaMa fine tuning

Meta released state-of-the-art open-source models until about two years ago. (Ages in AI time…) . We will focus on 2 (old) models:

The none-chat version is a pure token completion machine.If we ask, “Who are you?”, the completion is… well… weird:

Source: Image by the author. Llama2–70b completion from together.ai playground

On the other hand, the chat version “knows” to differentiate between requests (or instructions) and “understands” that you ask a question.

How the transformation works (base model -> chat model)?

The base model is further trained on a curated, relatively small dataset of “chat” or “instruct” data. The architecture of the model is still the same — the model still produces token by token, however, the input is not the raw string you send to it. The string, or the messages you send to it are being formatted in special template, defined by the fine-tuning process and the tokenizer.

For example, the query “Who are you?” for LLaMa-2–70-b-chat will be transformed to something like this:

[INST], <<SYS>> and additional tokens were added, and the model sees that string and its input. But who does the formatting? The chat model will not “crash” if we don’t use the template.

Chat completion vs Completion

When using chat completion, the hosting server does (or should) the formatting automatically for you. (Pro-tip: make sure the server does it properly)

Source: Image by the author. Usage of chat client and the resulting input to the model

When using the completion endpoint, the model receives the input as-is.

Usage of regular completion and the resulting input to the model

So for the second example, we will get subpar or even gibberish results, due to not following the chat template.

The caveat

The completion API came before the chat completion, as chat models are an incremental step over the base models. Due to that, many applications, tutorials (and data that LLM trained on) are based on the completion API. In the present, the releases models are usually chat or instruct based, so they have some chat template to interact with. Many (popular) libraries use the completion endpoint as default — so watch out.

So make sure that if you are using chat/instruct model, you are using the chat completion endpoint, or you are doing the chat formatting at your end, and using the completion endpoint.

Why “Completion” lingers

The completion endpoint could mess your GenAI application and you wouldn’t even know it. So why it is still here?

The main reason is back-compatibility — as many application relies on that and will break without it.

However, the completion endpoint provide flexibility that is not achieved via the chat endpoint. Let’s consider this example:

You are developing a chatbot for a French company. The chatbot should only answer in French. The go-to solution to answer only in French is to include it in the system prompt. This will probably work well for most cases, however when a user will ask something in English, the system may break and will answer in English (weaker models especially ).

To enforce a more robust solution, one can leverage the nature of LLM as probabilistic token machine. By its nature — if the first token of the generated answer will be “French” token, it is very likely that the rest of the answer will also be in French. We can “trick” the model by letting it “think” that he already generated a single token in French. This can only be done only with the completion endpoint:

Source: Image by the author. Added suffix of “La réponse est :” which is Google translate of “The answer is” in French

Takeaways

The chat endpoint is a high-level convenience wrapper; underneath it’s still tokens all the way down.
Special tokens (<|/INST|>,<<SYS>>, etc.) govern role boundaries, stops, and tooling hooks.
Older code using the completion endpoint can silently lose those tokens unless you add them yourself.

Happy prompting!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Are You Using the OpenAI API Correctly?

Author(s): Tom or

Background

API Spec

Quick refresher on LLM inner workings

Example — LLaMa fine tuning

How the transformation works (base model -> chat model)?

Chat completion vs Completion

The caveat

Why “Completion” lingers

Takeaways

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Are You Using the OpenAI API Correctly?

Author(s): Tom or

Background

API Spec

Quick refresher on LLM inner workings

Example — LLaMa fine tuning

How the transformation works (base model -> chat model)?

Chat completion vs Completion

The caveat

Why “Completion” lingers

Takeaways

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement