Unleashing the Power of ONNX for Speedier SBERT Inference
Last Updated on May 19, 2026 by Editorial Team
Author(s): Swaraj Patil
Originally published on Towards AI.
SBERT, also known as Sentence-Bert, is a widely used approach for obtaining sentence embeddings that aim to retain the contextual information within the sentences. However, generating these embeddings can be slow when dealing with large amounts of data. To address this, one option is to utilize batch-based encoding to accelerate the inference. However, this may not necessarily reduce the inference time. In this Medium blog post, we will explore the application of the ONNX (Open Neural Network Exchange) framework and how it aids in reducing the inference time of the model.

P.S. This article does not delve into the internal workings of ONNX. For more in-depth information, please consult the official ONNX documentation.
Let’s begin by installing the import libraries. We can use pip for the installation of ONNX
pip install onnx
pip install onnxruntime-gpu
pip install transformers
pip install torch
Once ONNX is installed we verify it using the below snippet
import onnx
print(onnx.__version__)
In order to obtain sentence embeddings, we will utilize the IMDB dataset sourced from Kaggle. Specifically, we will focus on the “Overview of Movie” column to generate embeddings using SBERT. The time needed to create embeddings will be determined for the 1000 sentences present in the dataset.
We will perform two experiments here on both CPU and GPU
- Inference time for 1000 sentences using Vanilla SBERT (CPU).
- Inference time for 1000 sentences using ONNX converted SBERT (CPU).
- Inference time for 1000 sentences using Vanilla SBERT (GPU).
- Inference time for 1000 sentences using ONNX converted SBERT (GPU).
The Sentence BERT model that we would consider here is all-MiniLM-L6-v2
We can invoke the Sentence BERT model from the Hugging Face Library and the Sentence Transformer Library. The output embeddings from both the library will be the same. For our experiments, we will use the Hugging Face library. Remember that when we use the Hugging Face library after obtaining the embeddings, additional post-processing could be needed such as Pooling or Normalization. The different steps can be obtained from the model page on Hugging Face. Perform those steps to get final sentence embeddings.
Let's first convert the model to ONNX format.
# # Load pretrained model and tokenizer
from transformers import AutoModel, AutoTokenizer
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name,
do_lower_case=True )
model = AutoModel.from_pretrained(model_name )
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
temp = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1),
min=1e-9)
return F.normalize(temp, p=2, dim=1)
# Get the first example data to run the model and export it to ONNX
sample = ['Hey, how are you today?']
inputs = tokenizer(sample,
padding=True,
truncation=True,
return_tensors="pt"
)
## Convert Model to ONNX Format
import os
import torch
device = torch.device("cpu")
# Set model to inference mode, which is required before exporting
# the model because some operators behave differently in
# inference and training mode.
model.eval()
model.to(device)
output_dir = os.path.join(".", "onnx_models")
if not os.path.exists(output_dir):
os.makedirs(output_dir)
export_model_path = os.path.join(output_dir, 'all_MiniLM_L6-v2.onnx')
with torch.no_grad():
symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
torch.onnx.export(model, # model being run
args=tuple(inputs.values()), # model input (or a tuple for multiple inputs)
f=export_model_path, # where to save the model (can be a file or file-like object)
opset_version=11, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=['input_ids', # the model's input names
'attention_mask',
'token_type_ids'],
output_names=['start', 'end'], # the model's output names
dynamic_axes={'input_ids': symbolic_names, # variable length axes
'attention_mask' : symbolic_names,
'token_type_ids' : symbolic_names,
'start' : symbolic_names,
'end' : symbolic_names})
print("Model exported at ", export_model_path)
Now that we have converted the Sentence BERT Model. Let’s get the stats for the models.
Vanilla SBERT (CPU)
The inference time obtained for the Vanilla SBERT model on the CPU can be found using the snippet below.
import time
import pandas as pd
import numpy as np
from tqdm import tqdm
df = pd.read_csv('./imdb_top_1000.csv', usecols=['Overview'])
total_samples = len(df)
latency = []
outputs_cpu = []
with torch.no_grad():
for i in tqdm(range(total_samples)):
data = [df.loc[i, "Overview"]]
inputs = tokenizer(data,
padding=True,
truncation=True,
return_tensors="pt"
)
start = time.time()
outputs_cpu.append(mean_pooling(model(**inputs),
inputs['attention_mask']
).cpu().detach().numpy())
latency.append(time.time() - start)
print("\n")
print("PyTorch {} Inference time = {} ms".format(device.type,
np.round(np.average(latency)*1000, 4)))
100%|██████████| 1000/1000 [00:36<00:00, 27.62it/s]
PyTorch cpu Inference time = 34.2605 ms
ONNX Converted SBERT (CPU)
The inference time obtained for the ONNX SBERT model on the CPU can be found using the below snippet.
import onnxruntime
import numpy as np
sess_options = onnxruntime.SessionOptions()
session = onnxruntime.InferenceSession(export_model_path,
sess_options,
providers=['CPUExecutionProvider'])
latency = []
ort_outputs_cpu = []
for i in tqdm(range(total_samples)):
data = [df.loc[i, "Overview"]]
inputs = tokenizer(data,
padding=True,
truncation=True,
return_tensors="pt"
)
ort_inputs = {k:v.cpu().numpy() for k, v in inputs.items()}
start = time.time()
op = session.run(None, ort_inputs)
op = torch.from_numpy(op[0])
ort_outputs_cpu.append(mean_pooling([op],
inputs['attention_mask']
).cpu().detach().numpy())
latency.append(time.time() - start)
print("\n")
print("OnnxRuntime {} Inference time = {} ms".format(device.type,
np.round(np.average(latency)*1000, 4)))
100%|██████████| 1000/1000 [00:16<00:00, 60.80it/s]
OnnxRuntime cpu Inference time = 15.5696 ms
Outputs
outputs_cpu[0][:,:10] ## Vanilla SBERT CPU Output
array([[-0.06326339, 0.0414625 , -0.04707527, -0.03361899, -0.02562934,
0.03499832, 0.00804075, -0.05042004, 0.00215668, -0.03816812]],
dtype=float32)
ort_outputs_cpu[0][:,:10] ## Onnx SBERT CPU Output
array([[-0.06326343, 0.04146247, -0.04707528, -0.033619 , -0.02562926,
0.03499835, 0.0080408 , -0.05042008, 0.00215669, -0.03816817]],
dtype=float32)
Vanilla SBERT (GPU)
The inference time obtained for the Vanilla SBERT model on the GPU can be found using the snippet below.
device = torch.device("cuda")
# Set model to inference mode, which is required before exporting
# the model because some operators behave differently in
# inference and training mode.
model.eval()
model.to(device)
total_samples = len(df)
latency = []
outputs_gpu = []
with torch.no_grad():
for i in tqdm(range(total_samples)):
data = [df.loc[i, "Overview"]]
inputs = tokenizer(data,
padding=True,
truncation=True,
return_tensors="pt"
).to(device)
start = time.time()
outputs_gpu.append(mean_pooling(model(**inputs),
inputs['attention_mask']).cpu().detach().numpy())
latency.append(time.time() - start)
print("\n")
print("PyTorch {} Inference time = {} ms".format(device.type,
np.round(np.average(latency)*1000, 4)))
100%|██████████| 1000/1000 [00:07<00:00, 135.29it/s]
PyTorch cuda Inference time = 6.737 ms
ONNX Converted SBERT (GPU)
The inference time obtained for the ONNX SBERT model on the GPU can be found using the snippet below.
import onnxruntime
import numpy as np
sess_options = onnxruntime.SessionOptions()
session = onnxruntime.InferenceSession(export_model_path,
sess_options,
providers=['CUDAExecutionProvider'])
latency = []
ort_outputs_gpu = []
for i in tqdm(range(total_samples)):
data = [df.loc[i, "Overview"]]
inputs = tokenizer(data,
padding=True,
truncation=True,
return_tensors="pt"
).to(device)
ort_inputs = {k:v.cpu().numpy() for k, v in inputs.items()}
start = time.time()
op = session.run(None, ort_inputs)
op = torch.from_numpy(op[0])
ort_outputs_gpu.append(mean_pooling([op],
inputs['attention_mask'].cpu()).cpu().detach().numpy())
latency.append(time.time() - start)
print("\n")
print("OnnxRuntime {} Inference time = {} ms".format(device.type,
np.round(np.average(latency)*1000, 4)))
100%|██████████| 1000/1000 [00:02<00:00, 373.49it/s]
OnnxRuntime cuda Inference time = 1.9466 ms
Outputs
outputs_gpu[0][:,:10] ## Vanilla SBERT GPU
array([[-0.06326333, 0.04146247, -0.0470753 , -0.03361904, -0.02562935,
0.03499833, 0.00804079, -0.05042002, 0.00215669, -0.03816818]],
dtype=float32)
ort_outputs_gpu[0][:,:10] ## ONNX SBERT GPU
array([[-0.06326336, 0.04146249, -0.04707528, -0.03361899, -0.02562931,
0.03499832, 0.0080408 , -0.05042004, 0.00215668, -0.03816817]],
dtype=float32)
Summary Table

Conclusion
Based on the results obtained we can see that the ONNX-converted model takes significantly less time to get the sentence embedding without any loss in the data. The experiments were conducted on Google Colab with T4 GPU. Similar or better results can be expected from other hardware as well.
In ONNX, we can also have a Quantised version of SBERT. The quantized version would have int8 dtype. One can explore that as well. The Jupyter notebook for the complete experiments is added in the GitHub repo below for further reference.
GitHub – SP2203/onnx-sbert
Contribute to SP2203/onnx-sbert development by creating an account on GitHub.
github.com
References
- https://onnxruntime.ai/
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- https://github.com/microsoft/onnxruntime/tree/main
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
If this was useful, consider giving it a clap, it really helps. I write about ML, AI, and technology. Follow me here on Medium so you don’t miss the next one.
📌 More from me:
→ Keras Implementation of LE-NET
→ AlexNet: Pioneering the Path to Modern Deep Learning
→ Smoothening noisy GNSS dataONN
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.