An Introduction to Using NVIDIA’s NIM API
Last Updated on July 21, 2024 by Editorial Team
Author(s): Harpreet Sahota
Originally published on Towards AI.
An Introduction to Using NVIDIA’s NIM API
This is THE hands-on coding tutorial using the NIM API you’ve been looking for!
I recently got a chance to hack around with NVIDIA’s NIM API (that’s a lot of capital letters in a row), and I gotta say…it’s actually pretty dope.
NIM, short for NVIDIA Inference Microservices, basically helps you run models how you want to without relying on third parties. And it does this by making it easy to:
- Deploy models on your infrastructure.
- Serve these models for multi-user inference.
- Run models efficiently by making your NVIDIA GPUs go brrrrr.
- Maintain control over your model deployment and customization without relying on third-party APIs.
- Integrate into existing applications (this is because it has an OpenAI API-compatible server).
Alright, so what is a NIM?
It’s basically a Docker container with three main components:
- A server layer that provides an API for external interactions
- A runtime layer that manages model execution
- A model “engine” that contains the model weights and execution information
In this tutorial, we won’t be working with an actual NIM container or Docker. Instead, I’ll show you how to use the NIM API for text generation, video generation, and visual question-answering tasks.
I won’t get into the technical details of the models I’m using, as my main goal with this post is to help you get started using the NIM API as quickly as possible. To get started, you’ll need to sign up for a NIM API key, which you can do here. It’s absolutely free to sign up for the API; no credit card is required, and you get 1000 credits right off the bat.
Full disclosure: I’m part of NVIDIA’s “influencer” program. I don’t get paid any cash money from them, but they hook me up with credits to their API, plus send GPU hardware my way in exchange for reviewing their products and spreading the word about it to the community. By signing up using my link, all you’re doing is signaling to them that they should continue to send me GPUs. Honestly, this isn’t too bad of a deal, considering you’ll also get 1000 credits for the API!
Once you’ve signed up for an API key, go ahead and run the code below so you can start hacking with me in this tutorial.
👨🏽💻 Let’s code!
import getpass
import os
nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
os.environ["NVIDIA_API_KEY"] = nvidia_api_key
I’m taking a minimalist approach in this tutorial; we’re going to call the API using nothing but the requests
library.
The NIM API integrates with LangChain and LlamaIndex and is compatible with the OpenAI API. NVIDIA has put together a repository with examples that you can use after going through this basic tutorial.
Below is a helper function we’ll use throughout the tutorial.
import requests
import base64
from IPython.display import HTML
def call_nim_api(endpoint, payload, headers = None, api_key=nvidia_api_key):
"""
Generate a video using NVIDIA's AI API.
Args:
api_key (str): NVIDIA API key for authentication.
payload (dict): The complete payload for the API request.
endpoint (str, optional): API endpoint path. Defaults to "genai/stabilityai/stable-video-diffusion".
Returns:
dict: JSON response from the API.
Raises:
requests.HTTPError: If the API request fails.
"""
DEFAULT_HEADERS = {
"Authorization": f"Bearer {api_key}",
"Accept": "application/json",
}
if headers is None:
headers = DEFAULT_HEADERS
response = requests.post(
endpoint,
headers=headers,
json=payload
)
response.raise_for_status()
return response.json()
Large Language Models Endpoint
I typically hack around with “small” language models—in the 7–13 billion parameter range—since that’s what I can hack around with on the hardware I have available. But since you get hooked up with 1000 credits right off the bat when you sign up for the API, I took this as an opportunity to play around with some massive language models—ones that I would typically never get to play around with otherwise.
Here’s what I chose to play around with:
For this overview, I’m selecting one prompt from the IFEval dataset. I encourage you to try your own prompts, or some prompts from here.
PROMPT = """"The man was arrested for stealing a car. He was later released on bail."
Expand on it angrily in 90s gangster rap style, and make sure there are exactly 4 verses and a hook.
Separated the sections by the markdown divider: ***
"""
messages = [
{
"role": "user",
"content": PROMPT
}
]
def create_llm_payload(model, messages=messages, **kwargs):
"""
Create a payload for an NVIDIA NIM API call to an LLM.
Args:
model (str): The model to use.
messages (list): List of message dictionaries.
**kwargs: Arbitrary keyword arguments for additional payload parameters.
Returns:
dict: The payload for the API call.
"""
# Default values
default_params = {
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 1024,
"stream": False
}
# Update default parameters with any provided kwargs
default_params.update(kwargs)
# Create the payload
payload = {
"model": model,
"messages": messages,
**default_params
}
return payload
def print_response(response):
"""
Print the content of the response from an LLM API call.
Args:
response (dict): The response dictionary from the API call.
Returns:
None
"""
try:
content = response['choices'][0]['message']['content']
print(content)
except KeyError:
print("Error: Unable to find the expected content in the response.")
except Exception as e:
print(f"An error occurred while trying to print the response: {str(e)}")
You can call the endpoint like so:
LLM_ENDPOINT = "https://integrate.api.nvidia.com/v1/chat/completions"
NEMOTRON = "nvidia/nemotron-4-340b-instruct"
response = call_nim_api(LLM_ENDPOINT, create_llm_payload(model=NEMOTRON))
print_response(response)
This produces a pretty decent output!
***
(Verse 1)
Yo, check it, this tale of a man, livin' life on the edge,
Stealin' cars for a living, never had no privilege.
Cops on his tail, but he's too slick, too quick,
Finally caught him, threw him in the back, no click-click.
(Verse 2)
Down at the station, they book him, no respect,
Just another criminal, no need for no detect.
But this man's got heart, won't let them break his spirit,
Bails out, back on the streets, same old merit.
(Hook)
We ride, we slide, we hide, we seek,
In the shadows of the night, no time for the weak.
The man, the myth, the legend, they can't keep him down,
In this game of life, he wears the crown.
***
(Verse 3)
Now he's out, but the heat's still on,
Gotta watch his back, can't trust no one.
But he's a survivor, a true gangster, you see,
In this world of deceit, he's as real as can be.
(Verse 4)
So here's to the man, the one they couldn't cage,
In this rap game of life, he's on the front page.
Stealin' cars was his sin, but he's more than that,
A symbol of resilience, in a world that's whack.
(Hook)
We ride, we slide, we hide, we seek,
In the shadows of the night, no time for the weak.
The man, the myth, the legend, they can't keep him down,
In this game of life, he wears the crown.
***
Remember, this is just a creative expression and does not promote or glorify criminal activities. It's important to respect the law and others' property.
I won’t share the output from the other models I’ve hacked around to keep this tutorial as short as possible. It’s quite straightforward to make generations. All you have to do is change the model string to whatever model you want to use, for example:
ARCTIC = "snowflake/arctic"
YI_LARGE = "01-ai/yi-large"
MIXTRAL = "mistralai/mixtral-8x22b-instruct-v0.1"
There are a lot of other models you can play around with; check out the API reference for more details, including the arguments you can pass to manipulate the model’s output.
I had a blast playing around with these LLMs, especially since I couldn’t otherwise. Thanks, NVIDIA, for hosting these and also making inferencing with them pretty damn fast!
Visual Models
The Visual Models endpoint has some standard diffusion models, like various flavors of Stable Diffusion such as SDXL. It also has some of NVIDIA’s specialized models like RetailObjectDetection and OCRNet.
I took this opportunity to play around with Stable Video Diffusion
Stable Video Diffusion (SVD) is a generative model synthesizing 25-frame video sequences at 576×1024 resolution from a single input image. It uses diffusion-based generation to gradually add details and noise over multiple steps, creating short video clips with customizable frame rates and optional micro-conditioning parameters.
The version of the model available via the NIM API is SVD XT, an image-to-video model (no text prompt). Feel free to use your images; just note that your image must be smaller than 200KB. Otherwise, it must be uploaded to a resigned S3 bucket using NVCF Asset APIs.
To start with, here’s a picture of Winnipeg.
You can download the image like so:
!wget https://weexplorecanada.com/wp-content/uploads/2023/05/Things-to-do-in-Winnipeg-Twitter.jpg
Below are some helper functions to convert and work with images in base64.
import base64
def image_to_base64(image_path):
"""
Encodes an image into base64 format.
Args:
image_path: The path to the image file.
Returns:
A base64 encoded string of the image.
"""
with open(image_path, "rb") as image_file:
image_bytes = image_file.read()
encoded_string = base64.b64encode(image_bytes).decode()
return encoded_string
def save_base64_video_as_mp4(base64_string, output_mp4_path):
"""
Save a base64-encoded video as an MP4 file.
Args:
base64_string (str): The base64-encoded video string.
output_mp4_path (str): The path where the output MP4 should be saved.
Returns:
None
"""
try:
# Decode the base64 string
video_data = base64.b64decode(base64_string['video'])
# Write the binary data to an MP4 file
with open(output_mp4_path, "wb") as mp4_file:
mp4_file.write(video_data)
print(f"MP4 video saved successfully at {output_mp4_path}")
except Exception as e:
print(f"An error occurred: {str(e)}")
def play_base64_video(base64_string, video_type="mp4"):
"""
Play a base64-encoded video in a Colab notebook.
Args:
base64_string (str): The base64-encoded video string.
video_type (str, optional): The video format (e.g., 'mp4', 'webm'). Defaults to 'mp4'.
Returns:
None
"""
base64_string=base64_string['video']
# Ensure the base64 string doesn't have the data URI prefix
if base64_string.startswith('data:video/'):
# Extract the actual base64 data
base64_string = base64_string.split(',')[1]
# Create the HTML video tag
video_html = f'''
<video width="640" height="480" controls>
<source src="data:video/{video_type};base64,{base64_string}" type="video/{video_type}">
Your browser does not support the video tag.
</video>
'''
# Display the video
display(HTML(video_html))
This function will create the payload for an image with or without a prompt:
def create_image_payload(image_b64, image_format='jpeg', prompt=None):
"""
Create a payload with a base64-encoded image, with or without a prompt.
Args:
image_b64 (str): The base64-encoded image string (without the data URI prefix).
image_format (str, optional): The format of the image. Accepted formats are jpg, png and jpeg.
prompt (str, optional): The prompt to include before the image. Default is None.
Returns:
dict: The constructed payload.
"""
# Ensure the image_b64 doesn't already have the data URI prefix
if not image_b64.startswith('data:image/'):
image_b64 = f"data:image/{image_format};base64,{image_b64}"
if prompt:
return f'{prompt} <img src="{image_b64}" />'
else:
# Scenario without a prompt
return image_b64
Let’s convert the image to base64:
winnipeg = image_to_base64("/content/Things-to-do-in-Winnipeg-Twitter.jpg")
Note that the cfg_scale
guides how strongly the generated video sticks to the original image. Use lower values to allow the model more freedom to make changes and higher values to correct motion distortions.
SVD_ENDPOINT = "https://ai.api.nvidia.com/v1/genai/stabilityai/stable-video-diffusion"
winnipeg_payload = create_image_payload(winnipeg, image_format='jpeg', prompt=None)
payload = {
"image": winnipeg_payload,
"cfg_scale": 2.42, #number must be lt or eq to 9
"seed": 51
}
winnipeg_video = call_nim_api(endpoint = SVD_ENDPOINT, payload = payload)
play_base64_video(winnipeg_video)
Here’s the result:
Vision-Language Models
The NIM API has about 10 vision-language (aka “multimodal”) models available.
I’ve hacked around with all the ones here locally, but the inference speed via the NIM was quite nice. What caught my eye, though, is the NeVA22B model. NeVA is NVIDIA’s version of the LLaVA model, where they replaced the open-source LLaMA model with a GPT model trained by NVIDIA. In this approach, the image is encoded using a frozen Hugging Face CLIP model and combined with the prompt embeddings before passing through the language model.
This was a fun model to hack around with. It’s quite good and has a bit of a different “personality” than the LLaVA models I’ve hacked with. Those models were trained with either Vicuna, Mistral, or Hermes LLMs, while NeVA uses an LLM trained by NVIDIA. Sadly, I couldn’t find much info (or a paper) about NeVA online.
Here’s the image I passed to NeVA:
And here’s the code for generation:
NEVA22B_ENDPOINT = "https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b"
message_content = create_image_payload(
image_b64 = niners,
image_format='jpeg',
prompt="Describe, as a rap in the style of Kendrick Lamar, what you see in this scene. Say 'Comption' and 'Bay Area' at least once each"
)
payload = {
"messages": [{"role": "user", "content": message_content}],
"max_tokens": 512,
"temperature": 1.00,
"top_p": 0.70,
"stream": False
}
response = call_nim_api(endpoint = NEVA22B_ENDPOINT, payload = payload)
print_response(response)
This gave a surprisingly good result; it can be discerned that this is the SF 49ers football team:
(Verse 1)
Compton, Bay Area, where I'm from
The gridiron, the field, the sun
Red and gold, my team, the 49ers
Feelin' the heat, we're down to ten seconds
(Chorus)
It's a game of football, the clock's winding down
I'm throwin' the ball, I'm making a sound
Compton, Bay Area, my roots run deep
I'm playin' for the team, I'm never gonna sleep
(Verse 2)
I'm in the pocket, the clock's tickin' away
The team's dependin' on me, it's a big day
I throw the ball, it's catchin' in the air
Compton, Bay Area, I'm livin' my dream, no fear, no care
(Chorus)
It's a game of football, the clock's winding down
I'm throwin' the ball, I'm making a sound
Compton, Bay Area, my roots run deep
I'm playin' for the team, I'm never gonna sleep
(Verse 3)
The crowd's amped up, the energy's high
Compton, Bay Area, I'm feelin' alive
The game's on the line, the pressure's intense
But I'm ready, I'm comin' in for the entrance
(Chorus)
It's a game of football, the clock's winding down
I'm throwin' the ball, I'm making a sound
Compton, Bay Area, my roots run deep
I'm playin' for the team, I'm never gonna sleep
(Outro)
Compton, Bay Area, it's all about the game
I'm livin' my dream, I'm ready to claim
The title, the trophy, the top of the charts
Compton, Bay Area, it's all in my heart.
The NIM API also has various models related to Healthcare.
I didn’t hack around with any of these models, but my teammate at Voxel51 (Dan Gural) wrote an awesome blog on Segment Anything in a CT Scan with NVIDIA VISTA-3D, which I recommend checking out.
Final thoughts
It’s cool to see NVIDIA entering the API game.
They’ve got some great models in their model zoo, and I can only see them adding more over the coming months. The biggest thing that stands out to me is the speed. It’s super impressive!
👨🏽💻 I have this post available as a notebook here.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI