How NVIDIA Nim Can Revolutionize Deployment of Generative AI applications?

Last Updated on July 3, 2024 by Editorial Team

Author(s): Suhaib Arshad

Originally published on Towards AI.

How NVIDIA Nim Can Revolutionize Deployment of Generative AI applications? — (Image source)

There has been a drastic increase in number of generative AI products since the debut of ChatGPT in 2022. The field of generative AI has been seeing exponential development in recent years, from LLMs to text-to-image models and upcoming text-to-video models. With each industry having their wide variety of applications ranging from code generation to text generation to image generation. It becomes a critical task for enterprises to think about “How they are going to adopt these complex AI systems into their existing infrastructure?”.

The deployment of these super powerful models into production environments is NOT easy and time efficient. As we are halfway there to 2025, companies have to move beyond API calling pretrained Large Language Models and go deep into thinking about deploying these full-scale models into production environment.

What is Nvidia NIM?
Benefits of Nvidia NIM
NVIDIA NEMO vs NIM
How to integrate NIM with your applications?
Testing it out: Text Generation Task
Testing it out: Code Generation Task
Multimodal Capabilities: Image-to-text generation, Image-to-video generation, Text-to-Image generation, Image-to-tables generation
Building RAG-based streamlit application with CHATNVIDIA
How to Deploy NVIDIA NIM locally?
Summary

Enterprises absolutely need control of things like logging, monitoring, and security, while also striving to integrate AI into their established infrastructure. Going for in-house manufacturing might not be the feasible solutions for many of these enterprises; as it requires specialized knowledge, tools, and resources, that they might not have with them. This is when NVIDIA NIM comes into the picture.

What is Nvidia Nim?

Nvidia Inference Microservice (NIM): In simple terms, NIM is a collection of cloud-native microservices that help in deployment of generative AI models on GPU-accelerated workstations, cloud environments, and data centers. They reduce the overall time taken for a generative model to reach the market, hence streamlining the entire process from development to production for enterprises.

NVIDIA NIM is a component of NVIDIA’s AI infrastructure which provides end-to-end solution by making ready to use AI models accessible, packaging them and deploying into NVIDIA accelerated production environment via NIM. For example, LLAMA3 model optimized specifically to run on NIM, giving accelerated performance on inferences.

The new wave of generative AI applications has brought complexity into the LLM architecture such as multimodal LLMs, that are equipped with varied capabilities from text generation to image generation to video generation to audio generation and many more permutations between them. Hence, NVIDIA as a market leader in computing has stepped in to introduce NIM to help developer in not only developing endpoint but also deploying these generative AI applications in a standardized and scalable way.

*Llama 3 8B NIM model on Hugging Face attains 3x throughput*(source)

NIM also enables enterprises to maximize their infrastructure investments. For example, running Meta Llama 3–8B in a NIM produces up to 3x more generative AI tokens on accelerated infrastructure than without NIM. This lets enterprises boost efficiency and use the same amount of compute infrastructure to generate more responses.

Benefits of Nvidia NIM

➡ Operate from anywhere

The model can be deployed across various infrastructures, such as cloud, local on-prem data centers, and local workstations, including but not limited to NVIDIA RTX, NVIDIA DGX, and more.

➡Simple implementation via APIs

It looks way easier to now develop AI applications using simple API calls. Suitable for implementing all kinds of fast-paced and scalable AI technologies inside your organization.

➡Use models specific to a domain

NVIDIA NIM offers a wide range of domain-specific solutions like language, images, audio healthcare, and more, providing optimized performance for your particular use case.

➡Leveraging Inference Engines to Provide Better User Experience

NIM utilized inference engines particularly tuned for each model hardware settings, this not only provides lower latency/ high performance on accelerated environments, but also lower cost associated with optimizing models on proprietary data sources. Additionally, NIM-deployed applications can be scaled dynamically to meet the changing needs of the enterprise and manage the increased workload over time.

➡Open-source AI ready-to-use models

NIM supports community AI models, like Llama-3, Mistral, Gemma, and other open-source models are also included in this.

Clearing the Air: NVIDIA NEMO vs NIM

NVIDIA NeMo is an end-to-end platform designed for developing customized Generative AI solutions. It includes training of state-of-the-art AI models like large language models (LLMs), audio, vision, multi-modal, and more.

Nvidia NIM has simplified the process of building a large-scale AI Application from infrastructure optimization to application deployment. Now using pre-built containers and Industry APIs, enterprises can run AI applications in an accelerated environment, catering to faster and more complex inferencing.

How to integrate NIM with your applications

Here is the complete notebook link with all the use cases and examples🥂

Step 1: Create an account and signing

https://build.nvidia.com/explore/discover?signin=true and create an account to avail 1000 free credits

Step 2: Chose from the available LLM model

Step 3: Click on the Get API Key to get your API key and save it for later use.

Now that we have the API key, let’s get started.

Install the required packages

! pip install langchain langchain-nvidia-ai-endpoints openai langchain-community langchain-qdrant langchainhub sentence-transformers

Set up the API key.

from google.colab import userdata
import os

os.environ['NVIDIA_API_KEY'] = userdata.get('NVIDIA_API_KEY')

Before we start with LangChain, it’s worth noting that you can directly access NVIDIA AI Foundation Endpoints through the OpenAI package. This integration allows you to seamlessly leverage NVIDIA’s AI capabilities.

from openai import OpenAI

client = OpenAI(
base_url = "http://nim-address:8000/v1")
completion = client.chat.completions.create(
model="meta/llama3-70b-instruct",
messages=[{"role":"user","content":""}],## just write your prompt here
temperature=0.5,
top_p=1,
max_tokens=1024,
stream=True)

for chunk in completion:
 if chunk.choices[0].delta.content is not None:
 print(chunk.choices[0].delta.content, end="")

you can use ChatNVIDIA.get_available_models() to check the available_models.

from langchain_nvidia_ai_endpoints import ChatNVIDIA

ChatNVIDIA.get_available_models()

Output:

[Model(id='mistralai/mistral-large', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-mistral-large']),
 Model(id='meta/codellama-70b', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-codellama-70b', 'playground_llama2_code_70b', 'llama2_code_70b', 'playground_llama2_code_34b', 'llama2_code_34b', 'playground_llama2_code_13b', 'llama2_code_13b']),
 Model(id='writer/palmyra-med-70b-32k', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-palmyra-med-70b-32k']),
 Model(id='nvidia/neva-22b', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b', aliases=['ai-neva-22b', 'playground_neva_22b', 'neva_22b']),
 Model(id='meta/llama3-70b-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-llama3-70b']),
 Model(id='ibm/granite-8b-code-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-granite-8b-code-instruct']),
 Model(id='mediatek/breeze-7b-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-breeze-7b-instruct']),
 Model(id='google/recurrentgemma-2b', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-recurrentgemma-2b']),
 Model(id='microsoft/phi-3-small-128k-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-phi-3-small-128k-instruct']),
 Model(id='snowflake/arctic', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-arctic']),
 Model(id='seallms/seallm-7b-v2.5', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-seallm-7b']),
 Model(id='microsoft/phi-3-small-8k-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-phi-3-small-8k-instruct']),
 Model(id='upstage/solar-10.7b-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-solar-10_7b-instruct']),
 Model(id='mistralai/mixtral-8x7b-instruct-v0.1', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-mixtral-8x7b-instruct', 'playground_mixtral_8x7b', 'mixtral_8x7b']),
 Model(id='liuhaotian/llava-v1.6-mistral-7b', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/stg/vlm/community/llava16-mistral-7b', aliases=['ai-llava16-mistral-7b', 'community/llava16-mistral-7b', 'liuhaotian/llava16-mistral-7b']),
 Model(id='aisingapore/sea-lion-7b-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-sea-lion-7b-instruct']),
 Model(id='liuhaotian/llava-v1.6-34b', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/stg/vlm/community/llava16-34b', aliases=['ai-llava16-34b', 'community/llava16-34b', 'liuhaotian/llava16-34b']),
 Model(id='microsoft/phi-3-vision-128k-instruct', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/vlm/microsoft/phi-3-vision-128k-instruct', aliases=['ai-phi-3-vision-128k-instruct']),
 Model(id='microsoft/phi-3-mini-4k-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-phi-3-mini-4k', 'playground_phi2', 'phi2']),
 Model(id='google/gemma-7b', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-gemma-7b', 'playground_gemma_7b', 'gemma_7b']),
 Model(id='microsoft/phi-3-mini-128k-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-phi-3-mini']),
 Model(id='adept/fuyu-8b', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/vlm/adept/fuyu-8b', aliases=['ai-fuyu-8b', 'playground_fuyu_8b', 'fuyu_8b']),
 Model(id='google/deplot', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/vlm/google/deplot', aliases=['ai-google-deplot', 'playground_deplot', 'deplot']),
 Model(id='mistralai/mistral-7b-instruct-v0.2', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-mistral-7b-instruct-v2', 'playground_mistral_7b', 'mistral_7b']),
 Model(id='google/gemma-2b', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-gemma-2b', 'playground_gemma_2b', 'gemma_2b']),
 Model(id='meta/llama2-70b', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-llama2-70b', 'playground_llama2_70b', 'llama2_70b', 'playground_llama2_13b', 'llama2_13b']),
 Model(id='google/codegemma-7b', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-codegemma-7b']),
 Model(id='google/codegemma-1.1-7b', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-codegemma-1.1-7b']),
 Model(id='ibm/granite-34b-code-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-granite-34b-code-instruct']),
 Model(id='microsoft/kosmos-2', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/vlm/microsoft/kosmos-2', aliases=['ai-microsoft-kosmos-2', 'playground_kosmos_2', 'kosmos_2']),
 Model(id='meta/llama3-8b-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-llama3-8b']),
 Model(id='google/paligemma', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/vlm/google/paligemma', aliases=['ai-google-paligemma']),
 Model(id='microsoft/phi-3-medium-4k-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-phi-3-medium-4k-instruct']),
 Model(id='writer/palmyra-med-70b', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-palmyra-med-70b']),
 Model(id='mistralai/mixtral-8x22b-instruct-v0.1', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-mixtral-8x22b-instruct']),
 Model(id='databricks/dbrx-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-dbrx-instruct'])]

Text Generation Task:

Here we will assess the model's ability to generate text from an input prompt, Applications like ChatGPT and Claude excel in that. So, we will test out the NVIDIA powered text generation models.

Now, let’s see how NVIDIA AI Foundation Endpoints perform with a standard text generation task with langchain.

# Import the ChatNVIDIA class from the langchain_nvidia_ai_endpoints package
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# Create an instance of the ChatNVIDIA class, configuring it with the specified model and parameters
llm = ChatNVIDIA(
 model="meta/llama3-8b-instruct", # Specify the model to use for generating the text
 temperature=0.3, # Set the temperature for the generation to control randomness (0.3 is relatively low, meaning more deterministic)
 max_tokens=1024 # Limit the response to a maximum of 1024 tokens
)

# Invoke the model with a prompt to generate a ballad about LangChain
result = llm.invoke("Write the matematical formula for theory of relativity?")

# Print the content of the result
print(result.content)

All we have to do here is define the model parameter in the ChatNVIDIA function and invoke it with our prompt.

The following block has been generated by AI in order to evaluate its generative capabilities

Output:

A bold question!

The theory of relativity, developed by Albert Einstein, is a complex and multifaceted concept that encompasses two main parts: special relativity and general relativity. While there isn't a single, concise formula that summarizes the entire theory, I can provide you with some key equations that represent the core ideas.

**Special Relativity (1905)**

1. **Time dilation**: The Lorentz transformation for time dilation:

t' = γ(t - vx/c^2)

where t' is the time measured by an observer in motion, t is the time measured by an observer at rest, v is the relative velocity, x is the position, c is the speed of light, and γ is the Lorentz factor: γ = 1 / sqrt(1 - v^2/c^2).

2. **Length contraction**: The Lorentz transformation for length contraction:

L' = L / γ

where L' is the length measured by an observer in motion, L is the length measured by an observer at rest, and γ is the Lorentz factor.

3. **Relativistic mass-energy equivalence**: The famous equation:

E^2 = (pc)^2 + (mc^2)^2

where E is the total energy, p is the momentum, m is the rest mass, and c is the speed of light.

**General Relativity (1915)**

1. **Einstein's field equations**: The fundamental equations that describe the curvature of spacetime:

Rμν - 1/2Rgμν = (8πG/c^4)Tμν

where Rμν is the Ricci tensor, R is the Ricci scalar, gμν is the metric tensor, G is the gravitational constant, c is the speed of light, and Tμν is the stress-energy tensor.

2. **Geodesic equation**: The equation that describes the motion of objects in curved spacetime:

d^2x^μ/dλ^2 + Γμνρ dx^ν/dλ dx^ρ/dλ = 0

where x^μ is the position, λ is the affine parameter, and Γμνρ is the Christoffel symbol.

These equations form the foundation of the theory of relativity, but keep in mind that they are just a few examples of the many mathematical concepts and formulas that are used to describe the theory.

# Asking it to generate a new song to test its creativity
result = llm.invoke("Write a song about theory of relativity?")

# Print the content of the result
print(result.content)

The following block has been generated by AI in order to evaluate its generative capabilities

Output:

What a fascinating topic! Here's a song about the theory of relativity:

**Title:** "Time and Space"

**Verse 1:**
In the realm of physics, a revolution's born
Einstein's insight, a new dawn is sworn
Time and space, no longer the same
Relative truths, in a world of flame

**Chorus:**
Time and space, a dance we play
Dependent on speed, in a relativistic way
The faster we move, the slower it goes
In the theory of relativity, our minds explode

**Verse 2:**
The speed of light, a constant guide
The fabric of spacetime, where gravity resides
Mass and energy, intertwined they stand
The equivalence principle, a fundamental hand

**Chorus:**
Time and space, a dance we play
Dependent on speed, in a relativistic way
The faster we move, the slower it goes
In the theory of relativity, our minds explode

**Bridge:**
Imagine a clock, ticking away
On a train, moving at incredible sway
From a stationary view, it seems to slow
But for the traveler, time flows, don't you know

**Chorus:**
Time and space, a dance we play
Dependent on speed, in a relativistic way
The faster we move, the slower it goes
In the theory of relativity, our minds explode

**Outro:**
In the realm of relativity, we find our place
A world of wonder, where time and space entwine
The theory of relativity, a masterpiece of mind
A dance of time and space, forever aligned.

Please note that this is a creative interpretation of the theory of relativity, and not a scientifically accurate representation. The song aims to capture the essence of the theory in a lyrical and melodic form, rather than providing a detailed explanation of the mathematical concepts involved.

Code generator

Now Let’s move on to assessment of model’s ability to generate code from an input prompt, Foundation models like Codellama have special abilities in code generation tasks. So, we will test out the efficiency of NVIDIA powered code generation models.

prompt = ChatPromptTemplate.from_messages(
 [
 (
 "system",
 "You are an expert coding Assistant. Strictly Follow python syntax; Do not use external libraries; Provide only with the correct answer and refrain from explainations.",
 ),
 ("user", "{input}"),
 ]
)
chain = prompt | ChatNVIDIA(model="meta/codellama-70b") | StrOutputParser()

leetcode_problem=
'''
Solve this problem: https://leetcode.com/problems/median-of-two-sorted-arrays/description/
 
'''
for txt in chain.stream({"input": leetcode_problem}):
 print(txt, end="")

As you can see, I have definedcodellama-70b model from the ChatNVIDIA function, we have provided it with the link for leetcode problem, it will read through the problem description and give output as below.

Output:

class Solution:
 def findMedianSortedArrays(self, nums1: List[int], nums2: List[int]) -> float:
 def get_median(x):
 n = len(x)
 return (x[n//2] + x[n//2 - 1])/2 if n % 2 == 0 else x[n//2]
 
 x = nums1 + nums2
 x.sort()
 return get_median(x)

Passing both test cases on leetcode (source)

Multimodal Capabilities

Multimodal AI is the ability of Generative AI (Gen AI) models to understand, process and generate various types of content data including images, videos, text, audio and more. By leveraging the power of various modalities, AI can attain a more comprehensive understanding of how the world works.

How I think of it is: We as humans have 5 senses hear, speak, see, smell, and touch, the former 3 abilities have already been achieved by AI, the latter 2 are yet to be achieved in order to move towards AGI 😱.

Coming back to Nvidia NIM, here are the models you can inference:

📸 →📄 Image-to-text generation
📷 → 🎥 Image-to-video generation
📄 →📷 Text-to-Image generation
📷->📅 Image-to-tables generation

📸 — >📄 Image-to-text

import IPython
import requests

image_url = "https://th.bing.com/th/id/OIP.AvJzGpdSlqXmr1hNmoGiGgHaEo?rs=1&pid=ImgDetMain"
image_content = requests.get(image_url).content

IPython.display.Image(image_content)

from langchain_core.messages import HumanMessage
from langchain_nvidia_ai_endpoints import ChatNVIDIA

mllm = ChatNVIDIA(model="nvidia/neva-22b")

mllm.invoke(
 [
 HumanMessage(
 content=[
 {"type": "text", "text": "Can you name a movie based on this image"},
 {"type": "image_url", "image_url": {"url": image_url}},
 ]
 )
 ]
)

Output:

Yes, the image of the orange and white clownfish swimming near a 
coral reef could remind you of the movie "Finding Nemo". The movie features 
a clownfish named Nemo who gets captured by a diver and separated from 
his father, Marlin. Marlin goes on a journey to find his son, encountering 
various marine creatures and obstacles along the way. The image of the 
clownfish swimming near the coral reef could evoke the sense of adventure and 
exploration that the movie "Finding Nemo" represents.

📷 — > 🎥 Image to video generation

import requests

invoke_url = "https://ai.api.nvidia.com/v1/genai/stabilityai/stable-video-diffusion"

headers = {
 "Authorization": f"Bearer {userdata.get('image_nim_keu')}",
 "Accept": "application/json",
}

payload = {
 "image": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAIAAACQd1PeAAAAEElEQVR4nGK6HcwNCAAA//8DTgE8HuxwEQAAAABJRU5ErkJggg==",
 "cfg_scale": 2.5,
 "seed": 0
}

response = requests.post(invoke_url, headers=headers, json=payload)

response.raise_for_status()
response_body = response.json()
print(response_body)

import base64

# Assuming response_body contains the base64-encoded video data
response_body = {
 'artifacts': [
 {
 'base64': f'{response_body["video"]}'
 }
 ]
}

# Decode the base64 string
video_data = base64.b64decode(response_body['artifacts'][0]['base64'])

# Define the filename for the video
filename = 'generated_vido.mp4'

# Write the decoded data to the video file
with open(filename, 'wb') as f:
 f.write(video_data)

print(f"Video saved as {filename}")

📄 — >📷Text to Image Generation

import requests

invoke_url = "https://ai.api.nvidia.com/v1/genai/stabilityai/sdxl-turbo"

headers = {
 "Authorization": f"Bearer {userdata.get('image_nim_keu')}",
 "Accept": "application/json",
}

payload = {
 "text_prompts": [
 {
 "text": "A motion of titanic sinking in the abyss and helicopters circling the ship",
 "weight": 1
 }
 ],
 "seed": 0,
 "sampler": "K_EULER_ANCESTRAL",
 "steps": 2
}

response = requests.post(invoke_url, headers=headers, json=payload)

response.raise_for_status()
response_body = response.json()
print(response_body)

import base64
imgdata = base64.b64decode(response_body['artifacts'][0]['base64'])
filename = 'some_image.jpg' # I assume you have a way of picking unique filenames
with open(filename, 'wb') as f:
 f.write(imgdata)
IPython.display.Image(filename)

📷->📅 Image to tables

import requests, base64

invoke_url = "https://ai.api.nvidia.com/v1/vlm/google/deplot"
stream = False

with open("/content/top10hostcitiies.png", "rb") as f:
 image_b64 = base64.b64encode(f.read()).decode()

assert len(image_b64) < 180_000, \
 "To upload larger images, use the assets API (see docs)"

headers = {
 "Authorization": f"Bearer {userdata.get('image_nim_keu')}",
 "Accept": "text/event-stream" if stream else "application/json"
}

payload = {
 "messages": [
 {
 "role": "user",
 "content": f'Generate underlying data table of the figure below: <img src="data:image/png;base64,{image_b64}" />'
 }
 ],
 "max_tokens": 1024,
 "temperature": 0.20,
 "top_p": 0.20,
 "stream": stream
}

response = requests.post(invoke_url, headers=headers, json=payload)

if stream:
 for line in response.iter_lines():
 if line:
 print(line.decode("utf-8"))
else:
 print(response.json())

Output:

🔧 Building RAG-based streamlit application with CHATNVIDIA

I have made a YouTube tutorial of how the streamlit application works, mind to watch that:

Alternatively, go through this tutorial to implement your own rag application.

Navigate to build.nvidia.com , Sign up and Click on ***Get API Key***

Setting Environment variable in `.env` file

NVIDIA_API_KEY="<YOUR_NVIDIA_API_KEY>"

Installing Dependencies

!pip install openai python-dotenv langchain_nvidia_ai_endpoints 
langchain_community
streamlit
faiss-cpu
pypdf

Here is the full python script for the streamlit application:

import os
import streamlit.web.cli as st
from langchain_nvidia_ai_endpoints import * # Import NVIDIA AI endpoints for language model integration
from langchain_community.document_loaders import PyPDFDirectoryLoader # Import PDF document loader
from langchain.text_splitter import RecursiveCharacterTextSplitter # Import text splitter for document processing
from langchain.chains.combine_documents import create_stuff_documents_chain # Import function to create document chains
from langchain.prompts import ChatPromptTemplate # Import template for chat prompts
from langchain.chains import create_retrieval_chain # Import function to create retrieval chains
from langchain.output_parsers import StrOutputParser # Import output parser for string parsing
from langchain_community.vectorstores import FAISS # Import FAISS vector store for embeddings

from dotenv import load_dotenv # Import dotenv to load environment variables from a .env file
load_dotenv() # Load environment variables from .env file

# Set the NVIDIA API key from the environment variable
os.environ["NVIDIA_API_KEY"] = os.getenv("NVIDIA_API_KEY")

# Initialize the ChatNVIDIA language model
llm = ChatNVIDIA(model="meta/llama-70b-instruct")

# Function to generate vector embeddings for documents
def vector_embeddings():
 if "vectors" not in st.session_state: # Check if vectors are already in the session state
 st.session_state.embeddings = NVIDIAEmbeddings() # Initialize NVIDIA embeddings
 st.session_state.loader = PyPDFDirectoryLoader("docs") # Load PDF documents from the "docs" directory
 st.session_state.docs = st.session_state.loader.load() # Load documents into session state
 st.session_state.text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=50) # Initialize text splitter
 st.session_state.final_documents = st.session_state.text_splitter.split_documents(st.session_state.docs[:30]) # Split documents into chunks
 st.session_state.vectors = FAISS.from_documents(st.session_state.final_documents, st.session_state.embeddings) # Create FAISS vector store from documents

st.title("Nvidia NIM demo") # Set the title of the Streamlit app

# Define the chat prompt template
prompt = ChatPromptTemplate.from_template(
"""
Answer the questions based on the provided context only.
Please provide the most accurate response based on the question
<context>
{context}
<context>
Questions: {input}
"""
)

# Input field for user to enter their question
prompt1 = st.text_input("Enter Your Question From Documents")

# Button to trigger document embedding process
if st.button("Documents Embedding"):
 vector_embeddings() # Generate vector embeddings
 st.write("Vector Store DB Is Ready") # Indicate that the vector store is ready

import time # Import time module for performance measurement

# If a question is entered, process it
if prompt1:
 document_chain = create_stuff_documents_chain(llm, prompt) # Create document chain for processing
 retriever = st.session_state.vectors.as_retriever() # Use vectors as retriever
 retrieval_chain = create_retrieval_chain(retriever, document_chain) # Create retrieval chain
 start = time.process_time() # Start timer
 response = retrieval_chain.invoke({'input': prompt1}) # Get response from retrieval chain
 print("Response time:", time.process_time() - start) # Print response time
 st.write(response['answer']) # Display the answer

 # With a Streamlit expander for document similarity search
 with st.expander("Document Similarity Search"):
 # Find and display the relevant chunks
 for i, doc in enumerate(response["context"]):
 st.write(doc.page_content) # Display the page content of each relevant document
 st.write("--------------------------------") # Separator for clarity

save it as app.py and the run

streamlit run app.py

👨‍💻How to Deploy NVIDIA NIM locally?

✂️ Deploying Nvidia NIM locally via Docker

60 seconds · Clipped by suhaib hussain · Original video "Deploy AI Models to Production with NVIDIA NIM" by Prompt…

youtube.com

Summary

With the help of cloud-hosted APIs and the NVIDIA API catalog, developers will be able to evaluate the latest cutting-edge generative AI models. Or they can also go for download the NIM into their local environment and utilize a self-hosted architecture for your applications. This can cut short the time and cost it takes to deploy and run the models.

With the help of Kubernetes orchestration now NIM enables deployment of large-scale State-of-the-Art models into your enterprise. NIM eliminates the need for specialized knowledge or exhaustive fine-tuning/customization efforts and integrates easily with the current infrastructure of your organization.

From a Business perspective, NIM can support new entrants who are looking to build their AI infrastructure without the hassle of complex containerization or extraordinary levels of changes to the company’s pipeline or even without a solid understanding of AI models. NIM helps in lowering the cost it takes to operate and scale AI-powered infrastructure.

Resources:

Access the complete Notebook with examples used in this article.
Haystack RAG Pipeline with Self-Deployed AI Models and NVIDIA NIM
LangChain RAG Agent with NVIDIA NIM
LlamaIndex RAG Pipeline with NVIDIA NIM

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

How NVIDIA Nim Can Revolutionize Deployment of Generative AI applications?

Author(s): Suhaib Arshad

Table of Contents

What is Nvidia Nim?

Benefits of Nvidia NIM

➡ Operate from anywhere

➡Simple implementation via APIs

➡Use models specific to a domain

➡Leveraging Inference Engines to Provide Better User Experience

➡Open-source AI ready-to-use models

Clearing the Air: NVIDIA NEMO vs NIM

How to integrate NIM with your applications

Step 1: Create an account and signing

Step 2: Chose from the available LLM model

Step 3: Click on the Get API Key to get your API key and save it for later use.

Text Generation Task:

Code generator

Multimodal Capabilities

📸 — >📄 Image-to-text

📷 — > 🎥 Image to video generation

📄 — >📷Text to Image Generation

📷->📅 Image to tables

🔧 Building RAG-based streamlit application with CHATNVIDIA

Setting Environment variable in .env file

Installing Dependencies

👨‍💻How to Deploy NVIDIA NIM locally?

✂️ Deploying Nvidia NIM locally via Docker

60 seconds · Clipped by suhaib hussain · Original video "Deploy AI Models to Production with NVIDIA NIM" by Prompt…

Summary

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

Setting Environment variable in `.env` file