A Beginner’s Guide to CUDA Programming
Last Updated on January 14, 2025 by Editorial Team
Author(s): Aditya Kumar Manethia
Originally published on Towards AI.
Introduction
In this blog, we’ll explore the basics of CUDA programming, understand how GPUs differ from CPUs, and learn how CUDA enables us to unlock the full potential of GPUs for AI and other computationally intensive applications. Whether you’re a beginner or just curious about how GPUs power modern AI, this guide will help you take your first steps into the world of CUDA programming.
What is CUDA?
Compute Unified Device Architecture (CUDA) is a software platform and programming model created by NVIDIA. It allows users to use GPUs for tasks beyond rendering graphics, such as scientific simulations, machine learning, and high-performance computing. It allows to write programs that run on GPUs using language like C/C++.
Importance of CUDA
GPUs are designed to do intensive work and can handle thousands of tasks simultaneously, which make them perfect for parallel computing. CUDA provides the tools and libraries to write programs that can:
- Accelerate computations by running them on GPU.
- Handle large datasets.
- Optimize performance for tasks like image processing, and deep learning.
What is a GPU, and How is it different from a CPU?
CPU (Central Processing Unit) is “brain” of your computer. It is optimized for sequential tasks (one thing at a time). CPU has few powerful cores (e.g., 4–16 cores in mostly, which is good for general-purpose tasks like running your operating system or web browser.
GPU was originally designed for rendering graphics. Now, GPUs are optimized for parallel tasks (many tasks at once). GPU has thousands of smaller, less powerful cores (e.g., NVIDIA H100 consists of 16,896 CUDA cores) which are ideal for tasks like matrix multiplication, which is the backbone of AI.
Understanding the GPU Architecture
GPU is a highly parallel processor. Let’s see it’s components:
- CUDA cores
CUDA cores are basic computational units of a GPU. Each CUDA core is a simple processor capable of performing basic arithmetic operations like addition, subtraction, and multiplication.
Think of CUDA cores as the “workers” in a factory. Each worker handles small part of the overall tasks, allowing the GPU to process large amount of data in parallel.
2. Streaming Multiprocessors (SMs)
CUDA cores are grouped into SMs. Each SM contains:
- CUDA cores: Individual processing units.
- Special Function Units (SFUs): For complex mathematical operations like trigonometric functions.
- Registers: Fast, private memory for each thread.
- Shared Memory: Memory shared by all the thread in the same block.
- Warp Schedular: Manages the execution of threads in group called warps.
For example, NVIDIA RTX 4090 has 128 SMs, and each SM contain 128 CUDA cores.
3. Threads and Warps
A thread is the smallest unit of execution in CUDA. Each thread perform a specific task, such as processing one element of an array. Threads are grouped in warps, which are sets of 32 threads that execute the instruction simultaneously.
4. Memory Hierarchy
GPU has a complex memory hierarchy designed to balance speed and capacity. Let’s see main types of memory:
a. Global Memory: Accessible by all threads but relatively slow, used to store large datasets. Data must be copied from CPU to the GPU’s global memory before processing.
b. Shared Memory: Shared by all threads in same block and much faster than global memory. Ideal for sharing intermediate results between threads.
c. Registers: These are private to each thread and are fastest type of memory on GPU. These are used to store variables that are frequently accessed.
d. Constant Memory: Read-only memory shared by all threads, used for storing constants that don’t change during execution.
5. Tensor Cores
Tensor Cores are specialized hardware units in NVIDIA GPUs designed to accelerate matrix operations, which are critical for AI and machine learning. For example, Tensor Cores are used in deep learning frameworks like TensorFlow and PyTorch to speed up training and inference.
6. Streaming Processors
GPUs are designed to handle multiple tasks simultaneously using streams. A stream is a sequence of operations that execute in order and by multiple streams, you can overlap computation and memory transfers to improve performance.
How CUDA Works
CUDA allows us to write programs that run on GPU and are written in C/C++ with additional CUDA-specific syntax. The CPU (host) sends data and instructions to GPU (device), and the GPU performs the calculation in parallel. Let’s see basic steps of this program, here we will see addition of two vectors —
Step 1: Define the Kernel Function
Kernel is a function that runs on the GPU. It is defined using the __global__
keyword. The addKernel
function runs on the GPU. It contains code that will be executed in parallel by multiple threads on the GPU. Each thread calculates its global index idx
to determine which part of the data it will process.
Example-
__global__ void addKernel(float *a, float *b, float *c, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x; // Calculate global thread index
if (idx < n) {
c[idx] = a[idx] + b[idx]; // Perform addition
}
}
Step 2: Allocating Memory
Before the GPU can process data, memory must be allocated on the GPU (device). This is done using the cudaMalloc
function. Example-
float *d_a, *d_b, *d_c; // Device pointers
int size = n * sizeof(float); // Size of memory to allocate
cudaMalloc((void**)&d_a, size); // Allocate memory for array a on the GPU
cudaMalloc((void**)&d_b, size); // Allocate memory for array b on the GPU
cudaMalloc((void**)&d_c, size); // Allocate memory for array c on the GPU
Step 3: Copy Data from Host to Device
Data must be transferred from the CPU (host) to GPU (device) before the kernel can process it. This is done using the cudaMemcpy
function. Example-
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice); // Copy array a from host to device
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice); // Copy array b from host to device
Step 4: Launch the Kernel
The kernel is launched by the host using special CUDA syntax: <<<numBlocks, threadsPerBlock>>>
. This specifies how many threads and blocks will execute the kernel. Example-
int threadsPerBlock = 256; // Number of threads per block
int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock; // Number of blocks
addKernel<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, n);
numBlocks
: The number of blocks in the grid.threadsPerBlock
: The number of threads in each block.
Each thread in the grid processes a specific part of the data.
To understand more, assume GPU as a factory. Threads are individual workers in the factory. Blocks are groups of workers that work together on a specific task. And Grid is made up of multiple blocks.
Step 5: Copy Results from Device to Host
After execution, the results are copied back from GPU (device) to CPU (host) using cudaMemcpy
. Example-
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost); // Copy result array c from device to host
Step 6: Free GPU Memory
Once computation is complete, the memory allocated on the GPU must be freed using cudaFree
. Example-
cudaFree(d_a); // Free memory for array a on the GPU
cudaFree(d_b); // Free memory for array b on the GPU
cudaFree(d_c); // Free memory for array c on the GPU
Now, let’s see complete CUDA program that adds two vectors:
#include <iostream>
#include <cuda_runtime.h>
// Kernel function to add two vectors
__global__ void addKernel(float *a, float *b, float *c, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x; // Calculate global thread index
if (idx < n) {
c[idx] = a[idx] + b[idx]; // Perform addition
}
}
int main() {
const int n = 1 << 20; // 1 Million elements
const int size = n * sizeof(float); // Size of memory to allocate
// Host memory
float *h_a = new float[n];
float *h_b = new float[n];
float *h_c = new float[n];
// Initialize input vectors
for (int i = 0; i < n; ++i) {
h_a[i] = 1.0f;
h_b[i] = 2.0f;
}
// Device memory
float *d_a, *d_b, *d_c;
cudaMalloc((void**)&d_a, size); // Allocate memory on GPU for vector a
cudaMalloc((void**)&d_b, size); // Allocate memory on GPU for vector b
cudaMalloc((void**)&d_c, size); // Allocate memory on GPU for vector c
// Copy data from host to device
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
// Launch kernel
int threadsPerBlock = 256;
int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock;
addKernel<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, n);
// Copy result back to host
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
// Verify results
for (int i = 0; i < n; ++i) {
if (h_c[i] != 3.0f) {
std::cerr << "Error at index " << i << std::endl;
break;
}
}
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Free host memory
delete[] h_a;
delete[] h_b;
delete[] h_c;
return 0;
}
This is very basic example, in future we will see more advance examples and optimize them too.
How GPUs Power the AI
GPUs are at the heart of modern AI because they can handle the massive computational demands of AI workloads. Here’s how:
- Parallelism: GPUs can process thousands of tasks together.
- Tensor Cores: Accelerate matrix operations, which are critical for deep learning.
- Scalability: GPUs can be scaled across multiple devices to handle even larger workloads.
- AI Frameworks: Popular frameworks like TensorFlow and PyTorch are optimized for GPUs, making easy to leverage GPU’s power.
Conclusion
By understanding the architecture of GPUs — CUDA cores, SMs, threads, warps, and memory hierarchy — you can unlock their full potential for your applications. This blog is a very introductory and basic guide to CUDA programming, designed to help you take your first steps in understanding how GPUs work and how to write simple CUDA programs. In future blogs, we will dive deeper into more complex examples and explore optimization techniques for real-world applications, such as implementing operations like SoftMax, matrix multiplication, and other advanced algorithms used in AI and scientific computing.
Source
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI