A Beginner’s Guide to CUDA Programming

Last Updated on January 14, 2025 by Editorial Team

Author(s): Aditya Kumar Manethia

Originally published on Towards AI.

Introduction

In this blog, we’ll explore the basics of CUDA programming, understand how GPUs differ from CPUs, and learn how CUDA enables us to unlock the full potential of GPUs for AI and other computationally intensive applications. Whether you’re a beginner or just curious about how GPUs power modern AI, this guide will help you take your first steps into the world of CUDA programming.

What is CUDA?

Compute Unified Device Architecture (CUDA) is a software platform and programming model created by NVIDIA. It allows users to use GPUs for tasks beyond rendering graphics, such as scientific simulations, machine learning, and high-performance computing. It allows to write programs that run on GPUs using language like C/C++.

Importance of CUDA

GPUs are designed to do intensive work and can handle thousands of tasks simultaneously, which make them perfect for parallel computing. CUDA provides the tools and libraries to write programs that can:

Accelerate computations by running them on GPU.
Handle large datasets.
Optimize performance for tasks like image processing, and deep learning.

What is a GPU, and How is it different from a CPU?

CPU (Central Processing Unit) is “brain” of your computer. It is optimized for sequential tasks (one thing at a time). CPU has few powerful cores (e.g., 4–16 cores in mostly, which is good for general-purpose tasks like running your operating system or web browser.

GPU was originally designed for rendering graphics. Now, GPUs are optimized for parallel tasks (many tasks at once). GPU has thousands of smaller, less powerful cores (e.g., NVIDIA H100 consists of 16,896 CUDA cores) which are ideal for tasks like matrix multiplication, which is the backbone of AI.

Understanding the GPU Architecture

GPU is a highly parallel processor. Let’s see it’s components:

CUDA cores

CUDA cores are basic computational units of a GPU. Each CUDA core is a simple processor capable of performing basic arithmetic operations like addition, subtraction, and multiplication.

Think of CUDA cores as the “workers” in a factory. Each worker handles small part of the overall tasks, allowing the GPU to process large amount of data in parallel.

2. Streaming Multiprocessors (SMs)

CUDA cores are grouped into SMs. Each SM contains:

CUDA cores: Individual processing units.
Special Function Units (SFUs): For complex mathematical operations like trigonometric functions.
Registers: Fast, private memory for each thread.
Shared Memory: Memory shared by all the thread in the same block.
Warp Schedular: Manages the execution of threads in group called warps.

For example, NVIDIA RTX 4090 has 128 SMs, and each SM contain 128 CUDA cores.

3. Threads and Warps

A thread is the smallest unit of execution in CUDA. Each thread perform a specific task, such as processing one element of an array. Threads are grouped in warps, which are sets of 32 threads that execute the instruction simultaneously.

4. Memory Hierarchy

GPU has a complex memory hierarchy designed to balance speed and capacity. Let’s see main types of memory:

a. Global Memory: Accessible by all threads but relatively slow, used to store large datasets. Data must be copied from CPU to the GPU’s global memory before processing.

b. Shared Memory: Shared by all threads in same block and much faster than global memory. Ideal for sharing intermediate results between threads.

c. Registers: These are private to each thread and are fastest type of memory on GPU. These are used to store variables that are frequently accessed.

d. Constant Memory: Read-only memory shared by all threads, used for storing constants that don’t change during execution.

5. Tensor Cores

Tensor Cores are specialized hardware units in NVIDIA GPUs designed to accelerate matrix operations, which are critical for AI and machine learning. For example, Tensor Cores are used in deep learning frameworks like TensorFlow and PyTorch to speed up training and inference.

6. Streaming Processors

GPUs are designed to handle multiple tasks simultaneously using streams. A stream is a sequence of operations that execute in order and by multiple streams, you can overlap computation and memory transfers to improve performance.

How CUDA Works

CUDA allows us to write programs that run on GPU and are written in C/C++ with additional CUDA-specific syntax. The CPU (host) sends data and instructions to GPU (device), and the GPU performs the calculation in parallel. Let’s see basic steps of this program, here we will see addition of two vectors —

Step 1: Define the Kernel Function

Kernel is a function that runs on the GPU. It is defined using the __global__ keyword. The addKernel function runs on the GPU. It contains code that will be executed in parallel by multiple threads on the GPU. Each thread calculates its global index idx to determine which part of the data it will process.

Example-

__global__ void addKernel(float *a, float *b, float *c, int n) {
 int idx = threadIdx.x + blockIdx.x * blockDim.x; // Calculate global thread index
 if (idx < n) {
 c[idx] = a[idx] + b[idx]; // Perform addition
 }
}

Step 2: Allocating Memory

Before the GPU can process data, memory must be allocated on the GPU (device). This is done using the cudaMalloc function. Example-

float *d_a, *d_b, *d_c; // Device pointers
int size = n * sizeof(float); // Size of memory to allocate

cudaMalloc((void**)&d_a, size); // Allocate memory for array a on the GPU
cudaMalloc((void**)&d_b, size); // Allocate memory for array b on the GPU
cudaMalloc((void**)&d_c, size); // Allocate memory for array c on the GPU

Step 3: Copy Data from Host to Device

Data must be transferred from the CPU (host) to GPU (device) before the kernel can process it. This is done using the cudaMemcpy function. Example-

cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice); // Copy array a from host to device
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice); // Copy array b from host to device

Step 4: Launch the Kernel

The kernel is launched by the host using special CUDA syntax: <<<numBlocks, threadsPerBlock>>> . This specifies how many threads and blocks will execute the kernel. Example-

int threadsPerBlock = 256; // Number of threads per block
int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock; // Number of blocks

addKernel<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, n);

numBlocks : The number of blocks in the grid.
threadsPerBlock : The number of threads in each block.

Each thread in the grid processes a specific part of the data.

To understand more, assume GPU as a factory. Threads are individual workers in the factory. Blocks are groups of workers that work together on a specific task. And Grid is made up of multiple blocks.

Step 5: Copy Results from Device to Host

After execution, the results are copied back from GPU (device) to CPU (host) using cudaMemcpy . Example-

cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost); // Copy result array c from device to host

Step 6: Free GPU Memory

Once computation is complete, the memory allocated on the GPU must be freed using cudaFree . Example-

cudaFree(d_a); // Free memory for array a on the GPU
cudaFree(d_b); // Free memory for array b on the GPU
cudaFree(d_c); // Free memory for array c on the GPU

Now, let’s see complete CUDA program that adds two vectors:

#include <iostream>
#include <cuda_runtime.h>

// Kernel function to add two vectors
__global__ void addKernel(float *a, float *b, float *c, int n) {
 int idx = threadIdx.x + blockIdx.x * blockDim.x; // Calculate global thread index
 if (idx < n) {
 c[idx] = a[idx] + b[idx]; // Perform addition
 }
}

int main() {
 const int n = 1 << 20; // 1 Million elements
 const int size = n * sizeof(float); // Size of memory to allocate

 // Host memory
 float *h_a = new float[n];
 float *h_b = new float[n];
 float *h_c = new float[n];

 // Initialize input vectors
 for (int i = 0; i < n; ++i) {
 h_a[i] = 1.0f;
 h_b[i] = 2.0f;
 }

 // Device memory
 float *d_a, *d_b, *d_c;
 cudaMalloc((void**)&d_a, size); // Allocate memory on GPU for vector a
 cudaMalloc((void**)&d_b, size); // Allocate memory on GPU for vector b
 cudaMalloc((void**)&d_c, size); // Allocate memory on GPU for vector c

 // Copy data from host to device
 cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
 cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

 // Launch kernel
 int threadsPerBlock = 256;
 int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock;
 addKernel<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, n);

 // Copy result back to host
 cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

 // Verify results
 for (int i = 0; i < n; ++i) {
 if (h_c[i] != 3.0f) {
 std::cerr << "Error at index " << i << std::endl;
 break;
 }
 }

 // Free device memory
 cudaFree(d_a);
 cudaFree(d_b);
 cudaFree(d_c);

 // Free host memory
 delete[] h_a;
 delete[] h_b;
 delete[] h_c;

 return 0;
}

This is very basic example, in future we will see more advance examples and optimize them too.

How GPUs Power the AI

GPUs are at the heart of modern AI because they can handle the massive computational demands of AI workloads. Here’s how:

Parallelism: GPUs can process thousands of tasks together.
Tensor Cores: Accelerate matrix operations, which are critical for deep learning.
Scalability: GPUs can be scaled across multiple devices to handle even larger workloads.
AI Frameworks: Popular frameworks like TensorFlow and PyTorch are optimized for GPUs, making easy to leverage GPU’s power.

Conclusion

By understanding the architecture of GPUs — CUDA cores, SMs, threads, warps, and memory hierarchy — you can unlock their full potential for your applications. This blog is a very introductory and basic guide to CUDA programming, designed to help you take your first steps in understanding how GPUs work and how to write simple CUDA programs. In future blogs, we will dive deeper into more complex examples and explore optimization techniques for real-world applications, such as implementing operations like SoftMax, matrix multiplication, and other advanced algorithms used in AI and scientific computing.

Source

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

A Beginner’s Guide to CUDA Programming

Author(s): Aditya Kumar Manethia

Introduction

What is CUDA?

Importance of CUDA

What is a GPU, and How is it different from a CPU?

How GPUs Power the AI

Conclusion

Source

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Exploring Deep Learning Models: Comparing ANN vs CNN for Image Recognition

LAI #72: From Python Groundwork to Function Calling, ICL Theory, and Load Balancing MoEs

Quantum AI Is Coming. Here’s What No One Is Telling You (But Should)

Tool Descriptions Are Critical: Making Better LLM Tools + Research Capability

Top 5 AI Chatbot projects to showcase on your Portfolio: with Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

A Beginner’s Guide to CUDA Programming

Author(s): Aditya Kumar Manethia

Introduction

What is CUDA?

Importance of CUDA

What is a GPU, and How is it different from a CPU?

How GPUs Power the AI

Conclusion

Source

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥