
TFLOPS Aren’t Everything: The Many Dimensions That Shape GPU Performance
Last Updated on April 17, 2025 by Editorial Team
Author(s): Pawel Rzeszucinski, PhD
Originally published on Towards AI.

Introduction
The rapid advancement of AI algorithms, capable of fully leveraging GPUs’ capabilities, has transformed this hardware into an indispensable element of the current AI revolution. While we typically judge hardware quality by its raw ‘speed’, with GPUs it’s not that straightforward — it’s much more nuanced and interesting. When working in AI, choosing the right GPU isn’t simply about picking the fastest one. Like many people, I initially thought it was all about raw speed — more TFLOPS meant better performance. But the deeper I went into the subject — partly thanks to the fantastic 5-hour-long Lex Fridman podcast episode with Dylan Patel and Nathan Lambert — the clearer it became that the GPU universe is far more nuanced. There are several critical GPU parameters beyond raw compute power, each influencing AI performance differently depending on specific use cases and workloads.
I wrote this text to help me organize my knowledge on the subject, and I hope you will find it helpful too. We’ll explore Compute Power (TFLOPS), Memory (VRAM and Memory Bandwidth), Interconnect Bandwidth, and specialized hardware like CUDA and Tensor cores, clarifying why each matters and how they interact. In the end, I’ll also present a few practical examples showing that not all roads lead to “speed.”
NVIDIA A100 specifications — can you speak English, please?
Have a look at the specification from NVIDIA’s official A100 spec sheet below. If you see a lot of things you have no clue about, stick with me. I’ll walk you through the main topics, and by the end, we’ll revisit the sheet. Hopefully, everything will be much clearer by then!

Compute power (TFLOPS)
First, let’s talk about the obvious — compute power, usually measured in TFLOPS (teraflops). It represents how many floating-point operations a GPU can handle per second. Initially, I thought higher TFLOPS was always the best indicator of GPU performance. And compute power is indeed crucial, especially in AI training, because neural networks involve countless simultaneous calculations. This metric essentially describes how rapidly the GPU can process mathematical operations critical to AI, like (but not only) matrix multiplications.
It’s important to understand that AI training and inference can occur at different levels of numerical detail (precision), expressed through the depth of floating-point operations. Generally, higher precision provides better accuracy but results in more power-hungry computations. Let me take a short tangent here to explain floating-point precision.
Floating-point precision refers to how detailed or accurate the numerical representation is within GPU calculations. GPUs represent numbers in different precisions: FP64 (double precision), FP32 (single precision), FP16/BF16 (half precision), and INT8 (integer precision). FP64 provides extremely accurate computations — for example, a number might look like 3.141592653589793 — but is slower and usually unnecessary for AI. FP32 offers standard precision, providing a good balance between accuracy and computational speed; a typical FP32 number could be 3.141593. FP16/BF16 significantly accelerates calculations and saves memory, making it popular for deep learning training, where a number might be represented as approximately 3.14. INT8, primarily used for inference, stores whole integers without decimals — like simply 3 — prioritizing speed and efficiency over precision. There are many more representations, but the selection above gives a good rough intuition.
It may seem confusing at first that AI training typically takes place at higher precision levels like FP32 or FP16, while inference commonly uses lower precision such as INT8. The key to understanding this difference lies in the process of post-training model quantization. Quantization involves carefully converting or “rounding” model parameters from higher precision (e.g., 3.141593 in FP32) to lower precision (e.g., 3 in INT8) after training is complete. Higher precision during training is essential because it captures subtle adjustments and intricate details necessary for effective learning. Once training finishes, these detailed adjustments become stable, allowing the model to perform nearly as well at lower precision, thus significantly accelerating computations during inference without dramatically sacrificing accuracy.
GPU memory (VRAM and memory bandwidth)
Next, let’s consider GPU memory, often called VRAM (Video RAM). While speed can get you far, it needs raw material to work with. VRAM is like the workspace your GPU uses to temporarily store data, model parameters, intermediate calculations, and input batches during training. Larger VRAM allows for bigger neural networks, more extensive datasets, and larger batch sizes, all directly improving training efficiency and reducing overall training time.
Equally important, yet frequently overlooked, is memory bandwidth, which determines how rapidly data moves in and out of GPU memory. High memory bandwidth drastically reduces data waiting times, ensuring the GPU remains constantly active without idle periods. GPUs using High Bandwidth Memory (HBM) can feed the GPU cores much faster than standard GPU memory types like GDDR6 or GDDR6X, improving overall training throughput. Conversely, low memory bandwidth can create bottlenecks, severely limiting GPU performance and increasing training duration.
Interconnect bandwidth
No single GPU is capable of storing all the information required for large LLM training in memory. Training modern LLMs requires using thousands of GPUs orchestrated to work together. As you scale beyond a single GPU to multi-GPU setups, another crucial factor comes into play: interconnect bandwidth. This parameter measures how quickly GPUs communicate and exchange data with each other. In multi-GPU environments, it’s no longer just about individual GPU speed — effective collaboration becomes critical to performance.
Common GPU interconnect standards include PCIe, NVLink, and InfiniBand. PCIe, common in consumer systems, is affordable but offers relatively modest speeds. It’s suitable for simple, smaller setups where multi-GPU interactions aren’t frequent. NVLink, NVIDIA’s specialized GPU-to-GPU connection, drastically reduces latency and significantly enhances data throughput between GPUs within a single server. InfiniBand serves even larger scenarios involving numerous GPUs spread across multiple servers, necessary for massive workloads like training GPT-scale models or extensive deep learning clusters.
At this point, you might wonder — as I did initially — why not always opt for NVLink or InfiniBand? The reality is that these specialized interconnect technologies come with higher cost and complexity. So, to generalize clearly: PCIe is ideal for scenarios with infrequent multi-GPU interactions, NVLink is optimal for larger multi-GPU setups within a single machine, and InfiniBand becomes essential for huge multi-server deployments.
One detail worth adding here: interconnects don’t just enable GPU-to-GPU communication — they also impact how fast data can move between the CPU and GPU. This CPU–GPU communication can become a bottleneck if the interconnect is slow or if data transfers are frequent, especially in inference-heavy pipelines.
Specialized GPU hardware: CUDA vs. Tensor Cores
Lastly, let’s touch on CUDA and Tensor cores. CUDA cores are general-purpose processors within NVIDIA GPUs, designed to efficiently handle diverse parallel tasks. Tensor cores represent highly specialized GPU hardware explicitly optimized for matrix multiplications — the foundational operations in deep learning neural networks. Tensor cores perform these operations far more efficiently than CUDA cores by leveraging specialized hardware instructions called Fused Multiply-Add (FMA), significantly accelerating AI computations.
In this text, when I refer to CUDA cores, I’m talking about the physical processing units inside NVIDIA GPUs — part of the underlying hardware architecture.
CUDA is also the name of NVIDIA’s software platform for GPU programming, which includes a programming language (based on C/C++), libraries, and tools that allow developers to write and run code on NVIDIA GPUs.
It’s not all speed then, is it?
Let’s have a look at how different GPU parameters take center stage depending on the workload — and why focusing on TFLOPS alone can be misleading.
Scenario 1: In high-compute tasks like real-time inference on edge devices (e.g., autonomous vehicles), raw compute power (TFLOPS) is paramount. Speed is everything, and a single GPU must deliver predictions in milliseconds. In this case, interconnect bandwidth doesn’t matter at all, and moderate memory is usually enough.
Scenario 2: When training very large models or processing massive datasets, GPU memory (VRAM) and memory bandwidth become critical. If your model or batch doesn’t fit into memory, even the fastest GPU becomes a bottleneck due to constant data swapping. TFLOPS still matter, but they take a backseat to data handling capacity.
Scenario 3: In multi-GPU or multi-node training environments, such as training GPT-scale models across clusters, the spotlight shifts to interconnect speed. Here, the GPUs constantly exchange weights and gradients, and poor interconnect performance can cripple even the most powerful GPUs. NVLink or InfiniBand becomes essential, while compute and memory remain important but secondary.
The bottom line: not all workloads are created equal. GPU performance is multidimensional, and understanding which parameters matter most for your use case can make the difference between an efficient setup and a painfully slow one.
NVIDIA A100 specifications — let’s test our knowledge!
Let’s now practically apply all the GPU parameters we’ve explored. Below is the A100 specification we saw at the beginning. Naturally, my goal here is to help you understand general concepts rather than memorizing exact technical specifications.

One key aspect to consider is the GPU form factor, which refers to the physical configuration of the GPU and how it integrates into computing systems — this is why the specification is divided into two columns. The A100 primarily comes in two forms: PCIe and SXM. PCIe GPUs offer flexibility and are suited to various scenarios, typically on the lower end of the compute scale. They are relatively easy to install and manage, with moderate power consumption and simpler cooling requirements. In contrast, the SXM form factor is specifically optimized for high-performance servers typically found in large-scale data centers. SXM GPUs require advanced cooling solutions due to their higher power demands but offer significantly superior performance, especially beneficial in multi-GPU environments. As you can see in the image I took from datacrunch.io’s blog, they are completely different hardware components with different connection interfaces, electronic circuit designs, etc. Check out the link for a more detailed discussion.

Another aspect we touched on earlier is the relationship between computational precision and GPU performance. As a reminder: as computational precision decreases, the GPU can perform more calculations per second. High-precision formats like FP64 deliver maximum accuracy for complex scientific computations but result in fewer calculations per second, making them rare in typical AI workloads. Single precision (e.g. FP32) offers a balanced level of accuracy and performance, commonly used across general AI tasks. To further boost performance, NVIDIA has integrated specialized Tensor cores that optimize calculations at lower precision levels. Formats like TF32, BFLOAT16, and FP16 significantly increase computational throughput, leveraging the reduced precision to enhance calculation speed dramatically. These are particularly valuable in deep learning and modern AI workloads where speed is often more important than perfect accuracy. At the lowest precision level, INT8 calculations are ideal for inference tasks, where approximate accuracy is sufficient, enabling extremely rapid computation and efficient model deployment. Remember the post-training model quantization?
Even though TF32 has a single-precision label (“32”), it’s actually a hybrid precision introduced specifically by NVIDIA to combine aspects of FP32 and FP16, providing the speed of lower precision with the numerical range of FP32.
The memory section is pretty straightforward — having more GPU memory allows larger AI models and datasets to be held directly in memory, significantly reducing the need for slower data transfers. High memory bandwidth ensures data moves quickly between memory and the GPU cores, preventing computational delays that could reduce efficiency during intensive tasks.
One element we didn’t discuss above is Max Thermal Design Power (TDP). I didn’t include it in the main text as it’s somewhat tangential to the computational aspects of the GPU, but briefly: it’s the maximum amount of heat (measured in watts) a GPU is expected to generate under heavy workload conditions. It’s also directly related to the GPU design and interfacing standard (PCIe vs. SXM in this case). In practical terms, a higher TDP usually means more advanced cooling methods are necessary, such as liquid cooling or enhanced airflow, to maintain optimal performance and reliability.
The A100 GPU also supports Multi-Instance GPU (MIG) technology, allowing a single physical GPU to be divided into multiple virtual GPUs. Just as a Virtual Private Server (VPS) allows a single physical server to be partitioned into multiple independent, isolated virtual servers, MIG allows one physical GPU to be split into multiple smaller, virtualized GPU instances. Each MIG instance behaves like a standalone GPU with its dedicated resources, similar to how each VPS has allocated CPU, memory, and storage resources.
The method of interconnecting GPUs is another important factor affecting performance. We spoke about the differences between PCIe, NVLink, and InfiniBand above. You may have noticed that InfiniBand is not mentioned in the specification at all — it’s typically used as an external networking technology for connecting servers, rather than directly linking GPUs within the same machine.
Finally, when choosing between PCIe-based and SXM-based server options, considerations are made in that PCIe systems generally feature fewer GPUs per server (typically 1–8), making them suitable for flexible and general-purpose environments. SXM-based servers, like NVIDIA’s HGX A100 systems, can accommodate up to 16 GPUs per server. These are designed specifically for high-density, high-performance scenarios requiring specialized cooling and power infrastructure, often found in dedicated, large-scale data centers.
So, how’s the specification looking now? Less cryptic, I hope!
Final thoughts
In the world of AI, where algorithms often take center stage, it’s easy to overlook the hardware that powers them. But as we’ve seen, GPUs are far more than just raw compute engines — they’re intricate systems where compute power, memory, bandwidth, and interconnects each play critical roles. Understanding these dimensions isn’t just academic; it’s essential for making smart, cost-effective choices when deploying or scaling AI systems. Whether you’re building models on a laptop, training them across clusters, or optimizing for inference at the edge, knowing how to match your workload to the right GPU capabilities is a serious advantage. TFLOPS may grab the headlines, but real performance lives in the details.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI