Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

But What Is Inside an AI Accelerator?
Artificial Intelligence   Latest   Machine Learning

But What Is Inside an AI Accelerator?

Last Updated on July 8, 2024 by Editorial Team

Author(s): Aditya Mohan

Originally published on Towards AI.

But What Is Inside an AI Accelerator?
Photo by Google DeepMind on Unsplash

Heterogeneous computing refers to machines with more than one “kind” of computing “core”. The computing cores can be CPUs, GPUs, TPUs, and many other accelerators that are being developed every day. These specialized “cores” can also be called ASIC an abbreviation for “Application-Specific Integrated Circuit”.

This is how ARM defines ASIC —

An application-specific integrated circuit is an integrated circuit (IC) that’s custom-designed for a particular task or application. Unlike FPGA boards that can be programmed to meet a variety of use case requirements after manufacturing, ASIC designs are tailored early in the design process to address specific needs.

Since the release of ChatGPT and the subsequent release of other large language models (LLM), there has been a growing demand for computing power that is required to train these models (with billions of parameters) and also generate results, which is called inferencing. This is precisely where AI Accelerators come to the rescue!

An overview of what lies ahead in this article…

In this article, I will go over a small introduction to AI accelerators and how they differ from CPUs and GPUs. Then I will dive into systolic array architecture and how it works! I also peek inside the Google TPU and end the article with possible future research directions.

Introduction to AI Accelerators

AI accelerators are specialized hardware designed to enhance the performance of artificial intelligence (AI) tasks, particularly in machine learning and deep learning. These accelerators are designed to perform large-scale parallel computations (read matrix multiplications) as required by many deep learning models efficiently as compared to traditional CPUs.

Some key characteristics that differentiate AI Accelerators from CPUs and GPUs are:

  1. They are a type of ASIC specifically designed for deep learning workloads. In contrast, CPUs and GPUs can also be used for general-purpose programming and rendering graphics, respectively. NVIDIA GPUs in fact, started out as ASIC for handling computer graphics-related operations and then transitioned into being used in scientific computing (with the help of CUDA). Sometime later, around 2015, the focus of CUDA transitioned towards supporting neural networks.
  2. Massive parallel processing power — GPUs and accelerators are designed to execute many operations in parallel (“high throughput”), whereas CPUs are designed to perform sequential operations in the shortest time (“low latency”). Accelerators are meant to offload deep learning workloads from CPUs so as to perform these operations more efficiently.

Systolic Arrays

Systolic array is a simple and energy-efficient architecture for accelerating general matrix multiplication (GEMM) operations in hardware. They provide an alternative way to implement these operations and support parallel data streaming to improve memory access and promote data reuse. This architecture forms the basis of many commercial accelerator offerings like the Google TPU (tensor processing unit), Intel NPU (neural processing unit), IBM AIU, etc.

Systolic data flow of the MAC array [source: author]

These arrays comprise MAC (multiply-and-accumulate) units that perform the actual operations. Serving the MAC units are the row and column SRAM buffers that feed these units with data. Each MAC unit will save the incoming data in an internal register and then forward the same data to the outgoing connection in the next cycle.

This behavior results in significant savings in SRAM read requests and can exploit data reuse opportunities. For example, filter weights are something that remains stationary during a convolution operation as the filter map is convolved over the image. This can be exploited by storing the weights in the MAC array whereas the row buffer loads in the different windows of the input image. This reduces the read requests to load the weights, hence freeing up bandwidth to read from off-chip memory sources like DRAM or HBMs.

There are different techniques to exploit data reuse, which are referred to as dataflow or mapping schemes discussed in the next section.

Data Flow Techniques

Inputs to the MAC array [source: author]
Output Stationary Dataflow. Note the color coding to follow how the convolution window and weights have been unrolled [source: author]

Although there are no hard and fast rules to specify what kind of mapping is to be used with a systolic array architecture, here I will discuss one of the three strategies as specified in the Scale-Sim paper. The three strategies are named Output Stationary (OS), Weight Stationary (WS), and Input Stationary (IS). The word “stationary” here depicts what part of the computation spends the most amount of time being stored in the systolic array.

The output stationary dataflow is depicted in the figure above. “Output” stationary means that each MAC unit will be responsible for calculating the output pixel. All the required operands are fed from the left and top edges of the systolic array. Each row (IFMAP) consists of elements of one convolution window and one column (FILTER) entering from the top represents the unrolled filter. Elements of one row and one column are multiplied and accumulated to calculate one pixel of the output feature map (OFMAP).

Timing Model for a Systolic Array

Timing Model for a Systolic Array following the Output Stationary dataflow mapping [source: author]

Here we try to calculate the number of cycles that a systolic array will take to perform a matrix multiplication. Here we assume that there are no stalls during the operation due to memory bandwidth (make sure that SRAM buffers are filled with data to perform the compute) and also assume that we have unlimited MAC units available to perform the required computation.

Sr, Sc are the dimensions of the systolic array and in this case, is equivalent to the number of rows and columns of the IFMAP and FILTER respectively. T is the temporal dimension which in the case of the output stationary represents the convolution window size.

As described by the figure above, we can conclude that the number of cycles for the systolic array to perform a matrix multiplication is:

Image source: author

Obviously, in the real world, we do not have unlimited MACs. In that case, we divide the workload by the number of available MAC units and therefore get the following expression for timing:

Scale-Up Timing [Image source: author]

Here, we assume that R and C are the actual dimensions of the systolic array and Sr and Sc are the required dimensions. To decrease this time, we can increase the number of MAC units, a process we can call “scaling up”. Another approach is to have multiple MAC array units that perform the compute in parallel, which can be called “scaling out”. This further reduces the time needed to complete the operation.

Scale-up vs Scale-out [source: (Samajdar et al., 2020)]

A look inside Google TPU

Google TPUv1 [source: Jouppi et al., 2017]

Origins

Back in 2013, a projection at Google showed that if people searched using voice even for 3 minutes a day, it would result in doubling the computing demand of Google’s datacenters. Speech recognition models that used DNN were very expensive to perform inference using traditional CPUs. Therefore, they started working on a custom ASIC (application-specific integrated circuit) that would perform inference efficiently. The goal was 10x performance over GPUs. The outcome of this effort was the Google Tensor Processing Unit. Google TPU was based on the systolic array architecture.

Floorplan for the TPU die. [source: Jouppi et al., 2017]

TPU v1

As you are now aware systolic array-based AI accelerators are composed of MAC units. Google’s original TPU implementation consisted of 256×256 MAC units (see Matrix Multiply Unit in the figure above) that could perform 8-bit multiply-and-adds on signed or unsigned integers. The 16-bit products were then collected in 4 MiB of 32-bit Accumulators below the matrix unit. Then there are other components like the activation pipeline that could perform activation functions on the resulting matrix.

For more details about the Google TPU that was released in 2017 read this very interesting paper where they discuss in detail the TPU’s design and performance!

In-datacenter performance analysis of a tensor processing unit | IEEE Conference Publication | IEEE Xplore

TPU v2 and v3

Block Diagram of the TensorCore of TPU v2

Improving upon the design of TPU v1 Google released the specifications of TPU v2 and v3 as well with some major changes:

  1. Interconnect — A critical element of any chip design is the interconnect which decides how fast is the inter-chip communication. An on-device switch called Interconnect Router (see above figure) provides deadlock-free routing. It enables a 2D torus topology of interconnect.
  2. Memory — A major performance bottleneck in TPU v1 was the limited memory bandwidth of DRAM. This problem was somewhat solved using the HBM (High Bandwidth Memory) DRAM in TPU v2. It offers 20 times the bandwidth of TPU v1 by using an interposer substrate that connects the TPU v2 chip via thirty-two 128-bit buses to 4-stacks of DRAM chips.
  3. Multiple smaller MXU units per chip — While TPUv1 featured a MXU of the size 256×256, it was reduced to 128×128 in TPUv2 onwards and has multiple MXUs per chip. Larger MXUs require more memory bandwidth for optimal chip utilization. Google analyzed that convolutional model utilization ranged between 37%-48% for 128×128 MXUs, which was 1.6x of a single 256×256 MXU (22%-30%). The reason that Google has come up with this is that some convolutions are naturally smaller than 256×256, which leaves parts of the MXU unused.
A 2D-torus topology [source: Jouppi et al., 2020]

For more details regarding Google TPU v2 and v3:

A Domain Specific Supercomputer for Training Deep Neural Networks | ACM

AI and Memory Wall

Scaling of memory bandwidth in comparison to Peak FLOPS [source: Gholami et al., 2024]

The amount of computing needed to train modern deep learning models and perform inference using them is growing at a large rate. This trend prompted research into AI accelerators with a focus on increasing computing power. This has been achieved sometimes at the expense of neglecting memory hierarchies and bandwidth thus creating a memory bottleneck. In this section, I have briefly summarized what this very interesting paper [Gholami et al., 2024] talks about and which points toward future research avenues in the realm of AI accelerators.

But what is a memory wall?

Memory wall refers to the problem where the compute is faster than the rate at which data can be fetched from off-chip DRAM which limits the overall compute that can be performed. The time to complete an operation is dependent both on the speed of performing compute and also on how fast the data can be fed to the arithmetic units of hardware.

As can be seen in the graph above, the peak compute has increased 60000x in the last 20 years, whereas the DRAM and interconnect bandwidth have increased only 100x and 30x, respectively. This huge deficit results in aggravating the problem of memory wall especially with growing model sizes.

source: Gholami et al., 2024

As depicted in figure (a) above the number of parameters in the SOTA transformer models has increased at a rate of 410x every two years, whereas the AI accelerator memory capacity (green dots) has only been scaled at a rate of 2x every 2 years. Figure (b) depicts the amount of compute, measured in Peta FLOPs, needed to train SOTA models for different computer vision (CV), natural language processing (NLP), and Speech models, along with the different scaling of Transformer models (750x/2yrs).

This problem opens up many research avenues where progress can be made. Techniques like quantization and model pruning are being actively investigated to reduce model size. One of the major breakthroughs in AI accelerators has been the successful adoption of half-precision (FP 16) instead of single precision enabling a 10x increase in hardware compute capability. Another possible solution that the author proposes worth investigating is revisiting the organization of the cache hierarchy of AI Accelerators that has been simplified to prioritize computing power.

Do check out the paper by the author for a more detailed analysis and discussion on this topic: [2403.14123] AI and Memory Wall (arxiv.org)

Further Reading

References

  1. Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., … & Yoon, D. H. (2017, June). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture (pp. 1–12).
  2. Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N., Laudon, J., … & Patterson, D. (2020). A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63(7), 67–78.
  3. Gholami, A., Yao, Z., Kim, S., Hooper, C., Mahoney, M. W., & Keutzer, K. (2024). AI and memory wall. IEEE Micro.
  4. Samajdar, A., Joseph, J. M., Zhu, Y., Whatmough, P., Mattina, M., & Krishna, T. (2020, August). A systematic methodology for characterizing scalability of dnn accelerators using scale-sim. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 58–68). IEEE.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓