(Vision) Transformers: Rise of the Chimera
Last Updated on June 28, 2023 by Editorial Team
Author(s): Quadric
Originally published on Towards AI.
Itβs 2023, and transformers are having a moment. No, Iβm not talking about the latest installment of the Transformers movie franchise, Transformers: Rise of the Beasts; Iβm talking about the deep learning model architecture class, transformers, that is fueling anticipation, excitement, fear, and investment in AI.
Transformers are not so new in the world of AI anymore; they were first introduced by the team at Google Brain in 2017 in their paper βAttention is All You Needβ. Since their introduction, transformers have inspired a flurry of investment and research which have produced some of the most impactful model architectures and AI products to date, including ChatGPT, which is an acronym for Chat Generative Pre-trained Transformer.
These products, and the transformers theyβre built with, solve Natural Language Processing (NLP) problems, i.e., they consume βlanguageβ inputs in the form of text prompts and produce βlanguageβ outputs in the form of strings of words and punctuation that are (hopefully) human-readable. In order to produce meaningful outputs, these transformers are trained on truly mind-boggling amounts of textual data on a scale that few companies can afford to implement. The size and scale of the transformers used in these NLP problem domains have contributed to the new moniker for this class of models: Large Language Models (LLMs).
Equally exciting is the adaptation of these transformer architectures to be used in Computer Vision (CV) applications. A new class of models, broadly referred to as Vision Transformers (ViT), were empirically proven to be viable alternatives to more traditional Convolutional Neural Networks (CNN) in the paper βAn Image is Worth 16×16 Words: Transformers for Image Recognition at Scaleβ, also published by the team at Google Brain in 2021.
Unlike the NLP problem space, CV models like Vision Transformers are of a size and scale that are approachable for SoC designers targeting the high-performance, edge AI market. Thereβs just one problem: Vision Transformers are not CNNs, and many of the assumptions made by the designers of first-generation Neural Processing Unit (NPU) and AI hardware accelerators found in todayβs SoCs do not translate well to this new class of models.
In this article, weβll explain:
- What makes Vision Transformers (ViT) so special in comparison to their CNN counterparts,
- Why these unique architectural features of ViTs are βbreakingβ almost all NPU and AI hardware accelerators targeting the high-performance edge market, and
- How Quadricβs Chimera GPNPU architecture is able to run ViTs today with real-time throughput at ~1.2W.
Lastly, weβll make and defend a simple prediction:
The System-on-Chip (SoC) for Artificial Intelligence (AI) applications that most easily adapts to new model architectures, like Vision Transformers (ViT), will win in the market long-term.
If youβre a System-on-Chip (SoC) designer looking to enable state-of-the-art Vision Transformer (ViT) models for your developers today and whichever model architectures become state-of-the-art tomorrow, this article is for you.
What makes Vision Transformers so special?
ViTs garnered a lot of hype because the team at Google Brain proved that they were viable alternatives to CNNs. To understand what makes ViTs so special, letβs compare them to their CNN counterparts.
CNNs, as their name suggests, are built using convolutional filters. Letβs briefly refresh ourselves on what convolutional filters actually do.
In Figure 1 below, we have a 6×6 input matrix on the left, a 3×3 convolutional filter in the middle, and a 4×4 output tensor on the right. The output tensorβs values are calculated by multiplying each 3×3 section of the input matrix on the left with the 3×3 convolutional filter.
This particular convolutional filter, with positive 1 values in its left column, 0 values in its middle column, and negative 1 values in its right column, produces positive output values where vertical edges are found in the original matrix, i.e., in the middle of the example input matrix.
The important thing to note from the above example is that convolutional filters, like this vertical edge detection filter, learn local features within an image. CNNs have many of these filters, and each filter learns what values will extract the most meaningful information from the input image, but each filter only considers information in a localized window of the input image, e.g. a 3×3 crop of the image.
By stacking layers of these convolutional filters on top of one another, i.e., creating deep neural networks (DNN), these local filters gradually gain greater attention over more abstract patterns that exist in larger sections of the image because they are consuming as inputs the filter outputs from a collection of adjacent local filters. We can see this progression of learned abstractions β from edges, to textures, to patterns, to object parts, to full objects β by inspecting the intermediate layers of DNNs at different depths, as depicted below in Figure 2.
Vision Transformers are a revolutionary idea in the world of Computer Vision (CV) because they employ global attention at each layer.
Attention, as introduced in the paper βAttention is All You Needβ, is a dense mapping of weights between different elements in a sequence. These weights represent the relative importance of each element in the sequence to all other elements in the sequence.
To more intuitively understand attention, letβs look at the sentence below:
βI poured water from the bottle into the cup until it was full.β
As seasoned communicators, we might infer from context that the βitβ pronoun is referring to the βcupβ noun in this sentence because of the adjective βfullβ; however, by changing the word βfullβ to βemptyβ, the reference object for βitβ changes from βcupβ to βbottleβ. We change this inference without much thought because of the innate knowledge we have about how the verb βpourβ works, i.e the act of pouring implies that the bottle is losing water and the cup is gaining water. This example demonstrates the relative importance of the word βfullβ to the context of the word βitβ in this sentence.
Notice that in this example, we did not consider groups of three words at a time, but instead considered the entire sentence at the same time. Conceptually, this is what it means to have global attention, and it can be very useful in comparison to local attention in inferring context within a problem space.
Thereβs just one significant problem with the concept of global attention employed by transformers: itβs a dense mapping, and dense mappings scale quadratically.
In the above sentence, there are 13 words, i.e., `N=13` elements in the sequence. To achieve global attention on this sequence, we need `W=N*(N-1)` or `W=13*12=156` weights to represent the relative importance of each element to each other element (excluding ground truth class labels and patch delineators).
This operation is expensive, but feasible for this two-dimensional data. Unfortunately, global attention becomes untenable when we try to adapt to higher dimensional data like RGB images used in CV applications, i.e., when `N=224×224=50,176` pixels in an image and we need `W=50176*(50175)=2,517,580,800` weights for global attention.
To solve this problem, ViTs preprocess the three-dimensional image data into a two-dimensional representation. They accomplish this by:
- splitting the inputs into patches,
- creating a linear projection or two-dimensional βembeddingβ of each patch, and
- linearly combining a constant positional vector with the patch embedding to retain the patches position within the original image.
To put it more simply, in traditional NLP transformers, the input sequences of data are sentences composed of words. Analogously in Vision Transformers, each image is a βsentenceβ and each patch embedding is a βwordβ.
At face value, these concepts do not seem to be so revolutionary. Dense or fully-connected layers were implemented as a part of Multilayer Perceptrons (MLP), the earliest proof-of-concept of neural networks. Similarly, the preprocessing needed for image vectorization is, fundamentally, just a form of mathematical embedding learned by a neural network.
Why then should ViT be challenging to run on my SoCβs NPU or AI accelerator?
Why is it so hard to get ViT to run on my AI accelerator?
To understand why itβs so hard to run ViT on most AI accelerators, we need to understand:
- the sequence and type of operators that make up a transformer encoder, and
- the architectural assumptions made by NPU and AI accelerator designers
Transformer Encoder vs. CNN
Earlier in Figure 4, we looked at an image focusing on the pre-processing needed to adapt three-dimensional image data to work with a transformer architecture. Below in Figure 5, we zoom out to see what happens after the image data is preprocessed:
Specifically, we want to look at the Transformer Encoder block on the right side of the image above. These encoder blocks are stacked `L` times for different sizes of ViT models, just like how ResNet-18 and ResNet-50 models are the same architecture with different numbers of stacked residual blocks.
The key differences to note between the ViT Transformer Encoder block and most CNN blocks is that it has Normalization (represented as `Norm` layers in Figure 5) and Softmax layers (the activation function used for the `MLP` layer in Figure 5) in the middle of the network. In almost all CNN architectures, Normalization is performed once at the beginning of inference, and Softmax is performed once at the end of the network.
Normalization and softmax layers are simple enough mathematical operations that do operate on large tensors in the context of DNNs. The challenge these pose too many AI SoCs targeting the edge is that they cannot be accelerated by linear algebra accelerators and, in heterogeneous compute platforms, need to be processed by a DSP, GPU, or CPU.
Architectural Assumptions Made by NPU Designers
Heterogeneous compute nodes are computing devices with different architectures optimized for specific tasks, e.g., an AI SoC might include a CPU, a DSP, and a Neural Processing Unit (NPU) like the design on the left in Figure 6 below:
Heterogeneous computing, as a design principle for AI, requires that programs be segmented into their component tasks and each task must target its most optimal compute node for runtime. If programmed or compiled incorrectly to target an inefficient compute node, e.g. the CPU instead of the AI accelerator, the runtime performance of the program can suffer greatly.
Heterogeneous computing platforms, and the NPU cores used within them, have been optimized for performance on most Convolutional Neural Networks (CNN). Since most CNNs do not have any softmax or normalization operators in the middle of the network, most NPUs have been designed to optimize for only the convolutional compute, which is just basic linear algebra.
NPUs have optimized for these multiply-accumulate (MAC) operations that constitute linear algebra math with great success and heterogeneous computing platforms that use these NPUs have excelled at running CNNs because thereβs very infrequent, if any, data movement between compute nodes during inference. The entire inference program can be easily pipelined into three stages:
- Input data is color converted, reshaped, formatted, and normalized by a GPU or DSP,
- formatted data is off-loaded to the NPU for the multiply-accumulate (MAC), linear algebra operations like convolutions and fully-connected layers, and
- convolutional outputs are off-loaded to the GPU or DSP for softmax activation.
Heterogeneous computing platforms can hide most of the expensive memory-movement operations in these types of programs by pipelining the compute. Latency, or the time it takes to run the first inference, may be long, but throughput, the time it takes to run inference on average, is only limited by the slowest stage in this pipeline.
This runtime strategy, when applied to ViT architectures, creates a pipeline that requires frequent data movement between the different compute nodes:
- Input data is color converted, reshaped, formatted, and normalized by a GPU or DSP,
- formatted data is off-loaded to the NPU for the linear projection of image patches into two dimensions,
- image patches are sent back to the GPU or DSP for normalization,
- normalized patches are sent back to the NPU for attention mapping,
- back to the GPU or DSP for normalization,
- back to the NPU for the MLP layer,
- back to GPU or DSP for Softmax activation
- β¦
- Repeat steps 3β7 for `L` stacked transformer encoder blocks. (The smallest βbaseβ ViT model has `L=6`, the large has `L=12`, and the huge has `L=16`.)
This frequent movement of intermediate tensors between different compute nodes results in complex scheduling algorithms and significant overhead. This overhead of moving data between compute nodes substantially reduces the runtime efficiency of a model and burns excessive power. In AI SoC targeting power-sensitive edge applications, those extra memory-movement operations may render the system unviable.
Optimizing AI SoCs for performance on CNNs has enabled a lack of curiosity surrounding how to broadly accelerate inference. Heterogeneous computing platforms are using existing hardware IP and optimizing it for performance on AI tasks using complex software tricks. The only new hardware block that has been invented to address AI applications is the NPU, and it was assumed that the only operations it would need to accelerate were multiply-accumulate (MAC) operations that make up the convolutional and dense layers in the middle of CNNs. The pervasiveness of this mindset can be seen by some NPU developers reporting model complexity in a number of multiply-accumulate (MAC) operations. If MAC counts alone were indicative of a modelβs complexity, ViTs would not be so challenging to run on AI SoCs that are optimized with these assumptions.
Conclusion
After reading this article, hopefully, youβve come to appreciate three things:
- ViTs are a clever adaptation of the popular transformer architecture that works for Computer Vision applications,
- ViTs employs a unique permutation of common ML operators in comparison to the previously most popular CV architectures, Convolutional Neural Networks (CNN), and
- ViTs are problematic for heterogeneous AI SoCs with NPUs that were designed to only accelerate multiply-accumulate (MAC) operations.
The context that we have yet to add to this article so far is that the original ViT is already outdated. The hard truth is that the original ViT introduced to the world in 2021 will be remembered the same way the AlexNet architecture is remembered as the proof-of-concept for CNNs. AlexNet got everyone excited about CNNs potential, and it was quickly improved upon.
Similarly, there are already numerous variants of ViT that have improved upon the original architecture in the two short years since transformers were proven to be viable for CV applications:
One of the greatest challenges facing hardware designers today is how to make your AI SoC future-proof; however, no one can predict what artificial neural network architectures will become the most popular among developers in the future.
To ensure that your AI hardware remains relevant, you need hardware that is generic to the compute problem-space of AI and not over-optimized for the current state-of-the-art solutions, i.e., Convolutional Neural Networks (CNN). If youβre worried about your NPU being rendered extinct by the next wave of DNN architectures, ask your in-house NPU team or third-party provider:
- Can you run Vision Transformers (ViT) or similar model architectures with non-MAC compute layers like normalization, softmax, and patch creation interleaved with convolutional and dense layers?
- Are your tensor transformations operators, like resize, transpose, etc, easily programmable and easily parallelized?
- If tensor transformation operations are found in the middle of a model, like those proposed in Swin transformers, does that significantly hurt compute efficiency?
- Can you easily run models that are quantized asymmetrically to maximize the effective range of your lower-precision datatypes?
- Can developers easily program these algorithms for our NPU or system in a user-friendly language like C++? Or must they write machine assembly or use specific intrinsic to optimize for our hardware?
Quadric is solving this problem by defining a new, hybrid architecture capable of running scalar, vector, and matrix instructions. Our Chimera General-Purpose NPU (GPNPU) processors are designed to be a single processor solution for all AI/ML compute. It can handle image pre-processing, inference, and post-processing all in the same core. Because all computing is handled in a single core with a shared memory hierarchy, no data movement is needed between compute nodes for different types of ML operators.
Always having intermediate tensor data in local memory gives our Chimera Graph Compiler (CGC) an enormous amount of flexibility for operator fusion which further reduces memory movement overhead and improves program efficiency. In short, GPNPUs deliver the matrix-optimized performance you expect from a CNN-optimized compute engine and the ability to compute non-MAC operations in a single processor architecture. Further, these hardware capabilities are easily accessible to software developers via C++ libraries.
This design approach has enabled us the ability to run an int8 quantized version of ViT-base-patch16β224 in real-time at ~1.2W.
If youβre a hardware designer looking to enable ViTs for your developers today and the DL architectures of the future tomorrow, consider signing up for a Quadric DevStudio account to learn more about the Chimera GPNPU processor IP from Quadric.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI