Have… You Met the Vision Transformer?
Last Updated on January 3, 2025 by Editorial Team
Author(s): Kim Hyun Bin
Originally published on Towards AI.
This member-only story is on us. Upgrade to access all of Medium.
Technology never stops evolving.
I believe Vision Transformers are the prime example of this idea. 4 years after the Transformer architecture and attention mechanism was introduced, mainly for translation tasks back then, a group of researchers discovered a way to utilize the same architecture and discoveries for a different task, computer vision.
We have to understand that until the idea of Vision Transformers were introduced, Convolutional Neural Networks were the cornerstone of the computer vision field. They gave birth to the almighty ResNet architecture. It was an excellent way for the neural network to consider neighboring pixels and collect and learn the general features of an image and proceed onto the minute details for further layers.
However, with the introduction of Transformers, everyone was allured to the computational efficiencies and scalability of the architecture, especially when computational hardwares such as GPUs were supporting its existence. This is where this group of researchers were able to introduce the idea of Transformers into computer vision, introducing a new State-Of-The-Art architecture.
In this article, I will be explaining the architecture of Vision Transformers and how they work. Furthermore, I will also be showcasing some code to… Read the full blog for free on Medium.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI