Transformers for Videos
Last Updated on January 3, 2025 by Editorial Team
Author(s): Sarvesh Khetan
Originally published on Towards AI.
Note : this is in continuation of this blog wherein I have discussed different models to perform video classification task.
Video is nothing but a sequence of images and hence to make use of this sequence information researchers wanted to test out sequence models like RNN / LSTM / GRU / Transformers on video dataset !!
Since transformers model is the most prominent sequence model, below I will only discuss transformer but you can design similar architectures with other sequence models too !!
Taking inspiration from vision transformers that we saw in image classification task, researchers designed this architecture for transformers on videos
Issue with above architecture is that there is too much of attention being calculated due to which model takes a lot of time for computation and hence researchers wanted to reduce these computations. They proposed more efficient architectures and one such architecture was proposed by google in 2021 in its paper Video Vision Transformers (ViViT). First letβs us understand the intuition behind this architecture
Now to implement this, the architecture diagram looks something like as follows β¦
In above architecture I have only shown 1 spatial transformer and 1 temporal transformer but you can add more of these transformer layers to improve your system but remember
- Spatial positional encoding and Temporal positional encoding are only inputs to the very first spatial and temporal transformers respectively !!
- Also, you will have to do reshaping everytime before passing the inputs to spatial transformer and temporal transformer to make sure the attention is being calculated correctly as shown in the intuition
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI