Review of Multimodal Technologies: ViT Series (ViT, Pix2Struct, FlexiViT, NaViT)
Author(s): tangbasky Originally published on Towards AI. In the computer vision domain, CNNs have long been the dominant framework for understanding visual features. In contrast, the transformer framework has achieved great success in the NLP domain, which has encouraged some researchers to …
Qwen2.5-VL: A hands on code walkthrough
Author(s): tangbasky Originally published on Towards AI. Twin articles: Qwen2-VL: A hands-on code walkthrough understand the working mechanism of multimodal LLMs medium.com It is difficult for those who read Qwen-VL for the first time to understand. The key barrier lies not in …
The Comparison between the Encoder and the Decoder
Author(s): tangbasky Originally published on Towards AI. This article primarily discusses the advantages and disadvantages of large language models based on encoder and decoder architectures. Both the encoder and decoder architectures are built upon the Transformer model. Initially, this encoder-decoder architecture was …
The Evolution of GRPO: DAPO
Author(s): tangbasky Originally published on Towards AI. Dynamic sAmpling Policy Optimization (DAPO) is actually a type of reinforcement learning optimization algorithm. To thoroughly understand DAPO, we need to progressively sort out and explain it from PPO -> GRPO -> DAPO. Proximal Policy …