Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Can Mixture of Experts (MoE) Models Push GenAI to the Next Level?
Artificial Intelligence   Computer Vision   Latest   Machine Learning

Can Mixture of Experts (MoE) Models Push GenAI to the Next Level?

Last Updated on August 8, 2024 by Editorial Team

Author(s): Nick Minaie, PhD

Originally published on Towards AI.

Can Mixture of Experts (MoE) Models Push GenAI to the Next Level?

Having worked in the AI/ML field for many years, I vividly recall the early days of GenAI when creating even simple coherent text was a Herculean task. I worked on a project where we had to generate summaries of large sales documents, and I’ll never forget the puzzled look on the client’s face when our model spat out awkward, noncoherent summaries. Those were challenging times, but they also taught us a lot. Fast forward to today, and it’s unbelievable how far we’ve come over the past couple of years. Now, we have models that can write like humans, create breathtaking images, and even compose music. However, these advancements come with their own set of challenges. GenAI models still struggle with scalability, require massive computational power, and often fall short of tackling diverse tasks. These hurdles are significant roadblocks to achieving what we dream of as Artificial General Intelligence (AGI). But our journey is far from over.

If you’re interested in the fascinating journey towards AGI, you might enjoy reading my article: The Quest for Artificial General Intelligence (AGI): When AI Achieves Superpowers

In my experience leading AI/ML teams on large-scale projects, I’ve discovered that one of the most promising solutions to these challenges is the Mixture of Experts (MoE) model. Picture a team of specialized experts, each excelling in specific tasks, working seamlessly together, guided by a system that knows precisely which expert to deploy and when. This is the essence of MoE models. Although the concept was introduced in 1991 by Jacobs et al., it’s only now, with today’s powerful GPUs and vast datasets, that we can fully understand and leverage its potentials. As generative AI continues to evolve, the ability of MoE models to employ specialized sub-models for different tasks makes them incredibly relevant. So, let’s dive deep into what MoEs are and how they are leveraged in language, vision, and recommender models.

Generated by Midjourney β€” Digiguru

Understanding Mixture of Experts (MoE) Models

Over the past few years, we’ve witnessed the rise of ever-larger models, each striving to surpass the previous best in various benchmarks. However, it appears that these GenAI models eventually hit a plateau, and moving the needle becomes even more challenging. In my opinion, the more recent GenAI models face significant challenges in scalability, computational efficiency, and generalization. MoE models offer a solution by using multiple specialized sub-models, or β€˜experts,’ each handling different aspects of a task. This approach not only optimizes performance but also ensures efficient resource utilization, distinguishing MoE models from traditional monolithic AI models.

Technical Architecture of MoE Models

Let’s take a closer look at the architecture of a typical MoE model. Imagine a team of experts, each one a specialist in a particular area. These experts are the specialized neural networks. Then, there’s the gating network, like a smart manager who knows exactly which expert to call on based on the task at hand. Finally, the combiner acts like a project coordinator, pulling together the inputs from each expert into a seamless, cohesive output (not like my document summarization project a few years ago!).

Mixture of Experts (MoE) Concept from Mixtral of Experts paper (https://arxiv.org/pdf/2401.04088)

The MoE concept isn’t limited to just Transformer architecture; it can be applied to various neural network setups. However, its most exciting recent applications have been with Transformer-based models. Transformers architecture, introduced back in 2017, revolutionized AI, particularly in language models. They use a lot of computational power to handle massive datasets and parameters. MoE models build on this by enhancing the architecture. Transformers use self-attention mechanisms to figure out which parts of the input data are most important. By integrating MoE, these layers can call on multiple experts. The gating network acts like a dispatcher, directing each piece of data to the right expert, optimizing both efficiency and performance. MoE in transformers is illustrated below.

MoE layer in Transformer from the Switch Transformers paper (https://arxiv.org/abs/2101.03961)

MoE in Language Models

Some of my favorite uses of MoE are in language models. These models have experts specializing in different linguistic features, like syntax, semantics, and sentiment analysis. For instance, if an MoE model is processing a complex sentence, it might send tricky phrases to syntax experts and emotional words to sentiment experts. This not only makes the model more efficient but also more accurate. One standout example is Google’s Switch Transformer, which uses this approach brilliantly.

MoE in Vision Models

What’s my next favorite topic? Yes, vision! Vision models apply similar principles. Vision Transformers (ViTs) break down an image into smaller patches, processing each one independently. In an MoE-enhanced ViT, the gating network evaluates each patch and assigns it to the most suitable expert based on characteristics like texture, color, shape, and motion. This selective activation allows MoE models to handle high-resolution images and large datasets efficiently, making them highly effective for tasks like image classification and object detection. Vision MoE (V-MoE) is a good example of this approach.

MoE in Recommender Systems

Recommender systems are making a comeback again to the front row with applications of Mixture of Experts (MoE). Traditional recommendation algorithms often struggle with personalization and scalability. MoE models address this by using specialized experts, each focusing on different user behaviors and preferences, for example short-term interests vs long-term habits, leading to a better user experience. Multi-gate MoE (MMeE), illustrated below, is a successful implementation of this concept for recommender systems. This architecture enhances multi-task learning by sharing expert submodels across all tasks, with a gating network trained to optimize performance for each specific task.

Multi-gate MoE model from Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts (https://dl.acm.org/doi/pdf/10.1145/3219819.3220007)

Some of the Noteworthy MoE Models (As of August 2024)

Now that we’ve explored what MoE models are and how they help scale GenAI, let’s take a look at some of the most noteworthy MoE models that have been widely adopted by the AI community.

Mistral Mixtral 8x7B made a big splash back in Dec 2023, when they released stunning evaluation metrics. It is an advanced MoE model developed by Mistral AI, comprising eight distinct expert modules, each with 7 billion parameters (thus the name 8x7B). Its performance has set a new benchmark in the field.

Switch Transformers was eveloped by Google and released back in 2021. It employs a MoE approach to achieve impressive scalability with a 1.6 trillion parameter model ᕦ( Ν‘Β° ΝœΚ– Ν‘Β°)α•€. It uses a sparse activation method, where only a subset of experts is activated for each input.

V-MoE (Vision Mixture of Experts) was developed for computer vision tasks and released in 2021, and what I love about it is that it applies the MoE architecture to Vision Transformers (ViT). It partitions images into patches and dynamically selects the most appropriate experts for each patch.

GShard is another model from Google,and is a framework for scaling large models efficiently using MoE. It allows for the training of models with up to trillions of parameters (βŒβ– _β– ) by dividing them into smaller, specialized expert networks.

Z-code is Microsoft’s initiative that leverages MoE architecture for natural language processing tasks, such as translation. It supports massive scales of model parameters while keeping computational requirements constant, enhancing efficiency and performance.

MMoE (Multi-Gate Mixture of Experts) was proposed by Google researchers for YouTube video recommendation systems back in 2018, and it uses multiple gating networks to optimize predictions for different user behaviors, such as engagement and satisfaction, improving the accuracy of recommendations.

If you’ve had experience with any other MoE models, I’d love to hear about them! Feel free to share your thoughts in the comments below.

Final Thoughts …

Mixture of Experts (MoE) models are a game-changer for GenAI. I’ve watched AI grow from simple tasks to creating complex art and text, but it hits a wall with scalability and efficiency. MoE models offer a smart way around this by using specialized experts that handle different parts of a task, making everything faster and more efficient. MoE models have been applied in LLMs, computer vision, and recommendation systems by improving accuracy and speed while reducing computational load.

I believe as generative AI continues to evolve, the role of MoE models will become even more crucial. We might soon see these models tackling even more complex tasks with ease, pushing the boundaries of what we thought possible to the next level. BUT, WHAT IS THE NEXT LEVEL AI? Β―\_(ツ)_/Β― Only time will tell.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓