Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Most Powerful Machine Learning Models Explained (Transformers, CNNs, RNNs, GANs …)
Latest   Machine Learning

Most Powerful Machine Learning Models Explained (Transformers, CNNs, RNNs, GANs …)

Last Updated on August 30, 2023 by Editorial Team

Author(s): Oleks Gorpynich

Originally published on Towards AI.

Midjourney Generated Image

Machine Learning is a massive field, and it’s understandably difficult to find a source that gives an overview of what models and techniques are at the bleeding edge of the tech right now. That being said, this article will be more of a conceptual exploration, rather than a concrete scientific analysis of each of these models. In fact, I actually recommend diving deeper into each one if possible. I’d also like to provide examples of where these models are used as in my opinion, theory should always be tied to practice. If I miss any information, please feel free to provide feedback and request more.

Before I begin, here is the list of the models which will be covered.

  1. CNNs (Convolutional Neural Networks)
  2. RNNs (Recurrent Neural Networks)
  3. Transformers
  4. GANs (Generative Adversarial Networks)


A CNN (Convolutional Neural Network) is a type of neural network (

Arthur Arnx) that works great on topological data and is modified in some way to be better at pattern detection. So how is it different?
Well to start, let me give you a brief general overview of what a neural network is.

A neural network, in short, is a “map” of nodes that process input data and produce an output. It consists of layers that map one set of nodes into another, propagating and transforming the input into a result. The propagation is done through weights, which are the values changing our input at each propagation step to produce the result we desire. After each propagation step, a bias is applied. The weights and biases are what we are really looking for: they are the numbers changed when training the network.

Figure 1 (source)

How is a CNN special though? Well, what separates a CNN is that it utilizes convolutional layers in its layer stack. That's not to say a CNN cannot have other types of layers (it usually does), but convolution makes it special. Here is how this layer works.

If you imagine each pixel in your image to be a brightness value, your image becomes just a 2D matrix of numbers. Convolution will take this matrix and produce an output matrix by applying a kernel to it. A kernel is just a smaller matrix that acts like a filter for every area in your image.

The smaller matrix “steps” through the bigger matrix of the image to produce an output matrix.

Figure 2 (source)

There are a few key ideas here.

  1. The kernel is in a way applied to every pixel of the image and its surrounding area, but it stays the same throughout. This is because a kernel is meant to detect a pattern or feature within the area of that pixel.
  2. Kernels are usually considerably smaller than the image itself which helps with training greatly.
  3. The idea behind kernels is that any image is just a set of patterns we can break down. For instance, say we have a face. Imagine that there is a kernel which is able to detect circles. Its output will contain 2 bright spots near the top of the image (the eyes that it detected). Now imagine that there is another that can detect two lines close together. The output will have a bright spot near the bottom (the mouth that it detected). Finally, imagine that the final kernel applied can detect this formation of 2 circles and the bottom 2 lines. Well, then it would recognize the face.
  4. A convolutional layer can have multiple of these kernels applied to produce multiple new images. These are then stacked together and fed forward in the network. Then another convolutional layer would have another set of kernels to be applied.
  5. CNNs usually also contain pooling layers which are there to reduce image size and complexity.

Obviously, there's a lot more detail and math which I missed here, but the main intuition behind CNNs lies in the kernel.

Some popular tools and products that use CNNs include Google Photos, DeepMind’s AlphaGo, and Tesla’s Autopilot systems.


As you’ve seen CNNs are primarily used for image processing. RNNs on the other hand are used mostly for NLP (natural language processing) and some other domains such as time series analysis. To understand the architecture behind RNNs, let’s first highlight some problems with using a simple neural network for NLP.
Let’s look at a standard NLP problem — text autocomplete. The input to our model is a piece of text and the output is another piece of text. The issue is that our input is a variable size (can be a few words or a lot of words), and simple neural networks typically have a fixed input size. The other issue is capturing the complex relationships between words in our input to produce the correct output. Remember, there are thousands of words in the English language, and the order of these words in a sentence doesn’t necessarily change the meaning. So, how do you ensure that the sentence “The fluffy cat came here on Sunday” is similar to “On Sunday, the cat which was fluffy came here”, however different from “The Sunday came here on a fluffy cat”?

The intuition behind RNNs comes from how information flows through them. Let’s take a sentence for example and see how an RNN would process it — “The cat eats”.

Let’s look at the sentence as a sequence of words — “The”, “cat”, and “eats” (in reality it would probably be represented as a sequence of numbers of vectors). The RNN will now process this sequence…sequentially (which is where the “Reccurent” part of the name comes from). First, it will take in the word “The” and produce some output x1 by piping “The” through its own set of weights and biases. Then, it will take x1 and the next word in the sequence — “cat”, to pipe that through this same set of weights and biases to get the next output x2. Then, it will take x2 and the next word in the sequence — “eats”, to get the next output x3. In this way, you can see how the RNN takes its previous output and the next input, to produce some new output. The current “state” of the RNN is called the hidden state. Here is an animation that might help you build this intuition from a great article written by

Michael Phi.

Figure 3 (source)

How can this be used to predict the next word after you ask? Well, imagine that each output — x1, x2, x3 actually represents a new word. We can train the network such that the output is actually a prediction of the word that comes next. So, let’s look at our sentence being processed again.

“The” -> Piped through our model -> produces x0, we train our model such that x0 can be correctly extrapolated into “cat”

“cat” and previous output x0 -> Piped through our model -> produces x1, after training x1 can be correctly extrapolated into “eats”

“eats” and previous output x1 -> Piped through our model -> produces x2. We find out that x2 now represents the word “tuna”! We can use this for our next “input”

“tuna” and previous output x2 -> Pipe through our model -> produces x3… And so on and on

The main intuition behind RNNs lies in the fact that

  1. RNNs always keep track of what was seen before through this hidden state and this captures the relationships between words, or any sequential data for that matter.
  2. The same model is applied to every part of the sequence recurrently which makes RNNs feasible to train (as opposed to having a huge model process the whole input at once)

Here you can probably already envision some issues with this approach. Once the text gets long, our initial few words barely contribute to the current hidden state which isn’t ideal. Additionally, we are forced to do this processing sequentially and so both our processing and training speed is limited by the algorithm itself.

Still, there's a lot more to these powerful models, so I encourage you to look deeper!

Some popular tools and products that use RNNs include Google Translate, OpenAI’s GPT2, and Spotify’s recommendation system.


Transformers! The current rage of the Machine Learning world. Both GPT4 and BERT (Google’s own advanced language model) are based on the transformer architecture. So what are they about?

Well, transformers are mostly used in NLP problems, just like RNNs, and so they must solve similar issues related to language processing that I described previously. There are a few key ideas, however that mitigate these differently from RNNs.

  1. Positional Encoding — While the sequence importance in language is naturally preserved in RNNs via their hidden states, Transformers embed this information directly into the input. Positional encodings are added to word embeddings (vector representations of words), ensuring each word’s position in a sentence is captured. So, the representation of “dog” gets modified based on its position in the text.
  2. Huge training dataset size — To utilize the benefit of positional encodings, transformers must be trained on huge datasets. These differences in word order are captured in the data, and so without seeing all sorts of different possibilities in order and word type, the model will function sub-optimally.
  3. Self-attention — The model learns to “care” more about certain words and how they relate to every other word in the input. After all, some words will carry a lot more meaning and power in prediction or translation, especially if grouped together with others. How does it learn to do this? Again, it is through the massive training dataset size and its architecture.

The architecture behind transformers, however is a bit more complex and difficult to explain in such a short article. Still, I’ll try to paint a very high-level picture for you. Transformers are composed of decoders and encoders. The encoder consists of a stack of identical layers whose role is to process the text and provide input for the decoder that will capture the most important information of the text. The decoder’s role is to take this input and produce the output that we desire, in a similar process with a stack of identical layers. Below is a figure showcasing this architecture, with a great article (by

Ketan Doshi) linked that elaborates on this topic.

Figure 4 (source)

I strongly recommended diving deeper into this topic and especially the “self-attention” mechanism which lies at the heart of transformers.


GANs (Generative Adversarial Networks) are fundamentally about two opposing models competing with each other. GAN most often refers to the learning method for training these models, as opposed to the models themselves. In fact, the architecture of the 2 models is not overly important to the concept itself, as long as one is generative and the other is a classifier.

Let’s begin with describing the standard supervised learning technique for training models.

  1. We feed some input to our model, and it produces some output
  2. We compare this output with the desired output and update the model in some way to do better

An issue arises, however, when we want to create a model that generates realistic output that doesn’t necessarily have to be identical to any sort of output we have (think image generation or music generation). This is where GANs come in. In a GAN we have two models, the Generator Model and the Discriminator Model.

As an example, take a generator model which produces images. We begin by prompting this model to produce a number of fake images and then also find some real images. We then feed a combination of all these images into our discriminator model which classifies them as real or fake. If our generator model does well, the discriminator model is stumped and will get the answer right around half the time. Obviously, this won’t be the case at the start (our discriminator is usually pre-trained slightly), and so in a supervised learning fashion (we ourselves know which images are real or fake), we train the discriminator model to do better next time. We also can train the generator model as we know how well it was able to fool the discriminator model. Training is done when we can generate images that fool the discrimination model around 50% of the time.

Figure 5 (source)

Some examples of GANs being put to real-world use are Runway ML, Midjourney Art generation (check out my previous article on art!), and OpenAI’s DALL * E.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓