
Building GPT From First Principles: Code and Intuition
Last Updated on April 22, 2025 by Editorial Team
Author(s): Akhil Shekkari
Originally published on Towards AI.
The main goal of this blog post would be to Understand each component inside GPT with Intuition and be able to Code it in Plain PyTorch.
Please have a look at the below two Figures.
- Our Implementation will heavily follow the Figure-1.
2. I will be taking lots of Ideas and concepts from Figure – 2
(taken from Anthropicβs paper on https://transformercircuits.pub/2021/framework/index.html ). we will use this Figure for our Intuition and Understanding.
For Every component, I will go over the theory required. This is important because we have to understand why that particular component / concept is used. Next I will go over coding part.
Lets look at all the Individual components of a Transformer:
1. Residual Stream(Also Known as Skip Connections)
2. Embedding Matrix
3. Layer Normalization
4. Positional Encoding
5. Self Attention Mechanism(Causal Masking)
6. Multi β Layer Perceptron
7. UnEmbedding Matrix
Before looking at Residual Stream, It is always good to approach concepts with an example in mind. One of the main reason I came across when people say they find it difficult to code is, the problem with input output dimensions. Before and After every Transformation, we should know how the vector changes and its dimension changes.
Let the example sentence be βMessi is the greatest of all timeβ.
For this example, there are 7 tokens(1 word = 1 token for simplicity). let us take 1 token can be represented in 50 dimensions. we call this d_model.
Batch size is usually the no. of examples we feed to model at given point of time. since we are working with demo example and we have only one example, let us consider a batch_size of 1.Let us Assume max length of any sentence in our dataset is less than or equal to 10. we call this seq_len. Let the total no. of tokens in our vocabulary is 5000. we call this d_vocab.
So the Configurations of our toy example is:
d_model = 50
d_vocab = 5000
Note: The above config is for toy example. In our Code, we will be working with Actual GPT level Configs. (See Below)
Letβs define our Config
Note: There are lot oh hyper parameters, which you havenβt seen. But donβt worry, we will cover all of them in later parts of the blog
from dataclasses import dataclass
## lets define all the parameters of our model
@dataclass
class Config:
d_model: int = 768
debug: bool = True
layer_norm_eps: float = 1e-5
d_vocab: int = 50257
init_range: float = 0.02
n_ctx: int = 1024
d_head: int = 64
d_mlp: int = 3072
n_heads: int = 12
n_layers: int = 12
cfg = Config()
print(cfg)
Note that @dataclass is simplifying a lot of stuff for us. With @dataclass we get constructor and a clean output representation when we want to print the parameters of that class. No Need of huge boilerplate code. Without that, we would have to write the same class like this.
class Config:
def __init__(self, d_model=768, d_vocab=50257):
self.d_model = d_model
self.d_vocab = d_vocab
def __repr__(self):
return f"Config(d_model={self.d_model}, d_vocab={self.d_vocab})"
Some Common Implementation details for all the Components:
1. For Every Component , we define a class.
2. Every Class needs to subclass nn.Module. This is important for many reasons like storing model parameters, using helper functions etc., You can read more about this at https://pytorch.org/tutorials/beginner/nn_tutorial.html
3. super().__init__() makes sure the constructor of nn.Module gets called. https://www.geeksforgeeks.org/python-super-with-__init__-method/
4. We then pass config obj to that class, to set values for our parameters as required.
What is Embedding matrix ?
This is just the plain lookup table. You look for an embedding vector of a particular token.
Questions to ask before coding:
Q. What is the Input for Embedding Matrix ?
A. Int[Tensor, βbatch positionβ]
Here [batch, position] represent dimensions. position refers to token postion.
Q. What is the Output for Embedding Matrix ?
A. Float[Tensor, βbatch seq_len d_modelβ]
Returns corresponding embedding vectors in the above shape.
class Embed(nn.Module):
def __init__(self, cfg: Config):
super().__init__()
self.cfg = cfg
self.W_E = nn.Parameter(t.empty((cfg.d_vocab, cfg.d_model)))
nn.init.normal_(self.W_E, std = self.cfg.init_range)
def forward(self, tokens: Int[Tensor, 'batch position']) -> Float[Tensor, 'batch seq_len d_model']:
return self.W_E[tokens]
Every trainable parameter in a neural network needs to be tracked and updated based on gradients. PyTorch simplifies this with nn.Parameter().
Any tensor wrapped in nn.Parameter is automatically registered as a trainable weight.
nn.init.normal_ fills the tensor with values drawn from a normal distribution (a.k.a. Gaussian), in-place.
Our Embedding matrix will be of the shape (d_vocab, d_model). Intuitively, we can read it as, for every token the matrix row will represent its corresponding embedding vector.
What is Positional Embedding ?
This can also be thought of as a lookup table. Here Instead of token Ids, we have numbers/positions.Positional Embedding Refers to a learned vector assigned to each position (like a token embedding).Think of this as model learns that certain positions have certain tokens and relationships between them which is useful for attention tasks downstream.
Small Clarification:
In the original paper βAttention is all you Needβ, authors came up with Positional Encoding. Itβs not learned itβs a fixed function (based on sine and cosine) you add to the input embeddings. In our GPT, we use Positional Embedding.
More Intuition:
For the example
"Akhil plays football."
Positional embeddings evolve such that:
pos[0] β helps identify "Akhil" as the subject
pos[1] β contributes to verb detection
pos[2] β contributes to object prediction
Questions to ask before coding:
Q. What is the Input for Positional Embedding ?
A. Int[Tensor, βbatch positionβ]
Here [batch, position] represent dimensions. position refers to token postion.
Q. What is the Output for Positional Embedding ?
A. Float[Tensor, βbatch seq_len d_modelβ]
Returns corresponding embedding vectors in the above shape.
class PosEmbed(nn.Module):
def __init__(self, cfg: Config):
super().__init__()
self.cfg = cfg
self.W_pos = nn.Parameter(t.empty((cfg.n_ctx, cfg.d_model)))
nn.init.normal_(self.W_pos, std=self.cfg.init_range)
def forward(self, tokens: Int[Tensor, "batch position"]) -> Float[Tensor, "batch position d_model"]:
batch, seq_len = tokens.shape
return einops.repeat(self.W_pos[:seq_len], "seq d_model -> batch seq d_model", batch=batch)
Here n_ctx is the context length of our model. That means at any given time, we will have atmost of n_ctx number token to be positioned.
In the forward pass, we slice out the relevant position vectors from our learned embedding matrix, and repeat them across the batch. This gives us a tensor of shape [batch, seq_len, d_model], so each token gets a learnable embedding for its position β which we can then add to the token embeddings.
What is a Residual Stream ?
The straight path from Embed to UnEmbed from Figure β 2. You can kind of think of this as a central part in a Transformer. Information inside this stream flows forward. By forward, I mean from Embedding stage to UnEmbedding stage.
The tokens will be represented with their corresponsing embeddings via the Embedding Table. These Embeddings then enter the Residual Stream.
We represent the example Messi is the greatest of all time, inside the residual stream in the following dimensions.
[batch_size, seq_len, d_model] ==> [1, 10, 50]
(since each token is 50 dimensional vector, and we have 7 tokens. Here we pad the remaining 3 tokens with zeros to maintain dimensions.)
Next Steps In general,
- Input gets sent to LayerNorm
- Attention Heads, Read Information from this Residual Stream. Attention heads are responsible for moving information within tokens, based on Attention Matrix. (More on this in Attention Section)
- MLP does explicit read and write operations(new vectors) onto this Residual stream. They can also delete information from Residual Stream.
(Will explain more on this in later sections)
What is Layer Normalization ?
Fundamental reason why we do normalization is to keep the data flowing nicely through gradients without gradients vanishing or exploding.
From Figure β 5 , we can see two hyper parameters. These are Gamma(scaling factor) and beta(shifting factor). We make the values inside our embedding vector in Normalized form. E[x] is mean. Then we allow the model a little bit of room as training progresses for the purpose of Scaling and Shifting. we can see small epsilon in order to avoid division by zero error.
Questions to Ask:
Q. What does Layer Norm take as input ?
A. It takes the residual after attention. [Batch posn d_model]
Q. What does it return ?
A. It just normalizes existing values on the embedding vector. Doesnβt add anything new. So returns normalized values.
Note: dim = -1 denotes perform operations on the last dimension. Here last dimension is d_model. So, we take mean and varience along the embedding vector of each token independently.
### LayerNorm Implementation
class LayerNorm(nn.Module):
def __init__(self, cfg: Config ): ### has the x input vector
super().__init__()
self.cfg = cfg
self.w = nn.Parameter(t.ones(cfg.d_model)) ## these are gamma and beta
self.b = nn.Parameter(t.zeros(cfg.d_model)) ## learnable
def forward(self, residual: Float[Tensor, 'batch posn d_model']) -> Float[Tensor, 'batch posn d_model']:
residual_mean = residual.mean(dim = -1, keepdim=True)
residual_std = (residual.var(dim = -1, keepdim=True, unbiased=False)+ cfg.layer_norm_eps).sqrt()
residual = (residual - residual_mean) / residual_std
residual = residual * self.w + self.b
return residual
Multi Head Attention:
Okay. Letβs think in simple terms first. Before talking about Multiple Attention Heads, let us understand what happens in a Single Attention Head.
Questions to Ask:
Q. What does an Attention Head get as an Input ?
A. The Attention head reads what is present in residual stream. i.e., Float [batch seq_len d_model]. From our toy examples, this might be one of the examples like βMessi is the greatest of all timeβ
Q. After the completion if Self Attention Process(from the attention block), what does the output look like ?
A. Float[Tensor, βbatch seq_len d_modelβ]. The output is still the same example, but there is a lot of information movement. Letβs go through that in detail.
Information Movement:(Intuition)
Let's take two tokens from above.
(for our convenience, we represent each token in 4 dimensions.)
Below is the state of embedding vectors before entering the attention block.
Messiβ [0.1 0.9 2.3 7.1]
greatest β [2.1 4.4 0.6 1.8]
Once these tokens enter into Attention block, tokens start to attend to the tokens came before in order to include more context and change its representation. This is called Causal Self Attention.
Messi is the greatest of all time.
In this example, when greatest wants to encode some context inside of it, it can only use the context from the words Messi, is and the. From these words, the representation of greatest changes. After Attention,
Messiβ [0.1 0.9 2.3 7.1]
greatest β [0.2 1.1 0.6 1.8] (changed representation)
what does that mean ? Look at the greatest vector. Now It represents some βMessiβ inside of it. This is kinda like while constructing the embedding vector for βgreatestβ it is referring to a bit of Messi. This is what is information Movement.
But still, we want to know how this process exactly happens.
Let me introduce few Matrices which are important in this process. In the literature, these are named as Queries, Keys and Values.
Q = Input * Wq
K = Input * Wk
V = Input * Wv
Here Input is our example βMessi is the greatest of all timeβ. The Idea of Q, K and V are to do linear transformation of Input into a different space where these Inputs are represented in more meaningful way. Let's see the dimensions of these matrix multiplications on our toy example.
Input/redisual = [1 10 50] [batch seq d_model]
βWqβ matrix dimension depends on how many heads we want to have in our model. This is a very important statement because, if we decide to have only one attention head, then we can have
Wq = [n_head, d_model, d_model] ==> [1, 50,50]
If we decided to have n_heads, then the dimension will be [d_model/n_heads]. We represent this quantity as d_head.
So, if we want to have 5 heads, The dimensions of Wq will be
[n_head, d_model, d_head] ==> [5, 50, 10].
Lets say we want to have 5 heads, then
Q = [1 10 50] * [5, 50, 10] ==>[batch seq d_model] * [n_head d_model d_head] ==> [1 10 5 10] [Batch seq_len n_head d_head]
The extra dimension in the beginning is batch. For Q, K and V we will clearly see how all of this fits together in a diagram.
The same applies for K and V matrices. First letβ s talk about K.
K = [1 10 50] * [5 50 10] ==> [1 10 5 10]
Attention is calculated by multiplying the Matrices Q and K. Remember, Attention matrix will always be a Square Matrix.
Please look at the diagram I made. I tried to communicate what do those dimensions even mean. Look at the left part. [Batch seq d_head d_model]
I took two example sentences.
1. I good
2. You bad
In Left representation, for one batch we are having two examples. Of those two examples, we have 2 seq tokens per example. For each token, we have all the heads which is like full d_model dimension. we are not computing attention per head.
But, we want it such that for every batch , and for every head we want those tokens to be represented by different Attention heads parallely. Right side of the representation helps in that. That is the reason while computing attention, we permute the shapes. (hope this helps!!!)
Note: Donβt worry all of this transformation can be done very intuitively through einsum. you will see this in coding.
Now that we have understood, how attention is computed, letβs get back to our Messi example.
Earlier we talked about how βgreatestβ would attend to βMessiβ. we get a [10,10] matrix of all words of our toy example attending to all the other words.
Here After getting the attention matrix, we apply causal masking to prevent words attending to future words. βgreatestβ cannot attend to βtimeβ. After that, we apply Softmax on Attention Matrix. Softmax gives us a score that would sum to 1 along that row.
For the word greatest, it would tell how much percent it should attend to βMessiβ, how much to βisβ and How much to βtheβ and itself.
I took some other example from google to make things visually simple. You can easily connect this with our Messi example.
Once this is done, next step is to Multiply this matrix with our Value vector.
As I discussed above, value vector will also be nothing but linear transformation of Input to another Space.
V = Input * Wv ==> [1 10 50] * [5 50 10] ==> [1 10 5 10]
Z = V * A==> [batch seq_len n_head d_head] * [batch n_head q_seq k_seq] ==> [1 10 5 10] * [1 5 10 10] ==> [1 10 5 10] [batch seq_len n_head d_head]
Again, once you look at einsum code, this is self explanatory.
Z is the output from 1 head. We stack all these outputs of [1 10 5 10] horizontally. There are 5 heads. so the result becomes [1 10 5 50].
The concatenation of all these heads is then multiplied with one final Output Matrix(Wo) which can be intuitively thought of as learning to represent how to combine all these outputs from different heads.
(Z from all the heads) * Wo
[1 10 5 50] * [5 10 50][n_head d_head d_model]==> [1 10 50]
This is how information is moved in between tokens. I know there are a lot of dimensions here, but this is the core part. once you grab gist of it, everything looks straighforward. Now the information is moved inside the Residual Stream. Look at the code for Implementing Attention below. There are bias initializations which are self explanatory.
Note: I use βposnβ and βseq_lenβ interchangeably. They are the same.
Implementation details:
- Regarding implementing causal mask is tril and triu functions in PyTorch. please look at them as they are straightforward.
- Register buffer is the process of creating temperory parameters that doesnt require gradient tracking. They give nice functionality of moving between CPU and GPU if we register them with PyTorch provided buffer.
class Attention(nn.Module):
### register your buffer here
IGNORE: Float[Tensor, '']
def __init__(self, cfg: Config):
super().__init__()
self.cfg = cfg
self.W_Q = nn.Parameter(t.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
self.W_K = nn.Parameter(t.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
self.W_V = nn.Parameter(t.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
self.W_O = nn.Parameter(t.empty((cfg.n_heads, cfg.d_head, cfg.d_model)))
self.b_Q = nn.Parameter(t.zeros((cfg.n_heads, cfg.d_head)))
self.b_K = nn.Parameter(t.zeros((cfg.n_heads, cfg.d_head)))
self.b_V = nn.Parameter(t.zeros((cfg.n_heads, cfg.d_head)))
self.b_O = nn.Parameter(t.zeros((cfg.d_model)))
nn.init.normal_(self.W_Q, std=self.cfg.init_range)
nn.init.normal_(self.W_K, std=self.cfg.init_range)
nn.init.normal_(self.W_V, std=self.cfg.init_range)
nn.init.normal_(self.W_O, std=self.cfg.init_range)
self.register_buffer('IGNORE', torch.tensor(float('-inf'), dtype=torch.float32, device = device)) # mention device also
def forward(self, normalized_resid_pre: Float[Tensor, 'batch pos d_model']) -> Float[Tensor, 'batch pos d_model']:
### calculate query, key and value vectors and go according to the formula
q = (
einops.einsum(
normalized_resid_pre, self.W_Q, "batch posn d_model, nheads d_model d_head -> batch posn nheads d_head"
)
+ self.b_Q
)
k = (
einops.einsum(
normalized_resid_pre, self.W_K, "batch posn d_model, nheads d_model d_head -> batch posn nheads d_head"
)
+ self.b_K
)
v = (
einops.einsum(
normalized_resid_pre, self.W_V, "batch posn d_model, nheads d_model d_head -> batch posn nheads d_head"
)
+ self.b_V
)
attn_scores = einops.einsum(
q, k, "batch posn_Q nheads d_head, batch posn_K nheads d_head -> batch nheads posn_Q posn_K"
)
attn_scores_masked = self.apply_causal_mask(attn_scores / self.cfg.d_head**0.5)
attn_pattern = attn_scores_masked.softmax(-1)
# Take weighted sum of value vectors, according to attention probabilities
z = einops.einsum(
v, attn_pattern, "batch posn_K nheads d_head, batch nheads posn_Q posn_K -> batch posn_Q nheads d_head"
)
# Calculate output (by applying matrix W_O and summing over heads, then adding bias b_O)
attn_out = (
einops.einsum(z, self.W_O, "batch posn_Q nheads d_head, nheads d_head d_model -> batch posn_Q d_model")
+ self.b_O
)
return attn_out
def apply_causal_mask(
self, attn_scores: Float[Tensor, "batch n_heads query_pos key_pos"]
) -> Float[Tensor, "batch n_heads query_pos key_pos"]:
"""
Applies a causal mask to attention scores, and returns masked scores.
"""
# Define a mask that is True for all positions we want to set probabilities to zero for
all_ones = t.ones(attn_scores.size(-2), attn_scores.size(-1), device=attn_scores.device)
mask = t.triu(all_ones, diagonal=1).bool()
# Apply the mask to attention scores, then return the masked scores
attn_scores.masked_fill_(mask, self.IGNORE)
return attn_scores
Imp takeaway
What information we copy depends on the source tokenβs residual stream, but this doesnβt mean it only depends on the value of that token, because the residual stream can store more information than just the token identity (the purpose of the attention heads is to move information between vectors at different positions in the residual stream).What does that mean ?
Messi is the greatest of all time
So when greatest is referring/Attending back to Messi, it doesnβt just see the value Messi. Residual stream stores much more than just the identity. It refers to things like
Messi is a subject.
Messi is a person etc.,
All of this is stored in the residual stream.
Now Input goes into MLP.
Multi β layer Perceptron (MLP Layer)
This is very important layer. 2/3rds of Modelβs parameters are MLPs. These are responsible for Non β Linear Transformation of given Input vectors.
The main Intuition of this layer is to form rich projections. To store facts.
There is a very Intuitive video made by 3 blue 1 brown about this. Itβs a must watch. https://www.youtube.com/watch?v=9-Jl0dxWQs8&t=498s
Intuition
You can loosely think of the MLP as working like a Key β Value function, where:
Input = βKeyβ (what token currently holds in residual stream)
Output = βValueβ (what features we want to add to the residual stream)
For Example Key = tokenβs current context vector coming from the residual stream. It Represents the meaning of the token so far (including attention context)
Value = non-linear mix of learned features
Could be:
1. βThis is a named entityβ
2. βThis clause is negatedβ
3. βA question is being askedβ
4. βBoost strength-related featuresβ
5. βTrigger next layerβs copy circuitβ
So the MLP says:
βOh youβre a token thatβs the subject of a sentence AND you were just negated? Cool. Let me output features relevant to that situation.β Hope you got the intuition.
The first hidden layer has 3072 neurons. we call this as d_mlp and have declared it in our config. Also the 2nd hidden layer projects these back to d_model space. These have been shown as W_in and W_out in the code.
We use GeLU Non linearity.
class MLP(nn.Module):
def __init__(self, cfg: Config):
super().__init__()
self.cfg = cfg
self.W_in = nn.Parameter(t.empty(cfg.d_model, cfg.d_mlp))
self.b_in = nn.Parameter(t.zeros(cfg.d_mlp))
self.W_out = nn.Parameter(t.empty(cfg.d_mlp, cfg.d_model))
self.b_out = nn.Parameter(t.zeros(cfg.d_model))
nn.init.normal_(self.W_in, std=self.cfg.init_range)
nn.init.normal_(self.W_out, std=self.cfg.init_range)
def forward(self, normalized_resid_mid: Float[Tensor, 'batch posn d_model']):
## Its going to do per token level matmul
pre = einops.einsum(normalized_resid_mid, self.W_in, 'batch posn d_model,d_model d_mlp->batch posn d_mlp') + self.b_in
post = gelu_new(pre)
mlp_out = einops.einsum(pre, self.W_out, 'batch posn d_mlp, d_mlp d_model->batch posn d_model') + self.b_out
return mlp_out
With this, we completed one layer of what we call Transformer Block. There are 12 such layers in GPT-2. Also there are 12 attention heads in GPT that we are implementing. Therefore n_heads = 12 and n_layers = 12. These have already been coded in the config. Our GPT model contains (d_model) 768 dimensions and a vocabulary(d_vocab) of over 50257 tokens.
So this Transformer block is repeated 12 times. Code for TransformerBlock is just connecting ( LayerNorm + Attention + MLP & Skip Connections).
class TransformerBlock(nn.Module):
def __init__(self, cfg: Config):
super().__init__()
self.cfg = cfg
self.ln1 = LayerNorm(cfg)
self.attn = Attention(cfg)
self.ln2 = LayerNorm(cfg)
self.mlp = MLP(cfg)
def forward(self, resid_pre: Float[Tensor, 'batch posn d_model']) -> Float[Tensor, 'batch posn d_model']:
resid_mid = self.attn(self.ln1(resid_pre)) + resid_pre ### skip connection
resid_post = self.mlp(self.ln2(resid_mid)) + resid_mid
return resid_post
Here skip connections are nothing but adding input directly into the Residual Stream along with Attention and MLP. resid_pre says the residual before normalization, which is raw input. resid mid is the residual after attention and it again gets added. This is done inorder to stabilize training for large amount of time.
UnEmbed
UnEmbed Matrix is when you want to map the learned representations back to the probability over all the tokens in vocab.
Questions to Ask:
Q. What input does it take?
A. Residual Stream token vector. [batch posn d_model]
Q. What does it give out?
A. It gives out probability of tokens that are likely given current token. i.e.,
a matrix of size [batch posn d_vocab]
look at logits for how precisely it is calculated.
class UnEmbed(nn.Module):
def __init__(self,cfg:Config):
super().__init__()
self.cfg = cfg
self.W_U = nn.Parameter(t.empty(cfg.d_model, cfg.d_vocab))
nn.init.normal_(self.W_U, std=self.cfg.init_range)
self.b_U = nn.Parameter(t.zeros((cfg.d_vocab), requires_grad=False))
def forward(self, normalized_resid_final: Float[Tensor, 'batch posn d_model']) -> Float[Tensor, 'batch pos d_vocab']:
logits = einops.einsum(normalized_resid_final, self.W_U, 'batch posn d_model, d_model d_vocab -> batch posn d_vocab') + self.b_U
return logits
Transformer
Finally we arrive at the last part. Here, we just need to put all the components we have seen together. Letβs do that !!!
class Transformer(nn.Module):
def __init__(self, cfg: Config):
super().__init__()
self.cfg = cfg
self.embed = Embed(cfg)
self.posembed = PosEmbed(cfg)
self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)])
self.ln_final = LayerNorm(cfg)
self.unembed = UnEmbed(cfg)
def forward(self, tokens: Int[Tensor, 'posn']) -> Float[Tensor, 'batch posn d_vocab']:
residual = self.embed(tokens) + self.posembed(tokens)
for block in self.blocks:
residual = block(residual)
logits = self.unembed(self.ln_final(residual))
return logits
Here we go from taking tokens as input to calling residual/transformer blocks for 12 times.
Implementation detail: Since all the Transformer blocks have their own parameters to be tracked, we need to define them in ModuleList. This is proper way of Initializing a list of blocks we need.Each block will take input from Residual Stream, will learn and contribute their learnings to Residual Stream.
Thats it Guys!!!!! Hope you have gained a ton of Knowledge on how to build your own GPT. Support and Follow me for more cool blogs!
Thanks to Neel Nanda and Callum McDougall !!!! I have learnt a lot from their Materials and Videos. This blog is inspired from their work.
Connect with Me on: https://www.linkedin.com/in/akhilshekkari/
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI