# Bigram Language Modeling From Scratch

**Author(s): Abhishek Chaudhary**

Originally published on Towards AI.

Language modeling is all about how computers understand and generate human language. Itβs a key part of making AI systems that can communicate with us effectively. In recent years, ChatGPT has been a prominent topic of discussion in tech articles. Chat-**G**enerative **P**re-trained **T**ransformer has allowed general users to make use of a highly proficient language model to answer their queries. How does ChatGPT accomplish this task? Firstly, itβs pre-trained on a solid chunk of the internet, which, when coupled with careful preprocessing, training, and fine-tuning, can work wonders. Secondly, it works on the principle of finding the next chunk, given an input sentence. This coupled of attention mechanism allows ChatGPT to create whole coherent paragraphs.

In this article, weβll discuss a similar, much less powerful language model. Our model will be a character-level model as opposed to a word/chunk-level model and would predict the next character given a previous character.

*The code for this article can be found in the following J**upyter notebook*

## Bigram model

The essence of the bigram model in language modeling is to approximate the probability of a word sequence by considering the probability of each word given its immediate predecessor.

The probability of a word sequence *(W = w_1, w_2, β¦, w_n)* is represented as follows:

`P(W) = P(w_1, w_2, ..., w_n) β P(w_1) * P(w_2 U+007C w_1) * P(w_3 U+007C w_2) * ... * P(w_n U+007C w_{n-1})`

Where:

`P(w_1)`

is the probability of the first word in the sequence.`P(w_i U+007C w_{i-1})`

is the conditional probability of the word`w_i`

given that the previous word is`w_{i-1}`

.

The conditional probability `P(w_i U+007C w_{i-1})`

for a bigram is estimated from a corpus of text as follows:

`P(w_i U+007C w_{i-1}) = Count(w_{i-1}, w_i) / Count(w_{i-1})`

`Count(w_{i-1}, w_i)`

represents the number of times the word`w_i`

follows the word`w_{i-1}`

in the corpus.`Count(w_{i-1})`

is the number of times the word`w_{i-1}`

appears in the corpus.

## Implementation of the Bigram Model

For input corpus, weβll make use of all the strings in `names.txt`

`words = open('names.txt', 'r').read().splitlines()`

words[:10]

`['emma',`

'olivia',

'ava',

'isabella',

'sophia',

'charlotte',

'mia',

'amelia',

'harper',

'evelyn']

`print(f"Number of words in corpus {len(words)}")`

print(f"Shortest name in corpus {min(len(w) for w in words)}")

print(f"Longest name in corpus {max(len(w) for w in words)}")

`Number of words in corpus 32033`

Shortest name in corpus 2

Longest name in corpus 15

Next, weβll create bigram pairs and store their count in a dictionary. Weβll add a special token/character to denote the start and end of the word

`bg_pairs = dict()`

for word in words:

word = ['.'] + list(word) + ['.']

for ch1, ch2 in zip(word, word[1:]):

bg_pair = (ch1, ch2)

bg_pairs[bg_pair] = bg_pairs.get(bg_pair, 0) + 1

Printing out the top 10 bigram pair based on their frequency

`sorted(bg_pairs.items(), key = lambda kv: -kv[1])[:10]`

`[(('n', '.'), 6763),`

(('a', '.'), 6640),

(('a', 'n'), 5438),

(('.', 'a'), 4410),

(('e', '.'), 3983),

(('a', 'r'), 3264),

(('e', 'l'), 3248),

(('r', 'i'), 3033),

(('n', 'a'), 2977),

(('.', 'k'), 2963)]

Bigram `(n, .)`

is the most frequent pair, which means names end most frequently with `n`

in our dataset. Let's try to visualize the data that we have. We'll make use of matplotlib, but since it doesn't recognize text, we'll convert our characters into integer representation.

`character_list = sorted(list(set(''.join(words))))`

character_list

`['a',`

'b',

'c',

'd',

'e',

'f',

'g',

'h',

'i',

'j',

'k',

'l',

'm',

'n',

'o',

'p',

'q',

'r',

's',

't',

'u',

'v',

'w',

'x',

'y',

'z']

`stoi = {s:i+1 for i,s in enumerate(character_list)} # Adding 1 to each index so that special character can be given index 0`

stoi['.'] = 0

itos = {i:s for s,i in stoi.items()} # Create reverse mapping as well

stoi

`{'a': 1,`

'b': 2,

'c': 3,

'd': 4,

'e': 5,

'f': 6,

'g': 7,

'h': 8,

'i': 9,

'j': 10,

'k': 11,

'l': 12,

'm': 13,

'n': 14,

'o': 15,

'p': 16,

'q': 17,

'r': 18,

's': 19,

't': 20,

'u': 21,

'v': 22,

'w': 23,

'x': 24,

'y': 25,

'z': 26,

'.': 0}

Now that we have the dictionary to map the characters to integers, letβs make use of a PyTorch tensor to store the frequency of *bg_pairs*

`import torch`

N = torch.zeros((27, 27), dtype=torch.int32)

for word in words:

word = ['.'] + list(word) + ['.']

for ch1, ch2 in zip(word, word[1:]):

idx1 = stoi[ch1]

idx2 = stoi[ch2]

N[idx1, idx2] += 1

`import matplotlib.pyplot as plt`

import numpy as np

%matplotlib inline

plt.figure(figsize=(16,16))

plt.imshow(N, cmap='Blues')

for i in range(27):

for j in range(27):

chstr = itos[i] + itos[j]

plt.text(j, i, chstr, ha="center", va="bottom", color='gray')

plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray')

plt.axis('off');

In the diagram above, the value corresponding to each pair `(ch1, ch2)`

denotes how many times has *ch1* appeared before *ch2*. Pair like `(a, .)`

show how many names have ended with `a`

and similarly pairs like `(., a)`

show the count of names starting with `a`

.

Now, weβll need to convert the above frequency distribution to a probabilistic distribution. Letβs see what that looks like for the first row.

`N[0]`

`tensor([ 0, 4410, 1306, 1542, 1690, 1531, 417, 669, 874, 591, 2422, 2963,`

1572, 2538, 1146, 394, 515, 92, 1639, 2055, 1308, 78, 376, 307,

134, 535, 929], dtype=torch.int32)

`p = N[0].float() # Convert to float so that probability doesn't roundoff to 0`

p /= N[0].sum()

p

`tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273,`

0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029,

0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])

This gives us the probabilistic distribution. For the example above, for row-indexed `0`

we have this probability distribution. To sample an index from a probability distribution, we can make use of `torch.multinomial`

`gen = torch.Generator().manual_seed(2147483647) # We'll use generator with a seed so as to make this experiment deterministic`

[itos[ix.item()] for ix in torch.multinomial(p, num_samples=3, replacement=True, generator=gen)] # Sampling 3 characters from the probability distribution p

`['m', 's', 'n']`

Now we know how to get `num_samples`

from a probability distribution `p`

. Next we'll convert the tensor `N`

from frequency distribution to probability distribution. Since `N`

is a 2D tensor of shape `(27, 27)`

, to normalize it we'll need to divide each element `N[i, j]`

by `SUM(N[i])`

`P = (N).float()`

P /= N.sum(1, keepdims=True) # 1 denotes summation along idx 1, which is column.

P.shape

`torch.Size([27, 27])`

## Broadcasting

In the code above we used `N.sum(1, keepdims=True)`

. Let's look deeper into this expression to understand the nuances of broadcasting in pytorch. The shape of N.sum(1) is `torch.Size([27])`

whereas shape of N.sum(1, keepdims=True) is `torch.Size([27, 1])`

`N.shape`

`torch.Size([27, 27])`

`N.sum(1).shape`

`torch.Size([27])`

When we try to divide shape `(27, 27)`

by `(27)`

, shape is converted to `(1, 27)`

and then stretched out across every row to create shape `(27, 27)`

. This means on division we'll get `N[i, j]/ sum of N[:, j]`

`N.sum(1, keepdims=True).shape`

`torch.Size([27, 1])`

When we try to divide shape `(27, 27)`

by `(27, 1)`

, shape is stretched out across every column to create shape `(27, 27)`

. Which means on division we'll get `N[i, j]/ sum of N[i, :]`

which is what we expect.

## Likelihood

Now, letβs calculate the probability our model assigns to some of the words from our corpus. Ideally this probability would be `1`

since those words are present in the dataset.

`gen = torch.Generator().manual_seed(2147483647)`

for i in range(10):

out = []

ix = 0

while True:

p = P[ix]

ix = torch.multinomial(p, num_samples=1, replacement=True, generator=gen).item()

out.append(itos[ix])

if ix == 0:

break

print(''.join(out))

`junide.`

janasah.

p.

cony.

a.

nn.

kohin.

tolian.

juee.

ksahnaauranilevias.

The above result might look like gibberish, to convince ourselves that our model is indeed learning something, letβs try this out

`for word in words[:3]:`

word = ['.'] + list(word) + ['.']

for ch1, ch2 in zip(word, word[1:]):

idx1 = stoi[ch1]

idx2 = stoi[ch2]

print(f"{ch1} {ch2} {P[idx1, idx2]: .4f}")

`. e 0.0478`

e m 0.0377

m m 0.0253

m a 0.3899

a . 0.1960

. o 0.0123

o l 0.0780

l i 0.1777

i v 0.0152

v i 0.3541

i a 0.1381

a . 0.1960

. a 0.1377

a v 0.0246

v a 0.2495

a . 0.1960

Here we see the probability of each pair, now if the model wasnβt learning any patterns each of this would have been equally likely i.e `1/27 ~ 0.037`

, but the models have higher probability(likelihood) for some pairs and less probability(likelihood) for others.

## Evaluating the quality of the model using Negative log-likelihood

Likelihood is just another term for the probability of an item or bigram in our case. Since probabilities can go from 0 to 1, to have a smooth, continuous loss function, it makes sense to have a log of probability or log-likelihood

`x = np.linspace(0.000001, 1, 100)`

y = np.log(x)

plt.plot(x, y, label='y = log(x)')

Calculating log-likelihood of the pairs

`log_likelihood = 0`

for word in words[:3]:

word = ['.'] + list(word) + ['.']

for ch1, ch2 in zip(word, word[1:]):

idx1 = stoi[ch1]

idx2 = stoi[ch2]

prob = P[idx1, idx2]

logprob = torch.log(prob)

log_likelihood += logprob

print(f"{ch1} {ch2} {prob: .4f} {logprob: .4f}")

print(f"{log_likelihood=}")

negative_log_likelihood = -log_likelihood

print(f"{negative_log_likelihood=}")

`. e 0.0478 -3.0408`

e m 0.0377 -3.2793

m m 0.0253 -3.6772

m a 0.3899 -0.9418

a . 0.1960 -1.6299

. o 0.0123 -4.3982

o l 0.0780 -2.5508

l i 0.1777 -1.7278

i v 0.0152 -4.1867

v i 0.3541 -1.0383

i a 0.1381 -1.9796

a . 0.1960 -1.6299

. a 0.1377 -1.9829

a v 0.0246 -3.7045

v a 0.2495 -1.3882

a . 0.1960 -1.6299

log_likelihood=tensor(-38.7856)

negative_log_likelihood=tensor(38.7856)

We can see that for numbers with higher probability like `m a`

loglihelihood is closer to 0, whereas for pair like `. e`

reaches value `-3.0408`

. To evaluate the model, we would like to have a number that represents the efficiency of the model. For this, we can add the log probabilities for each word. Which gives us `-38.7856`

.

The lower value of log-likelihood means the model is performing badly, which is counterintuitive to the definition of the loss function, so what we want is to have a negative log-likelihood and try to minimize that. We can also make use of averaging the value of negative log-likelihood. Running this for the entire dataset now

`log_likelihood = 0.0`

n = 0

for w in words:

chs = ['.'] + list(w) + ['.']

for ch1, ch2 in zip(chs, chs[1:]):

ix1 = stoi[ch1]

ix2 = stoi[ch2]

prob = P[ix1, ix2]

logprob = torch.log(prob)

log_likelihood += logprob

n += 1

print(f'{log_likelihood=}')

nll = -log_likelihood

print(f'{nll=}')

print(f'{nll/n}')

`log_likelihood=tensor(-559891.7500)`

nll=tensor(559891.7500)

2.454094171524048

## Inference

Letβs try to get a few name recommendations by inputting itβs length and starting character only.

`ch = 'a'`

for i in range(10):

out = [ch]

while len(out) < 6:

ix = stoi[ch]

p = P[ix]

ix = torch.multinomial(p, num_samples=1, replacement=True, generator=gen).item()

if itos[ix] == '.': continue # ignore '.'

out.append(itos[ix])

print(f"{i} : {''.join(out)}")

`0 : anlylr`

1 : ailnyh

2 : amnrir

3 : arrylr

4 : asgirn

5 : adrhyi

6 : amnrkn

7 : anvjyn

8 : asnihd

9 : ahhuln

**Using Neural Network**

In the above method, we created a bigram model by extracting information regarding bigram frequency from the corpus and converting it to a probability distribution. We can look at the problem from a different perspective by modeling it using a neural network. In my previous article discussing neural networks from scratch, we came across the expression β W*X+b, where W is the weight for an individual neuron, X is the input to that neuron, and b is the bias. Letβs try to convert the data that we have into something that can be used by a neural network. Weβll use a diagram from a single word so that itβs easier to understand whatβs going on.

Weβll separate the training data into x and y. Such that y[i] is the expected character for input x

`## Preparing training dataset`

Xs=[]

Ys=[]

for w in words[:1]:

chs = ['.'] + list(w) + ['.']

for ch1, ch2 in zip(chs, chs[1:]):

ix1 = stoi[ch1]

ix2 = stoi[ch2]

Xs.append(ix1)

Ys.append(ix2)

Xs = torch.tensor(Xs)

Ys = torch.tensor(Ys)

Xs, Ys

`(tensor([ 0, 5, 13, 13, 1]), tensor([ 5, 13, 13, 1, 0]))`

Neural networks donβt understand indexes, so weβll convert the indexes into one hot encoding vector.

**One hot encoding**

One hot encoding takes the integer and num of classes as input and creates a vector of dimension (1 x num of classes) with a single-bit set.

`import torch.nn.functional as F`

F.one_hot(Xs[1], num_classes=27)

`tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,`

0, 0, 0])

Weβll do the same for entire X

`Xenc = F.one_hot(Xs, num_classes=27).float()`

plt.imshow(Xenc)

The image above shows set bit in one-hot encoding for each of the 5 input indexes *[ 0, 5, 13, 13, 1].* Using the neural network expression we saw above *W*X +b*, letβs create a tensor *W*. For the purpose of demonstration, letβs have a single neuron. This neuron would take input of shape *(1,27)* and output a tensor of shape *(1,1)*.

## Formulating Neural Network as Bigram Model

Weβll use a very simple neural network β A single linear layer and no bias.

As we are already familiar with neural network expression

Y = W*X + b

Weβll initialize w with random values and take a dot product with input tensor.

`w = torch.randn((27, 1))`

Xenc @ w # Matrix dot product

`tensor([[ 0.3378],`

[ 1.0748],

[ 0.1643],

[ 0.1643],

[-1.3647]])

For each of 5 *Xs[i]* we get an output tensor, now letβs bump up the number of neurons to *27*, thereby making the shape of *W (27,27)* and shape of output *(1, 27)*

`W = torch.randn((27, 27))`

logits = Xenc @ W

logits

`tensor([[ 1.6538, -0.8016, -1.0628, 0.5207, -0.8133, 1.8273, 1.5478, -0.9953,`

0.5832, 0.9147, -0.3186, 1.9207, 2.1282, 1.2778, 0.4173, -1.0522,

0.8507, 0.0043, -0.4466, 0.9321, -0.9775, 0.5061, 1.1806, 1.1374,

0.3909, 0.6887, -0.5974],

[ 0.3734, -0.9440, 2.1173, 0.4307, -0.8386, 0.2973, -0.1508, -1.1140,

1.5063, 0.4548, 0.3725, 0.3407, -0.0376, 1.6323, -0.7899, -0.9287,

1.0687, 1.3284, 0.2154, 0.1661, -0.4255, -1.4457, 0.0080, 1.1457,

0.1350, 0.6581, 0.5713],

[-0.3007, -0.1158, 1.0404, 0.3564, -0.3480, 0.1582, -0.5061, 0.5499,

-0.1747, -1.1983, -0.1389, -1.2891, 0.8116, -1.0019, 1.2577, 1.1648,

-0.0425, 0.9556, -0.8966, 1.4691, -0.8577, -0.2066, -0.1065, 0.1099,

-0.4243, -1.3913, -1.1660],

[-0.3007, -0.1158, 1.0404, 0.3564, -0.3480, 0.1582, -0.5061, 0.5499,

-0.1747, -1.1983, -0.1389, -1.2891, 0.8116, -1.0019, 1.2577, 1.1648,

-0.0425, 0.9556, -0.8966, 1.4691, -0.8577, -0.2066, -0.1065, 0.1099,

-0.4243, -1.3913, -1.1660],

[ 0.4654, -0.5457, 0.0731, 1.3982, -0.6943, -0.3172, 0.6360, -2.0058,

-1.7574, 0.1172, 0.2438, -0.4918, -0.7632, 0.4355, 0.0921, 1.0822,

0.6615, 0.2039, -0.2937, 0.9257, -0.1299, -0.1696, -0.8557, 0.3851,

-1.4590, -0.7883, -1.2211]])

In the example above, ** Xenc @ W[i, j]** is the activation/output of jth neuron for ith input. The values we get out of the neurons are termed as logits or log-counts. To convert them to a values resembling counts we can take exp of the logits and normalize them to get probabilities.

**This entire operation of converting logits to probabilities is also referred to as Softmax operation.**

`counts = logits.exp()`

probs = counts/counts.sum(1, keepdim=True)

probs[0], probs[0].sum()

`(tensor([0.0816, 0.0070, 0.0054, 0.0263, 0.0069, 0.0970, 0.0734, 0.0058, 0.0280,`

0.0389, 0.0113, 0.1065, 0.1311, 0.0560, 0.0237, 0.0054, 0.0365, 0.0157,

0.0100, 0.0396, 0.0059, 0.0259, 0.0508, 0.0487, 0.0231, 0.0311, 0.0086]),

tensor(1.))

Evaluating the model for the 5 bigram pairs above by calculating log-likelihood and negative log-likelihood.

`nlls = torch.zeros(5)`

for i in range(5):

ix = Xs[i].item()

iy = Ys[i].item()

print(f"Input to neural network {itos[ix]}")

print(f"Expected output from neural netork {itos[iy]}")

likelihood_y = probs[i, iy]

print(f"Probability of {itos[iy]}: {likelihood_y}")

log_likelihood_y = likelihood_y.log()

print(f"log likelihood {log_likelihood_y}")

negative_log_likelihood_y = -log_likelihood_y

print(f"negative log likelihood {negative_log_likelihood_y}")

nlls[i] = negative_log_likelihood_y

loss = nlls.mean()

print(f"Average negative log likelihood {loss}")

`Input to neural network .`

Expected output from neural netork e

Probability of e: 0.09700655937194824

log likelihood -2.3329765796661377

negative log likelihood 2.3329765796661377

Input to neural network e

Expected output from neural netork m

Probability of m: 0.10330255329608917

log likelihood -2.2700932025909424

negative log likelihood 2.2700932025909424

Input to neural network m

Expected output from neural netork m

Probability of m: 0.01063988171517849

log likelihood -4.543146133422852

negative log likelihood 4.543146133422852

Input to neural network m

Expected output from neural netork a

Probability of a: 0.025809239596128464

log likelihood -3.657022714614868

negative log likelihood 3.657022714614868

Input to neural network a

Expected output from neural netork .

Probability of .: 0.051668521016836166

log likelihood -2.9629065990448

negative log likelihood 2.9629065990448

Average negative log likelihood 3.153228998184204

We get an average log-likelihood of 3.15, which isnβt great. Letβs try to improve this model and train it on the entire dataset

## Using Entire dataset

## Creating dataset

`# create the dataset`

xs, ys = [], []

for w in words:

chs = ['.'] + list(w) + ['.']

for ch1, ch2 in zip(chs, chs[1:]):

ix1 = stoi[ch1]

ix2 = stoi[ch2]

xs.append(ix1)

ys.append(ix2)

xs = torch.tensor(xs)

ys = torch.tensor(ys)

num = xs.nelement()

print('number of examples: ', num)

# initialize the 'network'

g = torch.Generator().manual_seed(2147483647)

W = torch.randn((27, 27), generator=g, requires_grad=True)

`number of examples: 228146`

## Training loop

Bringing together what we discussed above

- Forward pass to convert input character index to one hot encoding, calculate dot product with neural network weights, performing softmax to obtain probabilities from logits, calculating nll
- Backward pass involving setting gradient to None and performing backpropagation across the neural network
- Updating parameter wrt to the gradients

`# gradient descent`

for k in range(50):

# forward pass

xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding

logits = xenc @ W # predict log-counts

counts = logits.exp() # counts, equivalent to N

probs = counts / counts.sum(1, keepdims=True) # probabilities for next character

loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()

print(loss.item())

# backward pass

W.grad = None # set to zero the gradient

loss.backward()

# update

W.data += -50 * W.grad

`3.7686190605163574`

3.3788065910339355

3.16109037399292

3.0271859169006348

2.9344840049743652

2.867231607437134

2.8166539669036865

2.777146100997925

2.745253801345825

2.7188303470611572

2.696505308151245

2.6773722171783447

2.6608052253723145

2.6463513374328613

2.633665084838867

2.622471570968628

2.6125476360321045

2.6037068367004395

2.595794916152954

2.5886809825897217

2.5822560787200928

2.5764293670654297

2.5711236000061035

2.566272735595703

2.5618226528167725

...

2.5137410163879395

2.512698173522949

2.511704444885254

2.5107581615448

An interesting thing to note in the above code is

loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()

specially the *0.01*(W**2).mean()*. This is also termed a regularization factor and is used for model smoothening. Essentially, what we control is the smoothness of the model weights, by keeping the constant large, we force the mean of W**2 to be small and have values closer to each other to reduce loss, and by keeping the constant small, we reduce the contribution of W weights, thus allowing W to have larger weights as well.

## Inference

`# finally, sample from the 'neural net' model`

g = torch.Generator().manual_seed(2147483647)

ch = 'a'

for i in range(10):

out = [ch]

ix = stoi[ch]

while len(out) < 6:

xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()

logits = xenc @ W # predict log-counts

counts = logits.exp() # counts, equivalent to N

p = counts / counts.sum(1, keepdims=True) # probabilities for next character

ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()

if ix == 0:

continue

out.append(itos[ix])

print(f"{i} : {''.join(out)}")

`0 : asonde`

1 : adiana

2 : asahpx

3 : anayan

4 : ankohi

5 : antoli

6 : araste

7 : azzada

8 : aheiau

9 : ayanil

## Conclusion

There you have it. A bigram model that learns from the input list of names and evaluates itself using the log-likelihood method. *Granted that the Bigram method only takes into account local relationships between pairs and, in turn, ignores the context of the word as a whole, itβs a great place to start when learning language modeling.* With the above approach of using a neural network, we achieved a loss of 2.5107581615448, which isnβt better than what we had using a probabilistic model where we counted the occurrences. This is to be expected due to the nature of the bigram problem. Since, at any moment, we have very limited information (previous character), the manual approach of counting and using frequency to predict the next character turns out to be the same as using gradient-based methods to optimize neural network weights. Thus, the output of both methods remains the same. One thing to note here is that even though both models perform equally badly, the neural network model is very flexible in terms of input. We can extend the input to include a bunch of characters if we want, frequency distribution method on the other hand doesnβt scale well with the increase in number of input characters.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI