Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Bigram Language Modeling From Scratch
Data Science   Latest   Machine Learning

Bigram Language Modeling From Scratch

Author(s): Abhishek Chaudhary

Originally published on Towards AI.

https://cdn.analyticsvidhya.com/wp-content/uploads/2019/08/language_model.png

Language modeling is all about how computers understand and generate human language. It’s a key part of making AI systems that can communicate with us effectively. In recent years, ChatGPT has been a prominent topic of discussion in tech articles. Chat-Generative Pre-trained Transformer has allowed general users to make use of a highly proficient language model to answer their queries. How does ChatGPT accomplish this task? Firstly, it’s pre-trained on a solid chunk of the internet, which, when coupled with careful preprocessing, training, and fine-tuning, can work wonders. Secondly, it works on the principle of finding the next chunk, given an input sentence. This coupled of attention mechanism allows ChatGPT to create whole coherent paragraphs.

In this article, we’ll discuss a similar, much less powerful language model. Our model will be a character-level model as opposed to a word/chunk-level model and would predict the next character given a previous character.

The code for this article can be found in the following Jupyter notebook

Bigram model

The essence of the bigram model in language modeling is to approximate the probability of a word sequence by considering the probability of each word given its immediate predecessor.

The probability of a word sequence (W = w_1, w_2, …, w_n) is represented as follows:

P(W) = P(w_1, w_2, ..., w_n) ≈ P(w_1) * P(w_2 U+007C w_1) * P(w_3 U+007C w_2) * ... * P(w_n U+007C w_{n-1})

Where:

  • P(w_1) is the probability of the first word in the sequence.
  • P(w_i U+007C w_{i-1}) is the conditional probability of the word w_i given that the previous word is w_{i-1}.

The conditional probability P(w_i U+007C w_{i-1}) for a bigram is estimated from a corpus of text as follows:

P(w_i U+007C w_{i-1}) = Count(w_{i-1}, w_i) / Count(w_{i-1})
  • Count(w_{i-1}, w_i) represents the number of times the word w_i follows the word w_{i-1} in the corpus.
  • Count(w_{i-1}) is the number of times the word w_{i-1} appears in the corpus.

Implementation of the Bigram Model

For input corpus, we’ll make use of all the strings in names.txt

words = open('names.txt', 'r').read().splitlines()
words[:10]
['emma',
'olivia',
'ava',
'isabella',
'sophia',
'charlotte',
'mia',
'amelia',
'harper',
'evelyn']
print(f"Number of words in corpus {len(words)}")
print(f"Shortest name in corpus {min(len(w) for w in words)}")
print(f"Longest name in corpus {max(len(w) for w in words)}")
Number of words in corpus 32033
Shortest name in corpus 2
Longest name in corpus 15

Next, we’ll create bigram pairs and store their count in a dictionary. We’ll add a special token/character to denote the start and end of the word

bg_pairs = dict()
for word in words:
word = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(word, word[1:]):
bg_pair = (ch1, ch2)
bg_pairs[bg_pair] = bg_pairs.get(bg_pair, 0) + 1

Printing out the top 10 bigram pair based on their frequency

sorted(bg_pairs.items(), key = lambda kv: -kv[1])[:10]
[(('n', '.'), 6763),
(('a', '.'), 6640),
(('a', 'n'), 5438),
(('.', 'a'), 4410),
(('e', '.'), 3983),
(('a', 'r'), 3264),
(('e', 'l'), 3248),
(('r', 'i'), 3033),
(('n', 'a'), 2977),
(('.', 'k'), 2963)]

Bigram (n, .) is the most frequent pair, which means names end most frequently with n in our dataset. Let's try to visualize the data that we have. We'll make use of matplotlib, but since it doesn't recognize text, we'll convert our characters into integer representation.

character_list = sorted(list(set(''.join(words))))
character_list
['a',
'b',
'c',
'd',
'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z']
stoi = {s:i+1 for i,s in enumerate(character_list)} # Adding 1 to each index so that special character can be given index 0
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()} # Create reverse mapping as well
stoi
{'a': 1,
'b': 2,
'c': 3,
'd': 4,
'e': 5,
'f': 6,
'g': 7,
'h': 8,
'i': 9,
'j': 10,
'k': 11,
'l': 12,
'm': 13,
'n': 14,
'o': 15,
'p': 16,
'q': 17,
'r': 18,
's': 19,
't': 20,
'u': 21,
'v': 22,
'w': 23,
'x': 24,
'y': 25,
'z': 26,
'.': 0}

Now that we have the dictionary to map the characters to integers, let’s make use of a PyTorch tensor to store the frequency of bg_pairs

import torch
N = torch.zeros((27, 27), dtype=torch.int32)
for word in words:
word = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(word, word[1:]):
idx1 = stoi[ch1]
idx2 = stoi[ch2]
N[idx1, idx2] += 1
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(27):
for j in range(27):
chstr = itos[i] + itos[j]
plt.text(j, i, chstr, ha="center", va="bottom", color='gray')
plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray')
plt.axis('off');
Bigram frequency distribution

In the diagram above, the value corresponding to each pair (ch1, ch2) denotes how many times has ch1 appeared before ch2. Pair like (a, .) show how many names have ended with a and similarly pairs like (., a) show the count of names starting with a.

Now, we’ll need to convert the above frequency distribution to a probabilistic distribution. Let’s see what that looks like for the first row.

N[0]
tensor([ 0, 4410, 1306, 1542, 1690, 1531, 417, 669, 874, 591, 2422, 2963,
1572, 2538, 1146, 394, 515, 92, 1639, 2055, 1308, 78, 376, 307,
134, 535, 929], dtype=torch.int32)
p = N[0].float() # Convert to float so that probability doesn't roundoff to 0
p /= N[0].sum()
p
tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273,
0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029,
0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])

This gives us the probabilistic distribution. For the example above, for row-indexed 0 we have this probability distribution. To sample an index from a probability distribution, we can make use of torch.multinomial

gen = torch.Generator().manual_seed(2147483647) # We'll use generator with a seed so as to make this experiment deterministic
[itos[ix.item()] for ix in torch.multinomial(p, num_samples=3, replacement=True, generator=gen)] # Sampling 3 characters from the probability distribution p
['m', 's', 'n']

Now we know how to get num_samples from a probability distribution p. Next we'll convert the tensor N from frequency distribution to probability distribution. Since N is a 2D tensor of shape (27, 27), to normalize it we'll need to divide each element N[i, j] by SUM(N[i])

P = (N).float()
P /= N.sum(1, keepdims=True) # 1 denotes summation along idx 1, which is column.
P.shape
torch.Size([27, 27])

Broadcasting

In the code above we used N.sum(1, keepdims=True). Let's look deeper into this expression to understand the nuances of broadcasting in pytorch. The shape of N.sum(1) is torch.Size([27]) whereas shape of N.sum(1, keepdims=True) is torch.Size([27, 1])

N.shape
torch.Size([27, 27])
N.sum(1).shape
torch.Size([27])

When we try to divide shape (27, 27) by (27), shape is converted to (1, 27) and then stretched out across every row to create shape (27, 27). This means on division we'll get N[i, j]/ sum of N[:, j]

N.sum(1, keepdims=True).shape
torch.Size([27, 1])

When we try to divide shape (27, 27) by (27, 1), shape is stretched out across every column to create shape (27, 27). Which means on division we'll get N[i, j]/ sum of N[i, :] which is what we expect.

Likelihood

Now, let’s calculate the probability our model assigns to some of the words from our corpus. Ideally this probability would be 1 since those words are present in the dataset.

gen = torch.Generator().manual_seed(2147483647)
for i in range(10):
out
= []
ix = 0
while True:
p = P[ix]
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=gen).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
junide.
janasah.
p.
cony.
a.
nn.
kohin.
tolian.
juee.
ksahnaauranilevias.

The above result might look like gibberish, to convince ourselves that our model is indeed learning something, let’s try this out

for word in words[:3]:
word = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(word, word[1:]):
idx1 = stoi[ch1]
idx2 = stoi[ch2]
print(f"{ch1} {ch2} {P[idx1, idx2]: .4f}")
. e 0.0478
e m 0.0377
m m 0.0253
m a 0.3899
a . 0.1960
. o 0.0123
o l 0.0780
l i 0.1777
i v 0.0152
v i 0.3541
i a 0.1381
a . 0.1960
. a 0.1377
a v 0.0246
v a 0.2495
a . 0.1960

Here we see the probability of each pair, now if the model wasn’t learning any patterns each of this would have been equally likely i.e 1/27 ~ 0.037, but the models have higher probability(likelihood) for some pairs and less probability(likelihood) for others.

Evaluating the quality of the model using Negative log-likelihood

Likelihood is just another term for the probability of an item or bigram in our case. Since probabilities can go from 0 to 1, to have a smooth, continuous loss function, it makes sense to have a log of probability or log-likelihood

x = np.linspace(0.000001, 1, 100)
y = np.log(x)
plt.plot(x, y, label='y = log(x)')
Log likelihood

Calculating log-likelihood of the pairs

log_likelihood = 0
for word in words[:3]:
word = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(word, word[1:]):
idx1 = stoi[ch1]
idx2 = stoi[ch2]
prob = P[idx1, idx2]
logprob = torch.log(prob)
log_likelihood += logprob
print(f"{ch1} {ch2} {prob: .4f} {logprob: .4f}")
print(f"{log_likelihood=}")
negative_log_likelihood = -log_likelihood
print(f"{negative_log_likelihood=}")
. e 0.0478 -3.0408
e m 0.0377 -3.2793
m m 0.0253 -3.6772
m a 0.3899 -0.9418
a . 0.1960 -1.6299
. o 0.0123 -4.3982
o l 0.0780 -2.5508
l i 0.1777 -1.7278
i v 0.0152 -4.1867
v i 0.3541 -1.0383
i a 0.1381 -1.9796
a . 0.1960 -1.6299
. a 0.1377 -1.9829
a v 0.0246 -3.7045
v a 0.2495 -1.3882
a . 0.1960 -1.6299
log_likelihood=tensor(-38.7856)
negative_log_likelihood=tensor(38.7856)

We can see that for numbers with higher probability like m a loglihelihood is closer to 0, whereas for pair like . e reaches value -3.0408. To evaluate the model, we would like to have a number that represents the efficiency of the model. For this, we can add the log probabilities for each word. Which gives us -38.7856.

The lower value of log-likelihood means the model is performing badly, which is counterintuitive to the definition of the loss function, so what we want is to have a negative log-likelihood and try to minimize that. We can also make use of averaging the value of negative log-likelihood. Running this for the entire dataset now

log_likelihood = 0.0
n = 0
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
logprob = torch.log(prob)
log_likelihood += logprob
n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')
log_likelihood=tensor(-559891.7500)
nll=tensor(559891.7500)
2.454094171524048

Inference

Let’s try to get a few name recommendations by inputting it’s length and starting character only.

ch = 'a'
for i in range(10):
out = [ch]
while len(out) < 6:
ix = stoi[ch]
p = P[ix]
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=gen).item()
if itos[ix] == '.': continue # ignore '.'
out.append(itos[ix])

print(f"{i} : {''.join(out)}")
0 : anlylr
1 : ailnyh
2 : amnrir
3 : arrylr
4 : asgirn
5 : adrhyi
6 : amnrkn
7 : anvjyn
8 : asnihd
9 : ahhuln

Using Neural Network

https://cs231n.github.io/assets/nn1/neural_net2.jpeg

In the above method, we created a bigram model by extracting information regarding bigram frequency from the corpus and converting it to a probability distribution. We can look at the problem from a different perspective by modeling it using a neural network. In my previous article discussing neural networks from scratch, we came across the expression — W*X+b, where W is the weight for an individual neuron, X is the input to that neuron, and b is the bias. Let’s try to convert the data that we have into something that can be used by a neural network. We’ll use a diagram from a single word so that it’s easier to understand what’s going on.

We’ll separate the training data into x and y. Such that y[i] is the expected character for input x

## Preparing training dataset
Xs=[]
Ys=[]
for w in words[:1]:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
Xs.append(ix1)
Ys.append(ix2)

Xs = torch.tensor(Xs)
Ys = torch.tensor(Ys)
Xs, Ys
(tensor([ 0, 5, 13, 13, 1]), tensor([ 5, 13, 13, 1, 0]))

Neural networks don’t understand indexes, so we’ll convert the indexes into one hot encoding vector.

One hot encoding

One hot encoding takes the integer and num of classes as input and creates a vector of dimension (1 x num of classes) with a single-bit set.

import torch.nn.functional as F
F.one_hot(Xs[1], num_classes=27)
tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0])

We’ll do the same for entire X

Xenc = F.one_hot(Xs, num_classes=27).float()
plt.imshow(Xenc)
One-hot encoding

The image above shows set bit in one-hot encoding for each of the 5 input indexes [ 0, 5, 13, 13, 1]. Using the neural network expression we saw above W*X +b, let’s create a tensor W. For the purpose of demonstration, let’s have a single neuron. This neuron would take input of shape (1,27) and output a tensor of shape (1,1).

Formulating Neural Network as Bigram Model

We’ll use a very simple neural network — A single linear layer and no bias.

As we are already familiar with neural network expression

Y = W*X + b

We’ll initialize w with random values and take a dot product with input tensor.

w = torch.randn((27, 1))
Xenc @ w # Matrix dot product
tensor([[ 0.3378],
[ 1.0748],
[ 0.1643],
[ 0.1643],
[-1.3647]])

For each of 5 Xs[i] we get an output tensor, now let’s bump up the number of neurons to 27, thereby making the shape of W (27,27) and shape of output (1, 27)

W = torch.randn((27, 27))
logits = Xenc @ W
logits
tensor([[ 1.6538, -0.8016, -1.0628, 0.5207, -0.8133, 1.8273, 1.5478, -0.9953,
0.5832, 0.9147, -0.3186, 1.9207, 2.1282, 1.2778, 0.4173, -1.0522,
0.8507, 0.0043, -0.4466, 0.9321, -0.9775, 0.5061, 1.1806, 1.1374,
0.3909, 0.6887, -0.5974],
[ 0.3734, -0.9440, 2.1173, 0.4307, -0.8386, 0.2973, -0.1508, -1.1140,
1.5063, 0.4548, 0.3725, 0.3407, -0.0376, 1.6323, -0.7899, -0.9287,
1.0687, 1.3284, 0.2154, 0.1661, -0.4255, -1.4457, 0.0080, 1.1457,
0.1350, 0.6581, 0.5713],
[-0.3007, -0.1158, 1.0404, 0.3564, -0.3480, 0.1582, -0.5061, 0.5499,
-0.1747, -1.1983, -0.1389, -1.2891, 0.8116, -1.0019, 1.2577, 1.1648,
-0.0425, 0.9556, -0.8966, 1.4691, -0.8577, -0.2066, -0.1065, 0.1099,
-0.4243, -1.3913, -1.1660],
[-0.3007, -0.1158, 1.0404, 0.3564, -0.3480, 0.1582, -0.5061, 0.5499,
-0.1747, -1.1983, -0.1389, -1.2891, 0.8116, -1.0019, 1.2577, 1.1648,
-0.0425, 0.9556, -0.8966, 1.4691, -0.8577, -0.2066, -0.1065, 0.1099,
-0.4243, -1.3913, -1.1660],
[ 0.4654, -0.5457, 0.0731, 1.3982, -0.6943, -0.3172, 0.6360, -2.0058,
-1.7574, 0.1172, 0.2438, -0.4918, -0.7632, 0.4355, 0.0921, 1.0822,
0.6615, 0.2039, -0.2937, 0.9257, -0.1299, -0.1696, -0.8557, 0.3851,
-1.4590, -0.7883, -1.2211]])

In the example above, Xenc @ W[i, j] is the activation/output of jth neuron for ith input. The values we get out of the neurons are termed as logits or log-counts. To convert them to a values resembling counts we can take exp of the logits and normalize them to get probabilities. This entire operation of converting logits to probabilities is also referred to as Softmax operation.

counts = logits.exp()
probs = counts/counts.sum(1, keepdim=True)
probs[0], probs[0].sum()
(tensor([0.0816, 0.0070, 0.0054, 0.0263, 0.0069, 0.0970, 0.0734, 0.0058, 0.0280,
0.0389, 0.0113, 0.1065, 0.1311, 0.0560, 0.0237, 0.0054, 0.0365, 0.0157,
0.0100, 0.0396, 0.0059, 0.0259, 0.0508, 0.0487, 0.0231, 0.0311, 0.0086]),
tensor(1.))

Evaluating the model for the 5 bigram pairs above by calculating log-likelihood and negative log-likelihood.

nlls = torch.zeros(5)
for i in range(5):
ix = Xs[i].item()
iy = Ys[i].item()
print(f"Input to neural network {itos[ix]}")
print(f"Expected output from neural netork {itos[iy]}")
likelihood_y = probs[i, iy]
print(f"Probability of {itos[iy]}: {likelihood_y}")
log_likelihood_y = likelihood_y.log()
print(f"log likelihood {log_likelihood_y}")
negative_log_likelihood_y = -log_likelihood_y
print(f"negative log likelihood {negative_log_likelihood_y}")
nlls[i] = negative_log_likelihood_y

loss = nlls.mean()
print(f"Average negative log likelihood {loss}")
Input to neural network .
Expected output from neural netork e
Probability of e: 0.09700655937194824
log likelihood -2.3329765796661377
negative log likelihood 2.3329765796661377
Input to neural network e
Expected output from neural netork m
Probability of m: 0.10330255329608917
log likelihood -2.2700932025909424
negative log likelihood 2.2700932025909424
Input to neural network m
Expected output from neural netork m
Probability of m: 0.01063988171517849
log likelihood -4.543146133422852
negative log likelihood 4.543146133422852
Input to neural network m
Expected output from neural netork a
Probability of a: 0.025809239596128464
log likelihood -3.657022714614868
negative log likelihood 3.657022714614868
Input to neural network a
Expected output from neural netork .
Probability of .: 0.051668521016836166
log likelihood -2.9629065990448
negative log likelihood 2.9629065990448
Average negative log likelihood 3.153228998184204

We get an average log-likelihood of 3.15, which isn’t great. Let’s try to improve this model and train it on the entire dataset

Using Entire dataset

Creating dataset

# create the dataset
xs, ys = [], []
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
xs.append(ix1)
ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)

# initialize the 'network'
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
number of examples: 228146

Training loop

Bringing together what we discussed above

  • Forward pass to convert input character index to one hot encoding, calculate dot product with neural network weights, performing softmax to obtain probabilities from logits, calculating nll
  • Backward pass involving setting gradient to None and performing backpropagation across the neural network
  • Updating parameter wrt to the gradients
# gradient descent
for k in range(50):
# forward pass
xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
print(loss.item())

# backward pass
W.grad = None # set to zero the gradient
loss.backward()

# update
W.data += -50 * W.grad
3.7686190605163574
3.3788065910339355
3.16109037399292
3.0271859169006348
2.9344840049743652
2.867231607437134
2.8166539669036865
2.777146100997925
2.745253801345825
2.7188303470611572
2.696505308151245
2.6773722171783447
2.6608052253723145
2.6463513374328613
2.633665084838867
2.622471570968628
2.6125476360321045
2.6037068367004395
2.595794916152954
2.5886809825897217
2.5822560787200928
2.5764293670654297
2.5711236000061035
2.566272735595703
2.5618226528167725
...
2.5137410163879395
2.512698173522949
2.511704444885254
2.5107581615448

An interesting thing to note in the above code is

loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()

specially the 0.01*(W**2).mean(). This is also termed a regularization factor and is used for model smoothening. Essentially, what we control is the smoothness of the model weights, by keeping the constant large, we force the mean of W**2 to be small and have values closer to each other to reduce loss, and by keeping the constant small, we reduce the contribution of W weights, thus allowing W to have larger weights as well.

Inference

# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647)
ch = 'a'
for i in range(10):
out = [ch]
ix = stoi[ch]
while len(out) < 6:
xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
p = counts / counts.sum(1, keepdims=True) # probabilities for next character
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
if ix == 0:
continue
out.append(itos[ix])

print(f"{i} : {''.join(out)}")
0 : asonde
1 : adiana
2 : asahpx
3 : anayan
4 : ankohi
5 : antoli
6 : araste
7 : azzada
8 : aheiau
9 : ayanil

Conclusion

There you have it. A bigram model that learns from the input list of names and evaluates itself using the log-likelihood method. Granted that the Bigram method only takes into account local relationships between pairs and, in turn, ignores the context of the word as a whole, it’s a great place to start when learning language modeling. With the above approach of using a neural network, we achieved a loss of 2.5107581615448, which isn’t better than what we had using a probabilistic model where we counted the occurrences. This is to be expected due to the nature of the bigram problem. Since, at any moment, we have very limited information (previous character), the manual approach of counting and using frequency to predict the next character turns out to be the same as using gradient-based methods to optimize neural network weights. Thus, the output of both methods remains the same. One thing to note here is that even though both models perform equally badly, the neural network model is very flexible in terms of input. We can extend the input to include a bunch of characters if we want, frequency distribution method on the other hand doesn’t scale well with the increase in number of input characters.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓