Bigram Language Modeling From Scratch
Author(s): Abhishek Chaudhary
Originally published on Towards AI.
Language modeling is all about how computers understand and generate human language. Itβs a key part of making AI systems that can communicate with us effectively. In recent years, ChatGPT has been a prominent topic of discussion in tech articles. Chat-Generative Pre-trained Transformer has allowed general users to make use of a highly proficient language model to answer their queries. How does ChatGPT accomplish this task? Firstly, itβs pre-trained on a solid chunk of the internet, which, when coupled with careful preprocessing, training, and fine-tuning, can work wonders. Secondly, it works on the principle of finding the next chunk, given an input sentence. This coupled of attention mechanism allows ChatGPT to create whole coherent paragraphs.
In this article, weβll discuss a similar, much less powerful language model. Our model will be a character-level model as opposed to a word/chunk-level model and would predict the next character given a previous character.
The code for this article can be found in the following Jupyter notebook
Bigram model
The essence of the bigram model in language modeling is to approximate the probability of a word sequence by considering the probability of each word given its immediate predecessor.
The probability of a word sequence (W = w_1, w_2, β¦, w_n) is represented as follows:
P(W) = P(w_1, w_2, ..., w_n) β P(w_1) * P(w_2 U+007C w_1) * P(w_3 U+007C w_2) * ... * P(w_n U+007C w_{n-1})
Where:
P(w_1)
is the probability of the first word in the sequence.P(w_i U+007C w_{i-1})
is the conditional probability of the wordw_i
given that the previous word isw_{i-1}
.
The conditional probability P(w_i U+007C w_{i-1})
for a bigram is estimated from a corpus of text as follows:
P(w_i U+007C w_{i-1}) = Count(w_{i-1}, w_i) / Count(w_{i-1})
Count(w_{i-1}, w_i)
represents the number of times the wordw_i
follows the wordw_{i-1}
in the corpus.Count(w_{i-1})
is the number of times the wordw_{i-1}
appears in the corpus.
Implementation of the Bigram Model
For input corpus, weβll make use of all the strings in names.txt
words = open('names.txt', 'r').read().splitlines()
words[:10]
['emma',
'olivia',
'ava',
'isabella',
'sophia',
'charlotte',
'mia',
'amelia',
'harper',
'evelyn']
print(f"Number of words in corpus {len(words)}")
print(f"Shortest name in corpus {min(len(w) for w in words)}")
print(f"Longest name in corpus {max(len(w) for w in words)}")
Number of words in corpus 32033
Shortest name in corpus 2
Longest name in corpus 15
Next, weβll create bigram pairs and store their count in a dictionary. Weβll add a special token/character to denote the start and end of the word
bg_pairs = dict()
for word in words:
word = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(word, word[1:]):
bg_pair = (ch1, ch2)
bg_pairs[bg_pair] = bg_pairs.get(bg_pair, 0) + 1
Printing out the top 10 bigram pair based on their frequency
sorted(bg_pairs.items(), key = lambda kv: -kv[1])[:10]
[(('n', '.'), 6763),
(('a', '.'), 6640),
(('a', 'n'), 5438),
(('.', 'a'), 4410),
(('e', '.'), 3983),
(('a', 'r'), 3264),
(('e', 'l'), 3248),
(('r', 'i'), 3033),
(('n', 'a'), 2977),
(('.', 'k'), 2963)]
Bigram (n, .)
is the most frequent pair, which means names end most frequently with n
in our dataset. Let's try to visualize the data that we have. We'll make use of matplotlib, but since it doesn't recognize text, we'll convert our characters into integer representation.
character_list = sorted(list(set(''.join(words))))
character_list
['a',
'b',
'c',
'd',
'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z']
stoi = {s:i+1 for i,s in enumerate(character_list)} # Adding 1 to each index so that special character can be given index 0
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()} # Create reverse mapping as well
stoi
{'a': 1,
'b': 2,
'c': 3,
'd': 4,
'e': 5,
'f': 6,
'g': 7,
'h': 8,
'i': 9,
'j': 10,
'k': 11,
'l': 12,
'm': 13,
'n': 14,
'o': 15,
'p': 16,
'q': 17,
'r': 18,
's': 19,
't': 20,
'u': 21,
'v': 22,
'w': 23,
'x': 24,
'y': 25,
'z': 26,
'.': 0}
Now that we have the dictionary to map the characters to integers, letβs make use of a PyTorch tensor to store the frequency of bg_pairs
import torch
N = torch.zeros((27, 27), dtype=torch.int32)
for word in words:
word = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(word, word[1:]):
idx1 = stoi[ch1]
idx2 = stoi[ch2]
N[idx1, idx2] += 1
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(27):
for j in range(27):
chstr = itos[i] + itos[j]
plt.text(j, i, chstr, ha="center", va="bottom", color='gray')
plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray')
plt.axis('off');
In the diagram above, the value corresponding to each pair (ch1, ch2)
denotes how many times has ch1 appeared before ch2. Pair like (a, .)
show how many names have ended with a
and similarly pairs like (., a)
show the count of names starting with a
.
Now, weβll need to convert the above frequency distribution to a probabilistic distribution. Letβs see what that looks like for the first row.
N[0]
tensor([ 0, 4410, 1306, 1542, 1690, 1531, 417, 669, 874, 591, 2422, 2963,
1572, 2538, 1146, 394, 515, 92, 1639, 2055, 1308, 78, 376, 307,
134, 535, 929], dtype=torch.int32)
p = N[0].float() # Convert to float so that probability doesn't roundoff to 0
p /= N[0].sum()
p
tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273,
0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029,
0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])
This gives us the probabilistic distribution. For the example above, for row-indexed 0
we have this probability distribution. To sample an index from a probability distribution, we can make use of torch.multinomial
gen = torch.Generator().manual_seed(2147483647) # We'll use generator with a seed so as to make this experiment deterministic
[itos[ix.item()] for ix in torch.multinomial(p, num_samples=3, replacement=True, generator=gen)] # Sampling 3 characters from the probability distribution p
['m', 's', 'n']
Now we know how to get num_samples
from a probability distribution p
. Next we'll convert the tensor N
from frequency distribution to probability distribution. Since N
is a 2D tensor of shape (27, 27)
, to normalize it we'll need to divide each element N[i, j]
by SUM(N[i])
P = (N).float()
P /= N.sum(1, keepdims=True) # 1 denotes summation along idx 1, which is column.
P.shape
torch.Size([27, 27])
Broadcasting
In the code above we used N.sum(1, keepdims=True)
. Let's look deeper into this expression to understand the nuances of broadcasting in pytorch. The shape of N.sum(1) is torch.Size([27])
whereas shape of N.sum(1, keepdims=True) is torch.Size([27, 1])
N.shape
torch.Size([27, 27])
N.sum(1).shape
torch.Size([27])
When we try to divide shape (27, 27)
by (27)
, shape is converted to (1, 27)
and then stretched out across every row to create shape (27, 27)
. This means on division we'll get N[i, j]/ sum of N[:, j]
N.sum(1, keepdims=True).shape
torch.Size([27, 1])
When we try to divide shape (27, 27)
by (27, 1)
, shape is stretched out across every column to create shape (27, 27)
. Which means on division we'll get N[i, j]/ sum of N[i, :]
which is what we expect.
Likelihood
Now, letβs calculate the probability our model assigns to some of the words from our corpus. Ideally this probability would be 1
since those words are present in the dataset.
gen = torch.Generator().manual_seed(2147483647)
for i in range(10):
out = []
ix = 0
while True:
p = P[ix]
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=gen).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
junide.
janasah.
p.
cony.
a.
nn.
kohin.
tolian.
juee.
ksahnaauranilevias.
The above result might look like gibberish, to convince ourselves that our model is indeed learning something, letβs try this out
for word in words[:3]:
word = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(word, word[1:]):
idx1 = stoi[ch1]
idx2 = stoi[ch2]
print(f"{ch1} {ch2} {P[idx1, idx2]: .4f}")
. e 0.0478
e m 0.0377
m m 0.0253
m a 0.3899
a . 0.1960
. o 0.0123
o l 0.0780
l i 0.1777
i v 0.0152
v i 0.3541
i a 0.1381
a . 0.1960
. a 0.1377
a v 0.0246
v a 0.2495
a . 0.1960
Here we see the probability of each pair, now if the model wasnβt learning any patterns each of this would have been equally likely i.e 1/27 ~ 0.037
, but the models have higher probability(likelihood) for some pairs and less probability(likelihood) for others.
Evaluating the quality of the model using Negative log-likelihood
Likelihood is just another term for the probability of an item or bigram in our case. Since probabilities can go from 0 to 1, to have a smooth, continuous loss function, it makes sense to have a log of probability or log-likelihood
x = np.linspace(0.000001, 1, 100)
y = np.log(x)
plt.plot(x, y, label='y = log(x)')
Calculating log-likelihood of the pairs
log_likelihood = 0
for word in words[:3]:
word = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(word, word[1:]):
idx1 = stoi[ch1]
idx2 = stoi[ch2]
prob = P[idx1, idx2]
logprob = torch.log(prob)
log_likelihood += logprob
print(f"{ch1} {ch2} {prob: .4f} {logprob: .4f}")
print(f"{log_likelihood=}")
negative_log_likelihood = -log_likelihood
print(f"{negative_log_likelihood=}")
. e 0.0478 -3.0408
e m 0.0377 -3.2793
m m 0.0253 -3.6772
m a 0.3899 -0.9418
a . 0.1960 -1.6299
. o 0.0123 -4.3982
o l 0.0780 -2.5508
l i 0.1777 -1.7278
i v 0.0152 -4.1867
v i 0.3541 -1.0383
i a 0.1381 -1.9796
a . 0.1960 -1.6299
. a 0.1377 -1.9829
a v 0.0246 -3.7045
v a 0.2495 -1.3882
a . 0.1960 -1.6299
log_likelihood=tensor(-38.7856)
negative_log_likelihood=tensor(38.7856)
We can see that for numbers with higher probability like m a
loglihelihood is closer to 0, whereas for pair like . e
reaches value -3.0408
. To evaluate the model, we would like to have a number that represents the efficiency of the model. For this, we can add the log probabilities for each word. Which gives us -38.7856
.
The lower value of log-likelihood means the model is performing badly, which is counterintuitive to the definition of the loss function, so what we want is to have a negative log-likelihood and try to minimize that. We can also make use of averaging the value of negative log-likelihood. Running this for the entire dataset now
log_likelihood = 0.0
n = 0
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
logprob = torch.log(prob)
log_likelihood += logprob
n += 1
print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')
log_likelihood=tensor(-559891.7500)
nll=tensor(559891.7500)
2.454094171524048
Inference
Letβs try to get a few name recommendations by inputting itβs length and starting character only.
ch = 'a'
for i in range(10):
out = [ch]
while len(out) < 6:
ix = stoi[ch]
p = P[ix]
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=gen).item()
if itos[ix] == '.': continue # ignore '.'
out.append(itos[ix])
print(f"{i} : {''.join(out)}")
0 : anlylr
1 : ailnyh
2 : amnrir
3 : arrylr
4 : asgirn
5 : adrhyi
6 : amnrkn
7 : anvjyn
8 : asnihd
9 : ahhuln
Using Neural Network
In the above method, we created a bigram model by extracting information regarding bigram frequency from the corpus and converting it to a probability distribution. We can look at the problem from a different perspective by modeling it using a neural network. In my previous article discussing neural networks from scratch, we came across the expression β W*X+b, where W is the weight for an individual neuron, X is the input to that neuron, and b is the bias. Letβs try to convert the data that we have into something that can be used by a neural network. Weβll use a diagram from a single word so that itβs easier to understand whatβs going on.
Weβll separate the training data into x and y. Such that y[i] is the expected character for input x
## Preparing training dataset
Xs=[]
Ys=[]
for w in words[:1]:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
Xs.append(ix1)
Ys.append(ix2)
Xs = torch.tensor(Xs)
Ys = torch.tensor(Ys)
Xs, Ys
(tensor([ 0, 5, 13, 13, 1]), tensor([ 5, 13, 13, 1, 0]))
Neural networks donβt understand indexes, so weβll convert the indexes into one hot encoding vector.
One hot encoding
One hot encoding takes the integer and num of classes as input and creates a vector of dimension (1 x num of classes) with a single-bit set.
import torch.nn.functional as F
F.one_hot(Xs[1], num_classes=27)
tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0])
Weβll do the same for entire X
Xenc = F.one_hot(Xs, num_classes=27).float()
plt.imshow(Xenc)
The image above shows set bit in one-hot encoding for each of the 5 input indexes [ 0, 5, 13, 13, 1]. Using the neural network expression we saw above W*X +b, letβs create a tensor W. For the purpose of demonstration, letβs have a single neuron. This neuron would take input of shape (1,27) and output a tensor of shape (1,1).
Formulating Neural Network as Bigram Model
Weβll use a very simple neural network β A single linear layer and no bias.
As we are already familiar with neural network expression
Y = W*X + b
Weβll initialize w with random values and take a dot product with input tensor.
w = torch.randn((27, 1))
Xenc @ w # Matrix dot product
tensor([[ 0.3378],
[ 1.0748],
[ 0.1643],
[ 0.1643],
[-1.3647]])
For each of 5 Xs[i] we get an output tensor, now letβs bump up the number of neurons to 27, thereby making the shape of W (27,27) and shape of output (1, 27)
W = torch.randn((27, 27))
logits = Xenc @ W
logits
tensor([[ 1.6538, -0.8016, -1.0628, 0.5207, -0.8133, 1.8273, 1.5478, -0.9953,
0.5832, 0.9147, -0.3186, 1.9207, 2.1282, 1.2778, 0.4173, -1.0522,
0.8507, 0.0043, -0.4466, 0.9321, -0.9775, 0.5061, 1.1806, 1.1374,
0.3909, 0.6887, -0.5974],
[ 0.3734, -0.9440, 2.1173, 0.4307, -0.8386, 0.2973, -0.1508, -1.1140,
1.5063, 0.4548, 0.3725, 0.3407, -0.0376, 1.6323, -0.7899, -0.9287,
1.0687, 1.3284, 0.2154, 0.1661, -0.4255, -1.4457, 0.0080, 1.1457,
0.1350, 0.6581, 0.5713],
[-0.3007, -0.1158, 1.0404, 0.3564, -0.3480, 0.1582, -0.5061, 0.5499,
-0.1747, -1.1983, -0.1389, -1.2891, 0.8116, -1.0019, 1.2577, 1.1648,
-0.0425, 0.9556, -0.8966, 1.4691, -0.8577, -0.2066, -0.1065, 0.1099,
-0.4243, -1.3913, -1.1660],
[-0.3007, -0.1158, 1.0404, 0.3564, -0.3480, 0.1582, -0.5061, 0.5499,
-0.1747, -1.1983, -0.1389, -1.2891, 0.8116, -1.0019, 1.2577, 1.1648,
-0.0425, 0.9556, -0.8966, 1.4691, -0.8577, -0.2066, -0.1065, 0.1099,
-0.4243, -1.3913, -1.1660],
[ 0.4654, -0.5457, 0.0731, 1.3982, -0.6943, -0.3172, 0.6360, -2.0058,
-1.7574, 0.1172, 0.2438, -0.4918, -0.7632, 0.4355, 0.0921, 1.0822,
0.6615, 0.2039, -0.2937, 0.9257, -0.1299, -0.1696, -0.8557, 0.3851,
-1.4590, -0.7883, -1.2211]])
In the example above, Xenc @ W[i, j] is the activation/output of jth neuron for ith input. The values we get out of the neurons are termed as logits or log-counts. To convert them to a values resembling counts we can take exp of the logits and normalize them to get probabilities. This entire operation of converting logits to probabilities is also referred to as Softmax operation.
counts = logits.exp()
probs = counts/counts.sum(1, keepdim=True)
probs[0], probs[0].sum()
(tensor([0.0816, 0.0070, 0.0054, 0.0263, 0.0069, 0.0970, 0.0734, 0.0058, 0.0280,
0.0389, 0.0113, 0.1065, 0.1311, 0.0560, 0.0237, 0.0054, 0.0365, 0.0157,
0.0100, 0.0396, 0.0059, 0.0259, 0.0508, 0.0487, 0.0231, 0.0311, 0.0086]),
tensor(1.))
Evaluating the model for the 5 bigram pairs above by calculating log-likelihood and negative log-likelihood.
nlls = torch.zeros(5)
for i in range(5):
ix = Xs[i].item()
iy = Ys[i].item()
print(f"Input to neural network {itos[ix]}")
print(f"Expected output from neural netork {itos[iy]}")
likelihood_y = probs[i, iy]
print(f"Probability of {itos[iy]}: {likelihood_y}")
log_likelihood_y = likelihood_y.log()
print(f"log likelihood {log_likelihood_y}")
negative_log_likelihood_y = -log_likelihood_y
print(f"negative log likelihood {negative_log_likelihood_y}")
nlls[i] = negative_log_likelihood_y
loss = nlls.mean()
print(f"Average negative log likelihood {loss}")
Input to neural network .
Expected output from neural netork e
Probability of e: 0.09700655937194824
log likelihood -2.3329765796661377
negative log likelihood 2.3329765796661377
Input to neural network e
Expected output from neural netork m
Probability of m: 0.10330255329608917
log likelihood -2.2700932025909424
negative log likelihood 2.2700932025909424
Input to neural network m
Expected output from neural netork m
Probability of m: 0.01063988171517849
log likelihood -4.543146133422852
negative log likelihood 4.543146133422852
Input to neural network m
Expected output from neural netork a
Probability of a: 0.025809239596128464
log likelihood -3.657022714614868
negative log likelihood 3.657022714614868
Input to neural network a
Expected output from neural netork .
Probability of .: 0.051668521016836166
log likelihood -2.9629065990448
negative log likelihood 2.9629065990448
Average negative log likelihood 3.153228998184204
We get an average log-likelihood of 3.15, which isnβt great. Letβs try to improve this model and train it on the entire dataset
Using Entire dataset
Creating dataset
# create the dataset
xs, ys = [], []
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
xs.append(ix1)
ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)
# initialize the 'network'
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
number of examples: 228146
Training loop
Bringing together what we discussed above
- Forward pass to convert input character index to one hot encoding, calculate dot product with neural network weights, performing softmax to obtain probabilities from logits, calculating nll
- Backward pass involving setting gradient to None and performing backpropagation across the neural network
- Updating parameter wrt to the gradients
# gradient descent
for k in range(50):
# forward pass
xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
print(loss.item())
# backward pass
W.grad = None # set to zero the gradient
loss.backward()
# update
W.data += -50 * W.grad
3.7686190605163574
3.3788065910339355
3.16109037399292
3.0271859169006348
2.9344840049743652
2.867231607437134
2.8166539669036865
2.777146100997925
2.745253801345825
2.7188303470611572
2.696505308151245
2.6773722171783447
2.6608052253723145
2.6463513374328613
2.633665084838867
2.622471570968628
2.6125476360321045
2.6037068367004395
2.595794916152954
2.5886809825897217
2.5822560787200928
2.5764293670654297
2.5711236000061035
2.566272735595703
2.5618226528167725
...
2.5137410163879395
2.512698173522949
2.511704444885254
2.5107581615448
An interesting thing to note in the above code is
loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
specially the 0.01*(W**2).mean(). This is also termed a regularization factor and is used for model smoothening. Essentially, what we control is the smoothness of the model weights, by keeping the constant large, we force the mean of W**2 to be small and have values closer to each other to reduce loss, and by keeping the constant small, we reduce the contribution of W weights, thus allowing W to have larger weights as well.
Inference
# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647)
ch = 'a'
for i in range(10):
out = [ch]
ix = stoi[ch]
while len(out) < 6:
xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
p = counts / counts.sum(1, keepdims=True) # probabilities for next character
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
if ix == 0:
continue
out.append(itos[ix])
print(f"{i} : {''.join(out)}")
0 : asonde
1 : adiana
2 : asahpx
3 : anayan
4 : ankohi
5 : antoli
6 : araste
7 : azzada
8 : aheiau
9 : ayanil
Conclusion
There you have it. A bigram model that learns from the input list of names and evaluates itself using the log-likelihood method. Granted that the Bigram method only takes into account local relationships between pairs and, in turn, ignores the context of the word as a whole, itβs a great place to start when learning language modeling. With the above approach of using a neural network, we achieved a loss of 2.5107581615448, which isnβt better than what we had using a probabilistic model where we counted the occurrences. This is to be expected due to the nature of the bigram problem. Since, at any moment, we have very limited information (previous character), the manual approach of counting and using frequency to predict the next character turns out to be the same as using gradient-based methods to optimize neural network weights. Thus, the output of both methods remains the same. One thing to note here is that even though both models perform equally badly, the neural network model is very flexible in terms of input. We can extend the input to include a bunch of characters if we want, frequency distribution method on the other hand doesnβt scale well with the increase in number of input characters.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI