Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

# Bigram Language Modeling From Scratch

#### Author(s): Abhishek Chaudhary

Originally published on Towards AI.

Language modeling is all about how computers understand and generate human language. It’s a key part of making AI systems that can communicate with us effectively. In recent years, ChatGPT has been a prominent topic of discussion in tech articles. Chat-Generative Pre-trained Transformer has allowed general users to make use of a highly proficient language model to answer their queries. How does ChatGPT accomplish this task? Firstly, it’s pre-trained on a solid chunk of the internet, which, when coupled with careful preprocessing, training, and fine-tuning, can work wonders. Secondly, it works on the principle of finding the next chunk, given an input sentence. This coupled of attention mechanism allows ChatGPT to create whole coherent paragraphs.

In this article, we’ll discuss a similar, much less powerful language model. Our model will be a character-level model as opposed to a word/chunk-level model and would predict the next character given a previous character.

The code for this article can be found in the following Jupyter notebook

## Bigram model

The essence of the bigram model in language modeling is to approximate the probability of a word sequence by considering the probability of each word given its immediate predecessor.

The probability of a word sequence (W = w_1, w_2, …, w_n) is represented as follows:

`P(W) = P(w_1, w_2, ..., w_n) ≈ P(w_1) * P(w_2 U+007C w_1) * P(w_3 U+007C w_2) * ... * P(w_n U+007C w_{n-1})`

Where:

• `P(w_1)` is the probability of the first word in the sequence.
• `P(w_i U+007C w_{i-1})` is the conditional probability of the word `w_i` given that the previous word is `w_{i-1}`.

The conditional probability `P(w_i U+007C w_{i-1})` for a bigram is estimated from a corpus of text as follows:

`P(w_i U+007C w_{i-1}) = Count(w_{i-1}, w_i) / Count(w_{i-1})`
• `Count(w_{i-1}, w_i)` represents the number of times the word `w_i` follows the word `w_{i-1}` in the corpus.
• `Count(w_{i-1})` is the number of times the word `w_{i-1}` appears in the corpus.

## Implementation of the Bigram Model

For input corpus, we’ll make use of all the strings in `names.txt`

`words = open('names.txt', 'r').read().splitlines()words[:10]`
`['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia', 'harper', 'evelyn']`
`print(f"Number of words in corpus {len(words)}")print(f"Shortest name in corpus {min(len(w) for w in words)}")print(f"Longest name in corpus {max(len(w) for w in words)}")`
`Number of words in corpus 32033Shortest name in corpus 2Longest name in corpus 15`

Next, we’ll create bigram pairs and store their count in a dictionary. We’ll add a special token/character to denote the start and end of the word

`bg_pairs = dict()for word in words: word = ['.'] + list(word) + ['.'] for ch1, ch2 in zip(word, word[1:]): bg_pair = (ch1, ch2) bg_pairs[bg_pair] = bg_pairs.get(bg_pair, 0) + 1`

Printing out the top 10 bigram pair based on their frequency

`sorted(bg_pairs.items(), key = lambda kv: -kv[1])[:10]`
`[(('n', '.'), 6763), (('a', '.'), 6640), (('a', 'n'), 5438), (('.', 'a'), 4410), (('e', '.'), 3983), (('a', 'r'), 3264), (('e', 'l'), 3248), (('r', 'i'), 3033), (('n', 'a'), 2977), (('.', 'k'), 2963)]`

Bigram `(n, .)` is the most frequent pair, which means names end most frequently with `n` in our dataset. Let's try to visualize the data that we have. We'll make use of matplotlib, but since it doesn't recognize text, we'll convert our characters into integer representation.

`character_list = sorted(list(set(''.join(words))))character_list`
`['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']`
`stoi = {s:i+1 for i,s in enumerate(character_list)} # Adding 1 to each index so that special character can be given index 0stoi['.'] = 0itos = {i:s for s,i in stoi.items()} # Create reverse mapping as wellstoi`
`{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}`

Now that we have the dictionary to map the characters to integers, let’s make use of a PyTorch tensor to store the frequency of bg_pairs

`import torchN = torch.zeros((27, 27), dtype=torch.int32)for word in words: word = ['.'] + list(word) + ['.'] for ch1, ch2 in zip(word, word[1:]): idx1 = stoi[ch1] idx2 = stoi[ch2] N[idx1, idx2] += 1`
`import matplotlib.pyplot as pltimport numpy as np%matplotlib inlineplt.figure(figsize=(16,16))plt.imshow(N, cmap='Blues')for i in range(27): for j in range(27): chstr = itos[i] + itos[j] plt.text(j, i, chstr, ha="center", va="bottom", color='gray') plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray')plt.axis('off');`

In the diagram above, the value corresponding to each pair `(ch1, ch2)` denotes how many times has ch1 appeared before ch2. Pair like `(a, .)` show how many names have ended with `a` and similarly pairs like `(., a)` show the count of names starting with `a`.

Now, we’ll need to convert the above frequency distribution to a probabilistic distribution. Let’s see what that looks like for the first row.

`N[0]`
`tensor([ 0, 4410, 1306, 1542, 1690, 1531, 417, 669, 874, 591, 2422, 2963, 1572, 2538, 1146, 394, 515, 92, 1639, 2055, 1308, 78, 376, 307, 134, 535, 929], dtype=torch.int32)`
`p = N[0].float() # Convert to float so that probability doesn't roundoff to 0p /= N[0].sum()p`
`tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273, 0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029, 0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])`

This gives us the probabilistic distribution. For the example above, for row-indexed `0` we have this probability distribution. To sample an index from a probability distribution, we can make use of `torch.multinomial`

`gen = torch.Generator().manual_seed(2147483647) # We'll use generator with a seed so as to make this experiment deterministic[itos[ix.item()] for ix in torch.multinomial(p, num_samples=3, replacement=True, generator=gen)] # Sampling 3 characters from the probability distribution p`
`['m', 's', 'n']`

Now we know how to get `num_samples` from a probability distribution `p`. Next we'll convert the tensor `N` from frequency distribution to probability distribution. Since `N` is a 2D tensor of shape `(27, 27)`, to normalize it we'll need to divide each element `N[i, j]` by `SUM(N[i])`

`P = (N).float()P /= N.sum(1, keepdims=True) # 1 denotes summation along idx 1, which is column.P.shape`
`torch.Size([27, 27])`

## Broadcasting

In the code above we used `N.sum(1, keepdims=True)`. Let's look deeper into this expression to understand the nuances of broadcasting in pytorch. The shape of N.sum(1) is `torch.Size([27])` whereas shape of N.sum(1, keepdims=True) is `torch.Size([27, 1])`

`N.shape`
`torch.Size([27, 27])`
`N.sum(1).shape`
`torch.Size([27])`

When we try to divide shape `(27, 27)` by `(27)`, shape is converted to `(1, 27)` and then stretched out across every row to create shape `(27, 27)`. This means on division we'll get `N[i, j]/ sum of N[:, j]`

`N.sum(1, keepdims=True).shape`
`torch.Size([27, 1])`

When we try to divide shape `(27, 27)` by `(27, 1)`, shape is stretched out across every column to create shape `(27, 27)`. Which means on division we'll get `N[i, j]/ sum of N[i, :]` which is what we expect.

## Likelihood

Now, let’s calculate the probability our model assigns to some of the words from our corpus. Ideally this probability would be `1` since those words are present in the dataset.

`gen = torch.Generator().manual_seed(2147483647)for i in range(10): out = [] ix = 0 while True: p = P[ix] ix = torch.multinomial(p, num_samples=1, replacement=True, generator=gen).item() out.append(itos[ix]) if ix == 0: break print(''.join(out))`
`junide.janasah.p.cony.a.nn.kohin.tolian.juee.ksahnaauranilevias.`

The above result might look like gibberish, to convince ourselves that our model is indeed learning something, let’s try this out

`for word in words[:3]: word = ['.'] + list(word) + ['.'] for ch1, ch2 in zip(word, word[1:]): idx1 = stoi[ch1] idx2 = stoi[ch2] print(f"{ch1} {ch2} {P[idx1, idx2]: .4f}")`
`. e 0.0478e m 0.0377m m 0.0253m a 0.3899a . 0.1960. o 0.0123o l 0.0780l i 0.1777i v 0.0152v i 0.3541i a 0.1381a . 0.1960. a 0.1377a v 0.0246v a 0.2495a . 0.1960`

Here we see the probability of each pair, now if the model wasn’t learning any patterns each of this would have been equally likely i.e `1/27 ~ 0.037`, but the models have higher probability(likelihood) for some pairs and less probability(likelihood) for others.

## Evaluating the quality of the model using Negative log-likelihood

Likelihood is just another term for the probability of an item or bigram in our case. Since probabilities can go from 0 to 1, to have a smooth, continuous loss function, it makes sense to have a log of probability or log-likelihood

`x = np.linspace(0.000001, 1, 100)y = np.log(x)plt.plot(x, y, label='y = log(x)')`

Calculating log-likelihood of the pairs

`log_likelihood = 0for word in words[:3]: word = ['.'] + list(word) + ['.'] for ch1, ch2 in zip(word, word[1:]): idx1 = stoi[ch1] idx2 = stoi[ch2] prob = P[idx1, idx2] logprob = torch.log(prob) log_likelihood += logprob print(f"{ch1} {ch2} {prob: .4f} {logprob: .4f}")print(f"{log_likelihood=}")negative_log_likelihood = -log_likelihoodprint(f"{negative_log_likelihood=}")`
`. e 0.0478 -3.0408e m 0.0377 -3.2793m m 0.0253 -3.6772m a 0.3899 -0.9418a . 0.1960 -1.6299. o 0.0123 -4.3982o l 0.0780 -2.5508l i 0.1777 -1.7278i v 0.0152 -4.1867v i 0.3541 -1.0383i a 0.1381 -1.9796a . 0.1960 -1.6299. a 0.1377 -1.9829a v 0.0246 -3.7045v a 0.2495 -1.3882a . 0.1960 -1.6299log_likelihood=tensor(-38.7856)negative_log_likelihood=tensor(38.7856)`

We can see that for numbers with higher probability like `m a` loglihelihood is closer to 0, whereas for pair like `. e` reaches value `-3.0408`. To evaluate the model, we would like to have a number that represents the efficiency of the model. For this, we can add the log probabilities for each word. Which gives us `-38.7856`.

The lower value of log-likelihood means the model is performing badly, which is counterintuitive to the definition of the loss function, so what we want is to have a negative log-likelihood and try to minimize that. We can also make use of averaging the value of negative log-likelihood. Running this for the entire dataset now

`log_likelihood = 0.0n = 0for w in words: chs = ['.'] + list(w) + ['.'] for ch1, ch2 in zip(chs, chs[1:]): ix1 = stoi[ch1] ix2 = stoi[ch2] prob = P[ix1, ix2] logprob = torch.log(prob) log_likelihood += logprob n += 1print(f'{log_likelihood=}')nll = -log_likelihoodprint(f'{nll=}')print(f'{nll/n}')`
`log_likelihood=tensor(-559891.7500)nll=tensor(559891.7500)2.454094171524048`

## Inference

Let’s try to get a few name recommendations by inputting it’s length and starting character only.

`ch = 'a'for i in range(10): out = [ch] while len(out) < 6: ix = stoi[ch] p = P[ix] ix = torch.multinomial(p, num_samples=1, replacement=True, generator=gen).item() if itos[ix] == '.': continue # ignore '.' out.append(itos[ix]) print(f"{i} : {''.join(out)}")`
`0 : anlylr1 : ailnyh2 : amnrir3 : arrylr4 : asgirn5 : adrhyi6 : amnrkn7 : anvjyn8 : asnihd9 : ahhuln`

## Using Neural Network

In the above method, we created a bigram model by extracting information regarding bigram frequency from the corpus and converting it to a probability distribution. We can look at the problem from a different perspective by modeling it using a neural network. In my previous article discussing neural networks from scratch, we came across the expression — W*X+b, where W is the weight for an individual neuron, X is the input to that neuron, and b is the bias. Let’s try to convert the data that we have into something that can be used by a neural network. We’ll use a diagram from a single word so that it’s easier to understand what’s going on.

We’ll separate the training data into x and y. Such that y[i] is the expected character for input x

`## Preparing training datasetXs=[]Ys=[]for w in words[:1]: chs = ['.'] + list(w) + ['.'] for ch1, ch2 in zip(chs, chs[1:]): ix1 = stoi[ch1] ix2 = stoi[ch2] Xs.append(ix1) Ys.append(ix2)Xs = torch.tensor(Xs)Ys = torch.tensor(Ys)Xs, Ys`
`(tensor([ 0, 5, 13, 13, 1]), tensor([ 5, 13, 13, 1, 0]))`

Neural networks don’t understand indexes, so we’ll convert the indexes into one hot encoding vector.

## One hot encoding

One hot encoding takes the integer and num of classes as input and creates a vector of dimension (1 x num of classes) with a single-bit set.

`import torch.nn.functional as FF.one_hot(Xs[1], num_classes=27)`
`tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])`

We’ll do the same for entire X

`Xenc = F.one_hot(Xs, num_classes=27).float()plt.imshow(Xenc)`

The image above shows set bit in one-hot encoding for each of the 5 input indexes [ 0, 5, 13, 13, 1]. Using the neural network expression we saw above W*X +b, let’s create a tensor W. For the purpose of demonstration, let’s have a single neuron. This neuron would take input of shape (1,27) and output a tensor of shape (1,1).

## Formulating Neural Network as Bigram Model

We’ll use a very simple neural network — A single linear layer and no bias.

As we are already familiar with neural network expression

Y = W*X + b

We’ll initialize w with random values and take a dot product with input tensor.

`w = torch.randn((27, 1))Xenc @ w # Matrix dot product`
`tensor([[ 0.3378], [ 1.0748], [ 0.1643], [ 0.1643], [-1.3647]])`

For each of 5 Xs[i] we get an output tensor, now let’s bump up the number of neurons to 27, thereby making the shape of W (27,27) and shape of output (1, 27)

`W = torch.randn((27, 27))logits = Xenc @ Wlogits`
`tensor([[ 1.6538, -0.8016, -1.0628, 0.5207, -0.8133, 1.8273, 1.5478, -0.9953, 0.5832, 0.9147, -0.3186, 1.9207, 2.1282, 1.2778, 0.4173, -1.0522, 0.8507, 0.0043, -0.4466, 0.9321, -0.9775, 0.5061, 1.1806, 1.1374, 0.3909, 0.6887, -0.5974], [ 0.3734, -0.9440, 2.1173, 0.4307, -0.8386, 0.2973, -0.1508, -1.1140, 1.5063, 0.4548, 0.3725, 0.3407, -0.0376, 1.6323, -0.7899, -0.9287, 1.0687, 1.3284, 0.2154, 0.1661, -0.4255, -1.4457, 0.0080, 1.1457, 0.1350, 0.6581, 0.5713], [-0.3007, -0.1158, 1.0404, 0.3564, -0.3480, 0.1582, -0.5061, 0.5499, -0.1747, -1.1983, -0.1389, -1.2891, 0.8116, -1.0019, 1.2577, 1.1648, -0.0425, 0.9556, -0.8966, 1.4691, -0.8577, -0.2066, -0.1065, 0.1099, -0.4243, -1.3913, -1.1660], [-0.3007, -0.1158, 1.0404, 0.3564, -0.3480, 0.1582, -0.5061, 0.5499, -0.1747, -1.1983, -0.1389, -1.2891, 0.8116, -1.0019, 1.2577, 1.1648, -0.0425, 0.9556, -0.8966, 1.4691, -0.8577, -0.2066, -0.1065, 0.1099, -0.4243, -1.3913, -1.1660], [ 0.4654, -0.5457, 0.0731, 1.3982, -0.6943, -0.3172, 0.6360, -2.0058, -1.7574, 0.1172, 0.2438, -0.4918, -0.7632, 0.4355, 0.0921, 1.0822, 0.6615, 0.2039, -0.2937, 0.9257, -0.1299, -0.1696, -0.8557, 0.3851, -1.4590, -0.7883, -1.2211]])`

In the example above, Xenc @ W[i, j] is the activation/output of jth neuron for ith input. The values we get out of the neurons are termed as logits or log-counts. To convert them to a values resembling counts we can take exp of the logits and normalize them to get probabilities. This entire operation of converting logits to probabilities is also referred to as Softmax operation.

`counts = logits.exp()probs = counts/counts.sum(1, keepdim=True)probs[0], probs[0].sum()`
`(tensor([0.0816, 0.0070, 0.0054, 0.0263, 0.0069, 0.0970, 0.0734, 0.0058, 0.0280, 0.0389, 0.0113, 0.1065, 0.1311, 0.0560, 0.0237, 0.0054, 0.0365, 0.0157, 0.0100, 0.0396, 0.0059, 0.0259, 0.0508, 0.0487, 0.0231, 0.0311, 0.0086]), tensor(1.))`

Evaluating the model for the 5 bigram pairs above by calculating log-likelihood and negative log-likelihood.

`nlls = torch.zeros(5)for i in range(5): ix = Xs[i].item() iy = Ys[i].item() print(f"Input to neural network {itos[ix]}") print(f"Expected output from neural netork {itos[iy]}") likelihood_y = probs[i, iy] print(f"Probability of {itos[iy]}: {likelihood_y}") log_likelihood_y = likelihood_y.log() print(f"log likelihood {log_likelihood_y}") negative_log_likelihood_y = -log_likelihood_y print(f"negative log likelihood {negative_log_likelihood_y}") nlls[i] = negative_log_likelihood_yloss = nlls.mean()print(f"Average negative log likelihood {loss}")`
`Input to neural network .Expected output from neural netork eProbability of e: 0.09700655937194824log likelihood -2.3329765796661377negative log likelihood 2.3329765796661377Input to neural network eExpected output from neural netork mProbability of m: 0.10330255329608917log likelihood -2.2700932025909424negative log likelihood 2.2700932025909424Input to neural network mExpected output from neural netork mProbability of m: 0.01063988171517849log likelihood -4.543146133422852negative log likelihood 4.543146133422852Input to neural network mExpected output from neural netork aProbability of a: 0.025809239596128464log likelihood -3.657022714614868negative log likelihood 3.657022714614868Input to neural network aExpected output from neural netork .Probability of .: 0.051668521016836166log likelihood -2.9629065990448negative log likelihood 2.9629065990448Average negative log likelihood 3.153228998184204`

We get an average log-likelihood of 3.15, which isn’t great. Let’s try to improve this model and train it on the entire dataset

## Creating dataset

`# create the datasetxs, ys = [], []for w in words: chs = ['.'] + list(w) + ['.'] for ch1, ch2 in zip(chs, chs[1:]): ix1 = stoi[ch1] ix2 = stoi[ch2] xs.append(ix1) ys.append(ix2)xs = torch.tensor(xs)ys = torch.tensor(ys)num = xs.nelement()print('number of examples: ', num)# initialize the 'network'g = torch.Generator().manual_seed(2147483647)W = torch.randn((27, 27), generator=g, requires_grad=True)`
`number of examples: 228146`

## Training loop

Bringing together what we discussed above

• Forward pass to convert input character index to one hot encoding, calculate dot product with neural network weights, performing softmax to obtain probabilities from logits, calculating nll
• Backward pass involving setting gradient to None and performing backpropagation across the neural network
• Updating parameter wrt to the gradients
`# gradient descentfor k in range(50): # forward pass xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding logits = xenc @ W # predict log-counts counts = logits.exp() # counts, equivalent to N probs = counts / counts.sum(1, keepdims=True) # probabilities for next character loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean() print(loss.item())  # backward pass W.grad = None # set to zero the gradient loss.backward()  # update W.data += -50 * W.grad`
`3.76861906051635743.37880659103393553.161090373992923.02718591690063482.93448400497436522.8672316074371342.81665396690368652.7771461009979252.7452538013458252.71883034706115722.6965053081512452.67737221717834472.66080522537231452.64635133743286132.6336650848388672.6224715709686282.61254763603210452.60370683670043952.5957949161529542.58868098258972172.58225607872009282.57642936706542972.57112360000610352.5662727355957032.5618226528167725...2.51374101638793952.5126981735229492.5117044448852542.5107581615448`

An interesting thing to note in the above code is

loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()

specially the 0.01*(W**2).mean(). This is also termed a regularization factor and is used for model smoothening. Essentially, what we control is the smoothness of the model weights, by keeping the constant large, we force the mean of W**2 to be small and have values closer to each other to reduce loss, and by keeping the constant small, we reduce the contribution of W weights, thus allowing W to have larger weights as well.

## Inference

`# finally, sample from the 'neural net' modelg = torch.Generator().manual_seed(2147483647)ch = 'a'for i in range(10): out = [ch] ix = stoi[ch] while len(out) < 6: xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float() logits = xenc @ W # predict log-counts counts = logits.exp() # counts, equivalent to N p = counts / counts.sum(1, keepdims=True) # probabilities for next character ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item() if ix == 0: continue out.append(itos[ix]) print(f"{i} : {''.join(out)}")`
`0 : asonde1 : adiana2 : asahpx3 : anayan4 : ankohi5 : antoli6 : araste7 : azzada8 : aheiau9 : ayanil`

## Conclusion

There you have it. A bigram model that learns from the input list of names and evaluates itself using the log-likelihood method. Granted that the Bigram method only takes into account local relationships between pairs and, in turn, ignores the context of the word as a whole, it’s a great place to start when learning language modeling. With the above approach of using a neural network, we achieved a loss of 2.5107581615448, which isn’t better than what we had using a probabilistic model where we counted the occurrences. This is to be expected due to the nature of the bigram problem. Since, at any moment, we have very limited information (previous character), the manual approach of counting and using frequency to predict the next character turns out to be the same as using gradient-based methods to optimize neural network weights. Thus, the output of both methods remains the same. One thing to note here is that even though both models perform equally badly, the neural network model is very flexible in terms of input. We can extend the input to include a bunch of characters if we want, frequency distribution method on the other hand doesn’t scale well with the increase in number of input characters.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI