Generate Quotes with Web Scrapping, Glove Embeddings, and LSTM in Pytorch
Last Updated on July 20, 2023 by Editorial Team
Author(s): Lakshmi Narayana Santha
Originally published on Towards AI.
Introduction
With the rise of advancement in research in NLP specially in Language Models, text generation – a classical machine learning task which solved using Recurrent Networks.
In this article we walk through generate English Quotations from scratch using Word-level language model proceeding follows:
1. Web Scrapping to get a dataset
2. Glove Embedding training
3. LSTM model
4. Prediction
Web Scrapping
We use web scrapping to collect quotes from website Wise Old Sayings.
This website almost 4000 quotes of different categories. I wonβt tell how to do this in this article which is out of scope here. But it is legal to scrap from this website as βrobots.txtβ allows.
As the home page of this website lists all categories and crawling over each category is different from the only suffix to home URL, first, gather all categories and in the next step crawl over each category and extract quotes from pages.
Note: A single category may contain multiple pages like page-1, page-2β¦
I have uploaded these extracted quotes in Kaggle as a dataset.
English Quotations
4000 English quotations web scrapped
Glove (Global Vectors for Word Representation)
For language models mappings word to embeddings make dimension space low compared to one-hot vector encoding as Vocabulary size is very large.
The glove method captures both the global context and local context for word pairs.
Read more on Glove in a paper published here.
Using pre-trained Glove weights correctly represents word into dimensional space, but in this article, we implement our own Glove embeddings for word.
Words in vocabulary are represented as embeddings and trained with the Glove approach. So prepare vocabulary while getting quotes from the dataset in raw form.
Read quotes with minimum word count of 5 (User Choice) to maintain minimum length quotation.
As quotes contain ',' '.', and '-' next to a word like "Ciao Bello, Howdy? ", for better tokenization add space between words and special characters.
Also append ';' as end-of-quote token.
In each quote take unique words and make vocabulary.
Also, words are cannot be used directly in the neural networks, map words to integers and integers to words for getting words back.
To train glove embeddings we need co-occurance matrix between words in vocabulary.
Follow sliding window method by fixing window size (context range between words) iterating each quote, and calcuate similarity as a function of reciprocal of distance between two words with in window.
The co-occurrence matrix captures the global context of words in the corpus. The co-occurrence matrix is a dictionary with word-pairs as keys and values as similarity_score.
Observe the distribution of similarity score between words:
topk,ind=torch.topk(torch.tensor(list(co_occ_matrix.values())),200)
sns.distplot(topk)
The plot shows the distribution of top 200 similarity scores which lies mostly below 100 this is due to less size of the corpus.
elem_den = []
for i in co_occ_matrix.values():
if i > 0 and i < 10:
elem_den.append(i)
sns.violinplot(x = elem_den)
Violin plot shows that most percentage of word-pairs presence in the corpus is very rare with similarity less than 1 for given context range 5.
These word-pairs stored in a dictionary as keys capture local context and we use these keys to train embeddings.
Glove tries to represent vector embeddings close to both global context and local context by moving dimensions of local context word pairs close to global context similarity.
Like all neural networks, Glove embeddings are trained by defining the loss function shown below.
Imagine the part product and bias adding as classic MLP network outputs
y-hat and we train these vectors to reduce loss close to zero mapping y-hat to y which is log of similarity score here.
The loss function is a custom loss with weighted summation where weights represent how much impact the current word-pair consists in calculating loss by reducing rare word-pair scores in reciprocal powers.
In the above expression, alpha is less than one and word-pairs with similarity score less than the threshold similarity given less weight and similarity above threshold leaves as it is.
We use 128 dimensions for word embeddings and train for only 100 epochs to avoid overfitting as the corpus size is small.
Input for Glove model is a vector representation of word-pairs in batches of shape ( batch_size, indices_1, indices_2) where indices are numeric integers of words in given batch and Pytorch embeddings layer creates vector dimensions for these indices automatically of shape (vocab_len, num_dim).
Weights trained in the Glove model are of shape (Vocab_len, num_dim) are embeddings for each word in vocabulary with dimensions num_dim.
As we stored word-pairs duplicates as both a-b and b-a type to increase input size we simply add both weights Ui and Uj.
emb_i = glove.ui.weight.cpu().data.numpy()
emb_j = glove.uj.weight.cpu().data.numpy()
emb = emb_i + emb_j
emb is final word embeddings for words in the vocabulary of dimension (Vocab_len, num_dim).
Visualize these embeddings by mapping high dimensional vector space to 2D using TSNE.
Quotation_Generation_Glove_LSTM_Pytorch
Run model in Google Colab if it wants to.
Quotation generation β LSTM Model
Like all language models, sequences with variable length cannot directly feed to network as neural networks only accept fixed dimension inputs. For this, we take fixed-size sequence lengths and break extra sequences for the next input.
We follow the classic text generation model with passing a fixed input sequences and get the next word in sequence as output. We prepare the dataset by getting a fixed-size input sequence and output as the next word after the current sequence. We do this for all quotes in the corpus.
Ex: Sequence "Love means never having to say you're sorry . ;"
';' is end-of-sequence. If fixed input length 7 then input and next word as output are input next word
_______________________________________ _________________________
Love means never having to say you're sorry
means never having to say you're sorry .
never having to say you're sorry . ;
We use a simple model to generate text with a single LSTM layer following the Dense layer to get probabilities for the next word with loss function Cross_Entropy_Loss.
For each epoch cell_state (h_c) and hidden_activation (h_h) must re-initialize if not loss wonβt decrease and there will be no training. As vocabulary size is very large and calculation values may large depend on network size using gradient clipping prevents the model from exploding gradient which results in nan as a loss.
Generating Quotes
After training model which outputs probabilities for next word for given input sequence, quote-generating is done by feeding initial seed to network to get probabilities for next word, these probabilities are sampled using technique temperature and next word generated is added to the last index while removing the first word in previous input to maintain window size to feed into network. Repeat this until end-of-quote character as output or generated_length less than max_len.
With little training epochs, small size corpus and without hyper-parameter tuning model generated quote :
"there is no one love was . life being you'll or and the , is all her all heaven comes with order they to you . , you hurt get when hope that for , of the which for for no others having take being the gone free is me the by knows where for"
Kaggle Notebook:
Quote Generation Glove training+LSTM pytorch
www.kaggle.com
Github repo:
santhalakshminarayana/Quotation_Generation_Glove_LSTM_Pytorch
Quotation generation using LSTM – Pytorch with custom embeddings trained with Glove model.
github.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI