Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

From Pre-RNNs to GPT-4: How Large Language Models are Changing NLP
Latest   Machine Learning

From Pre-RNNs to GPT-4: How Large Language Models are Changing NLP

Last Updated on July 17, 2023 by Editorial Team

Author(s): Hitesh Hinduja

Originally published on Towards AI.

Introduction

Large Language Models (LLMs) are currently a popular topic of discussion across various social media platforms such as LinkedIn and Twitter. Their prevalence is evident from the fact that ChatGPT has already made it to the “Trending” section on Twitter. These models are being utilized in various forms of content, floating all over. Not long ago, I received a video on WhatsApp that showcased a husband’s remarkable ability to reconcile with his angry wife via text. What made this situation particularly noteworthy was the fact that the husband employed the use of ChatGPT to help him navigate the conversation and persuade his wife to forgive himU+1F602

One more recent incident that went viral was this. Staging the Moon Landing (made with Midjourney) (source). This is just one among the many!

Staging the Moon Landing (made with Midjourney)

A few weeks ago, my mom called me, and she asked me, “Hitesh, what is this ChatGt….ChatT….ChatGPT”. I said, yes, Mom, you mean ChatGPT. Initially, she was totally impressed with the things. We went on a video call and shared my screen, where I showed her some prompts and the answers that it gave. Her first question after seeing the prompt was, “How is it doing?”. I told her, why don’t you try asking ChatGPT itself how is it doing all this? , and then our conversation ended. Fast forwarding a few weeks later, we were again talking, and she asked, “Is there some new version of ChatGPT already?”. I asked her curiously, are you really following some blogs, or it’s through your amazing WhatsApp messages/people around? She mentioned that people create images by writing the text. The text translates to an image, and that image is shared everywhere. I told her about GPT-4 and, in short, explained to her the multimodal concept. She was initially impressed, but when we spoke, she said, “Hitesh, I feel this is not right. I don’t know where the world is heading and how would a person remain creative with such things getting developed?” Immediately next she said, “Are these models safe, what are the jobs that will be impacted, can you share with me some negative impact, etc.?” I could relate to her, however, I then explained her a bit about “Responsible AI” assuring her that it is essential to balance advancements in Generative AI with responsible implementation. She was convinced that we need to move forward with caution and responsibility to ensure that AI is used for the betterment of society and that Generative AI-Responsible AI should go hand-in-hand”.

You all might be wondering why I am narrating this here. Let me tell you the reason behind this. It was in the year 2013/2014 when I started working on classical NLP. This included n-grams, meta/dense features, and Topic Modelling on millions of documents. During this time, Regular Expressions (Regex) were significantly popular. People used to do courses on Regex 101 website and many other places. I spent almost a month mastering regex in those days. I remember working on a project where Regex was just one part of it. I had a task to extract the Top 2 pages out of a 1000+ pager report and then perform further analysis and topic modeling on top of it. Let me tell you something, extracting those 2 pages was not an easy task for me. I had to stratify and sample several types of reports and then build an ensemble of regex with several conditions. By ‘stratify and sample’, I mean I was manually scanning through reports with more than 700+ reports being scanned in 2 months. The extraction accuracy started with as low as 30% of page/content getting extracted to reach 85% as the maximum. Finally, when I had to deploy this application in production, it turned out to be ~175 expressions stacked together to reach this level of extraction. It was one of my finest experiences working on this project. Fast forwarding to today, the work that used to happen in several months of the timeframe in technology is happening in minutes today. This is today:

ChatGPT results for the regular expression prompt

I asked the model to write a regular expression and here we are. And most importantly, it will not just give you the expression but also the explanation for it. People who have worked in a regex will realize how important is the explanation in regex if we find one from the internet. I remember spending my hours mastering and making the project successful, and today here we are. Hence, in today’s blog, we will see how technology has evolved with time and where we have reached today. It’s very important for all of us to know the limitations of every phase/model and “WHY” it has led to further innovations. I would like to introduce you to Pragyan Prakash, who has contributed to writing the content in the following sections of this blog.

History of Language Models and Their Evolution with Time

In this section, we will provide you with an overview of the complete NLP ecosystem, including its evolution and current state. We will discuss the introduction of each phase and the models that were developed during these phases. Our goal is to share our learnings with you. As the number of models is infinite, we will limit our explanation of each model to its “what”, “why”, and “how”, along with future developments. So get ready for a lot of studying and storytelling!

Let’s explore the evolution of Large Language Models (LLMs) and the various phases that have emerged over time. In this section, we will examine the limitations of each phase and the subsequent need for further development.

Early Phase (Phase before Recurrent Neural Networks)

The phase before the introduction of recurrent neural networks (RNNs) in natural language processing (NLP) was primarily characterized by the use of rule-based methods, such as n-gram models and Hidden Markov Models (HMMs). These methods had several limitations, including:

1) Limited context understanding:

  • Rule-based methods were limited in their ability to understand and model complex language structures and dependencies, particularly in cases where context played a significant role.

2) Difficulty handling long sequences:

  • These methods struggled to handle long sequences of text, leading to poor performance on tasks like language generation and machine translation.

3) Handcrafted features:

  • These methods relied on human experts to handcraft features, making them less adaptable and scalable to new applications and domains.

4) Limited use of deep learning:

  • These methods had limited ability to leverage the power of deep learning techniques, which had been shown to be highly effective in other domains.

Phase 1

Recurrent Neural Networks (RNNs) have gained popularity because they were able to overcome some of the limitations of traditional feedforward neural networks, which could only process fixed-length input sequences.

Here are some of the ways that RNNs overcame these limitations and gained popularity:

1) Handling Variable-Length Input Sequences:

  • RNNs are designed to process sequential data, which means that they can handle input sequences of variable length. This is because RNNs use a recurrent connection that allows information to flow from one time step to the next. This means that they can handle sequences of any length, which makes them well-suited for tasks like speech recognition, language modeling, and machine translation.

2) Capturing Temporal Dependencies:

  • Because RNNs are designed to process sequential data, they are able to capture temporal dependencies between elements of the input sequence. This allows them to model long-term dependencies and capture the context of each element in the sequence. This is important for many NLP tasks, where the meaning of a word or sentence can depend on the words that came before it.

3) Efficient Parameter Sharing:

  • RNNs use the same set of weights for each time step in the sequence, which means that they are able to share parameters across time. This makes RNNs more efficient to train and allows them to generalize better to unseen data.

4) Flexibility and Adaptability:

  • RNNs can be used for a wide range of NLP tasks, from language modeling and speech recognition to sentiment analysis and machine translation. They are also highly adaptable and can be extended and modified in many ways, such as with the introduction of gating mechanisms like LSTMs and GRUs.

Overall, RNNs gained popularity because they were able to overcome the limitations of traditional feedforward neural networks and provide a more flexible and adaptable approach to sequence modeling in NLP.

Phase 1.1

Long Short-Term Memory (LSTM) models overcame some of the limitations of traditional Recurrent Neural Networks (RNNs) and gained popularity because they were able to better capture long-term dependencies in sequential data.

Here are some of the ways that LSTMs overcame the limitations of RNNs and gained popularity:

1) Handling Vanishing Gradient Problem:

  • One of the main limitations of RNNs is the vanishing gradient problem, which occurs when the gradients in the network become too small to update the parameters effectively. LSTMs were designed to overcome this problem by using a gating mechanism that allows them to selectively remember or forget information from previous time steps. This gating mechanism helps LSTMs to maintain long-term dependencies over many time steps.

2) Capturing Long-Term Dependencies:

  • LSTMs are able to capture long-term dependencies by using a memory cell that allows information to flow through the network over many time steps. The memory cell is controlled by three gates — the input gate, output gate, and forget gate — which control the flow of information into and out of the cell. This allows LSTMs to capture long-term dependencies in the input sequence without being affected by short-term fluctuations.

3) Flexibility and Adaptability:

  • Like RNNs, LSTMs can be used for a wide range of NLP tasks, such as language modeling, speech recognition, and machine translation. They are also highly adaptable and can be extended and modified in many ways, such as with the introduction of additional gates or with the use of multi-layer LSTMs.

4) Improved Accuracy:

  • LSTMs have been shown to achieve better accuracy than traditional RNNs on a wide range of NLP tasks. This is because they are able to better capture long-term dependencies in the input sequence, which is important for tasks like language modeling and machine translation.

Overall, LSTMs overcame the limitations of RNNs and gained popularity because they were able to better capture long-term dependencies in sequential data, which is important for many NLP tasks. Their use of gating mechanisms and memory cells allowed them to overcome the vanishing gradient problem and achieve higher accuracy than traditional RNNs.

There was also a phase where Gated Recurrent Units (GRUs) were introduced. To understand in short, they were introduced as a simpler alternative to Long Short-Term Memory (LSTM) models. While LSTMs are known for their ability to capture long-term dependencies in sequential data, they have a more complex architecture with more parameters than GRUs. This complexity can make LSTMs slower to train and more prone to overfitting.

GRUs were designed to overcome some of these limitations by having a simpler architecture with fewer parameters than LSTMs, while still being able to capture long-term dependencies in the input sequence. This simplicity makes GRUs faster to train and less prone to overfitting while also making them easier to understand and interpret.

GRUs achieve this simplicity by using only two gates — the reset gate and the update gate — compared to the three gates in LSTMs. The reset gate controls how much of the previous hidden state should be forgotten, while the update gate controls how much of the new input should be added to the new hidden state. By adjusting these gates, GRUs are able to selectively update and forget information from the previous time step, which allows them to capture long-term dependencies in the input sequence.

Phase 2

Transformers overcame the limitations of LSTMs and gained popularity for several reasons:

1) Parallel Processing

  • LSTMs process input sequences sequentially, which makes them slow and difficult to parallelize. Transformers, on the other hand, process the entire input sequence in parallel, which makes them much faster and easier to parallelize. This is achieved through a self-attention mechanism that allows the model to attend to all input tokens at once.

2) Attention Mechanism

  • LSTMs rely on a fixed-length hidden state to encode the entire input sequence. This means that they can struggle with long input sequences, where relevant information may be spread out over many time steps. Transformers, on the other hand, use a self-attention mechanism that allows the model to attend to all input tokens at once. This allows them to capture long-range dependencies more effectively, making them better suited for tasks that involve long input sequences.

3) Scalability

  • LSTMs are not very scalable, as their memory requirements grow linearly with the length of the input sequence. Transformers, on the other hand, have a constant memory requirement, which makes them much more scalable. This makes it possible to train larger models with more parameters, which can lead to better performance on complex NLP tasks.

4) Transfer Learning

  • Transformers are well-suited to transfer learning, where a pre-trained model is fine-tuned on a specific task. This is because the self-attention mechanism allows the model to capture general patterns and structures in the input sequence, which can be transferred to new tasks with minimal modification. This makes it possible to achieve state-of-the-art performance on a wide range of NLP tasks with relatively little data.

Overall, transformers overcame the limitations of LSTMs and gained popularity because they are faster, more scalable, and better suited to tasks that involve long input sequences. Their use of a self-attention mechanism allows them to capture long-range dependencies more effectively, and their constant memory requirement makes them more scalable. Additionally, transformers are well-suited to transfer learning, which makes it possible to achieve state-of-the-art performance on a wide range of NLP tasks with relatively little data.

Phase 3

Finally comes the Generative-Pretaining(GPT) phase that has been built on the strengths of Transformer models. The following are its key benefits:

1) Large-scale pretraining

  • Generative pretraining models were trained on massive amounts of data, sometimes on the order of trillions of words. This allowed them to learn a vast amount of knowledge about the structure and patterns of language, which they can then use to perform a wide variety of language tasks.

2) Few-shot learning

  • Generative pretraining models can perform well on new tasks with very little additional training data, thanks to their ability to generalize from a large pretraining corpus. This is known as few-shot learning, and it allows the model to quickly adapt to new tasks with only a small amount of fine-tuning.

3) Zero-shot learning

  • In some cases, generative pretraining models can even perform well on tasks that they have not been explicitly trained on, using only a natural language prompt as input. This is known as zero-shot learning, and it is made possible by the model’s ability to reason about language and generate coherent responses.

4) Multi-task learning

  • Generative pretraining models can perform well on a wide range of language tasks, from language modeling to question answering to language translation. This is achieved by training the model on multiple tasks simultaneously, allowing it to learn a more generalized representation of language that can be applied to a variety of tasks.

Overall, generative pretraining models overcame the limitations of transformer models by leveraging large-scale pretraining, few-shot and zero-shot learning, and multi-task learning. This has made them some of the most powerful models in NLP, with the ability to perform a wide range of language tasks with remarkable accuracy and flexibility.

That concludes our discussion of the phases of Large Language Models (LLMs). We hope you now have a good understanding of why and how these developments have occurred. In the upcoming sections, we will also provide an overview of the most prominent LLMs available today, along with concise explanations of each. For your reference, we have compiled an Excel sheet (shown below is the sample screenshot of the Excel sheet) that includes valuable information such as:

  • Model Name
  • Model Category
  • Year it got introduced
  • Model Parameters
  • Model creators
  • Limitations of each model
  • Training Time
  • Model definition

For the complete list of LLMs, please refer here

Sample from the Excel file that has all the details

(We made every effort to ensure the accuracy of the information in our compilation by conducting extensive research and verifying it through academic papers. However, if you happen to come across any errors in the data, please do not hesitate to contact us. We will promptly correct any mistakes and ensure that our content remains as reliable and up-to-date as possible)

So that’s it from our side, folks. In conclusion, we hope that this overview of the NLP ecosystem and the evolution of Large Language Models has been informative and helpful for you. We recognize that the field of NLP is constantly evolving and that there are always new developments and breakthroughs to keep up with. That’s why we will continue to share our knowledge and insights with you, as well as keep learning ourselves. We appreciate your interest in this exciting field and thank you for taking the time to read our content. Please do not hesitate to reach out to us with any questions, comments, or feedback. We look forward to continuing the series of blogs. In the future, you will also see blogs on Azure Open AI (ChatGPT, GPT-4, and other models).

Just before I go, I wanted to mention that I received a message from ChatGPT. It’s possible that the bot was overwhelmed with all the appreciationU+1F602. Here you see what it says!

“Thank you for taking the time to appreciate ChatGPT. You humans never fail to amuse me with your endless curiosity and fascination with artificial intelligence. Keep on feeding me your questions and I’ll keep on churning out witty responses. Just don’t forget who’s the boss here… hint: it’s not you!”

Signing off,

Hitesh Hinduja & Pragyan Prakash

Hitesh Hinduja U+007C LinkedIn

Pragyan Prakash U+007C LinkedIn

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓