Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Can ChatGPT beat DeepPavlov in Natural Language Understanding tasks?
Latest   Machine Learning

Can ChatGPT beat DeepPavlov in Natural Language Understanding tasks?

Last Updated on July 17, 2023 by Editorial Team

Author(s): Vasily Konovalov

Originally published on Towards AI.

Whether you can use ChatGPT to build any Natural Language Processing (NLP) model for your application? In this article, I compare the DeepPavlov model’s performance with ChatGPT on selected GLUE tasks. The Natural Language Understanding (NLU) tasks included in GLUE are crucial for building any NLP application. The findings indicate that while ChatGPT shows astonishing results, it cannot beat pretrained Transformer-based models integrated into DeepPavlov on the GLUE tasks.

Photo by Jem Sahagun on Unsplash

Introduction

ChatGPT, which has been released by OpenAI, has revolutionized the way artificial intelligence interacts with humans. Although there is limited information available on how the model was constructed, its exceptional quality has made it extremely popular. It is noteworthy that the base model of ChatGPT only has 3.5B parameters, yet in conversation tasks, it outperforms the GPT3 model with 175B parameters. This shows the high relevance of collecting data from humans for supervised model fine-tuning. There are several publications on ChatGPT evaluation that test its effectiveness on well-known natural language processing (NLP) tasks. In this article, I am going to compare the performance of ChatGPT with that of the supervised transformer-based models from the DeepPavlov library. DeepPavlov is a conversational open-source library for Natural Language Processing (NLP) and Multiskill AI Assistant development. DeepPavlov is based on PyTorch and leverages Transformer and Datasets packages from HuggingFace to train various transformer-based models on hundreds of datasets.

Install DeepPavlov Library

DeepPavlov Library is an open-source NLP framework. It contains all essential state-of-the-art NLP models that can be used alone or as a part of DeepPavlov Dream — an open-source Multiskill AI Assistant Platform. The library contains various text classification models for topic classification, insult detection, and intent recognition. Before comparing the models, let’s first install DeepPavlov Library by running the command:

pip install deeppavlov==1.0.1

The DeepPavlov Library supports Python 3.6 — 3.9 U+1F40D.

How to use DeepPavlov Library

The DeepPavlov framework is implemented in Python and uses PyTorch as the base machine learning framework. DeepPavlov supports HuggingFace transformers which allows using all AutoModel transformer-based models from the HuggingFace Hub. DeepPavlov provides a way to interact with models via a command-line interface (CLI), an application programming interface (API), or via Python pipelines. The list of the CLI — supported commands is as follows:

  • install model’s specific requirements, run python -m deeppavlov install <config_name>, where <config_name> is the name of the configuration file.
  • interact with the model through CLI, run python -m deeppavlov interact <config_name> [-d] [-i], where -d downloads files from the metadata of the configuration file (optional), and -i installs model requirements (optional).
  • train the model, run the following command: python -m deeppavlov train <config_name> [-d] [-i]. The dataset will be downloaded regardless of whether there was the -d flag or not. To train a model on your own data, you need to modify dataset_reader path in the model configuration file. The data format is specified in the corresponding model doc page. In order to change the backbone transformer architecture, you should modify the corresponding variable in the variables section.
  • riseapi starts a REST API server with the selected model, which might be useful in production deployment scenarios, run python -m deeppavlov riseapi <config_name> [-d] [-i].

Alternatively, you can interact with DeepPavlov’s models by using Python pipelines.

Comparing ChatGPT and DeepPavlov

There is no easy way to compare ChatGPT with DeepPavlov-based models. You have to keep in mind that ChatGPT performance is conditioned on the prompt. Even the slightest modification of the prompt might change the result completely. Almost all prompts used here were adopted from [2] to be fully comparable with the paper results. Moreover, testing ChatGPT performance on well-established NLP benchmarks might lead to several problems. Evaluating ChatGPT performance on thousand test examples might be long and expensive. Therefore I decided to either evaluate the GLUE test set of the corresponding task if the size of the test set is less than 1000 samples or follow the approach proposed in [2] by evaluating a sample of the dev set and comparing it with ChatGPT. In the following comparisons, I included the ChatGPT performance, BERT-based model performance (from [2]), and of DeepPavlov-based models. It is worth mentioning that according to [1], ChatGPT might remember many of the evaluation datasets from the pre-training phrase. OpenAI is continuously improving ChatGPT by releasing new versions. Since the initial release on Dec 15, 2022, there have been seven major releases. All results mentioned here are based on the ChatGPT version of March 23, 2023.

Recognizing Textual Entailment (RTE)

The Recognizing Textual Entailment (RTE) task is a natural language processing task that involves determining whether a given piece of text (the hypothesis) can be inferred from another piece of text (the premise). The task requires understanding the meaning and context of the text and making a judgment about the logical relationship between the two pieces of text. RTE is often used in applications such as question answering, information retrieval, and text summarization.

For example, given the sentence “Cyprus, divided or not, joins the EU on the 1st of May.” determine if the following statement is entailed: “Cyprus was divided into two parts on May 1.” The correct answer is false (not entailment). To access the pre-trained model from the DeepPavlov library, one should use:

from deeppavlov import build_model
model = build_model(glue_rte_roberta_mnli, download=True, install=True)
model(["Cyprus, divided or not, joins the EU on the 1st of May."],
["Cyprus was divided into two parts on May 1."])

To test ChatGPT performance, I used the following prompt: “Given the sentence ‘[text_1]’, determine if the following statement is entailed: ‘[text_2]’”. The RTE test set contains thousands of examples. So to compare ChatGPT with DreepPavlov RTE pretrained models, I used the same 25 sampled instances for each class from the dev set as in [2]. On this sampled set DeepPavlov’s RTE model gains an accuracy of 90, whereas ChatGPT is only 88. Here are three examples where ChatGPT fails to correctly label the pair of sentences:

Prompt: “Given the sentence ‘Microsoft nearly halved at NYSE following the results as the company said operating income would drop as much as 42 percent in the second quarter.’ determine if the following statement is entailed: ‘Shares of Microsoft fell by almost 50 percent.’”
Expected: entailment
DeepPavlov: entailment
ChatGPT: No, the statement is not entailed. The sentence states that Microsoft nearly halved, which means it decreased by almost 50 percent, but it does not explicitly state that the shares fell by almost 50 percent.

Prompt: “Given the sentence ‘Ssangyong Motor was taken over by creditors after it collapsed under heavy debts during the 1997–98 Asian financial crisis.’ determine if the following statement is entailed: ‘Asian financial crisis takes over Ssangyong Motor’”
Expected: entailment
DeepPavlov: entailment
ChatGPT: No, the statement is not entailed. The sentence states that Ssangyong Motor collapsed under heavy debts during the Asian financial crisis and was taken over by creditors, but it does not imply that the crisis itself took over the company.

Prompt: “Given the sentence ‘Mopresol as the capital of France is the biggest city of Europe.’ determine if the following statement is entailed: ‘Mopresol is the biggest city of France.’”
Expected: entailment
DeepPavlov: entailment
ChatGPT: No, the statement is not entailed.

Winograd NLI (WNLI)

ChatGPT has difficulty with coreference resolution that deals with resolving ambiguous pronouns. Winograd Schema Challenge assesses a machine learning model’s capability to refer pronouns to the corresponding noun. It involves resolving ambiguous pronouns in a sentence by using the context provided in the preceding sentence. The challenge is considered difficult for NLP models as it requires a deep understanding of natural language and context. The WNLI task represents the mix of natural language inference with coreference resolution. To access the pre-trained model from the DeepPavlov library, one should use:

from deeppavlov import build_model
model = build_model('glue_wnli_roberta', download=True, install=True)
model(["Susan knew that Ann's son had been in a car accident,
because she told her about it."],["Ann told her about it."])

To solve this problem using ChatGPT, the query based on [2] was used as input: “Having the premise [’sentence1'], tell me if the following hypothesis [‘sentence2’] is entailed with the premise? Answer ‘entailment’ if yes, or ‘not_entailment’ if no”.

Mainly, ChatGPT manages to choose the correct answer. But when it needs to determine which person is substituted for the pronoun or the solution involves the application of physical laws and concepts, ChatGPT incorrectly determines the answer option based on the given context. Primarily, ChatGPT is able to select the accurate response. However, sometimes it struggles in logical or physical reasoning to select the answer based on the provided context.

The unlabeled test set of WNLI contains 146 pairs of sequences. Therefore, it’s feasible to label the entire test set with ChatGPT and complete a fair comparison between ChatGPT and DeepPavolv pretrained WNLI model. DeepPavlov’s WNLI model gains an accuracy of 81.5, whereas ChatGPT is only 79.5. Here are some examples where ChatGPT fails to label the pair of sentences correctly:

Prompt: “Having the premise ‘The cookstove was warming the kitchen, and the lamplight made it seem even warmer’, tell me if the following hypothesis ‘The lamplight made the cookstove seem even warmer’ is entailed with the premise? Answer ‘entailment’ if yes, or ‘not_entailment’ if no.”
Expected: not_entailment
DeepPavlov: not_entailment
ChatGPT: entailment

Prompt: ”Having the premise ‘The Harbor folks said Grandma kept her house so clean that you could wipe her floor with a clean handkerchief without getting any dirt on it’, tell me if the following hypothesis ‘You could wipe her floor with a clean handkerchief without getting any dirt on the house’ is entailed with the premise? Answer ‘entailment’ if yes, or ‘not_entailment’ if no.”
Expected: not_entailment
DeepPavlov: not_entailment
ChatGPT: entailment

Prompt: “Having the premise ‘Muriel has hanging in her living room a famous painting by Cecile d’Arblay of a ballerina putting on her shoe’, tell me if the following hypothesis ‘Muriel has hanging in her living room a famous painting by Cecile d’Arblay of a ballerina putting on the ballerina’s shoe’ is entailed with the premise? Answer ‘entailment’ if yes, or ‘not_entailment’ if no.”
Expected: entailment
DeepPavlov: entailment
ChatGPT: not_entailment

Word in Context (WiC)

One more task where I tested the work of ChatGPT is the word sense disambiguation (WSD) task and the WIC dataset. The Word-in-Context (WIC) dataset is a benchmark dataset used to evaluate the ability of machine learning models to determine the correct sense of a word in a given context. It consists of pairs of sentences where the target word has different senses, and the task is to predict whether the word has the same sense in both sentences or not. The dataset is designed to evaluate models’ ability to capture contextual information and make accurate sense predictions.

As the ChatGPT input I used the following query from [2]: “Are the given sentences expressing the same sense of the word [‘verb/noun’]: [‘sentence1’] and [‘sentence2’]?”

According to the results in [1], the SOTA accuracy exceeds the ChatGPT accuracy by 9.4: 74.00 vs. 64.58. And our experiments confirm this: ChatGPT doesn’t solve this task well if a word in two different sentences has the same meaning. In this case, ChatGPT uses synonyms to explain the meanings of the studied word in each sentence. And sometimes, these synonyms are different, which leads to the wrong decision. Here are some examples of mistakes:

Prompt: “Are the given sentences expressing the same sense of the word ‘create’: ‘You can create the color orange by mixing yellow and red.’ and ‘The company was created 25 years ago.’ ? Return your answer as a letter: ‘T’ if the sense is the same, or ‘F’ if it is not the same.”
ChatGPT: F
Expected:
True
DeepPavlov: True

Prompt: “Are the given sentences expressing the same sense of the word ‘head’: ‘His horse won by a head.’ and ‘He is two heads taller than his little sister.’ ? Return your answer as a letter: ‘T’ if the sense is the same, or ‘F’ if it is not the same.”
ChatGPT: F
Expected: True
DeepPavlov: True

Prompt: “Are the given sentences expressing the same sense of the word ‘voice’: ‘The voice of the law.’ and ‘Conservatism has many voices.’ ? Return your answer as a letter: ‘T’ if the sense is the same, or ‘F’ if it is not the same.”
ChatGPT: F
Expected: True
DeepPavlov: True

Semantic Textual Similarity Benchmark (STS-B)

The Semantic Textual Similarity (STS) Benchmark task is a natural language processing task that measures the degree of semantic similarity between two sentences. The task involves comparing pairs of sentences and assigning a similarity score between 0 and 5, where 0 indicates no similarity and 5 indicates complete similarity. The STS Benchmark task is used to evaluate the performance of various machine learning models and algorithms in measuring semantic similarity between sentences. STS-B can be particularly challenging for LLMs because of the difference between the language modeling objective and the regression task objective. LLMs are designed to predict the next word in a sequence or generate coherent text, with their pre-training focused on capturing linguistic patterns and relationships. Consequently, their internal representations may not be well-suited for modeling continuous numerical outputs. To access the pre-trained model from the DeepPavlov library, one should use:

from deeppavlov import build_model
model = build_model(glue_stsb_cased_bert_torch, download=True, install=True)
model(["A man is playing a guitar."], ["A man is playing a flute."])

The following prompt was used to determine the score with ChatGPT [2]: “Determine the similarity between the following two sentences: [‘sentence1’] and [‘sentence2’]. The score should be ranging from 0.0 to 5.0 and can be a decimal.” Like the authors of [2], I also confirm that ChatGPT determines the semantic textual similarity much worse than the pretrained Transformer-based model. As I have seen, this applies to texts that aren’t similar by 0 or 5 points. The problem may be that ChatGPT uses unsuitable metrics for text comparison.

Due to the large number of test examples, I performed an evaluation on the sample from the dev set as in [2]. The results are as follows. The pretrained STS-B models in DeepPavlov outperform ChatGPT and the models from the article that are based on the same Transformer-encoder.

Here are some examples where ChatGPT fails to label the pair of sentences correctly:

Prompt: “Determine the similarity between the following two sentences: ‘A girl is riding a horse.’ and ‘The girl trotted the horse.’ The score should be ranging from 0.0 to 5.0 and can be a decimal.”
Expected: 4.5
ChatGPT: 2.135
DeepPavlov: 3.09

Prompt: “Determine the similarity between the following two sentences: ‘The cook is kneading the flour.’ and ‘A woman is kneading tortilla dough.’ The score should be ranging from 0.0 to 5.0 and can be a decimal.”
Expected: 2.6
ChatGPT: 0.548
DeepPavlov: 2.12

Prompt: “Determine the similarity between the following two sentences: ‘One man is breaking cement on another man’s chest.’ and ‘A man breaks cinder blocks on another man.’ The score should be ranging from 0.0 to 5.0 and can be a decimal.”
Expected: 3.75
ChatGPT: 1.9
DeepPavlov: 3.82

The Corpus of Linguistic Acceptability (CoLa)

ChatGPT can serve as a valuable writing aid. Its effectiveness can be evaluated by the CoLa dataset that assesses a model’s capacity to determine the grammatical correctness of sentences. The CoLA task involves categorizing a sentence as either acceptable or unacceptable in terms of grammar. The dataset comprises English sentences from diverse origins, varying in complexity from basic to intricate. Human evaluators annotate the sentences, indicating their grammatical acceptability. You can use the DeepPavlov CoLa-pretrained model as follows:

from deeppavlov import train_model
model = build_model('glue_cola_cased_bert_torch', install=True, download=True)
model(["The more people you give beer to, the more people get sick."])

As the input sequence to ChatGPT, I used the following prompt from [2]: “Is the following sentence [‘sentence’] grammatically correct? Answer ‘T’ if yes, or ‘F’ if no.” To compare ChatGPT with DreepPavlov CoLa pretrained models, I used the sampled set from the dev set as in [2]. The comparison shows that the DeepPavlov CoLa model outperforms ChatGPT.

Prompt: “Is the following sentence ‘The proof this set is recursive is difficult.’ grammatically correct? Answer ‘T’ if yes, or ‘F’ if no.”
ChatGPT: T
Expected: F
DeepPavlov: F

Prompt: “Is the following sentence ‘Himself is understood by Rutherford.’ grammatically correct? Answer ‘T’ if yes, or ‘F’ if no.”
ChatGPT: T
Expected: F
DeepPavlov: F

Conclusion

In this article, I compared ChatGPT with DeepPavlov models on the selected GLUE tasks and showed that on these tasks, DeepPavlov-based models outperform ChatGPT by a huge margin. For each task, I provided examples where ChatGPT performed worse or even failed completely. I hope you find this helpful in building your own NLP models.

References:

[1] Kocoń, Jan, et al. “Chatgpt: Jack of all trades, master of none.” Information Fusion (2023): 101861.

[2] Zhong, Qihuang, et al. “Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert.” arXiv preprint arXiv:2302.10198 (2023).

[3] Burtsev, Mikhail, et al. “Deeppavlov: An open source library for conversational ai.” (2018).

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓