Can ChatGPT beat DeepPavlov in Natural Language Understanding tasks?
Last Updated on July 17, 2023 by Editorial Team
Author(s): Vasily Konovalov
Originally published on Towards AI.
Whether you can use ChatGPT to build any Natural Language Processing (NLP) model for your application? In this article, I compare the DeepPavlov modelβs performance with ChatGPT on selected GLUE tasks. The Natural Language Understanding (NLU) tasks included in GLUE are crucial for building any NLP application. The findings indicate that while ChatGPT shows astonishing results, it cannot beat pretrained Transformer-based models integrated into DeepPavlov on the GLUE tasks.
Introduction
ChatGPT, which has been released by OpenAI, has revolutionized the way artificial intelligence interacts with humans. Although there is limited information available on how the model was constructed, its exceptional quality has made it extremely popular. It is noteworthy that the base model of ChatGPT only has 3.5B parameters, yet in conversation tasks, it outperforms the GPT3 model with 175B parameters. This shows the high relevance of collecting data from humans for supervised model fine-tuning. There are several publications on ChatGPT evaluation that test its effectiveness on well-known natural language processing (NLP) tasks. In this article, I am going to compare the performance of ChatGPT with that of the supervised transformer-based models from the DeepPavlov library. DeepPavlov is a conversational open-source library for Natural Language Processing (NLP) and Multiskill AI Assistant development. DeepPavlov is based on PyTorch and leverages Transformer and Datasets packages from HuggingFace to train various transformer-based models on hundreds of datasets.
Install DeepPavlov Library
DeepPavlov Library is an open-source NLP framework. It contains all essential state-of-the-art NLP models that can be used alone or as a part of DeepPavlov Dream β an open-source Multiskill AI Assistant Platform. The library contains various text classification models for topic classification, insult detection, and intent recognition. Before comparing the models, letβs first install DeepPavlov Library by running the command:
pip install deeppavlov==1.0.1
The DeepPavlov Library supports Python 3.6 β 3.9 U+1F40D.
How to use DeepPavlov Library
The DeepPavlov framework is implemented in Python and uses PyTorch as the base machine learning framework. DeepPavlov supports HuggingFace transformers which allows using all AutoModel transformer-based models from the HuggingFace Hub. DeepPavlov provides a way to interact with models via a command-line interface (CLI), an application programming interface (API), or via Python pipelines. The list of the CLI β supported commands is as follows:
- install modelβs specific requirements, run python -m deeppavlov install <config_name>, where <config_name> is the name of the configuration file.
- interact with the model through CLI, run python -m deeppavlov interact <config_name> [-d] [-i], where -d downloads files from the metadata of the configuration file (optional), and -i installs model requirements (optional).
- train the model, run the following command: python -m deeppavlov train <config_name> [-d] [-i]. The dataset will be downloaded regardless of whether there was the -d flag or not. To train a model on your own data, you need to modify dataset_reader path in the model configuration file. The data format is specified in the corresponding model doc page. In order to change the backbone transformer architecture, you should modify the corresponding variable in the variables section.
- riseapi starts a REST API server with the selected model, which might be useful in production deployment scenarios, run python -m deeppavlov riseapi <config_name> [-d] [-i].
Alternatively, you can interact with DeepPavlovβs models by using Python pipelines.
Comparing ChatGPT and DeepPavlov
There is no easy way to compare ChatGPT with DeepPavlov-based models. You have to keep in mind that ChatGPT performance is conditioned on the prompt. Even the slightest modification of the prompt might change the result completely. Almost all prompts used here were adopted from [2] to be fully comparable with the paper results. Moreover, testing ChatGPT performance on well-established NLP benchmarks might lead to several problems. Evaluating ChatGPT performance on thousand test examples might be long and expensive. Therefore I decided to either evaluate the GLUE test set of the corresponding task if the size of the test set is less than 1000 samples or follow the approach proposed in [2] by evaluating a sample of the dev set and comparing it with ChatGPT. In the following comparisons, I included the ChatGPT performance, BERT-based model performance (from [2]), and of DeepPavlov-based models. It is worth mentioning that according to [1], ChatGPT might remember many of the evaluation datasets from the pre-training phrase. OpenAI is continuously improving ChatGPT by releasing new versions. Since the initial release on Dec 15, 2022, there have been seven major releases. All results mentioned here are based on the ChatGPT version of March 23, 2023.
Recognizing Textual Entailment (RTE)
The Recognizing Textual Entailment (RTE) task is a natural language processing task that involves determining whether a given piece of text (the hypothesis) can be inferred from another piece of text (the premise). The task requires understanding the meaning and context of the text and making a judgment about the logical relationship between the two pieces of text. RTE is often used in applications such as question answering, information retrieval, and text summarization.
For example, given the sentence βCyprus, divided or not, joins the EU on the 1st of May.β determine if the following statement is entailed: βCyprus was divided into two parts on May 1.β The correct answer is false (not entailment). To access the pre-trained model from the DeepPavlov library, one should use:
from deeppavlov import build_model
model = build_model(glue_rte_roberta_mnli, download=True, install=True)
model(["Cyprus, divided or not, joins the EU on the 1st of May."],
["Cyprus was divided into two parts on May 1."])
To test ChatGPT performance, I used the following prompt: βGiven the sentence β[text_1]β, determine if the following statement is entailed: β[text_2]ββ. The RTE test set contains thousands of examples. So to compare ChatGPT with DreepPavlov RTE pretrained models, I used the same 25 sampled instances for each class from the dev set as in [2]. On this sampled set DeepPavlovβs RTE model gains an accuracy of 90, whereas ChatGPT is only 88. Here are three examples where ChatGPT fails to correctly label the pair of sentences:
Prompt: βGiven the sentence βMicrosoft nearly halved at NYSE following the results as the company said operating income would drop as much as 42 percent in the second quarter.β determine if the following statement is entailed: βShares of Microsoft fell by almost 50 percent.ββ
Expected: entailment
DeepPavlov: entailment
ChatGPT: No, the statement is not entailed. The sentence states that Microsoft nearly halved, which means it decreased by almost 50 percent, but it does not explicitly state that the shares fell by almost 50 percent.Prompt: βGiven the sentence βSsangyong Motor was taken over by creditors after it collapsed under heavy debts during the 1997β98 Asian financial crisis.β determine if the following statement is entailed: βAsian financial crisis takes over Ssangyong Motorββ
Expected: entailment
DeepPavlov: entailment
ChatGPT: No, the statement is not entailed. The sentence states that Ssangyong Motor collapsed under heavy debts during the Asian financial crisis and was taken over by creditors, but it does not imply that the crisis itself took over the company.Prompt: βGiven the sentence βMopresol as the capital of France is the biggest city of Europe.β determine if the following statement is entailed: βMopresol is the biggest city of France.ββ
Expected: entailment
DeepPavlov: entailment
ChatGPT: No, the statement is not entailed.
Winograd NLI (WNLI)
ChatGPT has difficulty with coreference resolution that deals with resolving ambiguous pronouns. Winograd Schema Challenge assesses a machine learning modelβs capability to refer pronouns to the corresponding noun. It involves resolving ambiguous pronouns in a sentence by using the context provided in the preceding sentence. The challenge is considered difficult for NLP models as it requires a deep understanding of natural language and context. The WNLI task represents the mix of natural language inference with coreference resolution. To access the pre-trained model from the DeepPavlov library, one should use:
from deeppavlov import build_model
model = build_model('glue_wnli_roberta', download=True, install=True)
model(["Susan knew that Ann's son had been in a car accident,
because she told her about it."],["Ann told her about it."])
To solve this problem using ChatGPT, the query based on [2] was used as input: βHaving the premise [βsentence1'], tell me if the following hypothesis [βsentence2β] is entailed with the premise? Answer βentailmentβ if yes, or βnot_entailmentβ if noβ.
Mainly, ChatGPT manages to choose the correct answer. But when it needs to determine which person is substituted for the pronoun or the solution involves the application of physical laws and concepts, ChatGPT incorrectly determines the answer option based on the given context. Primarily, ChatGPT is able to select the accurate response. However, sometimes it struggles in logical or physical reasoning to select the answer based on the provided context.
The unlabeled test set of WNLI contains 146 pairs of sequences. Therefore, itβs feasible to label the entire test set with ChatGPT and complete a fair comparison between ChatGPT and DeepPavolv pretrained WNLI model. DeepPavlovβs WNLI model gains an accuracy of 81.5, whereas ChatGPT is only 79.5. Here are some examples where ChatGPT fails to label the pair of sentences correctly:
Prompt: βHaving the premise βThe cookstove was warming the kitchen, and the lamplight made it seem even warmerβ, tell me if the following hypothesis βThe lamplight made the cookstove seem even warmerβ is entailed with the premise? Answer βentailmentβ if yes, or βnot_entailmentβ if no.β
Expected: not_entailment
DeepPavlov: not_entailment
ChatGPT: entailmentPrompt: βHaving the premise βThe Harbor folks said Grandma kept her house so clean that you could wipe her floor with a clean handkerchief without getting any dirt on itβ, tell me if the following hypothesis βYou could wipe her floor with a clean handkerchief without getting any dirt on the houseβ is entailed with the premise? Answer βentailmentβ if yes, or βnot_entailmentβ if no.β
Expected: not_entailment
DeepPavlov: not_entailment
ChatGPT: entailmentPrompt: βHaving the premise βMuriel has hanging in her living room a famous painting by Cecile dβArblay of a ballerina putting on her shoeβ, tell me if the following hypothesis βMuriel has hanging in her living room a famous painting by Cecile dβArblay of a ballerina putting on the ballerinaβs shoeβ is entailed with the premise? Answer βentailmentβ if yes, or βnot_entailmentβ if no.β
Expected: entailment
DeepPavlov: entailment
ChatGPT: not_entailment
Word in Context (WiC)
One more task where I tested the work of ChatGPT is the word sense disambiguation (WSD) task and the WIC dataset. The Word-in-Context (WIC) dataset is a benchmark dataset used to evaluate the ability of machine learning models to determine the correct sense of a word in a given context. It consists of pairs of sentences where the target word has different senses, and the task is to predict whether the word has the same sense in both sentences or not. The dataset is designed to evaluate modelsβ ability to capture contextual information and make accurate sense predictions.
As the ChatGPT input I used the following query from [2]: βAre the given sentences expressing the same sense of the word [βverb/nounβ]: [βsentence1β] and [βsentence2β]?β
According to the results in [1], the SOTA accuracy exceeds the ChatGPT accuracy by 9.4: 74.00 vs. 64.58. And our experiments confirm this: ChatGPT doesnβt solve this task well if a word in two different sentences has the same meaning. In this case, ChatGPT uses synonyms to explain the meanings of the studied word in each sentence. And sometimes, these synonyms are different, which leads to the wrong decision. Here are some examples of mistakes:
Prompt: βAre the given sentences expressing the same sense of the word βcreateβ: βYou can create the color orange by mixing yellow and red.β and βThe company was created 25 years ago.β ? Return your answer as a letter: βTβ if the sense is the same, or βFβ if it is not the same.β
ChatGPT: F
Expected: True
DeepPavlov: TruePrompt: βAre the given sentences expressing the same sense of the word βheadβ: βHis horse won by a head.β and βHe is two heads taller than his little sister.β ? Return your answer as a letter: βTβ if the sense is the same, or βFβ if it is not the same.β
ChatGPT: F
Expected: True
DeepPavlov: TruePrompt: βAre the given sentences expressing the same sense of the word βvoiceβ: βThe voice of the law.β and βConservatism has many voices.β ? Return your answer as a letter: βTβ if the sense is the same, or βFβ if it is not the same.β
ChatGPT: F
Expected: True
DeepPavlov: True
Semantic Textual Similarity Benchmark (STS-B)
The Semantic Textual Similarity (STS) Benchmark task is a natural language processing task that measures the degree of semantic similarity between two sentences. The task involves comparing pairs of sentences and assigning a similarity score between 0 and 5, where 0 indicates no similarity and 5 indicates complete similarity. The STS Benchmark task is used to evaluate the performance of various machine learning models and algorithms in measuring semantic similarity between sentences. STS-B can be particularly challenging for LLMs because of the difference between the language modeling objective and the regression task objective. LLMs are designed to predict the next word in a sequence or generate coherent text, with their pre-training focused on capturing linguistic patterns and relationships. Consequently, their internal representations may not be well-suited for modeling continuous numerical outputs. To access the pre-trained model from the DeepPavlov library, one should use:
from deeppavlov import build_model
model = build_model(glue_stsb_cased_bert_torch, download=True, install=True)
model(["A man is playing a guitar."], ["A man is playing a flute."])
The following prompt was used to determine the score with ChatGPT [2]: βDetermine the similarity between the following two sentences: [βsentence1β] and [βsentence2β]. The score should be ranging from 0.0 to 5.0 and can be a decimal.β Like the authors of [2], I also confirm that ChatGPT determines the semantic textual similarity much worse than the pretrained Transformer-based model. As I have seen, this applies to texts that arenβt similar by 0 or 5 points. The problem may be that ChatGPT uses unsuitable metrics for text comparison.
Due to the large number of test examples, I performed an evaluation on the sample from the dev set as in [2]. The results are as follows. The pretrained STS-B models in DeepPavlov outperform ChatGPT and the models from the article that are based on the same Transformer-encoder.
Here are some examples where ChatGPT fails to label the pair of sentences correctly:
Prompt: βDetermine the similarity between the following two sentences: βA girl is riding a horse.β and βThe girl trotted the horse.β The score should be ranging from 0.0 to 5.0 and can be a decimal.β
Expected: 4.5
ChatGPT: 2.135
DeepPavlov: 3.09Prompt: βDetermine the similarity between the following two sentences: βThe cook is kneading the flour.β and βA woman is kneading tortilla dough.β The score should be ranging from 0.0 to 5.0 and can be a decimal.β
Expected: 2.6
ChatGPT: 0.548
DeepPavlov: 2.12Prompt: βDetermine the similarity between the following two sentences: βOne man is breaking cement on another manβs chest.β and βA man breaks cinder blocks on another man.β The score should be ranging from 0.0 to 5.0 and can be a decimal.β
Expected: 3.75
ChatGPT: 1.9
DeepPavlov: 3.82
The Corpus of Linguistic Acceptability (CoLa)
ChatGPT can serve as a valuable writing aid. Its effectiveness can be evaluated by the CoLa dataset that assesses a modelβs capacity to determine the grammatical correctness of sentences. The CoLA task involves categorizing a sentence as either acceptable or unacceptable in terms of grammar. The dataset comprises English sentences from diverse origins, varying in complexity from basic to intricate. Human evaluators annotate the sentences, indicating their grammatical acceptability. You can use the DeepPavlov CoLa-pretrained model as follows:
from deeppavlov import train_model
model = build_model('glue_cola_cased_bert_torch', install=True, download=True)
model(["The more people you give beer to, the more people get sick."])
As the input sequence to ChatGPT, I used the following prompt from [2]: βIs the following sentence [βsentenceβ] grammatically correct? Answer βTβ if yes, or βFβ if no.β To compare ChatGPT with DreepPavlov CoLa pretrained models, I used the sampled set from the dev set as in [2]. The comparison shows that the DeepPavlov CoLa model outperforms ChatGPT.
Prompt: βIs the following sentence βThe proof this set is recursive is difficult.β grammatically correct? Answer βTβ if yes, or βFβ if no.β
ChatGPT: T
Expected: F
DeepPavlov: FPrompt: βIs the following sentence βHimself is understood by Rutherford.β grammatically correct? Answer βTβ if yes, or βFβ if no.β
ChatGPT: T
Expected: F
DeepPavlov: F
Conclusion
In this article, I compared ChatGPT with DeepPavlov models on the selected GLUE tasks and showed that on these tasks, DeepPavlov-based models outperform ChatGPT by a huge margin. For each task, I provided examples where ChatGPT performed worse or even failed completely. I hope you find this helpful in building your own NLP models.
References:
[1] KocoΕ, Jan, et al. βChatgpt: Jack of all trades, master of none.β Information Fusion (2023): 101861.
[2] Zhong, Qihuang, et al. βCan chatgpt understand too? a comparative study on chatgpt and fine-tuned bert.β arXiv preprint arXiv:2302.10198 (2023).
[3] Burtsev, Mikhail, et al. βDeeppavlov: An open source library for conversational ai.β (2018).
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI