Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Zero-Shot Text Classification Experience With Scikit-LLM
Latest   Machine Learning

Zero-Shot Text Classification Experience With Scikit-LLM

Last Updated on November 5, 2023 by Editorial Team

Author(s): Claudio Giorgio Giancaterino

Originally published on Towards AI.

Text classification is one of the most common applications of natural language processing (NLP). It’s the task of assigning a set of predefined classes to pieces of text in a document.
Text classification can be helpful with many applications, such as sentiment analysis, spam detection, topic modeling, document summarization, and more.

The standard approach to text classification consists of training a model in a supervised manner. Anyway, following this methodology, results depend on the availability of hand-labelled training data. For instance, in real-world applications, data availability can be an issue and zero-shot text classification is a new approach that is becoming more popular.

What is zero-shot text classification?

Before introducing zero-shot text classification, it is necessary to speak about zero-shot learning that aims to perform modeling using less amount of labeled data. Yes, exactly, it can be thought of as an instance of transfer learning, which is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. It exploits the concept of the learning process by experience. This method is useful when there is less amount of labeled data available. Text classification is a task of natural language processing where the model predicts the classes of pieces of text in a document. The traditional approach requires a huge amount of labeled data to train the model and falls down when there is not enough labeled data in the training process. Solving text classification task with zero-shot learning, we obtain the zero-shot text classification, which has the task of classifying text documents without having seen any labeled text class before during the training process, and one way to do this is by using natural language inference (NLI) proposed by Yin et al (2019). You can find implementations of zero-shot classification in the transformer models and in the hugging face hub, where these models are available.

What is Scikit-LLM?

Scikit-learn is one of the most well-known and widely-used open-source Python libraries in the field of machine learning by data scientists due to its wide range of models and friendly use. You are able to solve any task, from regression to classification, from clustering to dimensionality reduction, using just one library. Scikit-LLM is a Python library that embodies large language models into the scikit-learn framework.
It’s a tool to perform natural language processing (NLP) tasks all within the Scikit-Learn pipeline.
Scikit-LLM is growing, it started integrating OpenAI models (like ChatGPT) and now PaLM 2.
For instance, it is a wrapper of the OpenAI API.

Always referring to the interface with OpenAI, in the following the features provided by Scikit-LLM:

-Zero-Shot Text Classification

-Few-Shot Text Classification

-Dynamic Few-Shot Text Classification

-Multi-Label Zero-Shot Text Classification

-Text Vectorization

-Text Translation

-Text Summarization

Goal of the analysis

The goal of the job is to explore the performances of GPT models:

-GPT-3.5 turbo with 4,097 tokens capacity

-GPT-3.5 turbo-16k with 16,385 tokens capacity

-GPT-4 with 8,192 tokens capacity

by Zero-Shot Text Classification approach using two data sets.

The first one is about sentiment analysis on a financial data set with 3 polarities: positive, neutral, and negative.

The second one is about text classification on CNN articles data set with 6 labels: business, entertainment, health, news, politics, and sport.

In both situations, have been used samples retrieved with stratified sampling and a 10% sample size of the whole data set to save the computational effort.

Given both data sets are labeled, it has allowed an evaluation of results, previously with the confusion matrix and then by the F1 score adapted for the multi-class: micro averaged F1 score.

The experience can be followed in this notebook.


The first task uses a data set for financial sentiment analysis based on financial sentences with sentiment labels and 5842 rows.

There are 3 sentiment labels with the predominance of the “neutral” class, and the experiment was carried out on 584 rows.

Looking at the GPT-4 confusion matrix, we can see a quite good allocation on the left diagonal of the data predicted.

From the F1 score, all models reach more than 70% of the score. GPT-4 as expected, is the best-performing model in the first experiment.

The second task uses a data set for multi-class text classification based on CNN news collected from 2013 to 2022 with 11 variables and 9307 rows.

The “part_of” column represents the category of news, and labels have been used for the target variable, meanwhile, the “Description” column has been used to perform zero-shot text classification.

There are 6 classes, with the predominance of both “news” and “sport” classes, and the experiment was carried out on 931 rows.

Looking at the GPT-4 confusion matrix, we can see an improved allocation on the left diagonal of the data predicted than the first task.

From the F1 score, GPT-3.5 models reach a little less performance than the first task, but more than 70% of the score. GPT-4 outperforms with a jump compared with the other models, reaching more than 80% of the score.

Final thoughts

GPT-3.5 turbo 16k has slightly lower performance than GPT-3.5 turbo, but it’s faster. GPT-4 on the other side, outperforms sentiment analysis and multi-class text classification and is much better in the last one, but it’s slower and more expensive than the others.

Running the notebook on the whole data sets, these results could be slightly different because I’ve taken a stratified sample equal to 10% of the size of the data sets, which is the same when you split your data set in ten folds with stratified cross-validation and pick up just a fold. In addition, in Scikit-LLM, there isn’t the opportunity, till now, to tune the temperature feature in order to have more deterministic results.

Anyway, I think that it is good to have an idea about the capacities that these models are able to reach with a zero-shot text classification approach. Therefore, zero-shot text classification could be a solution when training data are less available or when they don’t exist. Their universal applicability makes them very appealing, although fine-tuning large pre-trained models, certainly still outperform.
Surely, zero-shot learning could grow in relevance over the next few years because large language models are becoming a game changer in the way we use models to solve tasks.

The last thing to say about Scikit-LLM is that it’s a powerful tool for NLP, combining the versatility of the Scikit-Learn library with the potential coming from Large Language Models. Surely, it’s not comparable with LangChain, but is growing and surely helpful.



Financial data

CNN news articles

-Zero-shot text classification by statworx

-Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach

-OpenAI models


Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓