Making Models Smart: GPT-4 and Scikit-Learn
Last Updated on June 14, 2023 by Editorial Team
Author(s): Ulrik Thyge Pedersen
Originally published on Towards AI.
An Introduction to the seamless integration of ChatGPT-4 with Scitkit-Learn
ChatGPT has allowed for convenient and efficient approaches to constructing text classification models. Scikit-learn is the conventional library in Python to create machine learning models. The combination of the two, with Scikit-LLM, allows for more powerful models without the need to interact manually with OpenAI’s API.
Some common natural language processing (NLP) tasks and classification and labeling. These tasks often required collecting labeled data, model training, endpoint deployments, and inference setups. This can be time-consuming and expensive and often requires multiple models to search various tasks.
Large language models (LLMs) like ChatGPT has given us a novel approach to these NLP tasks. We can employ a single model, instead of training and deploying one for each task, to handle a wide range of NLP tasks by using prompt engineering.
Follow along, as we delve into the process of making a multiclass, multilabel text classification model powered by ChatGPT. We will introduce the useful new library scikit-LLM, which serves as a scikit-learn wrapper for OpenAI’s API, allowing us to create a powerful model, just like we would a regular scikit-learn model. Let's get started!
Setting Up
Lets start by installing the scikit-LLM package; use pip, poetry, or your favorite package manager:
pip install scikit-llm
Obtaining an OpenAI API Key
To harvest the full power of scikit-LLM, we provide our OpenAI API key. Let's import the config module and specify our key:
# Import SKLLMConfig to configure OpenAI API (key and organization)
from skllm.config import SKLLMConfig
# Set your OpenAI API key
SKLLMConfig.set_openai_key("<YOUR_KEY>")
# Set your OpenAI organization (optional)
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")
If you want to follow along, please consider this:
- A free OpenAI trial is not sufficient, since we need more than three requests per minute. Please switch to the “pay as you go” plan first.
- Make sure to provide your organization ID, not the name to the
SKLLMConfig.set_openai_org
. You can find your ID here: https://platform.openai.com/account/org-settings
We are all set up. Let's make some models!
Zero-Shot GPTClassifier
Text classification is one of the most impressive features of ChatGPT. It can even provide Zero Shot classification, which doesn’t require specific training for the task, instead relying on descriptive labels to perform classification. This can be done using with ZeroShotGPTClassifier
class:
# Importing the ZeroShotGPTClassifier module and the classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset
# Get the classification dataset from scikit-learn
X, y = get_classification_dataset()
# Define the model
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
# Fit the data
clf.fit(X, y)
# Make predictions
labels = clf.predict(X)
Scikit-LLM ensures that the response contains a valid label and if the response lacks a label, scikit-LLM will randomly select a label, taking the probabilities of label frequency into account.
Scikit-LLM takes care of the API-related aspects and makes sure that you receive usable labels. It even handles missing labels!
Multi-Label Zero-Shot Text Classification
In the previous chapter, we saw Zero Shot classification, but this can also be made using a multi-label approach. Instead of applying a single label, scikit-LLM can also mix and match to find a more nuanced label by combining existing labels using its NLP capabilities:
# Importing the MultiLabelZeroShotGPTClassifier module and the classification dataset
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset
# Get the multilabel classification dataset from scikit-learn
X, y = get_multilabel_classification_dataset()
# Define the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)
# Fit the model
clf.fit(X, y)
# Make predictions
labels = clf.predict(X)
In the code, the only difference between Zero-Shot and Multi-Label Zero-Shot is which class you use. To perform Multi-Label, we use the MultiLabelZeroShotGPTClassifier
class and assign the max_labels; in this example, we limit it to maximum 3 labels.
Text Vectorization
Another NLP task is the conversion of textual data into numerical representations that the machine can understand and further analyze. This process is called Text Vectorization and is also within scikit-LLM’s capability. Here is an example of how to do it using the GPTVectorizer
:
# Importing the GPTVectorizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTVectorizer
# Creating an instance of the GPTVectorizer class
# and assigning it to the variable 'model'
model = GPTVectorizer()
# Transforming the text data
vectors = model.fit_transform(X)
As with any regular scikit-learn model, we can fit the model and use it to transform the text using the fit_transform
method.
Lets take it to another level!
The output from the GPTVectorizer
can be used in a machine learning pipeline. In this can, we are using it to prepare data for an XGBoost Classifier, which preprocesses and classifies the text:
# Importing the necessary modules and classes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
# Creating an instance of LabelEncoder class
le = LabelEncoder()
# Encoding the training labels 'y_train' using LabelEncoder
y_train_encoded = le.fit_transform(y_train)
# Encoding the test labels 'y_test' using LabelEncoder
y_test_encoded = le.transform(y_test)
# Defining the steps of the pipeline as a list of tuples
steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]
# Creating a pipeline with the defined steps
clf = Pipeline(steps)
# Fitting the pipeline on the training data 'X_train'
# and the encoded training labels 'y_train_encoded'
clf.fit(X_train, y_train_encoded)
# Predicting the labels for the test data 'X_test'
# using the trained pipeline
yh = clf.predict(X_test)
First, we apply the text vectorization; then we classify using XGBoost. We encode the training labels and execute the pipeline on the training data to predict labels in the test data, nice!
Text Summarization
Our last example is a very commonly used NLP task, Text Summarization. ChatGPT is very efficient at these common NLP tasks and excels at anything to do with language. Scikit-LLM provides a useful GPTSummarizer
module that can be used in two ways: Independently or as part of your preprocessing pipeline. Let's see what it can do:
# Importing the GPTSummarizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTSummarizer
# Importing the get_summarization_dataset function
from skllm.datasets import get_summarization_dataset
# Calling the get_summarization_dataset function to retrieve input data 'X'
X = get_summarization_dataset()
# Creating an instance of the GPTSummarizer
s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)
# Applying the fit_transform method of the GPTSummarizer
# instance to the input data 'X'.
# It fits the model to the data and generates the summaries,
# which are assigned to the variable 'summaries'
summaries = s.fit_transform(X)
One thing to note is the max_words
parameter we set to 15. This serves as a flexible limit for how many words the generated summary produces. It provides a rough target length, but it is not strictly enforced, which can cause summaries to exceed the specified limit.
Wrap-Up
Wow, what a journey! We explored the power and versatility of Scikit-LLM, a Python library that enables the seamless integration of scikit-learn and ChatGPT. We learned how to improve text classification and build smart models using large LLMs.
Whether it's text classification, using one or many labels, zero-shot classification, or text summarization, many common NLP tasks can be performed efficiently by combining scikit-learn models with the power of ChatGPT.
The future is bright for machine learning, and building models has never been easier, thanks to Scikit-LLM!
Thank you for reading my story!
Subscribe for free to get notified when I publish a new story!
Find me on LinkedIn!
…and I would love your feedback!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI