Quick Guide for using Scikit-LLM with Text Classification

Rajat Roy
3 min readJul 21, 2024

--

AI Generated Image — https://ideogram.ai/g/juSmnGyATFm5D3euy-v3GQ/3

Introduction

This article primarily focuses on how to put LLM for use related to text classification tasks. Scikit-LLM facilitates integration of LLM in Scikit-learn APIs. The underlying architecture of this library is based on prompt engineering combined with LLM orchestration.

In the following sections, I've shared the code examples showing various techniques using which LLMs can leveraged for text classification tasks.

The example used in this article is related to spam classification. The dataset is taken from kaggle which contains two columns the actual text and its label (either spam or ham). You can find the dataset here.

Installing Scikit-LLM

!pip install scikit-llm

Zero Shot Classifier

In this example, we will predict the text directly using LLM without giving it prior training examples to refer to. Only thing we provide to LLM are the labels during model fitting.

from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier

clf = ZeroShotGPTClassifier(model="gpt-4o-mini",
key=OPENAI_API_KEY,
org=ORGANIZATION_ID)

# only provide labels to the LLM and not the training data
clf.fit(None, ["spam", "ham"])

# predict
y_pred = clf.predict(X_test.values)

print("Classification Report: \n", classification_report(y_test, y_pred))
print("\n\n")
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
Classification Report: 
precision recall f1-score support

ham 1.00 1.00 1.00 12
spam 1.00 1.00 1.00 3

accuracy 1.00 15
macro avg 1.00 1.00 1.00 15
weighted avg 1.00 1.00 1.00 15




Confusion Matrix:
[[12 0]
[ 0 3]]

Few Shot Classifier

Now, let's provide some training samples to the LLM.

from skllm.models.gpt.classification.few_shot import FewShotGPTClassifier

clf = FewShotGPTClassifier(model="gpt-4o-mini",
key=OPENAI_API_KEY,
org=ORGANIZATION_ID)

# adding training samples
clf.fit(X_train.values, y_train.values)

# predict on unseen data
y_pred = clf.predict(X_test.values)

print("Classification Report: \n", classification_report(y_test, y_pred))
print("\n\n")
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
Classification Report: 
precision recall f1-score support

ham 1.00 1.00 1.00 12
spam 1.00 1.00 1.00 3

accuracy 1.00 15
macro avg 1.00 1.00 1.00 15
weighted avg 1.00 1.00 1.00 15




Confusion Matrix:
[[12 0]
[ 0 3]]

Dynamic Few Shot Classifier

Use dynamic few shot classifier for providing equal number of examples to the LLM.

from skllm.models.gpt.classification.few_shot import DynamicFewShotGPTClassifier

# add 4 examples for each class
clf = DynamicFewShotGPTClassifier(model="gpt-4o-mini",
key=OPENAI_API_KEY,
org=ORGANIZATION_ID,
n_examples=4)

y_pred = clf.predict(X_test.values)

print("Classification Report: \n", classification_report(y_test, y_pred))
print("\n\n")
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
Classification Report: 
precision recall f1-score support

ham 1.00 1.00 1.00 12
spam 1.00 1.00 1.00 3

accuracy 1.00 15
macro avg 1.00 1.00 1.00 15
weighted avg 1.00 1.00 1.00 15




Confusion Matrix:
[[12 0]
[ 0 3]]

Chain of Thought Classifier

Generates predictions from the classifier with reasoning.

from skllm.models.gpt.classification.zero_shot import CoTGPTClassifier

clf = CoTGPTClassifier(model="gpt-4o-mini",
key=OPENAI_API_KEY,
org=ORGANIZATION_ID)
clf.fit(X_train['v2'], y_train['v1'])

y_pred = clf.predict(X_test['v2'])

Actual Text

X_test['v2'].iloc[4]
URGENT! Your mobile was awarded a å£1,500 Bonus Caller Prize on 27/6/03. 
Our final attempt 2 contact U! Call 08714714011

Label with Reason

label, reason = y_pred[4]

print(f"Category: {label}")
print(f"Reason: {reason}")
Category: spam
Reason: {"'ham'": "The text does not fit the 'ham' category as it is
not a legitimate or personal communication. 'Ham' typically refers to
non-commercial, genuine messages, such as personal emails or friendly
communications. This text is clearly promotional and unsolicited.",
"'spam'": "The text fits the 'spam' category as it is a promotional
message that attempts to solicit a response from the recipient. It
uses urgent language ('URGENT!') and offers a prize, which is a common
tactic in spam messages. Additionally, it includes a phone number to
call, which is often associated with scams or unwanted solicitations."}

Conclusion

This completes my article showing code examples related to various techniques you can apply on any text dataset for classification tasks. Scikit-LLM is a library which empoweres AI developers to quickly apply text classifier and get the predicted labels and also with reasoning without needing to pre-process the text or apply any transformation.

The code used in this article is present in this notebook.

Note for the readers

🚀 Hello, fellow knowledge enthusiast! 🌟

Searching for hands-on wisdom in Data Science, AI, and Python? You’re in the right place! 👩‍💻💡

I’m on a mission to demystify complexity, unleashing real-time applications that fuel your success. Let’s embark on this thrilling voyage of discovery!

Come, be a part of this exciting journey. I’m striving to reach 150 followers. 📈 Your follow is the boost I crave.

Fuel your curiosity, surge ahead. 🚀📊

Follow now and unlock the world of practical tech!

--

--