Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


It is raining Language Models! All about the new Small Language Model— Phi-2
Artificial Intelligence   Latest   Machine Learning

It is raining Language Models! All about the new Small Language Model— Phi-2

Last Updated on January 5, 2024 by Editorial Team

Author(s): Syed Huma Shah

Originally published on Towards AI.

Image by author, generated using Dall-E


Over the past year, we have witnessed a very rapid development in the field of language models. Since Chatgpt launched, the tech world has been pressured to come up with new advancements and products in the field of natural language processing. Till now, we witnessed the advent of LLMs of different sizes, but now small language models (SLM) have entered the space, which are a cost-effective, environment-friendly, agile, efficient alternative to LLMs. They represent a significant shift in the AI and machine learning landscape, focusing on quality and efficiency rather than just scale.

A small language model (SLM) is a language model that is trained on a dataset with parameters less than 10 billion. Although there is not a single universal definition yet, generally 10 billion and fewer parameters seem to be the cutoff. SLMs can be thought of as scaled-down versions of the Large Language Models specially designed for particular tasks, requiring less computing, providing faster inference, and more agility. Several examples of Small Language Models (SLMs) that have been released are Orca 2, DistilBERT, Gemini Nano, and TinyBERT.

The latest SLM released is Phi-2, which was developed by Microsoft. This is the latest version of the series of Language Models. Phi 1 and 1.5 were previously launched in September 2020 and September 2023, respectively.

Model Description

Phi 2 is a small language model that is trained on just 2.7 Billion parameters. The architecture of the model is transformer-based with a next-word prediction objective. This model outperforms other models, which are 25 times bigger than it, like Llama-2(70B). Phi-2 is open-source and available on huggingface and the Azure AI Catalog.

This model is way smaller than the average LLMs, a good example of the size comparison is: If Phi-2’s parameter count is akin to a small car, GPT-3.5’s parameter count would be comparable to a large commercial airliner and that of GPT-4 would be comparable to 10 of such large commercial airliners.

Due to the smaller dataset, Phi-2 has been trained on, the model has less generalized capabilities. This model works best for certain specific tasks about coding, mathematics, general reasoning, and common sense. This model doesn't respond well to slight adjustments and nuances of the prompts as it is not instruction fine-tuned.

Evolving Landscape of Phi-2

The earlier predecessors of Phi-2 were Phi-1 and Phi-1.5. Phi-1 was specially designed for carrying out Python coding tasks. It was trained on data with 1.3 billion parameters, curated from different sources: The Stack v1.2, StackOverflow, competition code, synthetic Python textbooks, and exercises generated by gpt-3.5-turbo-0301 model. The next version of this model, Phi-1.5 was developed for common sense reasoning and language understanding. It is also a transformer-based model trained on data containing NLP synthetic texts with 1.3 billion parameters. Phi-2 is a natural progression from its earlier predecessors which captures the capabilities to write code, language understanding, reasoning, and do math from both its predecessor models.

Phi 2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of synthetic and Web datasets for NLP and coding. The Microsoft research team carefully curated the training data with a focus on “textbook quality data” that helped in making this small yet powerful language model. This textbook-quality data was taken from the web and also synthesized using GPT-3.5(1B tokens). The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instructed fine-tuned.

Small Language Models (SLMs) are a fascinating development in the world of artificial intelligence and natural language processing. Phi-2 showcases the potential and power of SLMs. Despite its smaller size compared to larger language models, Phi-2 has demonstrated impressive capabilities. Due to the smaller size of SLMs, they serve as a more practical alternative to LLMs as they require less computing power and are more environment-friendly.

Model Results

Phi-2 is open-source and intended for research purposes. This model was tested against 4 models: Mistral (7B), and Llama-2(7B, 13B, 70B). Despite being trained on a smaller dataset, Phi-2 outperforms Mistral and Llama-2 models at 7B and 7B, 13B parameters, respectively, on BigBench Hard tasks, common sense reasoning, mathematics, and Llama-2 models at 7B, 13B parameters on language understanding. It significantly outperforms all 4 models in mathematics (except Llama-2–70B) and coding tasks.

Image from Microsoft Research Blog

Since this model is meant for the research community, it was also extensively tested on some common topics within the research community, e.g., physics problems, and it performed pretty well. It was even able to spot the errors in the student’s physics solution.

Phi-2 surpasses the performance of various bigger Language Models Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. The benchmarks used are as follows: Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)). It achieves better performance compared to 25x larger Llama-2–70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size. [Source: Microsoft Research Blog]

Image from Microsoft Research Blog

Apart from showing good performance in the above benchmarks, Phi-2 has also shown better behavior concerning toxicity and bias compared to some existing open-source models, even without undergoing reinforcement learning from human feedback.

Image from Microsoft Research Blog

What makes this model special?

One of the aspects that make Phi-2 powerful is the quality of its training data. The superior performance of Phi-2 in areas like common sense, reasoning, language comprehension, maths, and coding is due to various meticulous strategies put in place. Innovative training techniques were used like scaling up knowledge from its predecessor models(Phi-1.5) and embedding it in Phi-2. In addition to this, high-quality(textbook-quality) data was both chosen and specifically synthesized to teach the model common sense reasoning, language comprehension, and various other areas like science, daily activities, theory of mind, math, and coding.

This model is intended for research use only. It is a high-performing model relative to size and is ideal for experimentation and exploration of fine-tuning measures and introducing safety, bias, and toxicity improvements.

Traditional LLMs have a high cost of training and deploying and are detrimental to the environment. The pursuit of making small language models(SLM) good enough and comparable to these LLMs for a wide variety of specialized tasks serves as a much cheaper and eco-friendly option that is scalable and, in turn, will flourish the development of many Language Model-based applications.

SLMs offer numerous benefits and have a wide range of use cases in various industries. They are particularly valuable in scenarios where resource limitations are a factor, such as in on-device or on-premises deployments. Their smaller size allows for efficient processing, making them ideal for applications requiring real-time response and agility.

Getting Started with Phi-2: Practical Steps for Researchers and Developers

You can check the Phi-2 model on huggingface via this link and try this model for yourself. Experiment with it a little and explore its capabilities. You can use it to solve for various problems and develop a bunch of different applications specifically leveraging its speed and efficient processing.

Here are some ideas to engage and experiment with Phi-2:

Explore Phi-2 on Huggingface

Start with exploring the model first before dabbling into its applications or some proper use cases. Access Phi-2 on Huggingface and familiarize yourself with its functionalities first. Experiment with the model by inputting different types of text and analyzing its outputs.

Use the below code to get started with the model.

Firstly, let's make sure you have the required libraries installed

pip install transformers
pip install einops

After installing the library, use the below code to explore this model

from transformers import pipeline

# Initialize the pipeline
phi2_pipeline = pipeline("text-generation", model="microsoft/phi-2",trust_remote_code=True)

# Generate text
input_text = "Enter your text prompt here"
generated_text = phi2_pipeline(input_text)[0]['generated_text']


Develop a Domain-Specific Application

Utilize Phi-2’s efficient processing to handle domain-specific queries and provide tailored responses. Experiment with a question-answering system for specific domains like finance, or customer service, leveraging Phi-2’s ability to understand and generate human-like text.

For a question-answering system, we will need a model fine-tuned on QA tasks. If Phi-2 has such a version, use it; otherwise, this is an example with a general model:

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# Load Phi-2 model and tokenizer
model_name = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Initialize text generation pipeline with Phi-2
phi2_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True)

# Combine question and context
question = "Who developed Phi-2?"
context = "Phi-2 is a small language model developed by Microsoft."
input_text = f"Question: {question}, Context: {context}"

# Generate an answer
generated_text = phi2_pipeline(input_text)[0]['generated_text']

# print the answer

Conduct Comparative Studies

If you’re in research, compare Phi-2’s performance with other models on tasks like text generation, sentiment analysis, or language translation. Document how its smaller size impacts its effectiveness in these tasks.

For a comparative study, you’ll need to set up a similar pipeline for another model and compare outputs. Here’s how you might set it up for GPT-3, for instance:

from transformers import pipeline

# Initialize Phi-2 pipeline
phi2_pipeline = pipeline("text-generation", model="microsoft/phi-2", trust_remote_code=True)
# Initialize GPT-3 pipeline
gpt3_pipeline = pipeline("text-generation", model="EleutherAI/gpt-neo-2.7B", trust_remote_code=True)

# Input text
input_text = "Explain the top productivity hacks shared by James Clear"

# Generate text with both models
phi2_text = phi2_pipeline(input_text)[0]['generated_text']
gpt3_text = gpt3_pipeline(input_text)[0]['generated_text']

print("Phi-2 Output:", phi2_text)
print("GPT-3 Output:", gpt3_text)

Experiment with Fine-Tuning

For those with a technical background, try fine-tuning Phi-2 on a niche dataset. This could be particularly insightful for applications in less-common languages(your regional language perhaps)or specialized industrial contexts. You can even fine-tune Phi-2 on niche datasets to create specialized language models for unique fields like archaeology, quantum physics, or ancient languages.

Below is the code for basic fine-tuning of the Phi-2 model on a custom dataset. The training is done using Hugging Face’s Trainer class, which simplifies the process. However, keep in mind that working with large models like Phi-2 may still require considerable computational resources.

You need to specify the path to your dataset. Ensure that your dataset is in a format that the model expects. For a language model, this usually means a dataset of text.

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

# Load the Phi-2 model and tokenizer
model_name = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Make sure to set the pad token if it's not already set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

# Load your dataset
# Replace 'path_to_your_dataset' with the path to your dataset
dataset = load_dataset("json", data_files="path_to_your_dataset.json", split="train")

# Define training arguments
training_args = TrainingArguments(
output_dir="./results", # Output directory for model checkpoints
per_device_train_batch_size=1, # Batch size for training
num_train_epochs=3, # Number of training epochs
logging_steps=10, # Log training information every 10 steps
learning_rate=5e-5, # Learning rate
weight_decay=0.01, # Weight decay for regularization

# Initialize the Trainer
trainer = Trainer(

# Train the model

Apart from the above examples, you can also experiment with Phi-2 and integrate it with any existing digital product, particularly in chatbots, customer service interfaces, or content creation tools to enhance their language capabilities. Furthermore, it can serve as your personal tutor and use its assistance in learning new concepts and languages or solving some coding or math problems, as it notably outperforms many other models in these domains.


The development of SLMs is a testament to the growing understanding that efficiency, precision, and adaptability can coexist in more compact forms of AI models. They are not just alternatives to their larger counterparts, but they complement them, each with unique strengths and applications. The future of AI looks promising, with both small and large models playing pivotal roles in diverse fields and their applications.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓