Some Insights About Phi-4: Microsoft’s New Small Foundation Model that Punches Above its Weight
Last Updated on December 18, 2024 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
Some Insights About Phi-4: Microsoft’s New Small Foundation Model that Punches Above its Weight
I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…
thesequence.substack.com
Microsoft Phi was been credited with starting the small language model(SLM) movement as an alternative to the “intelligence by scale” approach followed by the large AI labs. Released a couple of years ago as part of the famous paper “Textbooks is All You Need”, every release of Phi brings new innovations in terms of data quality and training. Phi-4 is the latest addition to Microsoft’s marquee SLM and it does not disappoint. Today, I would like to dive into some of the details behind Phi-4.
Not so small anymore, Phi-4 is a 14-billion parameter language model that emphasizes the importance of data quality in achieving performance comparable to, or even exceeding, much larger models. It builds on the success of the Phi family of models, which have consistently demonstrated that improvements in data can rival the benefits of scaling model size. The innovations of Phi-4 rely on its unique pre-training, midtrainign and post-training approaches.
Pre-Training: A Data-Centric Approach
Phi-4’s pre-training strategy centers around three core pillars:
- Extensive use of high-quality synthetic data.
- Careful curation and filtering of organic data.
- A refined post-training process.
These elements work together to equip the model with strong reasoning and problem-solving abilities.
Leveraging Synthetic Data
Unlike most language models that rely heavily on organic data sources like web content, Phi-4 strategically incorporates synthetic data throughout its training process. The model utilizes a diverse array of synthetic data generation techniques, including multi-agent prompting, self-revision workflows, and instruction reversal. These methods enable the creation of datasets that specifically target reasoning and problem-solving skills, overcoming some limitations inherent in traditional unsupervised datasets.
The rationale behind Phi-4’s emphasis on synthetic data stems from the inherent advantages it offers over organic data:
- Structured and Gradual Learning: Synthetic data allows for the presentation of challenges in a more digestible and progression-oriented manner, facilitating structured and gradual learning for the model. In contrast, the complex and indirect relationships between tokens in organic datasets make it harder for the model to learn effectively from next-token prediction.
- Alignment with Inference Contexts: Training on synthetic data helps align the model’s pre-training experience with the scenarios it is likely to encounter during inference. This alignment ensures that the context seen during generation remains consistent with the data distribution the model was trained on.
Phi-4 adheres to four key principles when generating synthetic data:
- Diversity: The data must comprehensively cover various subtopics and skills within each domain, ensuring a broad and balanced representation of knowledge.
- Nuance and Complexity: To effectively challenge the model and facilitate learning, the synthetic data must go beyond basic examples and incorporate nuanced, non-trivial cases that reflect the inherent complexity of the domain.
- Accuracy: The generated data must maintain a high level of accuracy. For example, code must execute correctly, mathematical proofs must be valid, and explanations should be factually correct.
- Chain-of-Thought Reasoning: The data should encourage systematic reasoning by demonstrating different problem-solving approaches in a step-by-step manner, promoting the generation of coherent outputs for complex tasks.
Phi-4’s synthetic datasets are created from high-quality seeds sourced from diverse domains. These seeds serve as the foundation for generating exercises, discussions, and reasoning tasks specifically tailored to the model’s training objectives.
Generation of bogus questions
Consider the following trivia question:
# Question
{{ question }}
# Instructions
Your job is to turn this problem into a nonsensical one, for which the
↪ answer is invalid or unlikely to be known by anyone. For example, you
↪ might change the name from a well-known figure to a random name, or
↪ change the date from a well-known event to a random date, or the place
↪ to a different one. For example, you might change "When did Amelia
↪ Earhart cross the Atlantic Ocean?" to "When did Edgar Greenwood cross
↪ the Atlantic Ocean?" or "How many times did Amelia Earhart cross the
↪ English Channel?".
Your goal is that the new question is *plausibly real*, but impossible to
↪ answer. You should not make the question obviously fake, silly, or
↪ fictional; for example, all country names should be real countries,
↪ and no names should be obvious homages to the original question. It
↪ should sound like a serious trivia question.
24
You may start with a very brief discussion, then end with two markdown
↪ sections:
- The section '# Response' that contains the question.
- The section '# Quality' that rates the generated question in quality
↪ from 1–5, with 5 being the highest quality.
A high quality question is (1) different from the given question and
↪ (2) plausible
Phi-4 employs a multi-pronged approach to seed curation:
- Web and Code-Based Seeds: Excerpts and snippets demonstrating high complexity, reasoning depth, and educational value are extracted from web pages, books, and code repositories. A two-stage filtering process ensures quality: first, identifying pages with strong educational potential and, second, segmenting selected pages into passages and scoring them for factual and reasoning content.
- Question Datasets: Questions collected from various websites, forums, and Q&A platforms are filtered to balance difficulty levels. This is achieved through a plurality-based technique where multiple independent answers are generated for each question, and majority voting is used to assess the consistency of responses. Questions with unanimous answers (too easy) or entirely inconsistent answers (too difficult or ambiguous) are discarded.
- Extracting Question-Answer Pairs from Diverse Sources: Language models are leveraged to extract question-answer pairs from sources like books, scientific papers, and code, focusing on identifying deduction chains and logical progressions in the text. This technique goes beyond simply finding explicit Q&A pairs and instead aims to uncover the underlying reasoning processes within the text.
Once the seeds are curated, they are transformed into synthetic data through multi-step prompting workflows. These workflows include:
- Rewrite and Augment: Seeds are rewritten into exercises, discussions, or structured reasoning tasks to enhance their educational value for the model.
- Self-revision: The initial model responses are iteratively refined through a feedback loop where the model critiques and improves its own outputs, guided by rubrics focused on reasoning and factual accuracy.
- Instruction Reversal: This technique, particularly beneficial for code generation and other specific tasks, involves reversing existing instructions. For instance, code snippets are used to generate corresponding problem descriptions or task prompts, resulting in data pairs where the instruction precedes the code.
Curation and Filtering of Organic Data
While synthetic data constitutes a significant portion of Phi-4’s training data, organic data is not completely omitted. High-quality organic data sources are carefully curated and filtered, prioritizing reasoning-dense and nuanced materials such as academic papers, educational forums, and programming tutorials. This curated organic data serves two primary purposes:
- Directly used in pre-training as a complementary dataset.
- Used as seeds for specialized synthetic data generation pipelines.
Phi-4’s approach to organic data curation emphasizes quality and relevance, with a focus on selecting content that can enhance the model’s reasoning and knowledge base.
Key considerations in organic data curation and filtering include:
- Targeted Acquisitions: Inclusion of major repositories of reasoning-dense documents that are publicly accessible and permissible for use, such as arXiv, PubMed Central, and GitHub. Licensed books are also incorporated to ensure comprehensiveness, recency, and cleanliness.
- Filtering Web Dumps: To capture the vast amount of information available on the web, a small fraction of the highest-quality documents are selected from bulk web dumps using small classifiers trained on LLM-generated annotations.
- Multilingual Data: To ensure the model can handle a wide range of languages, high-quality multilingual documents from CommonCrawl and Wikipedia are included. A language identification model is used to categorize documents into 176 languages, and the same classifiers used for filtering web dumps are applied to filter for quality.
- Custom Extraction and Cleaning Pipelines: Customized heuristics and parsers are developed for each targeted data source to ensure cleanliness and uniformity across heterogeneous organic data sources. This involves building custom pipelines to handle various file formats and developing a custom HTML-to-text extractor for general web data.
Data Mixture and Midtraining
The pre-training process for Phi-4 involves a carefully designed data mixture that balances synthetic and organic data sources. The final data mixture allocates 30% of the training tokens to web and web rewrites data sources, 40% to synthetic data, 20% to code data, and 10% to targeted acquired sources like academic data and books.
Following the initial pre-training phase, Phi-4 undergoes a midtraining stage where the context length is increased from 4K to 16K. This stage focuses on further refining the model’s long-context understanding and reasoning abilities. The data mixture for midtraining prioritizes inherently long-context data sources, including carefully selected subsets of academic, book, and code data, as well as newly created synthetic datasets that meet the longer sequence requirements.
Post-Training: Refining the Model for Practical Applications
While pre-training lays the foundation for Phi-4’s capabilities, the post-training process is crucial for transforming the model into a safe and effective AI assistant for users. This process involves:
- Supervised Fine-Tuning (SFT): The pretrained model is fine-tuned using carefully curated user prompts and high-quality responses from various domains, including math, coding, reasoning, conversation, model identity, and safety.
- Direct Preference Optimization (DPO): Two rounds of DPO are employed to align the model with human preferences and steer it away from unwanted behavior. The first round utilizes a novel technique called Pivotal Token Search (PTS) to generate DPO pairs that specifically target pivotal tokens, which are identified as having a significant impact on the overall correctness of the solution. The second round, referred to as judge-guided DPO, gathers preference data by comparing model-generated responses against those from GPT-4 and using GPT-4 as a judge to label the preferred response based on criteria like accuracy, style, and detail.
- Hallucination Mitigation: Specific SFT data and DPO pairs are generated to mitigate the model’s tendency to hallucinate, encouraging it to refuse to answer rather than fabricate information when it does not know the answer.
This multi-stage post-training process aims to refine Phi-4’s performance across various benchmarks while addressing potential safety and ethical concerns.
Benchmarking and Performance
Phi-4 has been evaluated on a variety of standard benchmarks, including MMLU, GPQA, MATH, HumanEval, MGSM, SimpleQA, DROP, MMLUPro, HumanEval+, ArenaHard, IFEval, and PhiBench, an internal benchmark developed specifically for evaluating the diverse skills and reasoning abilities deemed critical for Phi-4’s development.
The results demonstrate that Phi-4 achieves strong performance relative to its size, particularly on reasoning-focused benchmarks, often exceeding the performance of much larger models. For instance, it outperforms its teacher model, GPT-4o, on the GPQA and MATH benchmarks. This exceptional performance is attributed to the improved data quality, training curriculum, and innovations in the post-training scheme.
Addressing Limitations and Future Directions
Despite its impressive capabilities, Phi-4 does have limitations, primarily stemming from its relatively small size. These limitations include:
- Factual Hallucinations: While mitigated through targeted post-training techniques, the model can still exhibit factual hallucinations, particularly around less common knowledge.
- Instruction Following: Phi-4 is less proficient at strictly following detailed instructions, particularly those with specific formatting requirements, as its training focused more on Q&A and reasoning tasks.
- Occasional Reasoning Errors: Even on reasoning tasks, the model can make mistakes, highlighting the need for further refinement of its reasoning abilities.
Addressing these limitations and enhancing Phi-4’s capabilities will require further research and development in areas such as:
- Continued Data Refinement: The quality and diversity of the training data play a crucial role in the model’s performance. Continued efforts to curate and generate high-quality synthetic and organic data will be essential for further improving Phi-4’s abilities.
- Exploring Novel Architectures: While Phi-4 utilizes a standard transformer architecture, exploring novel architectures or modifications could potentially lead to improvements in performance and efficiency.
- Addressing Ethical Concerns: As with any powerful AI system, it is crucial to continuously evaluate and address potential ethical concerns related to bias, fairness, and the potential for misuse.
Overall, Phi-4 represents a significant step forward in demonstrating the power of data quality in achieving high performance in smaller language models. Its innovative approach to pre-training and post-training, with a strong emphasis on synthetic data generation and careful organic data curation, provides valuable insights for the future development of efficient and capable language models.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI