Synthetic Data Generation for Fine-Tuning LLMs

Last Updated on January 27, 2025 by Editorial Team

Author(s): Kamban Parasuraman

Originally published on Towards AI.

General-purpose large language models (LLMs) like LLaMA, Mistral, and Phi excel at quickly answering a wide range of generic questions because they are trained on vast amounts of publicly available data, allowing them to have a broad understanding of various topics. However, when applied to specialized domains, they are less effective in grasping the industry’s nuanced language complexities. Enterprises collect and store large volumes of data, including propriety and sensitive information, for their internal use. This information remains inaccessible for public use, and hence general purpose LLM’s face limitations (understanding domain-specific terms and acronyms) when applied to industry-specific tasks.

In the insurance sector, this dynamic is particularly evident. Insurance companies collect proprietary information, including detailed client profiles, risk assessments, and claims histories. This data is crucial for accurate underwriting, policy pricing, and fraud detection. However, because this information isn’t shared publicly, general-purpose LLMs cannot access it, leading to sub-optimal performance when applied to insurance-specific applications. Custom-trained models, built with industry-specific data, are essential for delivering the precision and insights required in the insurance industry.

This article will explore how we can generate synthetic data for fine-tuning general-purpose LLMs. Dive in and discover how to overcome the limitations of general-purpose LLMs and adapt them to suit your specialized needs.

What is Synthetic Data?

Synthetic data is information that is artificially generated, rather than created by events in the real world. Synthetic data is derived from existing datasets or models, replicating the properties and characteristics of real-world events or models.

There are several ways to create synthetic data including:

(a) Adding noise — Generate Gaussian or Uniform noise and add it to the data. This approach helps capture the variability/randomness we can expect in real-world data.

(b) Transformations — Apply geometric transformations like rotations, translations, dilation, or scaling to come up with different perturbations of the data.

(c) Statistical — The statistical models sample from different distributions (e.g., Normal, Exponential, Gamma) to create a multi-variate dataset. Copulas are a widely employed statistical approach to generate/sample data that preserves the statistical properties of different attributes, and correlation structures.

(d) Generative Adversarial Networks (GANs) — GANs are a class of generative models for tabular data. GAN architecture comprises two sub-modes — “generator” and “discriminator”, that compete with each other to produce datasets that mimic real-world data. The principle role of the Generator is to generate synthetic data that mimics the training dataset to an extent where the Discriminator cannot distinguish the synthetic data from the real data. For more details on GAN, please see my blog on applying GANs to simulate stochastic weather data.

Why do we Need Synthetic Data?

Real-life experiences shape biological neural networks, similarly, data shapes an artificial neural network.

Real data is hard to come by or it might have sensitive/confidential information that we can’t readily get access to. Also, real-world data is very unstructured. In contrast, synthetic data is cheap to generate and is perfectly labeled and structured. This data can help organizations to fine-tune and test their models on scenarios that don’t exist. It can also help reduce the bias that may exist in real-world datasets, helping to make AI models more fair, accurate, and trustworthy.

While synthetic data provides domain-specific, well-labeled, high-volume data at a reasonable cost, it comes with its own sets of challenges. Since synthetic data is derived from existing data and models, it cannot generate completely new, yet plausible events that have not occurred yet.

Synthetic Data Generator Framework

We will develop a Streamlit application that allows users to upload documents and generate synthetic datasets using the LLaMA 3–8B model. All of this will be done locally on our machines using Ollama. For a deeper dive into developing local RAG applications with open-source models, please see my earlier blog post.

Typically in RAG applications, LLMs generate answers for a given question based on a specified context. The quality of these responses is based on the quality of the context the LLMs extract. However, we will employ a “UNO-reverse” strategy, where instead of seeking answers, we prompt the LLM to generate questions it can answer. In this approach, we first segment the document into smaller context windows, and for each of the contexts, we ask the LLM to generate a question that it can answer. This method leverages the LLM's ability to dynamically understand and produce contextually relevant questions.

The ultimate goal of synthetically generated data is to fine-tune general-purpose LLMs. Along with the generated questions, we should also extract corresponding reference answers. These reference answers can then serve as the “ground truth” against which the LLM’s can be fine-tuned. In summary, for each context window, we prompt the LLM to generate a question, and then use the generated question and the context to generate a reference answer from the LLM. At the end of this step, we will have a tuple consisting of Context, Question, and Reference Answer. Below is the system prompt used to generate the question-answer pairs for a given context window.

question_answer_prompt = """\
 You are an AI system that constructs question-answer pairs based on a given Context. 
 Your response should always have two parts, and label the parts as follows:
 Question: [Insert question here]
 Answer: [Insert answer here]
 
 Strictly follow the response format. Please do not include any 
 additional texts before or after Question and Answer labels.
 
 Context:
 {context}
 """

The source code for the Synthetic Data Generation application is available on my GitHub page.

Generating Synthetic Data

We’ll use the “Attention Is All You Need” paper as an input to the application to demonstrate how the application works and generates synthetic datasets. To create the context windows, we will employ the standard RecursiveCharacterTextSplitter. This results in 31 chunks for the referenced paper, each representing a context window. The application generates a Question and Reference Answer for each context using the LLaMA 3–8B model, while the Mistral model generates a response to compare against the Reference Answer. The resulting synthetic dataset consists of 31 data points for fine-tuning LLMs.

def TextSplit(raw_text):
 text_splitter = RecursiveCharacterTextSplitter(
 chunk_size=1000, # The maximum size of each text chunk
 chunk_overlap=80, # The number of characters to overlap between chunks
 length_function=len, # Function to calculate the length of each chunk
 is_separator_regex=False, # Indicates if the separator is a regex
 )
 # Split the raw text into chunks and return the result
 return text_splitter.split_documents(raw_text)

The table below presents the synthetic dataset generated by the application. For brevity only 4 of the 31 data points are shown. The “Context” column contains text chunks from the document. The “Question” and “Reference Answer” columns display the questions and reference answers generated by the LLaMA 3–8B model. The “Response” column includes the responses from the Mistral model. As demonstrated by the sample training instances in the table, the application successfully generated a variety of questions, including fact-based and contextual ones.

Fine-Tuning LLMs

Fine-tuning LLMs is beyond the scope of this blog and will be addressed in a future article. However, to demonstrate how the synthetic data generated by this application can be utilized for fine-tuning generic LLM’s, we will use the Mistral 7-B model to provide different perspectives on the generated question. This approach is similar to the variations we can expect when fine-tuning the weights in general-purpose LLMs. The objective function for fine-tuning the LLM will be to minimize the cross-entropy loss between the Reference Answer and the Response during each iteration of the weight adjustments.

Final Thoughts

Off-the-shelf, general-purpose LLMs excel at contextualizing information, but they fall short when applied to domain-specific tasks. While prompt engineering can help tease out some relevant information, these language models inherently struggle to understand industry-specific lingo.

Fine-tuning offers a viable solution to overcome these limitations, transforming general-purpose LLMs to fit specific industry needs. Fine-tuning LLMs requires a substantial amount of quality data (Question-Reference Answer pairs), and the application shared in this article provides an effective solution to quickly build and scale enterprise-specific synthetic datasets.

Thanks for reading this article! All feedbacks are appreciated. For any questions feel free to contact me.

If you liked this article, here are some other articles you may enjoy:

The views expressed in this article are my own and do not necessarily reflect the views of my employer.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Synthetic Data Generation for Fine-Tuning LLMs

Author(s): Kamban Parasuraman

What is Synthetic Data?

Why do we Need Synthetic Data?

Synthetic Data Generator Framework

Generating Synthetic Data

Fine-Tuning LLMs

Final Thoughts

Stochastic Weather Generator using Generative Adversarial Networks

Modeling Multivariate Distributions using GANs

Building a Local Committee-of-Expert (CoE) RAG Application for Document Discovery

Transforming the Landscape of Insurance and Reinsurance

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #71: Open-Sora: $200K Video Model, HPC’s Unsung Hero, and 10 Ways LLMs Fail in the Wild

Using CrewAI to Build Agentic Systems

Future of the Job Market — Impact of AI on Various Roles in 2025

Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search

TAI #148: New API Models from OpenAI (4.1) & xAI (grok-3); Exploring Deep Research’s Scaling Laws

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Synthetic Data Generation for Fine-Tuning LLMs

Author(s): Kamban Parasuraman

What is Synthetic Data?

Why do we Need Synthetic Data?

Synthetic Data Generator Framework

Generating Synthetic Data

Fine-Tuning LLMs

Final Thoughts

Stochastic Weather Generator using Generative Adversarial Networks

Modeling Multivariate Distributions using GANs

Building a Local Committee-of-Expert (CoE) RAG Application for Document Discovery

Transforming the Landscape of Insurance and Reinsurance

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥