Synthetic Data Generation for Fine-Tuning LLMs
Last Updated on January 27, 2025 by Editorial Team
Author(s): Kamban Parasuraman
Originally published on Towards AI.
General-purpose large language models (LLMs) like LLaMA, Mistral, and Phi excel at quickly answering a wide range of generic questions because they are trained on vast amounts of publicly available data, allowing them to have a broad understanding of various topics. However, when applied to specialized domains, they are less effective in grasping the industryβs nuanced language complexities. Enterprises collect and store large volumes of data, including propriety and sensitive information, for their internal use. This information remains inaccessible for public use, and hence general purpose LLMβs face limitations (understanding domain-specific terms and acronyms) when applied to industry-specific tasks.
In the insurance sector, this dynamic is particularly evident. Insurance companies collect proprietary information, including detailed client profiles, risk assessments, and claims histories. This data is crucial for accurate underwriting, policy pricing, and fraud detection. However, because this information isnβt shared publicly, general-purpose LLMs cannot access it, leading to sub-optimal performance when applied to insurance-specific applications. Custom-trained models, built with industry-specific data, are essential for delivering the precision and insights required in the insurance industry.
This article will explore how we can generate synthetic data for fine-tuning general-purpose LLMs. Dive in and discover how to overcome the limitations of general-purpose LLMs and adapt them to suit your specialized needs.
What is Synthetic Data?
Synthetic data is information that is artificially generated, rather than created by events in the real world. Synthetic data is derived from existing datasets or models, replicating the properties and characteristics of real-world events or models.
There are several ways to create synthetic data including:
(a) Adding noise β Generate Gaussian or Uniform noise and add it to the data. This approach helps capture the variability/randomness we can expect in real-world data.
(b) Transformations β Apply geometric transformations like rotations, translations, dilation, or scaling to come up with different perturbations of the data.
(c) Statistical β The statistical models sample from different distributions (e.g., Normal, Exponential, Gamma) to create a multi-variate dataset. Copulas are a widely employed statistical approach to generate/sample data that preserves the statistical properties of different attributes, and correlation structures.
(d) Generative Adversarial Networks (GANs) β GANs are a class of generative models for tabular data. GAN architecture comprises two sub-modes β βgeneratorβ and βdiscriminatorβ, that compete with each other to produce datasets that mimic real-world data. The principle role of the Generator is to generate synthetic data that mimics the training dataset to an extent where the Discriminator cannot distinguish the synthetic data from the real data. For more details on GAN, please see my blog on applying GANs to simulate stochastic weather data.
Why do we Need Synthetic Data?
Real-life experiences shape biological neural networks, similarly, data shapes an artificial neural network.
Real data is hard to come by or it might have sensitive/confidential information that we canβt readily get access to. Also, real-world data is very unstructured. In contrast, synthetic data is cheap to generate and is perfectly labeled and structured. This data can help organizations to fine-tune and test their models on scenarios that donβt exist. It can also help reduce the bias that may exist in real-world datasets, helping to make AI models more fair, accurate, and trustworthy.
While synthetic data provides domain-specific, well-labeled, high-volume data at a reasonable cost, it comes with its own sets of challenges. Since synthetic data is derived from existing data and models, it cannot generate completely new, yet plausible events that have not occurred yet.
Synthetic Data Generator Framework
We will develop a Streamlit application that allows users to upload documents and generate synthetic datasets using the LLaMA 3β8B model. All of this will be done locally on our machines using Ollama. For a deeper dive into developing local RAG applications with open-source models, please see my earlier blog post.
Typically in RAG applications, LLMs generate answers for a given question based on a specified context. The quality of these responses is based on the quality of the context the LLMs extract. However, we will employ a βUNO-reverseβ strategy, where instead of seeking answers, we prompt the LLM to generate questions it can answer. In this approach, we first segment the document into smaller context windows, and for each of the contexts, we ask the LLM to generate a question that it can answer. This method leverages the LLM's ability to dynamically understand and produce contextually relevant questions.
The ultimate goal of synthetically generated data is to fine-tune general-purpose LLMs. Along with the generated questions, we should also extract corresponding reference answers. These reference answers can then serve as the βground truthβ against which the LLMβs can be fine-tuned. In summary, for each context window, we prompt the LLM to generate a question, and then use the generated question and the context to generate a reference answer from the LLM. At the end of this step, we will have a tuple consisting of Context, Question, and Reference Answer. Below is the system prompt used to generate the question-answer pairs for a given context window.
question_answer_prompt = """\
You are an AI system that constructs question-answer pairs based on a given Context.
Your response should always have two parts, and label the parts as follows:
Question: [Insert question here]
Answer: [Insert answer here]
Strictly follow the response format. Please do not include any
additional texts before or after Question and Answer labels.
Context:
{context}
"""
The source code for the Synthetic Data Generation application is available on my GitHub page.
Generating Synthetic Data
Weβll use the βAttention Is All You Needβ paper as an input to the application to demonstrate how the application works and generates synthetic datasets. To create the context windows, we will employ the standard RecursiveCharacterTextSplitter. This results in 31 chunks for the referenced paper, each representing a context window. The application generates a Question and Reference Answer for each context using the LLaMA 3β8B model, while the Mistral model generates a response to compare against the Reference Answer. The resulting synthetic dataset consists of 31 data points for fine-tuning LLMs.
def TextSplit(raw_text):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # The maximum size of each text chunk
chunk_overlap=80, # The number of characters to overlap between chunks
length_function=len, # Function to calculate the length of each chunk
is_separator_regex=False, # Indicates if the separator is a regex
)
# Split the raw text into chunks and return the result
return text_splitter.split_documents(raw_text)
The table below presents the synthetic dataset generated by the application. For brevity only 4 of the 31 data points are shown. The βContextβ column contains text chunks from the document. The βQuestionβ and βReference Answerβ columns display the questions and reference answers generated by the LLaMA 3β8B model. The βResponseβ column includes the responses from the Mistral model. As demonstrated by the sample training instances in the table, the application successfully generated a variety of questions, including fact-based and contextual ones.
Fine-Tuning LLMs
Fine-tuning LLMs is beyond the scope of this blog and will be addressed in a future article. However, to demonstrate how the synthetic data generated by this application can be utilized for fine-tuning generic LLMβs, we will use the Mistral 7-B model to provide different perspectives on the generated question. This approach is similar to the variations we can expect when fine-tuning the weights in general-purpose LLMs. The objective function for fine-tuning the LLM will be to minimize the cross-entropy loss between the Reference Answer and the Response during each iteration of the weight adjustments.
Final Thoughts
Off-the-shelf, general-purpose LLMs excel at contextualizing information, but they fall short when applied to domain-specific tasks. While prompt engineering can help tease out some relevant information, these language models inherently struggle to understand industry-specific lingo.
Fine-tuning offers a viable solution to overcome these limitations, transforming general-purpose LLMs to fit specific industry needs. Fine-tuning LLMs requires a substantial amount of quality data (Question-Reference Answer pairs), and the application shared in this article provides an effective solution to quickly build and scale enterprise-specific synthetic datasets.
Thanks for reading this article! All feedbacks are appreciated. For any questions feel free to contact me.
If you liked this article, here are some other articles you may enjoy:
Stochastic Weather Generator using Generative Adversarial Networks
Modeling Multivariate Distributions using GANs
towardsdatascience.com
Building a Local Committee-of-Expert (CoE) RAG Application for Document Discovery
Transforming the Landscape of Insurance and Reinsurance
pub.towardsai.net
The views expressed in this article are my own and do not necessarily reflect the views of my employer.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI