Fine-Tuning LLMs with Synthetic Data for High-Quality Content Generation
Last Updated on July 17, 2024 by Editorial Team
Author(s): Vin Busquet
Originally published on Towards AI.
Table of Contents
Β· Table of Contents
Β· The POC Trek Begins
Β· Fine-Tuning VS RAG
β What is fine-tuning?
β So, what is an LLM?
β And what is this RAG thing?
Β· Choosing the Right Format
Β· Generating Synthetic Data
β An Introduction to Synthetic Data: Foundations and Techniques
β What I Did and How I Did It: Distillation in Action
Β· Fine-Tuning in Action
β Training and Validation Set
β Training Costs
β Training Jobs
β Analysis of the Training Logs
Β· In-Context Learning Setup
Β· Evaluating Performance
β Evaluation Conclusion
Β· Extra Surprise: Detecting AI Content
Β· Reflecting on the Journey
Β· References
The POC Trek Begins
A global consulting company hired me a few months ago to work with their Head of Technology and Innovation and Head of Data Science on developing a Proof of Concept (POC, as I will abbreviate in this article) AI app for a technical document generator using GenAI (LLM-based, to be more specific).
Using Azureβs OpenAI model, the company already built an in-house prototype using prompt engineering and RAG from their data sources months before my contract, but the results were far from ideal. It struggled to replicate the original document structures, especially in their specific technical language and complexity. So, it sounded to me like that could be a compelling case for LLM fine-tuning.
They also had an extensive repository of over 700,000 high-quality technical documents done over the last 5 years. They imposed two non-negotiable constraints on me: I was required that the final prototype should use Azure for their entire infrastructure and internal integration logic, and they restricted me to utilizing only OpenAI models, specifically those managed by Microsoft under Azure AI Studio. The main reason is that Azure models come with compliance certifications and standards they must adhere to, which default OpenAI APIs donβt provide.
The prototype should follow the same user experience as its predecessor: the specialist fills out a form with a bunch of structured (fixed) questions, and the document generator should create a technical document as close to what a human specialist would do. They gave me access to approximately 1,500 existing technical documents that covered some categories, as well as some limited access to their data sources for use in the generation logic.
After we agreed on the scope and limitations of the POC, the work started.
Fine-Tuning VS RAG
Before discussing the details of this project, I would like to outline the differences between those two approaches. Contrary to the title of this section, both solutions are complementary and can be used together, which can lead to a synergic solution in some cases.
What is fine-tuning?
While GenAI buzzwords are all over the internet, based on recent conversations Iβve been having lately with ordinary people, it seems like the burning question is, βWhat exactly is an LLM, and how does it chat so naturally?β or, βWhat on earth is a βlanguage modelβ anyway?β.
Check out the following image for a nice explanation (donβt worry about the math details in the formal explanation):
A language model is a system that probabilistically predicts the following words or characters for a given sequence of words or characters.
Prompt is what we call the modelβs input, and Completion is the name of the language modelβs output.
You use language models every day, probably for decades, without even realizing it.
So, what is an LLM?
A Large Language Model (commonly referred to as LLM) is an advanced type of language model whose main differences lie in its architecture, which favors parallel training, sheer size, and complexity. To put it in simple terms, the architecture of those models favors masking multiple different inputs and adding some attention mechanisms. Transformersβ self-attention mechanism, in particular, is a key innovation that enables LLMs to handle context and relationships within the text more effectively and parallelize the training on an extensive corpus of text. The math behind it and the parallelization allow the use of highly expensive GPU clusters for their training cycle, scaling up the training and the modelsβ knowledge by a huge factor. Usually, the training session can span weeks or even months and incur costs of several millions of dollars per session.
The Transformer architecture was developed by Googleβs researchers in 2017, and released in a paper named βAttention Is All You Needβ.
Once the training period is finished, the model not only exhibits fundamental knowledge of a language and its structure but also showcases way more than that; it appears to gain several insights into general world model concepts and connections, demonstrating elements of reasoning and some level of mathematical logic. The full extent of LLMs emergency capabilities is still a hotly debated topic and an active research area.
This process results in a pretrained model, which is basically a frozen model that contains a wealth of knowledge. But yet, it is still a language model: given a text sequence input, it will try to predict the next sequence of words.
To make more use of it, a set of fine-tuning training processes happens on top of the previous pre-trained model in a way that avoids destroying its previous knowledge. This process aims to train the model on a set of tasks that are more focused on Q&A and chat style, thereby transforming it from a pure language model to a more interactive and user-centered assistant. This places the model in a category known as instruction-tuned LLM.
Prior, to making the model available to the public, there is a phase called Model Alignment. This process ensures that the models outputs align, with values, intentions and human objectives. It involves training the model to avoid producing content and focus on generating responsible results.
Just a side note: to avoid confusion, in mainstream media and marketing material, the term pretrained model, is often used to refer to the public-released model, not to the initial LLM training cycle that I mentioned. Publicly released big models like this are also called foundation models.
Finally, after this lengthy explanation, we can discuss user-custom fine-tuning, which some companies, such as OpenAI, allow the API user to do with their closed model (for open source, obviously, it is always available and typically involves a more complex process). Those custom fine-tunings, which I will refer to in the rest of this article as fine-tuning only, help adapt the publicly available large language model to perform well on specific tasks, making it more task-specific and sometimes even gaining knowledge over proprietary data.
In the particular case of the projectβs POC that this article is discussing, the goal of fine-tuning is to enable the model to generate documents with the appropriate structure and technical language, a feature that was not achieved with prompt engineering and RAG alone.
And what is this RAG thing?
As I previously mentioned, the models donβt learn in real-time, they only learn during the training sessions, and this is usually true for the entire machine learning field. As the training process for LLMs is resource-intensive, costly, and time-consuming, it happens only at intervals of months (sometimes more), and the model knowledge quickly becomes outdated. Frequent custom fine-tuning cycles are an option, but beyond being expensive, doing so indiscriminately can lead to a problem known as Catastrophic Forgetting (Catastrophic inferencing is also a common term for this phenomenon), where the models forget previously learned knowledge. Plus, the models donβt have access to real-time data. A more viable solution to deal with this is RAG.
RAG stands for Retrieval Augmented Generation, the name given to a family of processes that focuses on connecting the LLM to external sources through retrieval mechanisms. A combination of the generative capabilities of the model with the ability to search for and incorporate relevant information from one knowledge base (or several).
There are different ways of classifying such systems, but most of them vary based on a few factors:
- Source of Information: Those sources can be literally anything from traditional databases, vector databases, knowledge graphs, to the internet itself.
- Retrieval Mechanism: As the sources are so varied, the same is true for the methods used to collect information, such as search engines, APIs, customized database searches, etc.
- Integration Method: It is also common to classify RAG systems based on how they are incorporated with the LLM to generate the completion process.
I will only focus on explaining the difference in the integration logic in this article, as it was the only noticeable change I made regarding the original prototype.
The RAG mechanism can be integrated as soon as the user prompts the input BEFORE the information reaches the LLM for completion. In this case, the RAG process happens every time a new input prompt is entered by the user, and the results of this process are used to enhance the user prompt by the time it hits the model.
Or the RAG process can occur AFTER the prompt reaches the LLM. In this scenario, the model is used as a reasoning engine to decide whether it needs to trigger RAG processes or not (and what mechanisms to use) to generate the appropriate completion based on the perceivable context. This process is usually known as Agentic RAG. In this scenario, the retrieval process doesnβt happen all the time, like with the other integration approach.
As a last note, it is also common to classify the RAG process based on its internal logic and complexity. Following this approach, we typically divide it into naive RAG, advanced (complex) RAG, Modular RAG, hybrid RAG, etc. Since this is a diverse and complex area with reliable sources, Iβll just mention that we used Advanced RAG for POC purposes because their previous prototype did so. If you are interested in learning more about different RAG mechanisms, I do recommend Vipra Singβs article on Advanced RAGs.
The main change I made to the POCβs RAG process was related to how it is triggered: I used the agentic RAG approach and made all the changes and enhancements to the existing complex RAG mechanisms to accommodate that. Additionally, I will fine-tune the model to determine which specific RAG strategy is more effective in improving its completion.
Choosing the Right Format
Backing again to the POC, the first step was to decide the best file format for the documents and how exactly the training set was going to be built.
All the available files have PDF and docx formats. None of them seemed to be suitable formats. because they have too much-unneeded data related to text styling and fonts, etc., and we only needed the semantic content and some level of textual structure.
Considering the requirements, the markdown format (also known as MD) appeared to be a more viable option because it preserves structure (tables, headings, lists) and also some level of semantics (bold, italics, code blocks) and also has a good level of context preservation (it allows for the inclusion of image links or alt-text, etc.). In addition to that, MD is a heavily distributed format online, so it is also a widely known format among LLMs.
To convert the docx files into MD, I used the pypandoc library, as you can check in the following code:
After that, the immediate step was more related to understanding the individual size and complexity of the existing documents.
So I created a dedicated Jupyter notebook to do some traditional NLP analysis on the documents. Not all the analyses done are worth mentioning, but I will share a few that I think are interesting and donβt have this issue.
One of the initial metrics I wanted to know was the token size for each document.
Up to this date, the OpenAI models can only generate a maximum completion of 4096 tokens, I needed to limit the documents that have less or equal to this token limit, as the team agreed that dealing with multi-prompting logic for document generation would be too complex to deal with properly for this POC and also more prone to completion distortion.
So, we trimmed down the documents to 1139 for the project.
Another interesting metric to share is the average readability score. For that, I used Textstat, a Python library for calculating statistics from text, more specifically readability, complexity, and grade level.
For more details on how to use and the meaning of the metrics, please check https://github.com/textstat/textstat as its details are out of the scope of this article. The following is a snippet of code used:
The results of the readability metrics suggest it is difficult for both humans and LLMs to fully comprehend them. The average score on the different metrics seems to indicate a college level at the minimum, some showing graduate, or higher levels.
This helped me to better understand why the previous prototype, using prompt engineering and RAG alone, failed, and to reinforce the idea that fine-tuning on top of the foundation model was required in order to instruct the model to learn the required thought process to generate accurate and high-quality documents from this data.
Maybe it wouldβve required more data, but at the time, I believed that 1000β1500 documents were enough to prove the point for a POC.
Generating Synthetic Data
As I already said, fine-tuning is a way to make a model using machine learning that has already been trained to work better with a certain task or dataset.
An Introduction to Synthetic Data: Foundations and Techniques
In other areas of machine learning, synthetic data generation has already proven, when well done, to be useful in helping with model training.
Instead of using data gathered from the internet, curated, or labeled by human beings, synthetic data uses other AI models or heuristics in simulated settings to generate data for training a model. It is also useful to mitigate privacy and copyright problems as it doesnβt rely on real user data or material that is safeguarded by intellectual property rights.
The creation process for synthetic data is usually achieved through two different approaches: distillation, which extracts information from a more powerful model, and self-improvement, which uses the modelβs outputs. Distillation transfers information and reasoning skills from a highly skilled model to a less skilled model, while self-improvement iteratively learns from its replies to enhance outputs.
The most prominent publications in this field were released within 24 hours apart, in December 2022, titled βUnnatural Instructions: Tuning Language Models with (Almost) No Human Laborβ, which focuses on data generation by distilling it from a more powerful model, and βSelf-Instruct: Aligning Language Models with Self-Generated Instructionsβ, which bootstraps synthetic data from the model itself.
Feel free to check for more details on each paper.
Since the release of βUnnatural Instructionsβ, several models have been fine-tuned using distilled synthetic data techniques, usually from OpenAI APIs.
For instance, Stanfordβs Center for Research on Foundation Models (CRFM) developed the Alpaca, an instruction-following model, that is a fine-tuned version of Metaβs LLaMA 7B model. The study used 175 human-written instruction-output pairs from the Self-Instruct paper (a seed set they made available on Github) and prompted GPT-3 to generate more instructions using the seed set as examples. The process was simplified and cost-effective, resulting in 52K unique instructions and outputs, and they reported that this cost less than $500.
Also, other researchers have studied complex distillation approaches in models like Vicuna, WizardLM (Microsoft), Orca (Microsoft), and an ever-growing list, usually refining smaller models using synthetic data from, mostly, GPT-3.5 and GPT-4.
On the other hand, the Self-Alignment with Instruction Backtranslation (Meta) is a famous self-improvement example in which they demonstrated progressively improved performance for a model by utilizing the same modelβs ability to create and improve synthetic data.
What I Did and How I Did It: Distillation in Action
For the POC, I opted for the distillation technique to create synthetic data using larger models like GPT-4, gathered enough data to fine-tune GPT3.5 turbo, a smaller model, and, as you will see, created a task-specific model for high-quality technical documentation.
As of writing this article, OpenAI and Azure OpenAI exclusively provide fine-tuning for the GPT-3.5 family.
According to their documentation, you must format the dataset in a JSONL file, which is a set of lines containing a JSON object with the system prompt, user input, and assistant/model completion. OpenAI provides an illustrative example in their documentation:
Note: Each JSON object should be in a single line in a jsonl file, but
the first object is pretty-printed to help visualize its attributes.
More specifically, in this case, as I was using the agentic RAG approach, this was the expected dataset (fine-tuning and function calling another example from the documentation):
Again, as this is a jsonl, it should be all in one line, one line per object.
You can see that the fine-tuning logic is limited to this conversational structure. Later I will mention more details about it but for now I just wanted to point out this limitation, compared to open-source models at least.
For the POC training set the data required were a basic system prompt for document generation, a set of inputs with the questions and answers as the user prompt and the existing document as the assistantβs completion, and also map it to the RAG mechanisms that it could trigger. Since we didnβt have any sort of input or associated historical data for the docs, creating synthetic data really seemed like the closest viable solution, and my second notebook was focused exclusively on that.
I worked with the specialists to expand the available data for 12 files by creating the Q&A inputs that would serve as the user prompt, for the docs generation.
The idea here was, for every existing document, to create answers for the static, structured questions we wanted to use in the technical document generator and also list what data sources and, consequently, RAG mechanisms would trigger different ways to consult the data needed to build that existing document.
Obviously, it wasnβt feasible for the specialists to do this for all the 1139 existing documents, as it was a very expensive and time-consuming process, and thatβs why we needed an effective data generation logic.
For each doc, the specialists also created an independent set of free-form questions and answers, simulating data that could have been used to generate the same document. With both data, figure out which model generated the best output took some time and it was very iterative, with back and forths between me and the specialist team. Eventually, we figured out that GPT4-o had the best performance and also was the cheapest model from the GPT4 branch.
To generate the data, I provided the 12 proposals in a big prompt to the model, using a prompt engineering technique called few-shot learning. In this setting we provide the model with a set of examples for a specific input and the expected output, trying to teach the model to learn a specific pattern within the prompt itself, without needing to perform any training. In this case, the input example was the proposal and output of the Q&A created by the specialists.
Although it seems to work poorly for more complex data patterns, few-shot learning is extremely effective for use cases like text classification, sentiment analysis, etc.
One of the disadvantages of this technique is that you need to provide a dense prompt for every request increasing considerably the cost per generation.
Also it is worth mentioning that GPT-4o family usage costs 10x more per token than the default GPT3.5 family.
An example of code logic used (you can check more details about it in LangChain docs about few-shot learning):
In this case, the input was the existing document itself, and the output was the answers to the static set of questions (which Iβm calling structured questions). I supplied the model along with the 12 examples in the system prompt, and the subsequent human message consisted of the documents and the static, structured questions, expecting the models to generate the answers based on the document content.
It was a very iterative process, as I generated samples and sought validation from the specialists. They provided me with a great deal of help until we identified the appropriate prompt and setup for the model to start generating valuable answers.
Once that was in place, I used the optimized setup to generate two different types of data from all the remaining 1026 documents:
- Answers for the Structured Questions: where the inputs were the existing document and the fixed, structured questions, and output the generated answers for those questions based on the document content.
- Free-Form Q&A: where the inputs were the existing document, and the output was a set of free-form questions and answers that couldβve been used to generate that document, according to the specialists' few-shot examples.
The entire synthetic generation data, which generated both structured and free-form data for each of the 1139 documents, cost approximately $680.
With this data ready, the next step was to create the JSONL dataset files.
Fine-Tuning in Action
Finally, the anticipated moment of fine-tuning is here. As previously discussed, it involves a training approach that is kind of similar to other machine learning training cycles. Let me give a basic explanation of how it works.
The fourth notebook was all focused on fine-tuning: βLLM_Fine_Tuning_for_Technical_Document_Generation.β
Training and Validation Set
The following JSON object is an example of what data each line in the jsonl training file has. In this case it is pretty printed just to show the reader the objectβs internal structure, but in the training jsonl, each line is an entire object inlined, representing an item. In our case, the system message is the default system message that needs to be used in the POC once this model is fine-tuned, the user prompt is a string with the questions and answers, and the assistant completion is an existing proposal that the questions and answers map to.
Also for training, it is required to divide around 70β80% of the data for the training set and 20β30% for the validation set. This ensures the model learns from a broader dataset while being tested on unseen data to validate its performance.
So I created 3 different datasets, each comprised of 2 files:
Structured Answers Dataset
Where each line contains the fixed/structured questions and their generated answers as the user input and the associated existing technical document as the assistant completion.
structured_training_set_v1.jsonl (containing 727 entries)
structured_validation_set_v1.jsonl (containing 311 entries)
Free-form Question & Answers Dataset
Each line contains the generated free-form Q&A as the user input and the associated existing document as the assistant completion.
free_form_training_set_v1.jsonl (containing 727 entries)
free_form_validation_set_v1.jsonl (containing 311 entries)
Mixed Dataset
I joined the previous dataset and shuffled the lines (items) to have a more distributed and rich dataset that could possibly help avoid overfitting (a bias phenomenon that happens when the models get ultra specialized on the training set but perform badly on unseen data, like the validation set and real model usage).
mixed_training_set_v1.jsonl (containing 1,454 entries)
mixed_form_validation_set_v1.jsonl (containing 662 entries)
Training Costs
As part of the same notebook, I wanted to know how much this fine-tuning training cycle would cost, so I created some algorithms to estimate the costs for this. I didnβt provide the code that generated the following output, but you can check here the pricing and the logic behind the costs.
The actual result ended up being pretty close to the estimate, actually a little bit lower, as I rounded up the values on the estimate.
Training Jobs
With all set, it was time to call the remote job to start the training. The following is the source code used to start the training jobs:
A typical response from the previous code is the following:
As the code output suggests, I ran the 3 jobs in parallel, which t took around 1 hour total to complete.
Analysis of the Training Logs
After it finished, I downloaded the training logs for evaluation.
Here is the source code for the analysis I did:
Looking at the training results, itβs clear that the type of data we feed into the model makes a big difference.
It seems that the βMixedβ dataset offered the best balance of training stability and validation performance, making it sound like the preferred choice for future fine-tuning. I believe the bigger dataset and data variability were the main reasons for that.
The βStructured Answersβ dataset also performs well but slightly underperforms compared to the βMixedβ dataset.
The βFree-Formβ dataset shows higher noise and less reliable validation results, suggesting it may not be the best standalone option for fine-tuning, or, at least, not suitable for this dataset size.
In-Context Learning Setup
Before I start evaluating the trained models, I wanted to have some baseline comparisons for future evaluation, so I created a notebook: βIn_Context_Learning_Evaluation_for_Technical_Document_Generation. β
As I already mentioned, in-context learning is a prompt engineering technique that uses different logic methods in pure prompt engineering to try to guide the LLM to specific goals. I wanted to create code and functions for zero-shot learning, mimicking their original prototype, and, once again, few-shot learning, this time for document generation and not answer generation. Again, as in synthetic data, I used the most advanced GPT-4 family models at the time.
Similar to what I did on creating the fine-tuning dataset, I used few-shots, where the inputs were the structure questions, generated answers, and output documents as examples, and also a separate set of tests, where the few-shot examples were the free-form questions and answers, and the output was the technical document. The following is a VERY REDUCED example of it:
I also did some tests with both functions, and the results for the few-shot were better than the zero-shot, but they weren`t quite there yet, as they lacked most of the document structure and technical language.
Evaluating Performance
It was imperative to have ways to quantify how better (or worse) the different generation methodologies compared to each other.
The gold standard for evaluating LLM apps is humans, usually domain experts in a particular field. For that, I created a small Streamlit app, which was the new POC prototype. It consists of a long web app form with 26 different inputs (most of them optional) where the specialists can fill in the answers for the inputs, and select one or more generation methodologies to generate one or multiple technical documents for the same input, which is useful for comparing the quality of the methods.
I included the work done on the In-context learning notebook and the original prototype, as well as gpt4-o, which didn`t exist when the first prototype was released.
But Human evaluation is expensive and slow, especially on a system like this, so a more effective way to evaluate the application against different methodologies was required.
So here, the Langsmith Evaluator framework comes in as a nice tool to help. Langsmith, as Langchain states: βis an all-in-one developer platform for every step of the LLM-powered application lifecycle, whether youβre building with LangChain or not.β
It allows you to closely monitor and evaluate your application, trace any call to model, check internal actions, among other things, but the most cool to me is the Evaluation framework.
Evaluators in LangSmith score your applicationβs performance on dataset examples, returning a metric key, score, and comment. Key approaches include Human evaluation for manual review, Heuristic evaluators using predefined rules, LLM-as-judge evaluators leveraging language models for scoring, and Pairwise evaluators comparing two outputs to determine the better one.
Langchain offers off-the-shelf evaluators for Python too. You can apply evaluators within LangChainβs evaluation chains, run application-specific evaluation experiments, and more.
A full explanation of the Evaluation framework is outside the scope of this article. Feel free to read more about it in the official docs.
Before running any experiment, you need to upload your datasets. For our case, I got 24 technical docs, out of the validation set (data never seen by the model in training), covering all possible categories and subcategories. Then I asked the human specialists to improve the inputs, and once they provided me with 24 new/improved inputs for those docs, I used them to create the evaluation dataset with a code very similar to the following snippet:
By running it, the dataset gets created and filled, and it becomes visible on the Langsmith website.
After everything is in place, you can set up the evaluators and run the experiments. Check out the following snippet on how I did it:
Just a note: I ran one experiment for each one of the 7 methodologies, and 3 times for each item in the dataset (so 72 times in total per methodology) to reduce variability.
You can also follow the experiment by accessing the Langsmith website dashboard, as shown below:
This experimentation had a considerable cost, Langsmith, at least for this usage rate, is free, but for the document generation itself, I was expecting a considerable cost, especially because the gpt4 and gpt4o were more expensive and their few-shot learning prompt with 12 samples took 48k input tokens. So I estimated how much before running the experiments, a value closer to $85. Check the reasoning behind it:
Which ended up being a good estimate, Here is the real value ( I havenβt calculated the embeddings models usage required on some evaluators and one LLM-as-judge we used cost):
Note: The usage of the GPT-3.5 Turbo Fine-tuned models costs 6x more per token than the default GPT-3.5 Turbo.
Once the experiment was done, I downloaded the data and ran my own data analysis, comparisons, and some visualization algorithms. The following is the code to download the experimentation logs:
The following images are part of my official report on the evaluation results, based on the downloaded logs:
As additional materials, I also did some Data visualizations for the results
Evaluation Conclusion
By checking the results, the methodologies
βGPT-3.5 Turbo Structured (Fine-tuned + Agentic RAG)β and the
βGPT-3.5 Turbo Mixed (Fine-tuned + Agentic RAG)β
shows up on top of the scores for almost all metrics, by far, followed, not so close, by the βGPT-4o β few-shot learning (Agentic RAG)β on some metrics.
The human evaluations via the Streamlit POC app that happened during the weeks following the release of the prototype also corroborated these findings β the specialists were divided between those two fine-tuned models as the best solution.
And they are also the cheapest models/methodologies. They cost around $0.03 to generate a technical document each, and the third (or fourth, depending on how the total average score is calculated) best approach is βGPT-4o β few-shot learning (Agentic RAG)β, which costs $0.30 to generate a technical document. This is 10x more!
Extra Surprise: Detecting AI Content
I was talking about this project with a great friend of mine, Leandro Cunha, who happens to be a great Machine Learning Engineer, and he gave me one intriguing idea: Why donβt you test some generated document against most famous AI detector services?
There are a bunch of services that try to detect if a text or document was AI-generated by any of the most famous LLMs, and the percentage of it that might be created or paraphrased by an AI. They are called AI writing detectors, and these detection methods are still evolving, and seem not to be infallible. Explaining the details of how this is done is out of scope here, but for a more in-depth understanding of these methods, you can check some sources in the Reference section [19], [20], [21], [22], and [23].
For this experiment, I got 10 generated documents per methodology and the original document for the same input, out of the 24 curated technical documents I used on the evaluation runs. Why 10? From the 24 docs, I filtered 10 that were done before 2020β2021. I wanted to make sure that the original documents were created by the specialists without any sort of GenAI influence, which seems to happen on docs post-2022.
What I actually did was semi-manual testing, 10x on each methodology, with different documents, against 6 different AI detection services:
- Copyleaks AI Detector (I used the paid version)
- Quillbot (Premium version) AI Content Detector
- Sapling AI Detector
- ZeroGPT
- UNDETECTABLE AI
- Scribbr (Free)
Most of the services were free, Copyleaks, for example, has a very low quota for testing, which forced me to spend $20 on credits to run the full experiment. The good thing about it is that, by doing that, I was allowed to use their API to automate the experiment. QuillBot was also a premium service, but they have a free version, Iβm not sure about the daily limit, and since Iβm already a Quill subscriber, I could use the service without extra costs. I decided to limit the test on Scribr Free version only (which limits to 500 words) because it is an expensive service, as, the paid detector o be part of another service they have, Plagiarism checker.
Here are the results, an average value of the 10x I ran per methodology β 80 runs per service as I had 10 original docs and 70 generated. For QuillBot, I also collected the average for the fine-grained metrics since it was the only one that provided 4 extra outputs, beyond the general percentage.
Reviewing the results, it is amazing how the fine-tuning was also effective in tricking most of those AI detectors. In this case, the βGPT-3.5 Turbo Mixed (Fine-tuned + Agentic RAG)β had the upper hand on more detectors.
Copyleaks had also trouble detecting pure GPT4o when it was using the Few-shot prompt. ZeroGPT seemed to have some erratic results I even ran some of those twice to make sure the output wasnβt changing by the same input, but all the detectors were pretty much deterministic.
Ironically, Undetectable AI lived up to its name: it didnβt detect any AI at all!
Reflecting on the Journey
This journey finally came to an end. Well, so, what can I say about it? I had more fun than I had expected and thatβs why I decided to write about it.
This project has opened my eyes to the possibilities and usefulness of training LLMs with synthetic data. Some may find inspiration in this article, which details my POC journey. As we build upon the foundation for expansion and improve the models with more data and categories than on the prototype, the future of this project is bright.
I hope you have found this journey somehow helpful. Thank you very much for your time, and congrats to who has read this lengthy post!
References
Note: Unless otherwise noted, all images are by the author.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI