Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Synthetic Data Generation with Language Models: A Practical Guide
Latest   Machine Learning

Synthetic Data Generation with Language Models: A Practical Guide

Last Updated on October 5, 2024 by Editorial Team

Author(s): Ehssan

Originally published on Towards AI.

Created with Nightcafe — Image property of Author

In the evolving landscape of artificial intelligence, data remains the fuel that powers innovation. But what happens when acquiring real-world data becomes challenging, expensive, or even impossible?

Enter synthetic data generation — a groundbreaking technique that leverages language models to create high-quality, realistic datasets. Consider training a language model on medical records without breaching privacy laws, or developing a customer interaction model without access to private conversation logs, or designing autonomous driving systems where collecting data on rare edge cases is nearly impossible. Synthetic data bridges gaps in data availability while maintaining the realism needed for effective AI training.

Beyond addressing data shortages, synthetic data enhances AI development by balancing imbalanced datasets (e.g., in fraud detection or rare medical conditions), simulating rare events, and augmenting limited data with realistic variations. Companies can accelerate development, improve model robustness, and experiment with datasets otherwise unavailable.

While the benefits of synthetic data — such as scalability, privacy preservation, and the ability to simulate hard-to-capture scenarios — are clear, it also has limitations, including limited real-world credibility, overfitting, and bias, which require careful consideration.

In this article, we’ll explore synthetic data generation, discuss its limitations and ways to overcome them, and show you how to implement your own synthetic data generator in Python.

How to Overcome the Limitations of Synthetic Data

1. Lack of Real-World Authenticity

Synthetic data may not fully capture the nuances and variability of real-world data, leading to models that perform well in controlled environments but fail in real-world applications.

How to Overcome:

  • Hybrid Approach: Use synthetic data to augment real data, not replace it. A combination ensures that the model can generalize to unseen, real-world scenarios.
  • Validation on Real Data: Always validate models on real-world datasets, even if training is done with synthetic data, to assess performance in practical applications and to ensure robustness.

2. Overfitting and Bias

Models trained on synthetic data might overfit to the patterns in that data, which may not exist in real-world data. This can lead to poor generalization when deployed. Also, Synthetic data can inherit or amplify biases present in the models used to generate it. This can result in biased predictions.

How to Overcome:

  • Data Regularization: Apply data augmentation techniques and introduce noise in synthetic data to mimic the randomness and variability of real-world data.
  • Diverse Data Generation: Ensure diversity in the synthetic data by using multiple models and methods to generate data from different perspectives.

In addition, keep in mind that ensuring the quality and representativeness of synthetic data can be difficult and often a little experimentation with few-shot learning (FSL) and chain-of-thought (CoT) prompting in prompt engineering can go a long way. We shall illustrate these in more detail below.

Synthetic Data Generator Implementation

You can run this tutorial on the Intel® Tiber™ Developer Cloud free environment, which is equipped with a 4th Generation Intel® Xeon® CPU. This platform provides ample computing resources ensuring smooth execution of our code.

Environment Setup

Let’s begin with importing the necessary libraries. In our demo we shall use Llama 3.1 and you will need a Hugging Face token to access this model’s gated repository. You may create and access your tokens directly from your Hugging Face account. To do so, select “Access Tokens” from your settings menu and create a token with the “write” permission.

Snapshot of the Hugging Face token creation page— Image by Author

Now, you can insert your token in your Python script. (Do not share your Access Tokens with anyone; Hugging Face removes any leaked Access Tokens.)

import torch
import numpy as np
from transformers import pipeline
import pandas as pd
from huggingface_hub import login

login("your_token")

Next, go to meta-llama/Meta-Llama-3.1–8B-Instruct and read the license before providing your information and submitting the Llama 3.1 access request.

Implementation

Let’s say we want to generate synthetic customer service texts classified by the following labels

labels = ["polite", "somewhat polite", "neutral", "impolite"]

in these contexts

category_type = {
"travel": ["air", "train"],
"stores": ["appliances", "toys and games"]
}

We shall randomly select labels and categories and instruct the language model to generate synthetic data based on the specified categories and labels.

Randomness will ensure data regularization; see the second challenge (Overfitting and Bias) above. Once we have selected a context category, we randomly choose a corresponding type from our dictionary as follows.


def diversify(category):
"""
Randomly selects a value from the list associated with a given key in the category_type dictionary.

Args:
category (str): A key in the category_type dictionary.

Returns:
str: A randomly chosen value from the list associated with the provided key.
"""

return np.random.choice(category_type[category])

Here’s how we go about the full implementation: we generate data in batches and our function randomly assigns labels and categories to the batch’s samples. For each sample in the batch, the sdg function:

  • Creates a prompt that instructs the language model to generate a synthetic customer service response based on the assigned label and category.
  • Uses the language model to generate a response to the prompt.
  • Extracts the relevant text from the generated response. You can leave the text_extraction function as an identity function for now, since its exact definition depends on factors like the prompt. It can be easily handled with regular expressions, for example.

Finally, each batch of the generated responses, along with their labels and the model used is appended to a CSV file.

def sdg(
sample_size,
labels,
categories,
batch_size=20,
output_path="./output.csv",
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
):
"""
Generates synthetic data based on specified categories and labels.

Args:
sample_size (int): The number of synthetic data samples to generate.
labels (list of str): The labels used to classify the synthetic data.
categories (list of str): The categories for data generation and diversification.
batch_size (int): The number of samples per batch to append to the output file.
output_dir (str): The directory path where the output file will be saved.
model (str): The large language model used for generating the synthetic data.
"""


# If sample_size is not divisible by batch_size, an extra batch is added
num_batches = (sample_size + batch_size - 1) // batch_size

print(f"Synthetic data will be appended to {output_path} in {num_batches} batches.")

for batch in range(num_batches):
# Calculate the start and end indices for the current batch
start = batch * batch_size
end = min(start + batch_size, sample_size)

# Store results of the current batch
batch_data = []

# Assign random labels to the current batch
batch_random_labels = np.random.choice(labels, batch_size, replace=True)

# Assign random categories to the current batch
batch_random_categories = np.random.choice(categories, batch_size, replace=True)

for i in range(start, end):
prompt = f"""I am creating synthetic OUTPUT to fine-tune
my BERT model. The usecase is customer service chatbots.
You should generate only one OUTPUT for the classification
LABEL: {batch_random_labels[i - start]} in CATEGORY:
{batch_random_categories[i - start]} and TYPE
{diversify(batch_random_categories[i - start])}.

Examples.
OUTPUT: The fee you’re seeing is likely related
to our standard account maintenance charges. I can provide
more details if needed.

OUTPUT: You can return it, but only if you have the
receipt and it’s within the return window.

OUTPUT: It's not our fault your baggage didn't make it.
What do you expect us to do about it now?

OUTPUT: I apologize for the trouble you’ve had with the
heater. We can certainly look into a return or exchange.
Please bring in your receipt, and we’ll take care of it
for you.

Only return one OUTPUT and not the LABEL or the CATEGORY.
"""

messages = [
{
"role": "system",
"content": f"You are a helpful assistant designed to generate synthetic customer service data with labels {labels} in categories {list(category_type.keys())}.",
},
{"role": "user", "content": prompt},
]
generator = pipeline("text-generation", model=model)
result = generator(messages, max_new_tokens=128)[0]["generated_text"][-1][
"content"
]

result = text_extraction(result)
batch_data.append(
{
"text": result,
"label": batch_random_labels[i - start],
"model": model,
}
)

# Convert the batch results to a DataFrame
batch_df = pd.DataFrame(batch_data)

# Append the DataFrame to the CSV file
if batch == 0:
# If it's the first batch, write headers
batch_df.to_csv(output_path, mode="w", index=False)
else:
# For subsequent batches, append without headers
batch_df.to_csv(output_path, mode="a", header=False, index=False)
print(f"Saved batch number {batch + 1}/{num_batches}")

Here’s a sample output.

| text | label | model |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------|
| You're still whining about your membership renewal fee? It's not like we're the ones who raised the prices, it's the board's decision. You should just deal with it and stop complaining. | impolite | meta-llama/Meta-Llama-3.1-8B-Instruct |
| I'm not sure why our membership fees are higher this quarter, but I can check on the pricing for our tennis courts and see if there's a way to adjust your plan to fit your budget better. | somewhat polite | meta-llama/Meta-Llama-3.1-8B-Instruct |

Further Improvements

To improve the quality of the outputs of our data generator, we could modify the prompt and diversify the model. We discuss each of these briefly.

Prompt

It’s good practice to pass explicit label descriptions to the model through the prompt. For instance, we could add the lines

polite: Text is considerate and shows respect and good manners, often including courteous phrases and a friendly tone.
somewhat polite: Text is generally respectful but lacks warmth or formality, communicating with a decent level of courtesy.
neutral: Text is straightforward and factual, without emotional undertones or specific attempts at politeness.
impolite: Text is disrespectful or rude, often blunt or dismissive, showing a lack of consideration for the recipient's feelings.

to our prompt. Additionally, we could require the language model to provide its reasoning to support the text generation for the specified label. Here is such an improved prompt.

prompt = f"""You should create synthetic data for specified labels and categories. 
This is especially useful for developing customer service chatbots.

Label descriptions:
- polite: Text is considerate and shows respect and good manners, often including courteous phrases and a friendly tone.
- somewhat polite: Text is generally respectful but lacks warmth or formality, communicating with a decent level of courtesy.
- neutral: Text is straightforward and factual, without emotional undertones or specific attempts at politeness.
- impolite: Text is disrespectful or rude, often blunt or dismissive, showing a lack of consideration for the recipient's feelings.

Examples.

LABEL: somewhat polite
CATEGORY: travel
TYPE: train
OUTPUT: I understand your concern about your booking, and I'll check what options we have for you.
REASONING: This text would be classified as "somewhat polite."
The acknowledgment of the customer's concern shows a basic level of respect.
The sentence is direct and lacks additional warmth or formality, but it communicates a willingness to help.
The use of "I'll check" is a straightforward commitment to action without additional courteous phrases that would make it fully polite.

LABEL: neutral
CATEGORY: stores
TYPE: appliances
OUTPUT: Your TV will be delivered within three to five business days.
REASONING: This text would be classified as "neutral."
The sentence is purely informational, providing the facts about delivery time without any emotional undertones.
There are no phrases that express politeness or rudeness; it's a straightforward statement.
The tone is impersonal and focused solely on conveying the necessary information.
####################
You should generate one OUTPUT for the classification below.
Only return the OUTPUT and REASONING.
Do not return the LABEL, CATEGORY, or TYPE.

LABEL: {batch_random_labels[i - start]}
CATEGORY: {batch_random_categories[i - start]}
TYPE: {diversify(batch_random_categories[i - start])}
OUTPUT:
REASONING:
"""

Diversity

To further diversify the output data, one can pass multiple different language models to the synthetic data generator. When we used identical generators and prompts on Llama-3.1–8B-Instruct, gemma-2–9b-it, and Mixtral-8x7B-Instruct-v0.1, we observed the following percentages of duplicated data.

  • Llama: 0.04%
  • Gemma: 94.6%(Note: This model wasn’t trained with any system instructions, so you need to modify messages accordingly.)
  • Mixtral: 7%

Gotcha Alert In some edge cases the language model might generate the same text for different labels! For instance, when we ran the generator with Llama 3.1, the following output was generated for both neutral and somewhat polite labels.

I'm afraid the toy you're looking for is currently out of stock, but we do have a similar product that might interest you. Would you like me to check availability?

Conclusion

Synthetic data generation with language models is a powerful tool that has the potential to reshape the future of AI. Whether you’re a researcher, developer, or business leader, understanding this technology could provide a competitive edge in the evolving AI landscape.

If you’re interested in exploring how synthetic data can revolutionize your AI projects, consider diving deeper into language models, writing your custom data generators, and experimenting with existing data generation tools to unlock new possibilities.

For more AI development how-to content, visit Intel® AI Development Resources.

Suggested Reading

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓

Sign Up for the Course
`; } else { console.error('Element with id="subscribe" not found within the page with class "home".'); } } }); // Remove duplicate text from articles /* Backup: 09/11/24 function removeDuplicateText() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, strong'); // Select the desired elements const seenTexts = new Set(); // A set to keep track of seen texts const tagCounters = {}; // Object to track instances of each tag elements.forEach(el => { const tagName = el.tagName.toLowerCase(); // Get the tag name (e.g., 'h1', 'h2', etc.) // Initialize a counter for each tag if not already done if (!tagCounters[tagName]) { tagCounters[tagName] = 0; } // Only process the first 10 elements of each tag type if (tagCounters[tagName] >= 2) { return; // Skip if the number of elements exceeds 10 } const text = el.textContent.trim(); // Get the text content const words = text.split(/\s+/); // Split the text into words if (words.length >= 4) { // Ensure at least 4 words const significantPart = words.slice(0, 5).join(' '); // Get first 5 words for matching // Check if the text (not the tag) has been seen before if (seenTexts.has(significantPart)) { // console.log('Duplicate found, removing:', el); // Log duplicate el.remove(); // Remove duplicate element } else { seenTexts.add(significantPart); // Add the text to the set } } tagCounters[tagName]++; // Increment the counter for this tag }); } removeDuplicateText(); */ // Remove duplicate text from articles function removeDuplicateText() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, strong'); // Select the desired elements const seenTexts = new Set(); // A set to keep track of seen texts const tagCounters = {}; // Object to track instances of each tag // List of classes to be excluded const excludedClasses = ['medium-author', 'post-widget-title']; elements.forEach(el => { // Skip elements with any of the excluded classes if (excludedClasses.some(cls => el.classList.contains(cls))) { return; // Skip this element if it has any of the excluded classes } const tagName = el.tagName.toLowerCase(); // Get the tag name (e.g., 'h1', 'h2', etc.) // Initialize a counter for each tag if not already done if (!tagCounters[tagName]) { tagCounters[tagName] = 0; } // Only process the first 10 elements of each tag type if (tagCounters[tagName] >= 10) { return; // Skip if the number of elements exceeds 10 } const text = el.textContent.trim(); // Get the text content const words = text.split(/\s+/); // Split the text into words if (words.length >= 4) { // Ensure at least 4 words const significantPart = words.slice(0, 5).join(' '); // Get first 5 words for matching // Check if the text (not the tag) has been seen before if (seenTexts.has(significantPart)) { // console.log('Duplicate found, removing:', el); // Log duplicate el.remove(); // Remove duplicate element } else { seenTexts.add(significantPart); // Add the text to the set } } tagCounters[tagName]++; // Increment the counter for this tag }); } removeDuplicateText(); //Remove unnecessary text in blog excerpts document.querySelectorAll('.blog p').forEach(function(paragraph) { // Replace the unwanted text pattern for each paragraph paragraph.innerHTML = paragraph.innerHTML .replace(/Author\(s\): [\w\s]+ Originally published on Towards AI\.?/g, '') // Removes 'Author(s): XYZ Originally published on Towards AI' .replace(/This member-only story is on us\. Upgrade to access all of Medium\./g, ''); // Removes 'This member-only story...' }); //Load ionic icons and cache them if ('localStorage' in window && window['localStorage'] !== null) { const cssLink = 'https://code.ionicframework.com/ionicons/2.0.1/css/ionicons.min.css'; const storedCss = localStorage.getItem('ionicons'); if (storedCss) { loadCSS(storedCss); } else { fetch(cssLink).then(response => response.text()).then(css => { localStorage.setItem('ionicons', css); loadCSS(css); }); } } function loadCSS(css) { const style = document.createElement('style'); style.innerHTML = css; document.head.appendChild(style); } //Remove elements from imported content automatically function removeStrongFromHeadings() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, h6, span'); elements.forEach(el => { const strongTags = el.querySelectorAll('strong'); strongTags.forEach(strongTag => { while (strongTag.firstChild) { strongTag.parentNode.insertBefore(strongTag.firstChild, strongTag); } strongTag.remove(); }); }); } removeStrongFromHeadings(); "use strict"; window.onload = () => { /* //This is an object for each category of subjects and in that there are kewords and link to the keywods let keywordsAndLinks = { //you can add more categories and define their keywords and add a link ds: { keywords: [ //you can add more keywords here they are detected and replaced with achor tag automatically 'data science', 'Data science', 'Data Science', 'data Science', 'DATA SCIENCE', ], //we will replace the linktext with the keyword later on in the code //you can easily change links for each category here //(include class="ml-link" and linktext) link: 'linktext', }, ml: { keywords: [ //Add more keywords 'machine learning', 'Machine learning', 'Machine Learning', 'machine Learning', 'MACHINE LEARNING', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, ai: { keywords: [ 'artificial intelligence', 'Artificial intelligence', 'Artificial Intelligence', 'artificial Intelligence', 'ARTIFICIAL INTELLIGENCE', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, nl: { keywords: [ 'NLP', 'nlp', 'natural language processing', 'Natural Language Processing', 'NATURAL LANGUAGE PROCESSING', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, des: { keywords: [ 'data engineering services', 'Data Engineering Services', 'DATA ENGINEERING SERVICES', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, td: { keywords: [ 'training data', 'Training Data', 'training Data', 'TRAINING DATA', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, ias: { keywords: [ 'image annotation services', 'Image annotation services', 'image Annotation services', 'image annotation Services', 'Image Annotation Services', 'IMAGE ANNOTATION SERVICES', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, l: { keywords: [ 'labeling', 'labelling', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, pbp: { keywords: [ 'previous blog posts', 'previous blog post', 'latest', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, mlc: { keywords: [ 'machine learning course', 'machine learning class', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, }; //Articles to skip let articleIdsToSkip = ['post-2651', 'post-3414', 'post-3540']; //keyword with its related achortag is recieved here along with article id function searchAndReplace(keyword, anchorTag, articleId) { //selects the h3 h4 and p tags that are inside of the article let content = document.querySelector(`#${articleId} .entry-content`); //replaces the "linktext" in achor tag with the keyword that will be searched and replaced let newLink = anchorTag.replace('linktext', keyword); //regular expression to search keyword var re = new RegExp('(' + keyword + ')', 'g'); //this replaces the keywords in h3 h4 and p tags content with achor tag content.innerHTML = content.innerHTML.replace(re, newLink); } function articleFilter(keyword, anchorTag) { //gets all the articles var articles = document.querySelectorAll('article'); //if its zero or less then there are no articles if (articles.length > 0) { for (let x = 0; x < articles.length; x++) { //articles to skip is an array in which there are ids of articles which should not get effected //if the current article's id is also in that array then do not call search and replace with its data if (!articleIdsToSkip.includes(articles[x].id)) { //search and replace is called on articles which should get effected searchAndReplace(keyword, anchorTag, articles[x].id, key); } else { console.log( `Cannot replace the keywords in article with id ${articles[x].id}` ); } } } else { console.log('No articles found.'); } } let key; //not part of script, added for (key in keywordsAndLinks) { //key is the object in keywords and links object i.e ds, ml, ai for (let i = 0; i < keywordsAndLinks[key].keywords.length; i++) { //keywordsAndLinks[key].keywords is the array of keywords for key (ds, ml, ai) //keywordsAndLinks[key].keywords[i] is the keyword and keywordsAndLinks[key].link is the link //keyword and link is sent to searchreplace where it is then replaced using regular expression and replace function articleFilter( keywordsAndLinks[key].keywords[i], keywordsAndLinks[key].link ); } } function cleanLinks() { // (making smal functions is for DRY) this function gets the links and only keeps the first 2 and from the rest removes the anchor tag and replaces it with its text function removeLinks(links) { if (links.length > 1) { for (let i = 2; i < links.length; i++) { links[i].outerHTML = links[i].textContent; } } } //arrays which will contain all the achor tags found with the class (ds-link, ml-link, ailink) in each article inserted using search and replace let dslinks; let mllinks; let ailinks; let nllinks; let deslinks; let tdlinks; let iaslinks; let llinks; let pbplinks; let mlclinks; const content = document.querySelectorAll('article'); //all articles content.forEach((c) => { //to skip the articles with specific ids if (!articleIdsToSkip.includes(c.id)) { //getting all the anchor tags in each article one by one dslinks = document.querySelectorAll(`#${c.id} .entry-content a.ds-link`); mllinks = document.querySelectorAll(`#${c.id} .entry-content a.ml-link`); ailinks = document.querySelectorAll(`#${c.id} .entry-content a.ai-link`); nllinks = document.querySelectorAll(`#${c.id} .entry-content a.ntrl-link`); deslinks = document.querySelectorAll(`#${c.id} .entry-content a.des-link`); tdlinks = document.querySelectorAll(`#${c.id} .entry-content a.td-link`); iaslinks = document.querySelectorAll(`#${c.id} .entry-content a.ias-link`); mlclinks = document.querySelectorAll(`#${c.id} .entry-content a.mlc-link`); llinks = document.querySelectorAll(`#${c.id} .entry-content a.l-link`); pbplinks = document.querySelectorAll(`#${c.id} .entry-content a.pbp-link`); //sending the anchor tags list of each article one by one to remove extra anchor tags removeLinks(dslinks); removeLinks(mllinks); removeLinks(ailinks); removeLinks(nllinks); removeLinks(deslinks); removeLinks(tdlinks); removeLinks(iaslinks); removeLinks(mlclinks); removeLinks(llinks); removeLinks(pbplinks); } }); } //To remove extra achor tags of each category (ds, ml, ai) and only have 2 of each category per article cleanLinks(); */ //Recommended Articles var ctaLinks = [ /* ' ' + '

Subscribe to our AI newsletter!

' + */ '

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

'+ '

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

' + '
' + '' + '' + '

Note: Content contains the views of the contributing authors and not Towards AI.
Disclosure: This website may contain sponsored content and affiliate links.

' + 'Discover Your Dream AI Career at Towards AI Jobs' + '

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 10,000 live jobs today with Towards AI Jobs!

' + '
' + '

🔥 Recommended Articles 🔥

' + 'Why Become an LLM Developer? Launching Towards AI’s New One-Stop Conversion Course'+ 'Testing Launchpad.sh: A Container-based GPU Cloud for Inference and Fine-tuning'+ 'The Top 13 AI-Powered CRM Platforms
' + 'Top 11 AI Call Center Software for 2024
' + 'Learn Prompting 101—Prompt Engineering Course
' + 'Explore Leading Cloud Providers for GPU-Powered LLM Training
' + 'Best AI Communities for Artificial Intelligence Enthusiasts
' + 'Best Workstations for Deep Learning
' + 'Best Laptops for Deep Learning
' + 'Best Machine Learning Books
' + 'Machine Learning Algorithms
' + 'Neural Networks Tutorial
' + 'Best Public Datasets for Machine Learning
' + 'Neural Network Types
' + 'NLP Tutorial
' + 'Best Data Science Books
' + 'Monte Carlo Simulation Tutorial
' + 'Recommender System Tutorial
' + 'Linear Algebra for Deep Learning Tutorial
' + 'Google Colab Introduction
' + 'Decision Trees in Machine Learning
' + 'Principal Component Analysis (PCA) Tutorial
' + 'Linear Regression from Zero to Hero
'+ '

', /* + '

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

',*/ ]; var replaceText = { '': '', '': '', '
': '
' + ctaLinks + '
', }; Object.keys(replaceText).forEach((txtorig) => { //txtorig is the key in replacetext object const txtnew = replaceText[txtorig]; //txtnew is the value of the key in replacetext object let entryFooter = document.querySelector('article .entry-footer'); if (document.querySelectorAll('.single-post').length > 0) { //console.log('Article found.'); const text = entryFooter.innerHTML; entryFooter.innerHTML = text.replace(txtorig, txtnew); } else { // console.log('Article not found.'); //removing comment 09/04/24 } }); var css = document.createElement('style'); css.type = 'text/css'; css.innerHTML = '.post-tags { display:none !important } .article-cta a { font-size: 18px; }'; document.body.appendChild(css); //Extra //This function adds some accessibility needs to the site. function addAlly() { // In this function JQuery is replaced with vanilla javascript functions const imgCont = document.querySelector('.uw-imgcont'); imgCont.setAttribute('aria-label', 'AI news, latest developments'); imgCont.title = 'AI news, latest developments'; imgCont.rel = 'noopener'; document.querySelector('.page-mobile-menu-logo a').title = 'Towards AI Home'; document.querySelector('a.social-link').rel = 'noopener'; document.querySelector('a.uw-text').rel = 'noopener'; document.querySelector('a.uw-w-branding').rel = 'noopener'; document.querySelector('.blog h2.heading').innerHTML = 'Publication'; const popupSearch = document.querySelector$('a.btn-open-popup-search'); popupSearch.setAttribute('role', 'button'); popupSearch.title = 'Search'; const searchClose = document.querySelector('a.popup-search-close'); searchClose.setAttribute('role', 'button'); searchClose.title = 'Close search page'; // document // .querySelector('a.btn-open-popup-search') // .setAttribute( // 'href', // 'https://medium.com/towards-artificial-intelligence/search' // ); } // Add external attributes to 302 sticky and editorial links function extLink() { // Sticky 302 links, this fuction opens the link we send to Medium on a new tab and adds a "noopener" rel to them var stickyLinks = document.querySelectorAll('.grid-item.sticky a'); for (var i = 0; i < stickyLinks.length; i++) { /* stickyLinks[i].setAttribute('target', '_blank'); stickyLinks[i].setAttribute('rel', 'noopener'); */ } // Editorial 302 links, same here var editLinks = document.querySelectorAll( '.grid-item.category-editorial a' ); for (var i = 0; i < editLinks.length; i++) { editLinks[i].setAttribute('target', '_blank'); editLinks[i].setAttribute('rel', 'noopener'); } } // Add current year to copyright notices document.getElementById( 'js-current-year' ).textContent = new Date().getFullYear(); // Call functions after page load extLink(); //addAlly(); setTimeout(function() { //addAlly(); //ideally we should only need to run it once ↑ }, 5000); }; function closeCookieDialog (){ document.getElementById("cookie-consent").style.display = "none"; return false; } setTimeout ( function () { closeCookieDialog(); }, 15000); console.log(`%c 🚀🚀🚀 ███ █████ ███████ █████████ ███████████ █████████████ ███████████████ ███████ ███████ ███████ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Towards AI is looking for contributors! │ │ Join us in creating awesome AI content. │ │ Let's build the future of AI together → │ │ https://towardsai.net/contribute │ │ │ └───────────────────────────────────────────────────────────────────┘ `, `background: ; color: #00adff; font-size: large`); //Remove latest category across site document.querySelectorAll('a[rel="category tag"]').forEach(function(el) { if (el.textContent.trim() === 'Latest') { // Remove the two consecutive spaces (  ) if (el.nextSibling && el.nextSibling.nodeValue.includes('\u00A0\u00A0')) { el.nextSibling.nodeValue = ''; // Remove the spaces } el.style.display = 'none'; // Hide the element } }); // Add cross-domain measurement, anonymize IPs 'use strict'; //var ga = gtag; ga('config', 'G-9D3HKKFV1Q', 'auto', { /*'allowLinker': true,*/ 'anonymize_ip': true/*, 'linker': { 'domains': [ 'medium.com/towards-artificial-intelligence', 'datasets.towardsai.net', 'rss.towardsai.net', 'feed.towardsai.net', 'contribute.towardsai.net', 'members.towardsai.net', 'pub.towardsai.net', 'news.towardsai.net' ] } */ }); ga('send', 'pageview'); -->