Accelerating Data Annotation with LLMs: A Practical Guide

Last Updated on May 13, 2025 by Editorial Team

Author(s): Abdullah Al Munem

Originally published on Towards AI.

Accelerating Data Annotation with LLMs: A Practical Guide — Photo by Katarzyna Pe on Unsplash

Data annotation stands as perhaps the most formidable challenge in supervised machine learning development. This critical process typically consumes 60–80% of project timelines and budgets, creating a significant bottleneck in ML development cycles. In my experience while working with ML teams across multiple domains, I’ve witnessed many projects stall at this stage, with annotation difficulties delaying or even derailing promising applications.

But what if we could dramatically accelerate this process?

The emergence of Large Language Models (LLMs) has created an opportunity to transform how we approach annotation tasks. Rather than using LLMs directly for inference (which requires significant computational resources), we can leverage them as annotation assistants to create high-quality labeled datasets for training lightweight, specialized models.

In this article, I’ll share practical implementations from two recent projects where we used LLMs to annotate datasets that would have otherwise required weeks of manual labor. I’ll provide the exact code we used, explain our rationale, and share the lessons learned along the way.

The Challenge of Data Annotation

For supervised learning training, data annotation is the scariest, time consuming, and labor intensive task, which may cost a lot of money too. Traditional annotation workflows typically follow a labor-intensive path:
1. Define annotation guidelines
2. Train annotators
3. Begin manual annotation
3. Perform quality control checks and validate the annotation
4. Apply active learning to accelerate the annotation process by skipping the manual annotation and focusing on validating the inference.
5. Deal with the human errors and finalize the ground truths.

From Active Learning to LLM-Assisted Annotation

There are some methods where we can train the model with limited annotated data, then use that model to predict on unlabeled data. After that, we can validate the predictions and add these data with the ground truth, so that we can train the model again and annotate more data. This method is known as “Active Learning.”

Active learning is a good approach, but it also needs manual annotation, and if the model is not trained properly, it can give you totally wrong predictions. Thus, with the emergence of LLMs, we can annotate unlabeled data in a more sophisticated way.
With LLMs, we can reimagine this workflow:
1. Define annotation guidelines and examples (Few-shot prompting)
2. Use LLMs to automatically annotate unlabeled data
3. Filter results based on confidence scores
4. Validate a small subset to ensure quality
5. Train lightweight models on the annotated data

This approach can reduce annotation time from weeks to days while maintaining comparable quality. Let’s look at two real-world examples where we applied this method.

Practical Use Case 1: Computer Vision for Fashion Classification

Working with a major and renowned Fashion brand in Bangladesh, we developed a Virtual Try-On (VTON) system that required a robust model that could accurately classify clothing types across both Western and ethnic styles. This required annotating a dataset of 50,000 images, a task that would traditionally take weeks of manual work.

Live Demo of Virtual Try-On available on the product’s detail page: Visit Website.

Implementation

We built a FastAPI service that uses a vision-capable LLM to classify garment images. Here’s the actual code we implemented:

cloth_category = {
 "sweater": "upper",
 "tunic": "upper",
 "shirt": "upper",
 "sweat shirt": "upper",
 "t-shirt": "upper",
 "polo shirt": "upper",
 "hoodie": "upper",
 "tops": "upper",
 "salwar kameez": "overall",
 "panjabi": "overall",
 "frock": "overall",
 "saree": "overall",
 "ghagra choli": "overall",
 "abaya": "overall",
 "skirt": "overall",
 "jacket": "upper",
 "coat": "upper"
}

cloths = list(cloth_category.keys())


@app.post("/infer")
async def infer(request: Request):
 try:
 # Read raw bytes from the request body
 image = await request.body()

 # Open image using PIL
 image_pil = Image.open(io.BytesIO(image))
 
 # Check if the image format is not JPG or PNG
 if image_pil.format not in ["JPEG", "PNG"]:
 output_format = "PNG" # Convert to PNG by default
 output = io.BytesIO()
 image_pil.convert("RGB").save(output, format=output_format)
 output.seek(0)
 image = output.getvalue() # Get bytes from the converted image

 # Call Ollama API
 response = ollama.chat(
 model="llama3.2-vision:11b", # "llama3.2-vision" "llava:7b" "llava:13b" "minicpm-v"
 messages=[{
 'role': 'user',
 'content': f'What cloth category is the person wearing? Answer within {cloths}. Answer just the cloth category.',
 'images': [image]
 }],
 stream=False,
 keep_alive=-1
 )
 cloth_type = response['message']['content'].lower().replace(".", "").strip()
 print(cloth_type)
 final_cloth_type = cloth_category.get(cloth_type, "upper")
 print("final_cloth_type:", final_cloth_type)
 return {"cloth_type": final_cloth_type}

 except Exception as e:
 print(traceback.format_exc())
 return {"error": str(e)}

A key insight from our implementation was the importance of prompt engineering. Notice how we:
1. Provided a constrained list of options (cloths)
2. Instructed the model to “Answer just the cloth category”
3. Added post-processing to normalize responses
4. Mapped specific garment types to broader categories (upper/overall), which was our main goal for developing this API.

This approach yielded impressive results; the LLM correctly classified approximately 94% of the images in our test set.

Sample Outputs

Input: [Image of a man wearing a traditional panjabi]
LLM response: "panjabi"
Final classification: "overall"

Input: [Image of a woman in a floral saree]
LLM response: "saree"
Final classification: "overall"

Input: [Image of a teenager in a hoodie]
LLM response: "hoodie"
Final classification: "upper"

The key advantage here wasn’t just reducing the annotation time and manual labor; it was adaptability. Unlike conventional computer vision models that struggle with domain-specific clothing items, especially ethnic wear, the LLM could leverage its broad knowledge to accurately classify diverse garment types without extensive retraining. At first, we have used the LLM as our cloth type classifier, but LMMs take a large GPU memory and comparatively high inference time. Thus, we have annotated our dataset with the LLM, then trained a lightweight, fast inference model that is time and memory efficient and serves similar results.

Practical Use Case 2: NLP for Multilingual Spam Detection

In a separate project, we needed to develop a spam detection system for a telecommunications client who receives millions of SMS messages daily in multiple languages (English, Bengali, and mixed text or “Banglish”).

The challenge? We started with zero labeled data.

Implementation

Our approach combined few-shot learning with LLM-based annotation. Here’s the actual code we implemented:

examples = [
 {"label": "ham", "text": "Dear Sir Your Customer ID is: 620 Current month bill is 750TK. Kindly pay your internet bill before 10th by bKash or Nagad Pay Bill Antaranga"},
 {"label": "ham", "text": "Dear Sir Your Customer ID is: 104614 Current month bill is 417TK. Kindly pay your internet bill before 10th by bKash or Nagad Pay Bill Antaranga"},
 {"label": "ham", "text": "আপনার আইপি টিভি এর পিন নাম্বার ১২৩৪। এটি কারও সাথে শেয়ার করবেন না। ধন্যবাদ।"},
 {"label": "ham", "text": "ভাই, আজকে সন্ধ্যা ৭টায় আসছি। খাবার রেডি রাখিস।"},
 {"label": "ham", "text": "Meeting rescheduled to 3pm tomorrow. Please confirm attendance."},
 {"label": "spam", "text": "বিনামূল্যে ৳৩৮ অ্যাপ ডাউনলোড, বিনামূল্যে টিকিট https://bit.ly/4fDpH48"},
 {"label": "spam", "text": "প্রথমবার জমা করুন আর স্লট গেমে ১০০ ফ্রি স্পিন জিতুন! xdss.net/4g0KWg9"},
 {"label": "spam", "text": "Congratulations! You've been selected for a FREE iPhone 15. Claim now: http://bit.ly/claim-prize"},
 {"label": "spam", "text": "অভিনন্দন! আপনি লটারি জিতেছেন! আপনার জিতে নেওয়া ১০,০০০ টাকা পেতে কল করুন 01XXXXXXXX"},
 {"label": "spam", "text": "Your account will be suspended. Update your information: http://tiny.cc/update-now"}
]

@app.post("/infer-single")
async def infer_single(request: Request):
 try:
 req_json = await request.json()
 sms_text = req_json.get("message")
 if not sms_text:
 return {"error": "No message provided."}

 # Build prompt with few-shot examples
 prompt = """You are an Banglish SMS spam classifier. Given a message, classify it as 'spam' or 'ham'. Also give a confidence score (0.0 to 100%). Consider any bad word or slang in Banglish as spam SMS. The confidence sore represent how much confident you are that, the messsage is Spam. Highly spam message should larger condifence value. Use bangla language knowledge to determine the spam. Please follow the exact response pattern. Give me a json with key and value, do not give me any text. the response will be (make sure the strings are in double quote, and JSON valid format): \n 
 
 <response>
 {
 "message": the given SMS (in double quote),
 "spam": 1 if spam, 0 if not spam,
 "confidence": confidence score (0.0 to 100%) of being spam 
 }
 </response>
 """

 for ex in examples:
 prompt += f"Message: {ex['text']}\nClass: {ex['label']}\n\n"
 prompt += f"Message: {sms_text}\nClass:"

 # LLM call
 response = ollama.chat(
 model="llama3:8b",
 messages=[{
 "role": "user",
 "content": prompt
 }],
 stream=False,
 keep_alive=-1
 )

 reply = response['message']['content'].strip().lower()
 print("LLM Reply:\n", reply)

 # Parse response
 import re
 import json

 match = re.search(r"<response>\s*(\{.*?\})\s*</response>", reply, re.DOTALL)
 if not match:
 return {"error": "Failed to parse model response."}

 json_str = match.group(1)
 print(json_str)

 parsed = json.loads(json_str)
 return {
 "message": parsed.get("message", sms_text),
 "spam": int(parsed.get("spam", 0)),
 "confidence": round(float(parsed.get("confidence", 50.0)), 2)
 }

 except Exception as e:
 print(traceback.format_exc())
 print(reply)
 return {"error": str(e), "reply": reply}

Several technical details made this approach successful:
1. Few-shot learning: We provided just 10 examples (5 spam, 5 ham) across multiple languages
2. Structured output: We forced the LLM to return JSON with confidence scores
3. Robust parsing: We implemented a precise response format and error handling for malformed responses
4. Domain knowledge integration: We included specific instructions about Banglish slang; we can add any special pattern into the prompt that may be missed by the LLM while classifying the text.

Sample Outputs

Input: "আপনার SIM কার্ড আগামীকাল বন্ধ হয়ে যাবে। ৭২ ঘন্টার মধ্যে রেজিস্ট্রেশন করুন: http://bit.ly/sim-reg"
LLM Reply:
<response>
 {
 "message": "আপনার SIM কার্ড আগামীকাল বন্ধ হয়ে যাবে। ৭২ ঘন্টার মধ্যে রেজিস্ট্রেশন করুন: http://bit.ly/sim-reg",
 "spam": 1,
 "confidence": 95.5
 }
</response>
Final output: {"message": "আপনার SIM কার্ড আগামীকাল বন্ধ হয়ে যাবে। ৭২ ঘন্টার মধ্যে রেজিস্ট্রেশন করুন: http://bit.ly/sim-reg", "spam": 1, "confidence": 95.5}

Input: "Dear customer, your internet package will expire tomorrow. Please recharge to continue service. Thank you."
LLM Reply:
<response>
 {
 "message": "Dear customer, your internet package will expire tomorrow. Please recharge to continue service. Thank you.",
 "spam": 0,
 "confidence": 25.8
 }
</response>
Final output: {"message": "Dear customer, your internet package will expire tomorrow. Please recharge to continue service. Thank you.", "spam": 0, "confidence": 25.8}

We used few-shot prompting with just 5 examples each of spam and ham SMS. Using the LLM, we annotated a large unlabeled dataset. Based on the confidence scores, we filtered out 5,000 high-confidence spam messages and selected 5,000 ham messages. This initial 10,000 message dataset was then used to train our first spam detection model.

The key insight was iterative improvement (Active Learning): we then used this initial model to annotate additional data from our unlabeled pool, manually verified a sample of its predictions by clustering using k-means algorithm, so that, we can cluster the similalr SMS and verify, to reduce the time to manually verify all the SMS one by one, and added the validated predictions back to our training set. With each iteration, the model’s performance improved significantly. This bootstrapping approach allowed us to efficiently create a high-quality labeled dataset from millions of unlabeled messages.

Technical Insights and Best Practices

Through these projects, we identified several critical factors for successful LLM-assisted annotation:

1. Prompt Engineering is Critical

The quality of your annotations depends heavily on your prompt design. Our guidelines:

Be explicit about output format (use response templates)
Include representative examples across categories
Provide domain-specific context when relevant
Request confidence scores to enable quality filtering

2. Model Selection Matters

We experimented with various models:

For vision tasks: Llama 3.2 Vision (11B) provided the best balance of accuracy and speed
For NLP tasks: Llama 3 (8B) worked surprisingly well, even for multilingual content
Hosting locally with Ollama provided the best cost/performance ratio for our use cases

3. Post-Processing Logic is Essential

Raw LLM outputs require normalization:

Strip punctuation and standardize casing
Parse structured outputs carefully with error handling
Implement confidence thresholds to filter results
Map raw classifications to your target schema

4. Validation Strategy

Even with LLM assistance, validation remains important:

Randomly sample and manually check 2–5% of annotations, or use a clustering method to speed up your manual validation
Focus validation efforts on edge cases and low-confidence predictions
Create a feedback loop to improve prompts based on validation findings

Beyond Classification: Other Applications

While we’ve focused on classification tasks, LLM-assisted annotation works well for a wide variety of tasks across both vision and NLP domains:

NLP Tasks

Named Entity Recognition: Identifying people, organizations, locations, etc. in text
Relationship Extraction: Determining relationships between entities in text
Sentiment Analysis: Labeling text as positive, negative, or neutral
Intent Classification: Categorizing user queries by their intended purpose
Text Summarization: Creating reference summaries for training extractive or abstractive models
Question Answering: Generating question-answer pairs from passages
Machine Translation: Creating parallel corpora across languages
Coreference Resolution: Identifying expressions that refer to the same entity
Part-of-Speech Tagging: Labeling words with their grammatical categories
Text Style Transfer: Identifying examples of different writing styles
Dialogue State Tracking: Annotating user intents and dialogue states
Content Moderation: Flagging toxic, harmful, or inappropriate content
Semantic Role Labeling: Identifying predicates and their arguments
Aspect-Based Sentiment Analysis: Identifying sentiment toward specific aspects
Data Augmentation: Generating variations of existing examples

Computer Vision Tasks

Object Detection: Identifying and localizing objects in images
Image Classification: Categorizing images into predefined classes
Semantic Segmentation: Pixel-level classification of image content
Instance Segmentation: Identifying and separating individual object instances
Image Captioning: Creating descriptive captions for images
Visual Question Answering: Generating QA pairs about images
Facial Expression Recognition: Labeling emotions from facial images
Pose Estimation: Identifying human body poses in images
Scene Understanding: Describing the overall context of an image
Action Recognition: Labeling activities in images or video frames
Visual Relationship Detection: Identifying relationships between objects
Depth Estimation: Creating reference depth maps for training depth models
Anomaly Detection: Identifying unusual patterns in visual data
Visual Attribute Classification: Labeling specific attributes of objects
Medical Image Annotation: Identifying and segmenting anatomical structures

Conclusion: A New Annotation Paradigm

LLM-assisted annotation represents a fundamental shift in how we approach data preparation for ML systems. By leveraging LLMs as annotation assistants rather than end-to-end solutions, we gain:

Speed: Annotation that would take weeks, completes in days
Cost efficiency: 80–90% reduction in annotation costs
Consistency: More uniform application of annotation guidelines
Flexibility: Works across domains, languages, and modalities

This approach doesn’t eliminate the need for human oversight, but it dramatically changes the equation, transforming annotation from a project bottleneck into a streamlined process that accelerates the entire ML development cycle.

What’s particularly exciting is how accessible this approach has become. With open-source models like Llama 3 and local inference servers like Ollama, even small teams can implement LLM-assisted annotation workflows without significant infrastructure investments.

The next time you’re facing a data annotation challenge, consider letting an LLM be your first annotator. Your future self (and budget) will thank you.

What annotation challenges have you faced in your ML projects? Have you experimented with using LLMs for annotation? Share your experiences in the comments below!

THANK YOU.

ABOUT ME

Abdullah Al Munem
Machine Learning Engineer at REVE Systems
LinkedIn: https://www.linkedin.com/in/abdullah-al-munem/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Accelerating Data Annotation with LLMs: A Practical Guide

Author(s): Abdullah Al Munem

The Challenge of Data Annotation

From Active Learning to LLM-Assisted Annotation

Practical Use Case 1: Computer Vision for Fashion Classification

Implementation

Practical Use Case 2: NLP for Multilingual Spam Detection

Implementation

Technical Insights and Best Practices

1. Prompt Engineering is Critical

2. Model Selection Matters

3. Post-Processing Logic is Essential

4. Validation Strategy

Beyond Classification: Other Applications

NLP Tasks

Computer Vision Tasks

Conclusion: A New Annotation Paradigm

ABOUT ME

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Accelerating Data Annotation with LLMs: A Practical Guide

Author(s): Abdullah Al Munem

The Challenge of Data Annotation

From Active Learning to LLM-Assisted Annotation

Practical Use Case 1: Computer Vision for Fashion Classification

Implementation

Practical Use Case 2: NLP for Multilingual Spam Detection

Implementation

Technical Insights and Best Practices

1. Prompt Engineering is Critical

2. Model Selection Matters

3. Post-Processing Logic is Essential

4. Validation Strategy

Beyond Classification: Other Applications

NLP Tasks

Computer Vision Tasks

Conclusion: A New Annotation Paradigm

ABOUT ME

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement