Accelerating Data Annotation with LLMs: A Practical Guide
Last Updated on May 13, 2025 by Editorial Team
Author(s): Abdullah Al Munem
Originally published on Towards AI.
Data annotation stands as perhaps the most formidable challenge in supervised machine learning development. This critical process typically consumes 60–80% of project timelines and budgets, creating a significant bottleneck in ML development cycles. In my experience while working with ML teams across multiple domains, I’ve witnessed many projects stall at this stage, with annotation difficulties delaying or even derailing promising applications.
But what if we could dramatically accelerate this process?
The emergence of Large Language Models (LLMs) has created an opportunity to transform how we approach annotation tasks. Rather than using LLMs directly for inference (which requires significant computational resources), we can leverage them as annotation assistants to create high-quality labeled datasets for training lightweight, specialized models.
In this article, I’ll share practical implementations from two recent projects where we used LLMs to annotate datasets that would have otherwise required weeks of manual labor. I’ll provide the exact code we used, explain our rationale, and share the lessons learned along the way.
The Challenge of Data Annotation
For supervised learning training, data annotation is the scariest, time consuming, and labor intensive task, which may cost a lot of money too. Traditional annotation workflows typically follow a labor-intensive path:
1. Define annotation guidelines
2. Train annotators
3. Begin manual annotation
3. Perform quality control checks and validate the annotation
4. Apply active learning to accelerate the annotation process by skipping the manual annotation and focusing on validating the inference.
5. Deal with the human errors and finalize the ground truths.
From Active Learning to LLM-Assisted Annotation
There are some methods where we can train the model with limited annotated data, then use that model to predict on unlabeled data. After that, we can validate the predictions and add these data with the ground truth, so that we can train the model again and annotate more data. This method is known as “Active Learning.”
Active learning is a good approach, but it also needs manual annotation, and if the model is not trained properly, it can give you totally wrong predictions. Thus, with the emergence of LLMs, we can annotate unlabeled data in a more sophisticated way.
With LLMs, we can reimagine this workflow:
1. Define annotation guidelines and examples (Few-shot prompting)
2. Use LLMs to automatically annotate unlabeled data
3. Filter results based on confidence scores
4. Validate a small subset to ensure quality
5. Train lightweight models on the annotated data
This approach can reduce annotation time from weeks to days while maintaining comparable quality. Let’s look at two real-world examples where we applied this method.
Practical Use Case 1: Computer Vision for Fashion Classification
Working with a major and renowned Fashion brand in Bangladesh, we developed a Virtual Try-On (VTON) system that required a robust model that could accurately classify clothing types across both Western and ethnic styles. This required annotating a dataset of 50,000 images, a task that would traditionally take weeks of manual work.
Live Demo of Virtual Try-On available on the product’s detail page: Visit Website.
Implementation
We built a FastAPI service that uses a vision-capable LLM to classify garment images. Here’s the actual code we implemented:
cloth_category = {
"sweater": "upper",
"tunic": "upper",
"shirt": "upper",
"sweat shirt": "upper",
"t-shirt": "upper",
"polo shirt": "upper",
"hoodie": "upper",
"tops": "upper",
"salwar kameez": "overall",
"panjabi": "overall",
"frock": "overall",
"saree": "overall",
"ghagra choli": "overall",
"abaya": "overall",
"skirt": "overall",
"jacket": "upper",
"coat": "upper"
}
cloths = list(cloth_category.keys())
@app.post("/infer")
async def infer(request: Request):
try:
# Read raw bytes from the request body
image = await request.body()
# Open image using PIL
image_pil = Image.open(io.BytesIO(image))
# Check if the image format is not JPG or PNG
if image_pil.format not in ["JPEG", "PNG"]:
output_format = "PNG" # Convert to PNG by default
output = io.BytesIO()
image_pil.convert("RGB").save(output, format=output_format)
output.seek(0)
image = output.getvalue() # Get bytes from the converted image
# Call Ollama API
response = ollama.chat(
model="llama3.2-vision:11b", # "llama3.2-vision" "llava:7b" "llava:13b" "minicpm-v"
messages=[{
'role': 'user',
'content': f'What cloth category is the person wearing? Answer within {cloths}. Answer just the cloth category.',
'images': [image]
}],
stream=False,
keep_alive=-1
)
cloth_type = response['message']['content'].lower().replace(".", "").strip()
print(cloth_type)
final_cloth_type = cloth_category.get(cloth_type, "upper")
print("final_cloth_type:", final_cloth_type)
return {"cloth_type": final_cloth_type}
except Exception as e:
print(traceback.format_exc())
return {"error": str(e)}
A key insight from our implementation was the importance of prompt engineering. Notice how we:
1. Provided a constrained list of options (cloths)
2. Instructed the model to “Answer just the cloth category”
3. Added post-processing to normalize responses
4. Mapped specific garment types to broader categories (upper/overall), which was our main goal for developing this API.
This approach yielded impressive results; the LLM correctly classified approximately 94% of the images in our test set.
Sample Outputs
Input: [Image of a man wearing a traditional panjabi]
LLM response: "panjabi"
Final classification: "overall"
Input: [Image of a woman in a floral saree]
LLM response: "saree"
Final classification: "overall"
Input: [Image of a teenager in a hoodie]
LLM response: "hoodie"
Final classification: "upper"
The key advantage here wasn’t just reducing the annotation time and manual labor; it was adaptability. Unlike conventional computer vision models that struggle with domain-specific clothing items, especially ethnic wear, the LLM could leverage its broad knowledge to accurately classify diverse garment types without extensive retraining. At first, we have used the LLM as our cloth type classifier, but LMMs take a large GPU memory and comparatively high inference time. Thus, we have annotated our dataset with the LLM, then trained a lightweight, fast inference model that is time and memory efficient and serves similar results.
Practical Use Case 2: NLP for Multilingual Spam Detection
In a separate project, we needed to develop a spam detection system for a telecommunications client who receives millions of SMS messages daily in multiple languages (English, Bengali, and mixed text or “Banglish”).
The challenge? We started with zero labeled data.
Implementation
Our approach combined few-shot learning with LLM-based annotation. Here’s the actual code we implemented:
examples = [
{"label": "ham", "text": "Dear Sir Your Customer ID is: 620 Current month bill is 750TK. Kindly pay your internet bill before 10th by bKash or Nagad Pay Bill Antaranga"},
{"label": "ham", "text": "Dear Sir Your Customer ID is: 104614 Current month bill is 417TK. Kindly pay your internet bill before 10th by bKash or Nagad Pay Bill Antaranga"},
{"label": "ham", "text": "আপনার আইপি টিভি এর পিন নাম্বার ১২৩৪। এটি কারও সাথে শেয়ার করবেন না। ধন্যবাদ।"},
{"label": "ham", "text": "ভাই, আজকে সন্ধ্যা ৭টায় আসছি। খাবার রেডি রাখিস।"},
{"label": "ham", "text": "Meeting rescheduled to 3pm tomorrow. Please confirm attendance."},
{"label": "spam", "text": "বিনামূল্যে ৳৩৮ অ্যাপ ডাউনলোড, বিনামূল্যে টিকিট https://bit.ly/4fDpH48"},
{"label": "spam", "text": "প্রথমবার জমা করুন আর স্লট গেমে ১০০ ফ্রি স্পিন জিতুন! xdss.net/4g0KWg9"},
{"label": "spam", "text": "Congratulations! You've been selected for a FREE iPhone 15. Claim now: http://bit.ly/claim-prize"},
{"label": "spam", "text": "অভিনন্দন! আপনি লটারি জিতেছেন! আপনার জিতে নেওয়া ১০,০০০ টাকা পেতে কল করুন 01XXXXXXXX"},
{"label": "spam", "text": "Your account will be suspended. Update your information: http://tiny.cc/update-now"}
]
@app.post("/infer-single")
async def infer_single(request: Request):
try:
req_json = await request.json()
sms_text = req_json.get("message")
if not sms_text:
return {"error": "No message provided."}
# Build prompt with few-shot examples
prompt = """You are an Banglish SMS spam classifier. Given a message, classify it as 'spam' or 'ham'. Also give a confidence score (0.0 to 100%). Consider any bad word or slang in Banglish as spam SMS. The confidence sore represent how much confident you are that, the messsage is Spam. Highly spam message should larger condifence value. Use bangla language knowledge to determine the spam. Please follow the exact response pattern. Give me a json with key and value, do not give me any text. the response will be (make sure the strings are in double quote, and JSON valid format): \n
<response>
{
"message": the given SMS (in double quote),
"spam": 1 if spam, 0 if not spam,
"confidence": confidence score (0.0 to 100%) of being spam
}
</response>
"""
for ex in examples:
prompt += f"Message: {ex['text']}\nClass: {ex['label']}\n\n"
prompt += f"Message: {sms_text}\nClass:"
# LLM call
response = ollama.chat(
model="llama3:8b",
messages=[{
"role": "user",
"content": prompt
}],
stream=False,
keep_alive=-1
)
reply = response['message']['content'].strip().lower()
print("LLM Reply:\n", reply)
# Parse response
import re
import json
match = re.search(r"<response>\s*(\{.*?\})\s*</response>", reply, re.DOTALL)
if not match:
return {"error": "Failed to parse model response."}
json_str = match.group(1)
print(json_str)
parsed = json.loads(json_str)
return {
"message": parsed.get("message", sms_text),
"spam": int(parsed.get("spam", 0)),
"confidence": round(float(parsed.get("confidence", 50.0)), 2)
}
except Exception as e:
print(traceback.format_exc())
print(reply)
return {"error": str(e), "reply": reply}
Several technical details made this approach successful:
1. Few-shot learning: We provided just 10 examples (5 spam, 5 ham) across multiple languages
2. Structured output: We forced the LLM to return JSON with confidence scores
3. Robust parsing: We implemented a precise response format and error handling for malformed responses
4. Domain knowledge integration: We included specific instructions about Banglish slang; we can add any special pattern into the prompt that may be missed by the LLM while classifying the text.
Sample Outputs
Input: "আপনার SIM কার্ড আগামীকাল বন্ধ হয়ে যাবে। ৭২ ঘন্টার মধ্যে রেজিস্ট্রেশন করুন: http://bit.ly/sim-reg"
LLM Reply:
<response>
{
"message": "আপনার SIM কার্ড আগামীকাল বন্ধ হয়ে যাবে। ৭২ ঘন্টার মধ্যে রেজিস্ট্রেশন করুন: http://bit.ly/sim-reg",
"spam": 1,
"confidence": 95.5
}
</response>
Final output: {"message": "আপনার SIM কার্ড আগামীকাল বন্ধ হয়ে যাবে। ৭২ ঘন্টার মধ্যে রেজিস্ট্রেশন করুন: http://bit.ly/sim-reg", "spam": 1, "confidence": 95.5}
Input: "Dear customer, your internet package will expire tomorrow. Please recharge to continue service. Thank you."
LLM Reply:
<response>
{
"message": "Dear customer, your internet package will expire tomorrow. Please recharge to continue service. Thank you.",
"spam": 0,
"confidence": 25.8
}
</response>
Final output: {"message": "Dear customer, your internet package will expire tomorrow. Please recharge to continue service. Thank you.", "spam": 0, "confidence": 25.8}
We used few-shot prompting with just 5 examples each of spam and ham SMS. Using the LLM, we annotated a large unlabeled dataset. Based on the confidence scores, we filtered out 5,000 high-confidence spam messages and selected 5,000 ham messages. This initial 10,000 message dataset was then used to train our first spam detection model.
The key insight was iterative improvement (Active Learning): we then used this initial model to annotate additional data from our unlabeled pool, manually verified a sample of its predictions by clustering using k-means algorithm, so that, we can cluster the similalr SMS and verify, to reduce the time to manually verify all the SMS one by one, and added the validated predictions back to our training set. With each iteration, the model’s performance improved significantly. This bootstrapping approach allowed us to efficiently create a high-quality labeled dataset from millions of unlabeled messages.
Technical Insights and Best Practices
Through these projects, we identified several critical factors for successful LLM-assisted annotation:
1. Prompt Engineering is Critical
The quality of your annotations depends heavily on your prompt design. Our guidelines:
- Be explicit about output format (use response templates)
- Include representative examples across categories
- Provide domain-specific context when relevant
- Request confidence scores to enable quality filtering
2. Model Selection Matters
We experimented with various models:
- For vision tasks: Llama 3.2 Vision (11B) provided the best balance of accuracy and speed
- For NLP tasks: Llama 3 (8B) worked surprisingly well, even for multilingual content
- Hosting locally with Ollama provided the best cost/performance ratio for our use cases
3. Post-Processing Logic is Essential
Raw LLM outputs require normalization:
- Strip punctuation and standardize casing
- Parse structured outputs carefully with error handling
- Implement confidence thresholds to filter results
- Map raw classifications to your target schema
4. Validation Strategy
Even with LLM assistance, validation remains important:
- Randomly sample and manually check 2–5% of annotations, or use a clustering method to speed up your manual validation
- Focus validation efforts on edge cases and low-confidence predictions
- Create a feedback loop to improve prompts based on validation findings
Beyond Classification: Other Applications
While we’ve focused on classification tasks, LLM-assisted annotation works well for a wide variety of tasks across both vision and NLP domains:
NLP Tasks
- Named Entity Recognition: Identifying people, organizations, locations, etc. in text
- Relationship Extraction: Determining relationships between entities in text
- Sentiment Analysis: Labeling text as positive, negative, or neutral
- Intent Classification: Categorizing user queries by their intended purpose
- Text Summarization: Creating reference summaries for training extractive or abstractive models
- Question Answering: Generating question-answer pairs from passages
- Machine Translation: Creating parallel corpora across languages
- Coreference Resolution: Identifying expressions that refer to the same entity
- Part-of-Speech Tagging: Labeling words with their grammatical categories
- Text Style Transfer: Identifying examples of different writing styles
- Dialogue State Tracking: Annotating user intents and dialogue states
- Content Moderation: Flagging toxic, harmful, or inappropriate content
- Semantic Role Labeling: Identifying predicates and their arguments
- Aspect-Based Sentiment Analysis: Identifying sentiment toward specific aspects
- Data Augmentation: Generating variations of existing examples
Computer Vision Tasks
- Object Detection: Identifying and localizing objects in images
- Image Classification: Categorizing images into predefined classes
- Semantic Segmentation: Pixel-level classification of image content
- Instance Segmentation: Identifying and separating individual object instances
- Image Captioning: Creating descriptive captions for images
- Visual Question Answering: Generating QA pairs about images
- Facial Expression Recognition: Labeling emotions from facial images
- Pose Estimation: Identifying human body poses in images
- Scene Understanding: Describing the overall context of an image
- Action Recognition: Labeling activities in images or video frames
- Visual Relationship Detection: Identifying relationships between objects
- Depth Estimation: Creating reference depth maps for training depth models
- Anomaly Detection: Identifying unusual patterns in visual data
- Visual Attribute Classification: Labeling specific attributes of objects
- Medical Image Annotation: Identifying and segmenting anatomical structures
Conclusion: A New Annotation Paradigm
LLM-assisted annotation represents a fundamental shift in how we approach data preparation for ML systems. By leveraging LLMs as annotation assistants rather than end-to-end solutions, we gain:
- Speed: Annotation that would take weeks, completes in days
- Cost efficiency: 80–90% reduction in annotation costs
- Consistency: More uniform application of annotation guidelines
- Flexibility: Works across domains, languages, and modalities
This approach doesn’t eliminate the need for human oversight, but it dramatically changes the equation, transforming annotation from a project bottleneck into a streamlined process that accelerates the entire ML development cycle.
What’s particularly exciting is how accessible this approach has become. With open-source models like Llama 3 and local inference servers like Ollama, even small teams can implement LLM-assisted annotation workflows without significant infrastructure investments.
The next time you’re facing a data annotation challenge, consider letting an LLM be your first annotator. Your future self (and budget) will thank you.
What annotation challenges have you faced in your ML projects? Have you experimented with using LLMs for annotation? Share your experiences in the comments below!
THANK YOU.
ABOUT ME
Abdullah Al Munem
Machine Learning Engineer at REVE Systems
LinkedIn: https://www.linkedin.com/in/abdullah-al-munem/
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI