From raw text to training gold: How to collect and prepare data for custom LLMs
Author(s): Laura Verghote Originally published on Towards AI. Practical guidance for building clean, domain-relevant datasets for fine-tuning, continued pretraining, or training from scratch If you’ve worked on language models beyond a quick prototype, you already know where the real bottleneck is. It’s …
Debugging Spark at Scale: Slow to Shipped
Author(s): Diogo Santos Originally published on Towards AI. A stepwise playbook to locate the true bottleneck — I/O, shuffle, Python, or memory — and fix it with minimal changes and hard measurements. If you’re here, you’ve got a Spark job that should …
Langfuse: A Technical Guide to Observability in LLM Applications
Author(s): Rachit Originally published on Towards AI. Langfuse: A Technical Guide to Observability in LLM Applications Large Language Models (LLMs) are incredibly powerful, but they’re also stochastic black boxes. You can design the perfect prompt, and yet in production, responses may vary …
Mastering Python Data Pipelines in 2025
Author(s): Code with Margaret Originally published on Towards AI. How I built scalable ETL workflows without losing my sanity Over the past four years, I’ve built more Python data pipelines than I can count. Some of them ran beautifully; others… well, let’s …
Mastering RAG: Precision from Table-Heavy PDFs
Author(s): Vicky’s Notes Originally published on Towards AI. I just wrapped a customer pilot where “documents” really meant PDFs stuffed with tables, footnotes, and odd layouts. The goal sounded simple: answer two kinds of questions reliably. For semantic questions like “What changed …
AWS Lambda: Serverless Application Is Like Cooking Pasta With a Magic Machine!!!
Author(s): Henry Originally published on Towards AI. How AWS Lambda Powers AI & Data Engineering AWS Lambda is a serverless compute service that runs your code, so you do not need to spend extra effort to maintain the server. It is like …
Beyond Pandas: The Modern Data Analytics and Engineering Techniques With Python (Part 1)
Author(s): Gift Ojeabulu Originally published on Towards AI. Image by author Outline Introduction The Data Size Decision Framework & Comprehensive Decision Flowchart A Diagrammatic representation based on Team Syntax Preference, and Performance or Integration Requirements Real-World Examples: Log file Analysis, E-commerce instance …
How to Augment Wildfire Datasets with Historical Weather Data using Python and Google Earth Engine
Author(s): Ruiz Rivera Originally published on Towards AI. Photo by Tim Mossholder on Unsplash Picture this: You’re a data scientist working with wildfire data, and all you have are basic fire records — location coordinates, timestamps, and maybe a unique fire ID. …
How to Build Bulletproof Data Pipelines with PySpark That Actually Scale
Author(s): Yuval Mehta Originally published on Towards AI. Photo by Claudio Schwarz on Unsplash We’re past the era when a CSV, a Pandas DataFrame, and a single machine could handle everything you threw at them. Data is heavier now. It arrives fast, …
Machine Learning at Scale: Why PySpark MLlib Still Wins in 2025
Author(s): Yuval Mehta Originally published on Towards AI. Photo by Kevin Ku on Unsplash Machine learning may be glamorous when you’re tuning models on Kaggle datasets or demoing GPT wrappers. But in production? It’s a grind. You’re not just building a model. …
Take a Dive Into Delta Lake
Author(s): Disha Verma Originally published on Towards AI. That’s Jerry — the frustrated Data Steward! Remember the time we spoke about Data Warehouse, Data Lake and Data Lakehouse? Today, we will learn about Delta Lake that belongs to the same data architecture …
Pipelines to Prompts: Getting started with Databricks and AWS
Author(s): Devi Originally published on Towards AI. A Beginner’s Guide to Building GenAI Applications from Raw Data Using Databricks and AWS As with my other blogs, we start with the theory, practice, and wrap up with some lessons learnt. NAVIGATION: Why Data …
The Data Stack for AI: RDBMS, Graph, and HTAP
Author(s): Vicky’s Notes Originally published on Towards AI. Earlier this year, I was working with an insurance client who was eager to adopt generative AI to improve customer engagement, but they kept hitting a wall. The AI models were fine; the real …
Beyond Pandas: The Modern Data Processing Toolkit for Data Engineering (Part 1)
Author(s): Gift Ojeabulu Originally published on Towards AI. Image by author Outline Introduction The Data Size Decision Framework & Comprehensive Decision Flowchart A Diagrammatic representation based on Team Syntax Preference, and Performance or Integration Requirements Real-World Examples: Log file Analysis, E-commerce instance …
Premium SSD vs Ultra SSD: Azure Storage Performance for Distributed Databases
Author(s): Richie Bachala Originally published on Towards AI. When building distributed systems in the cloud, storage performance can make or break your application’s success. In this post, we’ll explore how different Azure disk types perform under distributed database workloads, using YugabyteDB as …