Demystifying Googleβs Data Gemma
Author(s): Chirag Agrawal Originally published on Towards AI. Photo by Alvaro Reyes on Unsplash Discover how Googleβs Data Gemma leverages the Data Commons knowledge graph to tackle AI hallucinations. In this blog post, weβll explore how Data Gemma aims to improve the …
How Iβd Learn to Become a Data Engineer in 2025.
Author(s): Kamireddy Mahendra Originally published on Towards AI. A Clear Guide, If I could start over again from the beginning. This member-only story is on us. Upgrade to access all of Medium. Photo by ThisisEngineering on Unsplash My journey into the world …
What are Vector Databases?
Author(s): Ayo Akinkugbe Originally published on Towards AI. Photo by γγ«γγ on Unsplash Introduction Vector databases are databases designed specifically for storing vector embeddings. If a vector is a data representation having magnitude and direction, what then are vector embeddings? Vector embeddings …
Build and Run Data Pipelines with Sagemaker Pipelines
Author(s): Jake Teo Originally published on Towards AI. Leverage AWSβs MLOps Platform to run on your large data processing workloads seamlesslyImage from Amazonβs sagemaker official website [1] In this article, I will show how you can run long-running, repetitive, centrally managed and …
Volga β Open-source Feature Engine for real-time AI β Part 2
Author(s): Andrey Novitskiy Originally published on Towards AI. This is the second part of a 2-post series describing Volgaβs architecture and technical details. For motivation and the problemβs background, see the first part. Volga river TL;DR Volga is an open-source real-time feature …
Volga β Open-source Feature Engine For Real-time AI β Part 1
Author(s): Andrey Novitskiy Originally published on Towards AI. This is the first part of a 2-post series describing the background and motivation behind Volga. For technical details, see the second part. Volga river TL;DR Volga is an open-source, self-serve, scalable data/feature calculation …
Unlocking the Gates to Success: Dive into SQL Interview Questions from Leading MAANG Companies
Author(s): Kamireddy Mahendra Originally published on Towards AI. βConsistent practice is the key to unlocking success in clearing any coding interview.β Concepts used: Window functions, CTE, Joins, Subqueries, and GROUP BY Photo by Christian Wiediger on Unsplash Q1. Assume youβre given a …
Simplify Your Data Engineering Journey: The Essential PySpark Cheat Sheet for Success!
Author(s): Kamireddy Mahendra Originally published on Towards AI. β It is not important to complete tasks blindly. It is important to complete tasks more efficiently with more effectivenessβ Photo by Markus Winkler on Unsplash Yes, It is important to understand before getting …
Revolutionising Machine Learning: Achieving Top 4% in Kaggle with AutoGluon in Just 7 Lines of Code
Author(s): Daniel Voyce Originally published on Towards AI. Autogluon Forecasting Since starting a new Data Engineering role at Slalom _build, I realized I needed to refresh my ML experience as it was a couple of years out of date. A couple of …
Deletion Vectors in Delta Tables: Speeding Up Operations in Databricks
Author(s): Muttineni Sai Rohith Originally published on Towards AI. Traditionally, Delta Lake supports only the Copy-On-Write paradigm, in which underlying data files are changed anytime a file has been written. Example: When a single row in a file is deleted, the entire …
Understanding Data Lineage: From Source to Destination
Author(s): Muttineni Sai Rohith Originally published on Towards AI. I went to a restaurant yesterday, βAnthera.β After eating my fourth or fifth piece of pepper chicken, which, by the way, was delicious, I started to be amazed by our capability to digest …
Data Cleaning in Python
Author(s): Louis Adibe Originally published on Towards AI. Master data cleaning in Python using the Panda libraryScott Graham on Unsplash Today, I will show you how to implement data cleaning using pandas. The dataset used in this publication comes from open-rice Hongkong …
Understanding SCD β Slowly Changing Dimensions
Author(s): Saniya Parveez Originally published on Towards AI. Introduction In the dynamic realm of data management, the concept of Slowly Changing Dimensions (SCD) emerges as a crucial paradigm. SCD constitutes a fundamental principle in the field of data warehousing and database administration, …
Orchestrate Machine Learning Pipelines with AWS Step Functions
Author(s): ????Mike Shakhomirov Originally published on Towards AI. Advanced-Data Engineering and ML Ops with Infrastructure as CodePhoto by Markus Winkler on Unsplash This story explains how to create and orchestrate machine learning pipelines with AWS Step Functions and deploy them using Infrastructure …
Chat with Your BigQuery Data
Author(s): Benedict Neo Originally published on Towards AI. made with excalidraw Large language models (LLMs) have shown extraordinary ability in understanding natural language and generating code. One popular use-case for code generation is in Text-To-SQL tasks, where the goal is to automatically …