Getting Started with Apache Kafka — Beginners Tutorial

Last Updated on July 20, 2023 by Editorial Team

The objective of this article is to build an understanding of What is Kafka, Why Kafka, Kafka architecture, producer, consumer, broker, and different components of the Kafka ecosystem. And a small coding exercise with Python-Kafka.

1. What is Kafka?

Apache Kafka is a distributed, publish-subscribe based durable messaging system, exchanging data between processes, applications, and servers. In other terms, it is an Enterprise Messaging system that is highly scalable, fault-tolerant, and agile.

Different applications can connect to Kafka systems and push messages/records to the topic. A message/record can be anything from database table records to application events or web server logs. A message or record has a format comprising of a key, a value which is mandatory attributes, and timestamp and header are optional. A topic in simple terms can be understood as a database table that holds records or messages.

2. Kafka Architecture

The story of Apache Kafka architecture revolves around four core APIs: Producer, Consumer, Streams, and Connector. A cluster is basically a collection of one or more Kafka servers or brokers as a part of Distributed messaging systems. Let’s understand them in more detail.

Producer API: enables an application to connect to Kafka Brokers and publish a stream of records or messages to one or more Kafka topics.

Consumer API: enables an application to connect to Kafka Brokers and consume a stream of records or messages from one or more Kafka topics.

Streams API: enables applications to consume an input stream from one or more topics and produces an output stream for one or more topics, thus allowing the transformation of input and output streams.

Connector API: allows writing reusable consumer and producer code, e.g., read data from the database and publish to Kafka topic and consume data from Kafka topic and write to the database.

What is a Broker?

A Broker is an instance or a single server of a Kafka cluster. Kafka Broker stores the messages on the topic or topics. The broker instance that we connect to in order to access the messages or records is known as the bootstrap server. A cluster can have 100s of brokers.

What is Zookeeper?

A Zookeeper can be understood as a Resource Management part of the whole ecosystem. It keeps the metadata about the processes running in the system, performs health check and broker leader selection, etc.

What is a Topic?

The topic is the structure that holds the messages or the records published to the Kafka broker. Internally a topic is divided into partitions, the place exactly where data is published. These partitions are distributed across the brokers in the cluster. A topic is identified by a name just like a table in the database. Atopic is partitioned. Each partition of a topic will have a leader that will lead the read/write operations.

That’s all about the theory part of Kafka, and there are other concepts involved as well that we will cover in upcoming blogs.

3. Use Case and Coding:

With the help of a very basic Python code, we will publish some messages in key-value pair format to Kafka Broker and consume those messages from the broker. But before that, there are some services that need to be up and running to access the Kafka cluster.

Step1: Start the Zookeeper

D:\Kafka_setup\kafka_2.12–2.5.0\bin\windows\zookeeper-server-start.bat D:\Kafka_setup\kafka_2.12–2.5.0\config\zookeeper.properties

Step2: Start the Kafka Broker

D:\Kafka_setup\kafka_2.12–2.5.0\bin\windows\kafka-server-start.bat D:\Kafka_setup\kafka_2.12–2.5.0\config\server.properties

Step3: Create a topic

D:\Kafka_setup\kafka_2.12–2.5.0\bin\windows\kafka-topics.bat — zookeeper localhost:2181 — create — topic demo1 — partitions 2 — replication-factor 2

I have created a topic with two partitions with a replication factor of 2. In this case, as the cluster has more than two brokers, the partitions are evenly distributed, and the replicas of each partition are replicated over to another broker. As the replication factor is 2, there is no data loss, even if a broker goes down. To achieve resilience, the replication factor should be greater than one and less than the number of brokers in the cluster.

Step4: Start Consumer console

D:\Kafka_setup\kafka_2.12–2.5.0\bin\windows\kafka-console-consumer.bat — bootstrap-server localhost:9092 —-topic demo1

Step5: Run Producer code to push messages to Kafka Broker

#producer.py
from json import dumps
import json
from kafka import KafkaProducerproducer = KafkaProducer(bootstrap_servers=['localhost:9092'],
 value_serializer=lambda x:json.dumps(x).encode('utf-8'))try:
 for num in range(1,5):
 msg={'num': num}
 producer.send('demo1',value=msg)
 producer.send('demo1', {'name': 'vivek'})
 producer.send('demo1', {'name': 'vishakha'})
except Exception as e:
 print(e)print('Message sent to kafka demo1')

Step6: Check messages on Consumer console

Step7: Run consumer code to subscribe to messages

#consumer.py
from kafka import KafkaConsumer
from json import loads
import jsonconsumer = KafkaConsumer(‘demo1’,
 bootstrap_servers=[‘localhost:9092’],
 auto_offset_reset=’earliest’,
 enable_auto_commit=True,
 group_id=’my-group’,
 value_deserializer=lambda x: loads(x.decode(‘utf-8’)))for message in consumer:
 msg = message.value
 print(msg)
consumer.close()#subscribed messages:
{"num": 1}
{"num": 2}
{"num": 3}
{"num": 4}
{"name": "vivek"}
{"name": "vishakha"}

Let’s understand about the properties we have used in producer and consumer code:

bootstrap_servers: sets the host and port the producer should contact to bootstrap initial cluster metadata. It is not necessary to set this here since the default is localhost:9092.

value_serializer: the function of how the data should be serialized before sending it to the broker. Here, we convert the data to a JSON file and encode it to utf-8.

enable_auto_commit: makes sure the consumer commits its read to offset every interval.

auto_offset_reset: one of the most important arguments. It handles where the consumer restarts reading after breaking down or being turned off and can be set either to the earliest or latest. When set to the latest, the consumer starts reading at the end of the log. When set to earliest, the consumer starts reading at the latest committed offset, and that’s exactly what we want in this case.

value_deserializer: performed at the consumer end, is the reverse operation of serialization and converts the data from a byte array into JSON format.

Hurray, we learned some basics about Apache Kafka messaging systems. This blog can be helpful in building a basic understanding of Kafka systems and how the producer/consumer code looks like.

Summary:

· Kafka based distributed pub-sub messaging system.

· Kafka architecture and different components.

· Kafka concepts such as producer, consumer, broker, topic, partition, zookeeper, etc.

· Services related to the Kafka ecosystem.

· Topic creation, partition, and replication factor.

· Simple coding exercise, producer, and consumer in python language.

Thanks to all for reading my blog, and If you like my content and explanation, please follow me on medium and share your feedback, which will always help all of us to enhance our knowledge.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Getting Started with Apache Kafka — Beginners Tutorial

Author(s): Vivek Chaudhary

Programming

1. What is Kafka?

2. Kafka Architecture

What is a Broker?

What is Zookeeper?

What is a Topic?

3. Use Case and Coding:

Summary:

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Getting Started with Apache Kafka — Beginners Tutorial

Author(s): Vivek Chaudhary

1. What is Kafka?

2. Kafka Architecture

What is a Broker?

What is Zookeeper?

What is a Topic?

3. Use Case and Coding:

Summary:

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement