Getting Started with Kafka

Varuni Punchihewa

8 min readMar 28, 2019

What is Kafka?

Kafka is a distributed streaming platform.

It can,

Publish and subscribe to streams of records
Store replicas of the streams of records
Process streams of records in real-time

Kafka Usages

As a messaging system
As a storage system
As a stream-processor

As a Messaging System

Let us dive deeper into the Kafka’s messaging system.

In a typical messaging system, you can find three components.

A Producer (Publisher)- The client applications that send some messages.
A Broker- Receives messages from publishers and store them.
A Consumer- Reads the messages from the broker.

In a large organization, you might find many source systems and destination systems. In order to deliver messages in such a system, you need to have proper data pipelining.

This looks like a mess and it is hard to maintain as well. If we can use a messaging system, then we can make this system into a much simpler and neater one.

Using a Message System to simplify data pipelining

Kafka is a highly scalable and fault-tolerant messaging system.

A Kafka cluster is a bunch of brokers running in a group of computers. They take message records from producers and store them in message logs. Consumers read these message logs from the Kafka cluster, process them and perform their specific operations.

Stream-processors: These applications read continuous data from the Kafka cluster, process them and then either store them back in the cluster or send them directly to other systems.

Connectors: These applications are used to import data from databases into the Kafka cluster or export data from the cluster to the databases.

ZooKeeper: This is an open source project that came out from the Hadoop project. It is used to provide some coordination services for a distributed system. Since Kafka is a distributed system and we have multiple brokers, we need a system like a ZooKeeper to coordinate various things among these brokers.

Fault-tolerance in Kafka

Most of the time, Kafka will spread your data in partitions across various systems in the cluster. So what would happen if one or two systems in the cluster fail? That is where the Fault-tolerant concept comes into play.

By definition, fault-tolerant is enabling a system to continue operating properly in the event of the failure of some of its components. (~Wikipedia)

One simple solution is to make multiple copies of data and keep them on separate systems. We use a specific term for making multiple copies in Kafka, called “replication-factor”.

replication-factor = number of total copies

For example, let us say you have three copies of a partition (replication-factor = 3). Kafka stores these copies in three different machines. So you do not have to worry even if one machine fails, you will still have two copies of your data with you.

N.B: We set the replication-factor for a topic, not for a partition, but it is applied to all the partitions within the topic. (you will understand this more clearly when you are setting up Kafka in your machine)

So, How does it really work?

Kafka uses the “Leader and Follower” model to implement this. For every partition, one broker is elected as the leader, and he then takes care of all the client interactions. That means, when a producer wants to send some data, it connects with the leader and starts sending data. It is the leader’s responsibility to receive the messages, store them in the local disk and send back an acknowledgment to the producer.

Similarly, when a consumer wants to read data, it sends a request to the leader. Now, it is the leader’s responsibility to send the requested data back to the consumer.

For every partition, we have a leader and the leader takes care of all the requests and responses.

Leader-Follower model of a Kafka Cluster

For example, if we create a Topic with the replication-factor set to 3, the leader of the topic will be already maintaining the first copy. We need two more copies. So the Kafka will identify two more brokers as the followers to make those two copies. These followers then copy the data from the leader. They do not talk to the producer or consumer. This system is a 3-node Kafka cluster (one leader and two followers).

Another great feature of Kafka is it enables both scalable processing and multi-subscriber features.

Typically, there are two message models.

Queuing
Publish-Subscribe (Pub/Sub)

In a queue, each record goes to one of the consumers in a pool of consumers. In pub/sub the record is broadcast to all consumers.

Queuing allows you to divide up the processing of data over multiple consumer instances, thus making it scalable. Pub/sub allows you to broadcast data across multiple processes but has no way of scaling processing since every message goes to every subscriber.

Kafka enables both the above models through “Consumer group” concept making it scalable in processing and a multi-subscriber.

As with a queue, the consumer group allows you to divide up the processing over the members of the consumer group. As with the publish-subscribe, it allows you to broadcast messages to multiple consumer groups.

Setting up Kafka in your local machine

Here I am using Ubuntu 18.04.1.

First, you need to download Kafka from here.

Untar it and go inside the directory.

tar -xzf kafka_2.12-2.2.0.tgz
cd kafka_2.12-2.2.0

You need to start the ZooKeeper server before starting the Kafka server. It is available inside your bin directory.

bin/zookeeper-server-start.sh config/zookeeper.properties

Now open another terminal and start the Kafka server.

bin/kafka-server-start.sh config/server.properties ------------ (1)

Now let’s create a topic named “FirstTopic” with two partitions and one replica.

Kafka distributes partitions evenly over the available brokers. Since we have only a single broker here, Kafka will create both partitions on the same machine. replication-factor indicates the number of total copies of a partition that the Kafka maintains. We have talked more on this under Fault-tolerance of Kafka.

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 2 --topic FirstTopic

You can check whether the topic is created or not.

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Now let’s create a producer.

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic FirstTopic

Now the producer is up and running. Whatever you type now in the terminal, the producer will send that to the broker. But before sending any messages, we’ll start a consumer by opening a new terminal.

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic FirstTopic --from-beginning

Go back to the terminal where your producer is up and running and type some messages to be sent.

>first message
>message 2

You can see those messages appearing on the consumer side.

Creating a Kafka Cluster

I already started a Kafka cluster with my first broker in the previous lines of code. Now I’m going to start two more brokers on the same machine.

In an ideal cluster, we install one broker on one machine. But Kafka lets you start multiple brokers in a single machine as well.

We’ll use the step (1) above to create the brokers. But before that, we’ll make a copy of the broker config file and modify it. That’s necessary to start new brokers. We cannot use multiple brokers with the same properties.

Go inside the config folder and locate the server.properties file. Make two copies of that file, namely server2.properties and server3.properties.

We need to change three properties in these files.

Open the server2.properties file and locate “broker.id” and set it to 1. (set this to 2 in server3.properties)

broker.id=1

Broker ID is a unique identifier for the broker.

Locate “listeners=PLAINTEXT://:9092” and set it to 9093. If the line is commented, uncomment it. ( set this to 9094 in server3.properties)

listeners=PLAINTEXT://:9093

This is the broker port. It is a network port number to which the broker will bind itself. The broker uses this port number to communicate with producers and consumers.

You do not need to change these port numbers if you are starting the brokers on separate machines.

Locate “log.dirs=/tmp/kafka-logs” and change the name of the log file.

log.dirs=/tmp/kafka-logs-2

This is the main data directory of a broker. We do not want all brokers to write into the same directory.

Now let’s start two more brokers.

bin/kafka-server-start.sh config/server2.properties
bin/kafka-server-start.sh config/server3.properties

Now I have a 3-node Kafka cluster up and running.

Let’s create a new topic.

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 3 --partitions 2 --topic NewTopic

There is a “describe” command which tells you everything that you want to know about a topic.

bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic NewTopic

The output will be as follows.

In my case, the ID for the first partition is 0 and for the second partition is 1. For the first partition, broker 0 is the Leader. That means, the broker 0 will store and maintain the first copy of this partition and it also fulfills all the client requests for this partition. Similarly, the broker 2 is the Leader for the second partition. Next, it shows a list of the replicas. It tells that broker 0 holds the first copy, broker 1 holds the second copy and broker 2 holds the third copy. Broker 1 and 2 are the followers. Isr stands for an in-sync replica. You might have three copies but one of them may not in-sync with the leader. So the Isr shows the list of replicas that are in sync with the leader. In my case, all three are in sync.

Create a producer for the NewTopic

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic NewTopic

Create consumers for the NewTopic

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic NewTopic --from-beginning
bin/kafka-console-consumer.sh --bootstrap-server localhost:9093 --topic NewTopic --from-beginning
bin/kafka-console-consumer.sh --bootstrap-server localhost:9094 --topic NewTopic --from-beginning

Send a message from the Producer. You can see the message on all three consumers now.