SIRAJ CHAUDHARY: Kafka

Apache Kafka is a distributed streaming platform that is used for building real-time data pipelines and streaming applications. It is designed to handle high-throughput, fault-tolerant, and scalable messaging systems. It was originally developed at LinkedIn and later open-sourced through the Apache Software Foundation. Kafka is primarily used for three key functions:

Publish and Subscribe: Kafka allows applications to publish and subscribe to streams of records, which makes it similar to a message queue or enterprise messaging system.
Store Streams of Data: Kafka can store streams of records in a fault-tolerant, durable manner. The stored data can be persisted for a defined period, making it suitable for applications that need to process and analyze historical data as well as live data.
Process Streams: Kafka allows applications to process streams of data in real-time as they are produced. This is useful for real-time analytics, monitoring systems, and event-driven architectures.

Key Definitions

Producers: Producers are applications that write data (records) to Kafka topics.
Consumers: Consumers are applications that read data from topics.
Topic: A topic is a category or feed name to which records are published. Each record in Kafka belongs to a topic.
Partition: A subdivision of a topic. Each partition is an ordered, immutable sequence of messages, which allows Kafka to scale horizontally.
Broker: A Kafka server that stores messages in topics and serves client requests. A Kafka cluster consists of multiple brokers.
Replication: The process of copying data across multiple brokers to ensure durability and availability. Each partition can have multiple replicas.
Leader and Follower: In a replicated partition, one broker acts as the leader (handling all reads and writes), while the others are followers (replicating data from the leader).
Offset: A unique identifier for each message within a partition, allowing consumers to track their progress.
Consumer Lag: The difference between the latest message offset in a topic and the offset of the last message processed by a consumer. It indicates how far a consumer is behind.
Schema Registry: A service for managing schemas used in Kafka messages, ensuring that producers and consumers agree on data formats. It supports Avro, Protobuf, and JSON formats and ensures that schema evolution is handled safely (e.g., forward and backward compatibility).
Kafka Connect: A framework for integrating Kafka with external systems (databases, file systems, cloud services, etc.). Kafka Connect provides source connectors (to pull data into Kafka) and sink connectors (to push data out of Kafka).
Kafka Streams: A client library for building real-time applications that process data stored in Kafka, allowing for transformations, aggregations, and more.
Topic Retention: The policy that dictates how long messages are kept in a topic. This can be based on time (e.g., retain messages for 7 days) or size (e.g., retain up to 1 GB of messages).
Transactional Messaging: A feature that allows for exactly-once processing semantics, enabling producers to send messages to multiple partitions atomically.
Log Compaction: A process that reduces the storage footprint of a topic by retaining only the most recent message for each key, useful for maintaining a state snapshot.
KSQL: KSQL is a SQL-like streaming engine for Apache Kafka, which allows you to query, manipulate, and aggregate data in Kafka topics using SQL commands.
Zookeeper: While not directly part of Kafka's core functionality, Zookeeper is used for managing cluster metadata, broker coordination, and leader election. Note that newer versions are moving towards removing Zookeeper dependency.

Features

Durability: Kafka guarantees durability by writing data to disk and replicating it across multiple brokers. Even if some brokers fail, the data remains safe.
High Throughput: Kafka can handle a high volume of data with low latency. It achieves this by batching messages, storing them efficiently, and leveraging a zero-copy optimization in modern operating systems.
Fault Tolerance: Kafka replicates data across brokers, ensuring that if one broker fails, the data can still be read from another broker that holds the replica.
Scalability: Kafka’s partition-based architecture allows horizontal scaling. You can add more brokers to the cluster, and Kafka will redistribute data to ensure balance.
Retention: Kafka allows for configuring the retention policy of messages. You can store messages indefinitely or delete them after a certain period or when the log reaches a specific size. This makes Kafka flexible for different use cases, whether you need short-term processing or long-term storage.

Use Cases

Real-Time Analytics: Kafka is widely used in big data environments where companies want to process massive streams of events in real time. For example, LinkedIn uses Kafka for tracking activity data and operational metrics, feeding into both batch and stream processing systems.

Log Aggregation: Kafka can aggregate logs from multiple services or applications, making it easier to analyze them or store them for future reference. This is useful for monitoring, diagnostics, and troubleshooting.

Event Sourcing: Kafka is often used in event-driven architectures, where systems communicate by publishing events to Kafka topics. Consumers can process these events in real-time or later, enabling systems to handle complex workflows and state changes.

Messaging System: Kafka can replace traditional message brokers like RabbitMQ or ActiveMQ, especially when dealing with high-throughput messaging needs.

Data Pipelines: Kafka serves as a backbone for large-scale data pipelines, allowing the integration of data across multiple systems, such as databases, analytics platforms, and machine learning systems.

Companies Using Kafka

LinkedIn (where Kafka was originally developed)

Netflix (for real-time monitoring and analytics)

Uber (for geospatial tracking and event-based communication)

Airbnb (for real-time data flow management)

Twitter (for its log aggregation and stream processing systems)

Kafka's ability to handle large volumes of real-time data efficiently, with fault tolerance and scalability, makes it a vital tool for modern data-driven architectures.

Documentation: An official (great) documentation on Kafka can be found with following URL. You can find everything like definitions, setup, commands and all.

https://kafka.apache.org/documentation/

Example 1: A basic example on Kafka

- Setup apache kafka server on a VM. Use two terminals and consider one for producer and another for consumer. Producer will produce a message to the topic and consumer will read it from the topic.

Prerequisites

Java: Kafka runs on JVM, so ensure that Java is installed.
Zookeeper: Kafka uses Zookeeper to manage brokers, topics, and other cluster-related metadata. Zookeeper comes bundled with Kafka.

Step 1: Install Java. Kafka requires Java 8 or higher

sudo apt update

sudo apt install openjdk-17-jdk -y

java -version

Step 2: Download Kafka

wget https://downloads.apache.org/kafka/3.8.0/kafka_2.12-3.8.0.tgz

tar -xzf kafka_2.12-3.8.0.tgz

cd kafka_2.12-3.8.0

Step 3: Start Zookeeper

- Kafka requires Zookeeper to run, so you must first start a Zookeeper instance. Zookeeper comes bundled with Kafka, so you can use the default Zookeeper configuration.

bin/zookeeper-server-start.sh config/zookeeper.properties

Step 4: Start Kafka Broker

Once Zookeeper is running, you can start the Kafka broker. Open another terminal and run

bin/kafka-server-start.sh config/server.properties

Step 5: Create a topic

- Kafka organizes messages into topics. You can create a new topic

bin/kafka-topics.sh --create --topic my-topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

- You can verify the created topic

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Step 6: Produce a message to the topic

bin/kafka-console-producer.sh --topic my-topic --bootstrap-server localhost:9092

Type a message and press Enter.

Step 7: Consume a message from the topic

- Open another terminal and run

bin/kafka-console-consumer.sh --topic my-topic --from-beginning --bootstrap-server localhost:9092

Step 8: Managing Kafka

To scale your setup or add brokers, you'll need to configure more brokers and manage them via Zookeeper. Kafka supports various configurations for high availability, replication, and partitioning.

Additional Steps:

Configure Kafka for production: You’ll need to modify the server.properties file (e.g. set broker ID, configure log retention, optimize replication, etc.).
Monitoring and logging: Set up metrics and logging tools like Prometheus, Grafana, or Kafka’s own JMX monitoring.

Example 2: Integrate Apache Kafka with a Spring Boot application.

GitHub code: https://github.com/SirajChaudhary/basic-springboot-kafka-example

Step 1: Setup Spring Boot Project

You can create a Spring Boot application using Spring Initializr (https://start.spring.io/). Include the following dependencies:

Spring Web
Spring for Apache Kafka

Step 2: Add Kafka configuration in application.properties

#65.0.215.170 is IP of kafka server

spring.kafka.bootstrap-servers=65.0.215.170:9092

spring.kafka.consumer.group-id=my-group

spring.kafka.consumer.auto-offset-reset=earliest

Step 3: Create a Kafka Producer

- A service that will send messages to a Kafka topic.

import org.springframework.beans.factory.annotation.Autowired;

import org.springframework.kafka.core.KafkaTemplate;

import org.springframework.stereotype.Service;

@Service

public class KafkaProducer {

private final KafkaTemplate<String, String> kafkaTemplate;

@Autowired

public KafkaProducer(KafkaTemplate<String, String> kafkaTemplate) {

this.kafkaTemplate = kafkaTemplate;

}

public void sendMessage(String topic, String message) {

kafkaTemplate.send(topic, message);

}

Step 4: Create a Kafka Consumer

- A listener that will consume messages from a Kafka topic.

- Assuming we already created a topic name my-topic in Kafka.

import org.springframework.kafka.annotation.KafkaListener;

import org.springframework.stereotype.Service;

@Service

public class KafkaConsumer {

@KafkaListener(topics = "my-topic", groupId = "my-group")

public void listen(String message) {

System.out.println("Received message: " + message);

}

Step 5: Create a controller to test the Producer

import org.springframework.beans.factory.annotation.Autowired;

import org.springframework.web.bind.annotation.GetMapping;

import org.springframework.web.bind.annotation.RequestParam;

import org.springframework.web.bind.annotation.RestController;

@RestController

public class MessageController {

private final KafkaProducer kafkaProducer;

@Autowired

public MessageController(KafkaProducer kafkaProducer) {

this.kafkaProducer = kafkaProducer;

}

@GetMapping("/send")

public String sendMessage(@RequestParam String message) {

kafkaProducer.sendMessage("my-topic", message);

return "Message sent: " + message;

}

Step 6: Run Zookeeper, Kafka and the Spring Boot Application

Step 7: Test the application