大数据 - Kafka Series 2 - Kafka Concepts Introduction - 个人文章

Major Reference: https://www.conduktor.io/kafk...

What is Kafka

Apache Kafka: Highly scalable, distributed platform for creating/Processing real-time data streams.

Publisher = Producer
Subscriber = Consumer

Clusters

Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

Topics

Topic is a arbitrary name given to a data set. A unique name for a data stream.

Partitions

As Kafka is distributed platform, it can break topics into smaller partitions and store those partitions on different machines.
Solving storage capacity problem.
Partition is the smallest unit which is sitting on a single machine.

Offsets

A unique sequence ID of a message in the partition.
Automatically assigned by broker to every message record as it arrives in the partition.

Kafka Client API

Jave library
Producer API / Consumer API

Producers

An application sends data/message/message record
e.g. each line of text file / result of a DB query records as a message
Database -> Producer --(query result as message)--> Broker -> Consumer

Consumers

Consumer Groups

Multiple consumers form a consumer group to share the workload.

Brokers

Data stored in Kafka server (broker=central server system)

ZooKeeper

Kafka Connect

A component of Kafka for moving data between external systems and Kafka cluster.

Kafka Connect is a cluster, individual unit called a Worker, which can be added dynamically on configuration need.
Source connectors to get data from common data sources. (Source Tasks)
Sink connectors to publish data in common data stores. (Sink Tasks)

Part of ETL pipeline
Data Source like IBM DB2 -> Source Connector (internally use Kafka Producer API) Cluster -> Kakfka Cluster -> Sink Connector (internally use Kafka Consumer API) Cluster -> Storage like Snowflake

Reusable code, no need to write a single line of code.
Export as Jar file and can be used by others.
Just implement Java classes and configure them to use for any systems.

https://www.confluent.io/hub/

Kafka Connect Transformations

Single Message Transformations (SMTs)

Add a new field in your record using static data or metadata
Filter or Rename fields
Mask some fields with a Null Value
Change the Record Key
Route the record to a different Kafka Topic

Pros: We can apply some transformations or changes to each message on the fly.
Cons: We can not apply complex transformation and we can not validate data on real time.

Kafka Stream

A Java/Scala/... Library for creating real time processing applications.
Input data in Kafka topic.
Embed Kafka stream into microservices.
Deploy anywhere, no cluster needed.
Out of the box parallel processing, scalability, and fault tolerance.

KSQL

SQL interface to Kafka streams. Used for real time DB and querying processes.
KSQL runs in 2 modes: Interactive Mode, Headless Mode

Applications:

Grouping and Aggregating on Kakfa Topics
Grouping and Aggregating over a time window

Apply filters

SELECT store_id, COUNT(*)
FROM invoices
WINDOW TUMBLING (SIZE 1 HOUR)
WHERE UCASE(locality)='NY'
GROUP BY store_id;

Join two topics
Sink the result of your query into another topic

Kafka Series 2 - Kafka Concepts Introduction

What is Kafka

Clusters

Topics

Partitions

Offsets

Kafka Client API

Producers

Consumers

Consumer Groups

Brokers

ZooKeeper

Kafka Connect

Kafka Connect Transformations

Kafka Stream

KSQL

金金

引用和评论

API

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

小米基于 Apache Paimon 的流式湖仓实践