Major Reference: https://www.conduktor.io/kafk...

What is Kafka

Apache Kafka: Highly scalable, distributed platform for creating/Processing real-time data streams.

image.png

image.png

Publisher = Producer
Subscriber = Consumer

image.png

Clusters

Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

Topics

Topic is a arbitrary name given to a data set. A unique name for a data stream.

image.png

Partitions

As Kafka is distributed platform, it can break topics into smaller partitions and store those partitions on different machines.
Solving storage capacity problem.
Partition is the smallest unit which is sitting on a single machine.

Offsets

A unique sequence ID of a message in the partition.
Automatically assigned by broker to every message record as it arrives in the partition.

Kafka Client API

Jave library
Producer API / Consumer API

Producers

An application sends data/message/message record
e.g. each line of text file / result of a DB query records as a message
Database -> Producer --(query result as message)--> Broker -> Consumer

image.png

image.png

image.png

image.png

image.png

Consumers

image.png

Consumer Groups

Multiple consumers form a consumer group to share the workload.

Brokers

Data stored in Kafka server (broker=central server system)

image.png

image.png

image.png

ZooKeeper

image.png

image.png

image.png

Kafka Connect

A component of Kafka for moving data between external systems and Kafka cluster.
image.png
Kafka Connect is a cluster, individual unit called a Worker, which can be added dynamically on configuration need.
Source connectors to get data from common data sources. (Source Tasks)
Sink connectors to publish data in common data stores. (Sink Tasks)

Part of ETL pipeline
Data Source like IBM DB2 -> Source Connector (internally use Kafka Producer API) Cluster -> Kakfka Cluster -> Sink Connector (internally use Kafka Consumer API) Cluster -> Storage like Snowflake

Reusable code, no need to write a single line of code.
Export as Jar file and can be used by others.
Just implement Java classes and configure them to use for any systems.

https://www.confluent.io/hub/

Kafka Connect Transformations

Single Message Transformations (SMTs)

  1. Add a new field in your record using static data or metadata
  2. Filter or Rename fields
  3. Mask some fields with a Null Value
  4. Change the Record Key
  5. Route the record to a different Kafka Topic

Pros: We can apply some transformations or changes to each message on the fly.
Cons: We can not apply complex transformation and we can not validate data on real time.

Kafka Stream

A Java/Scala/... Library for creating real time processing applications.
Input data in Kafka topic.
Embed Kafka stream into microservices.
Deploy anywhere, no cluster needed.
Out of the box parallel processing, scalability, and fault tolerance.
image.png
image.png

KSQL

SQL interface to Kafka streams. Used for real time DB and querying processes.
KSQL runs in 2 modes: Interactive Mode, Headless Mode

image.png

Applications:

  • Grouping and Aggregating on Kakfa Topics
  • Grouping and Aggregating over a time window
  • Apply filters

    SELECT store_id, COUNT(*)
    FROM invoices
    WINDOW TUMBLING (SIZE 1 HOUR)
    WHERE UCASE(locality)='NY'
    GROUP BY store_id;
  • Join two topics
  • Sink the result of your query into another topic


金金
1 声望0 粉丝