Major Reference: https://www.conduktor.io/kafk...
What is Kafka
Apache Kafka: Highly scalable, distributed platform for creating/Processing real-time data streams.
Publisher = Producer
Subscriber = Consumer
Clusters
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
Topics
Topic is a arbitrary name given to a data set. A unique name for a data stream.
Partitions
As Kafka is distributed platform, it can break topics into smaller partitions and store those partitions on different machines.
Solving storage capacity problem.
Partition is the smallest unit which is sitting on a single machine.
Offsets
A unique sequence ID of a message in the partition.
Automatically assigned by broker to every message record as it arrives in the partition.
Kafka Client API
Jave library
Producer API / Consumer API
Producers
An application sends data/message/message record
e.g. each line of text file / result of a DB query records as a message
Database -> Producer --(query result as message)--> Broker -> Consumer
Consumers
Consumer Groups
Multiple consumers form a consumer group to share the workload.
Brokers
Data stored in Kafka server (broker=central server system)
ZooKeeper
Kafka Connect
A component of Kafka for moving data between external systems and Kafka cluster.
Kafka Connect is a cluster, individual unit called a Worker, which can be added dynamically on configuration need.
Source connectors to get data from common data sources. (Source Tasks)
Sink connectors to publish data in common data stores. (Sink Tasks)
Part of ETL pipeline
Data Source like IBM DB2 -> Source Connector (internally use Kafka Producer API) Cluster -> Kakfka Cluster -> Sink Connector (internally use Kafka Consumer API) Cluster -> Storage like Snowflake
Reusable code, no need to write a single line of code.
Export as Jar file and can be used by others.
Just implement Java classes and configure them to use for any systems.
Kafka Connect Transformations
Single Message Transformations (SMTs)
- Add a new field in your record using static data or metadata
- Filter or Rename fields
- Mask some fields with a Null Value
- Change the Record Key
- Route the record to a different Kafka Topic
Pros: We can apply some transformations or changes to each message on the fly.
Cons: We can not apply complex transformation and we can not validate data on real time.
Kafka Stream
A Java/Scala/... Library for creating real time processing applications.
Input data in Kafka topic.
Embed Kafka stream into microservices.
Deploy anywhere, no cluster needed.
Out of the box parallel processing, scalability, and fault tolerance.
KSQL
SQL interface to Kafka streams. Used for real time DB and querying processes.
KSQL runs in 2 modes: Interactive Mode, Headless Mode
Applications:
- Grouping and Aggregating on Kakfa Topics
- Grouping and Aggregating over a time window
Apply filters
SELECT store_id, COUNT(*) FROM invoices WINDOW TUMBLING (SIZE 1 HOUR) WHERE UCASE(locality)='NY' GROUP BY store_id;
- Join two topics
- Sink the result of your query into another topic
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。