Author|Bai Yu

When it comes to data islands, all technical people are familiar with it. In the IT development process, companies inevitably build various business systems, which operate independently and the data generated are independent of each other, making it difficult for companies to realize data sharing and integration, and forming "data islands".

Because data is scattered in different databases and message queues, computing platforms may encounter availability, transmission delays, and even system throughput problems when directly accessing these data. If we rise to the business level, we will find that these scenarios will be encountered at any time: summary business transaction data, migration of old system data to the new system, and data integration of different systems. Therefore, in order to make data more real-time and efficient integration and support various business scenarios, companies usually choose to use various ETL tools to achieve the above goals.

Therefore, we can see various solutions explored by enterprises, such as using custom scripts, or using Service Bus (Enterprise Service Bus, ESB) and Message Queue (Message Queue, MQ), such as using Enterprise application integration (Enterprise application integration (EAI) through the design of the underlying structure to traverse enterprise heterogeneous systems, applications, data sources, etc., to achieve seamless data sharing and exchange.

Although the above methods are considered to achieve effective real-time processing, they also enterprises: real-time, but not scalable, or scalable . But batch processing. At the same time, with the continuous development of data technology and business requirements, enterprises' requirements for ETL have also been continuously improved:

  • In addition to supporting transactional data, it also needs to be able to handle increasingly rich data sources such as Log and Metric;
  • The batch processing speed needs to be further improved;
  • The underlying technical architecture needs to support real-time processing and evolve to event-centric.

It can be seen that the stream processing/real-time processing platform serves as the cornerstone of event-driven interaction. It provides companies with global data/event links, instant data access, a single system to manage global data, and continuous indexing/query capabilities. It is precisely in the face of the above technical and business needs that Kafka provides a new idea:

  • As a real-time and extensible message bus, enterprise application integration is no longer required;
  • Provide streaming data pipelines for all message processing destinations;
  • As the basic building block of stateful stream processing microservices.

Let's take the data analysis scenario of a shopping website as an example. In order to achieve refined operations, the operation team and product managers need to aggregate many user behaviors, business data, and other data, including but not limited to:

  1. Various user behavior data such as clicking, browsing, additional purchases, and logins;
  2. Basic log data;
  3. APP actively uploads data;
  4. Data from db;
  5. other.

These data are collected into Kafka, and then the data analysis tool uniformly obtains the required data from Kafka for analysis and calculation. Because Kafka collects many data sources and various formats. Before data enters downstream data analysis tools, data cleaning, such as filtering and formatting, is required. The R&D team here has two options: (1) Write code to consume messages in Kafka, and send them to the target Kafka Topic after cleaning. (2) Use components for data cleaning and conversion, such as: Logstash, Kafka Stream, Kafka Connector, Flink, etc.

Looking at this, everyone will definitely have questions: Kafka Stream, as a stream processing library, directly provides specific classes for developers to call. The running mode of the entire application is mainly controlled by the developer, which is convenient for use and debugging. Is there any problem with this? Although the above methods can indeed solve the problem quickly, the problem is also obvious.

  • The R&D team needs to write the code by itself, and it needs to be maintained continuously in the later period, and the operation and maintenance cost is relatively high;
  • For many light-weight or simple computing requirements, the technical cost of introducing a new component is too high, and technical selection is required;
  • After a certain component is selected, the R&D team needs to learn and maintain it continuously, which brings unpredictable learning costs and maintenance costs.

1.png

In order to solve the problem, we provide a more lightweight solution: Kafka ETL function.

After using the Kafka ETL function, you only need to perform simple configuration through the Kafka console and write a piece of cleaning code online to achieve the purpose of ETL. The possible high availability, maintenance and other issues are completely handed over to Kafka.

So next, we show you how to quickly create a data ETL task, just 3 steps.

Step 1: Create task

Select Kafka source instance and source topic, and correspondingly select Kafka target instance and target topic. And configure the initial position of the message, failure handling, and resource creation method.

2.png3.png

Step 2: Write ETL main logic

We can choose Python3 as the functional language.

At the same time, a variety of data cleaning and data conversion templates are provided here, such as common functions such as rule filtering, string replacement, and adding pre/suffixes.

4.png

Step 3: Set task operation, abnormal parameter configuration, and execute

5.png

It can be seen that without additional component access or complex configuration, the lighter and lower cost Kafka ETL only requires 3-5 steps of visual configuration to start the ETL task. For teams with relatively simple data ETL requirements, Kafka ETL is the best choice and can focus more on business research and development.

Such easy and convenient ETL function is really not to be missed! Say goodbye to cumbersome scripts, say goodbye to component selection and access, scan the code immediately or click the link ( https://www.aliyun.com/product/kafka?utm_content=se_1009650951), and experience the ETL more relaxed!

6.png


阿里云云原生
1k 声望302 粉丝