1
Introduction to This article introduces some practices of Qunar Data Platform using Flink + Iceberg 0.11.

Author: Yu Dong

Abstract: This article introduces some practices of Qunar Data Platform using Flink + Iceberg 0.11. content include:

  • Background and pain points
  • Iceberg architecture
  • Pain point 1: Kafka data loss
  • Pain point 2: Near real-time Hive is under great pressure
  • Iceberg optimization practice
  • to sum up

GitHub address
https://github.com/apache/flink
Welcome everyone to give Flink likes and send stars~

1. Background and pain points

1. Background

In the process of using Flink for real-time data warehouse and data transmission, we encountered some problems: such as Kafka data loss, Flink combined with Hive's near real-time data warehouse performance, etc. The new features of Iceberg 0.11 solve the problems encountered in these business scenarios. Compared with Kafka, Iceberg has its own advantages in certain specific scenarios. Here we have done some practical sharing based on Iceberg.

2. Original architecture scheme

The original architecture uses Kafka to store real-time data, including data such as logs, orders, and tickets. Then use Flink SQL or Flink datastream to consume data for circulation. The platform for submitting SQL and Datastream is self-developed internally, and real-time jobs are submitted through the platform.

3. Pain points

  • Kafka has high storage costs and a large amount of data. Kafka sets the data expiration time to be relatively short due to high pressure. When the data has back pressure, backlog, etc., if the data is not consumed within a certain period of time, the data will expire, which will cause data loss.
  • Flink supports near real-time reading and writing on Hive. In order to share the pressure of Kafka, put some data with low real-time performance into Hive, and let Hive do minute-level partitions. However, as metadata continues to increase, the pressure on Hive metadata has become increasingly significant, queries have become slower, and the pressure on the database storing Hive metadata has also increased.

Two, Iceberg architecture

1. Iceberg architecture analysis

img

Term Analysis

  • data files

    The file that the Iceberg table actually stores data is generally stored in the data directory and ends with ".parquet".

  • Manifest file

    Each row is a detailed description of each data file, including the status of the data file, file path, partition information, column-level statistics (such as the maximum and minimum values of each column, the number of null values, etc.). Through this file, irrelevant data can be filtered out and retrieval speed can be improved.

  • Snapshot

    A snapshot represents the state of a table at a certain moment. Each snapshot version contains a list of all data files at a certain time. Data files are stored in different manifest files, manifest files are stored in a Manifest list file, and a Manifest list file represents a snapshot.

2. Iceberg query plan

The query plan is the process of finding the "documents required for the query" in the table.

  • metadata filtering

    The manifest file includes partition data tuples and column-level statistics for each data file. During planning, query predicates are automatically converted to predicates on partitioned data, and are first applied to filter data files. Next, use column-level value count, null count, lower limit and upper limit to eliminate files that do not match the query predicate.

  • Snapshot ID

    Each Snapshot ID is associated with a set of manifest files, and each set of manifest files contains many manifest files.

  • manifest files file list

    Each manifest files records the metadata information of the current data data block, which contains the maximum and minimum values of the file column, and then according to this metadata information, index to the specific file block, so as to query the data faster .

3. Pain point 1: Kafka data loss

1. Introduction to pain points

Usually we choose Kafka for real-time data warehouse and log transmission. Kafka itself has high storage costs, and the data retention time is time-sensitive. Once the consumption backlog is over and the data reaches the expiration time, the data will be lost and not consumed.

2. Solution

Bring real-time business data that is not demanding into the lake, for example, can accept a delay of 1-10 minutes. Because Iceberg 0.11 also supports SQL real-time reading, and can save historical data. This can not only reduce the pressure on online Kafka, but also ensure that data can be read in real time without losing data.

3. Why Iceberg can only enter the lake in near real time?

img

  1. Iceberg submits Transaction at file granularity. It is impossible to submit the transaction in seconds, otherwise the number of files will be expanded;
  2. There is no online service node. For real-time high-throughput and low-latency writes, pure real-time response cannot be obtained;
  3. Flink writes in checkpoint as the unit. After the physical data is written to Iceberg, it cannot be directly queried. When the checkpoint is triggered, the metadata file will be written. At this time, the data changes from invisible to visible. Checkpoint will have a certain amount of time each time it is executed.

4. Analysis of Flink's entry into the lake

img

component introduction

  • IcebergStreamWriter

    It is mainly used to write records to the corresponding avro, parquet, orc files, generate a corresponding Iceberg DataFile, and send it to the downstream operator.

The other is called IcebergFilesCommitter, which is mainly used to collect all DataFile files when the checkpoint arrives, and submit the Transaction to Apache Iceberg to complete the data writing of this checkpoint and generate DataFile.

  • IcebergFilesCommitter

    A DataFile file list is maintained for each checkpointId, namely map<Long, List>, so that even if a checkpoint transaction fails to be submitted, its DataFile file is still maintained in the State and can still be submitted through subsequent checkpoints Data into the Iceberg table.

5. Flink SQL Demo

Flink Iceberg enters the lake in real time, consumes Kafka data to write to Iceberg, and reads data from Iceberg in near real time.

img

5.1 Preliminary work

  • enable real-time read and write function

    set execution.type = streaming

  • Turn on the table sql hint function to use the OPTIONS attribute

    set table.dynamic-table-options.enabled=true

  • Registered Iceberg catalog used to operate Iceberg table

    CREATE CATALOG Iceberg_catalog WITH (\n" +
                "  'type'='Iceberg',\n" +
                "  'catalog-type'='Hive'," +
                "  'uri'='thrift://localhost:9083'" +
                ");
  • Kafka real-time data into the lake

    insert into Iceberg_catalog.Iceberg_db.tbl1 \n 
                select * from Kafka_tbl;
  • Real-time transfer between data lakes tbl1 -> tbl2

      insert into Iceberg_catalog.Iceberg_db.tbl2  
        select * from Iceberg_catalog.Iceberg_db.tbl1 
        /*+ OPTIONS('streaming'='true', 
    'monitor-interval'='10s',snapshot-id'='3821550127947089987')*/

5.2 Parameter explanation

  • monitor-interval

    The time interval for continuously monitoring newly submitted data files (default value: 1s).

  • start-snapshot-id

    Read data from the specified snapshot ID. Each snapshot ID is associated with a set of manifest file metadata files. Each metadata file maps its own real data file. Through the snapshot ID, a certain version can be read. data.

6. Records of stepping on pits

I used to write data to Iceberg in SQL Client. The data in the data directory has been updated, but the metadata has no data, which leads to no counts when querying, because Iceberg queries require metadata to index the real data. SQL Client does not turn on checkpoint by default, and it needs to be turned on through a configuration file. So it will cause the data directory to write data but the metadata directory does not write metadata.

PS: Whether entering the lake through SQL or Datastream, Checkpoint must be turned on.

7. Data sample

The following two pictures show the effect of real-time querying Iceberg, and the data changes one second before and one second later.

  • one second ago

img

  • Data refreshed after one second

img

4. Pain point 2: Flink combined with Hive's near real-time is getting slower and slower

1. Introduction to pain points

Although the near real-time architecture of Flink + Hive supports real-time read and write, the problem that this architecture brings is that as tables and partitions increase, the following problems will be faced:

  • Too many

    Hive changed the partition to hour/minute level. Although it improved the quasi-real-time performance of the data, the pressure of meteorore was also obvious. Too much metadata caused the generation of query plans to slow down, and it would also affect the stability of other online businesses.

  • database pressure increases

    As metadata increases, the pressure on the database storing Hive metadata will increase. After a period of time, the database needs to be expanded, such as storage space.

img

img

2. Solution

Migrate the original Hive to Iceberg in near real time. Why Iceberg can handle the problem of large amount of metadata, while Hive is easy to form a bottleneck when the metadata is large?

  • Iceberg maintains metadata on a scalable distributed file system, there is no centralized metadata system;
  • Hive maintains the metadata on the partition in the metastore (too many partitions puts huge pressure on mysql), while the metadata in the partition is actually maintained in the file (starting the job requires listing a large number of files to determine whether the file is Need to be scanned, the whole process is very time-consuming).

img

Five, optimization practice

1. Small file handling

  • Before Iceberg 0.11, small files were merged by triggering batch api periodically. Although it can be merged, a set of Actions code needs to be maintained, and it is not merged in real time.

    Table table = findTable(options, conf);
    Actions.forTable(table).rewriteDataFiles()
            .targetSizeInBytes(10 * 1024) // 10KB
            .execute();
  • Iceberg 0.11 is a new feature that supports stream merging of small files.

    Use the partition/bucket key to write data using hash shuffling and merge files directly from the source. The advantage of this is that a task will process the data of a certain partition and submit its own Datafile. For example, a task only processes the corresponding partition. data. This avoids the problem of multiple tasks processing and submitting many small files, and no additional maintenance code is required. You only need to specify the attribute write.distribution-mode when building the table. This parameter is common to other engines, such as Spark.

    CREATE TABLE city_table ( 
         province BIGINT,
         city STRING
      ) PARTITIONED BY (province, city) WITH (
        'write.distribution-mode'='hash' 
      );

2. Iceberg 0.11 sort

2.1 Introduction to Sorting

Before Iceberg 0.11, Flink did not support the Iceberg sorting function, so previously it could only be combined with Spark to support the sorting function in batch mode. 0.11 added support for the sorting feature, which also means that we can also experience this benefit in real time. .

The essence of sorting is to scan faster, because after aggregating according to the sort key, all the data is sorted from small to large, and max-min can filter out a lot of invalid data.

img

2.2 Sorting demo

insert into Iceberg_table select days from Kafka_tbl order by days, province_id;

3. Detailed explanation of the manifest after sorting by Iceberg

img

Parameter explanation

  • file\_path: physical file location.
  • partition: The partition corresponding to the file.
  • lower\_bounds: The minimum value of multiple sort fields in this file. The figure below shows the minimum value of my days and province\_id.
  • upper\_bounds: The maximum value of multiple sort fields in this file. The figure below shows the maximum value of my days and province\_id.

Determine whether to read the file\_path file by the upper and lower limit information of the partition and column. After the data is sorted, the information of the file column will also be recorded in the metadata. The query plan will locate the file from the manifest, and there is no need to record the information in Hive. metadata, thereby reducing the pressure on Hive metadata and improving query efficiency.

Using the sorting feature of Iceberg 0.11, the day is used as the partition. Sort by day, hour, and minute, then the manifest file will record this sorting rule, thereby improving query efficiency when retrieving data, which can not only achieve the retrieval advantages of Hive partitions, but also avoid excessive Hive metadata metadata. pressure.

Six, summary

Compared with the previous version, Iceberg 0.11 has added many practical functions. Compared with the previous version, the following summary is made:

  • Flink + Iceberg sorting function

    Before Iceberg 0.11, the sorting function was integrated with Spark, but not with Flink. At that time, a batch of Hive tables were migrated in batches with Spark + Iceberg 0.10. The benefits of BI are: Originally, BI built multi-level partitions to improve Hive query speed, resulting in too many small files and metadata. In the process of entering the lake, Spark was used to sort BI frequently query conditions, combined with implicit partitions, and finally improved While BI retrieval speed, there is no problem of small files. Iceberg has its own metadata, which also reduces the pressure on Hive metadata.

    Icebeg 0.11 supports Flink's sorting, which is a very useful feature. We can transfer the original Flink + Hive partition to Iceberg sorting, which can not only achieve the effect of Hive partition, but also reduce small files and improve query efficiency.

  • read data in real time

    Through the SQL programming method, the data can be read in real time. The advantage is that you can put data with low real-time requirements, such as business that can accept 1-10 minutes of delay, into Iceberg. While reducing the pressure on Kafka, it can also achieve near-real-time data reading and save historical data. .

  • real-time merge small files

    Before Iceberg 0.11, it was necessary to use Iceberg's merge API to maintain small file merging. The API needed to pass in table information and timing information, and the merging was carried out in batches, not real-time. In terms of code, it increases maintenance and development costs; in terms of timeliness, it is not real-time. 0.11 Use the Hash method to merge data from the source in real time. You only need to specify the ('write.distribution-mode'='hash') attribute when creating the SQL table, without manual maintenance.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。