Hudi&#39;s own tool DeltaStreamer&#39;s real-time lake entry best practice

Abstract: This article introduces how to use the DeltaStreamer, a tool that comes into the lake by Hudi, to enter the lake in real time.

This article is shared from Huawei Cloud Community "Huawei FusionInsight MRS Actual Combat-Hudi Real-time DeltaStreamer Tool Best Practice" , author: Jin Hongqing.

background

The organizational structure of the traditional big data platform is designed for offline data processing needs, and the commonly used data import method is batch import using sqoop timing jobs. With the continuous improvement of real-time requirements for data analysis, hourly or even minute-level data synchronization is becoming more and more common. Thus started the development of a (quasi) real-time synchronization system based on spark/flink stream processing mechanism.

However, real-time synchronization faced several challenges from the beginning:

Small file problem. Whether it is spark's microbatch mode or flink's one-by-one processing mode, each time it is written to HDFS, it is a file of several MB or even tens of KB. The large number of small files generated over a long period of time will put tremendous pressure on the HDFS namenode.
Support for update operations. The HDFS system itself does not support data modification and cannot modify records during synchronization.
Transactional. Regardless of whether it is adding data or modifying data, how to ensure transactional properties. That is, data is only written to HDFS once during the commit operation of the stream processing program. When the program rolls back, the data that has been written or partially written can be deleted.

Hudi is one of the solutions to the above problems. Use the DeltaStreamer tool that comes with Hudi to write data to Hudi, turn on –enable-hive-sync to synchronize data to the hive table.

Hudi DeltaStreamer writing tool introduction

DeltaStreamer tool use reference https://hudi.apache.org/cn/docs/writing_data.html

The HoodieDeltaStreamer utility (part of the hudi-utilities-bundle) provides a way to ingest from different sources such as DFS or Kafka, and has the following functions.

Single ingest new events from Kafka, output from Sqoop, HiveIncrementalPuller or multiple files in DFS folder
Support incoming data of json, avro or custom record type
Manage checkpoints, rollbacks and recovery
Avro mode using DFS or Confluent schema registry.
Support custom conversion operations

Scene description

The production database data is entered into the designated topic of Kafka in the MRS cluster in real time through the CDC tool (debezium).
Through the DeltaStreamer tool provided by Hudi, the data in the specified topic of Kafka is read and analyzed.
At the same time, use the DeltaStreamer tool to write the processed data to the hive of the MRS cluster.

Introduction to sample data

Raw data of production database MySQL:

Introduction to the CDC tool debezium

Specific reference for docking steps: https://fusioninsight.github.io/ecosystem/zh-hans/Data_Integration/DEBEZIUM/

After completing the docking, add, modify, and delete the Kafka messages corresponding to the MySQL production database.

Add operation: insert into hudi.hudisource3 values (11,"Jiang Yutang","38","女","picture","player","28732");

Corresponding to the kafka message body:

Change operation: UPDATE hudi.hudisource3 SET uname='Anne Marie333' WHERE uid=11;

Corresponding to the kafka message body:

Delete operation: delete from hudi.hudisource3 where uid=11;

Corresponding to the kafka message body:

Debugging steps

Huawei MRS Hudi sample project acquisition

Log in to github according to the actual MRS version to obtain the sample code: https://github.com/huaweicloud/huaweicloud-mrs-example/tree/mrs-3.1.0

Open the project SparkOnHudiJavaExample

Sample code modification and introduction

1.debeziumJsonParser

Description: parse the message body of debezium and get the op field.

The source code is as follows:

package com.huawei.bigdata.hudi.examples;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson.TypeReference;

public class debeziumJsonParser {

    public static String getOP(String message){

        JSONObject json_obj = JSON.parseObject(message);
        String op = json_obj.getJSONObject("payload").get("op").toString();
        return  op;
    }
}

2.MyJsonKafkaSource

Note: DeltaStreamer uses org.apache.hudi.utilities.sources.JsonKafkaSource to consume the data of the specified topic in Kafka by default. If the data parsing operation is involved in the consumption phase, you need to rewrite MyJsonKafkaSource for processing.

The following is the source code, add comments

package com.huawei.bigdata.hudi.examples;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson.parser.Feature;
import org.apache.hudi.common.config.TypedProperties;
import org.apache.hudi.common.util.Option;
import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics;
import org.apache.hudi.utilities.schema.SchemaProvider;
import org.apache.hudi.utilities.sources.InputBatch;
import org.apache.hudi.utilities.sources.JsonSource;
import org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen;
import org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen.CheckpointUtils;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import org.apache.spark.streaming.kafka010.LocationStrategies;
import org.apache.spark.streaming.kafka010.OffsetRange;
import java.util.Map;

/**
 * Read json kafka data.
 */
public class MyJsonKafkaSource extends JsonSource {

    private static final Logger LOG = LogManager.getLogger(MyJsonKafkaSource.class);

    private final KafkaOffsetGen offsetGen;

    private final HoodieDeltaStreamerMetrics metrics;

    public MyJsonKafkaSource(TypedProperties properties, JavaSparkContext sparkContext, SparkSession sparkSession,
                             SchemaProvider schemaProvider) {
        super(properties, sparkContext, sparkSession, schemaProvider);
        HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder();
        this.metrics = new HoodieDeltaStreamerMetrics(builder.withProperties(properties).build());
        properties.put("key.deserializer", StringDeserializer.class);
        properties.put("value.deserializer", StringDeserializer.class);
        offsetGen = new KafkaOffsetGen(properties);
    }

    @Override
    protected InputBatch<JavaRDD<String>> fetchNewData(Option<String> lastCheckpointStr, long sourceLimit) {
        OffsetRange[] offsetRanges = offsetGen.getNextOffsetRanges(lastCheckpointStr, sourceLimit, metrics);
        long totalNewMsgs = CheckpointUtils.totalNewMessages(offsetRanges);
        LOG.info("About to read " + totalNewMsgs + " from Kafka for topic :" + offsetGen.getTopicName());
        if (totalNewMsgs <= 0) {
            return new InputBatch<>(Option.empty(), CheckpointUtils.offsetsToStr(offsetRanges));
        }
        JavaRDD<String> newDataRDD = toRDD(offsetRanges);
        return new InputBatch<>(Option.of(newDataRDD), CheckpointUtils.offsetsToStr(offsetRanges));
    }

    private JavaRDD<String> toRDD(OffsetRange[] offsetRanges) {
        return KafkaUtils.createRDD(this.sparkContext, this.offsetGen.getKafkaParams(), offsetRanges, LocationStrategies.PreferConsistent()).filter((x)->{
            //过滤空行和脏数据
            String msg = (String)x.value();
            if (msg == null) {
                return false;
            }
            try{
                String op = debeziumJsonParser.getOP(msg);
            }catch (Exception e){
                return false;
            }
            return true;
        }).map((x) -> {
            //将debezium接进来的数据解析写进map,在返回map的tostring, 这样结构改动最小
            String msg = (String)x.value();
            String op = debeziumJsonParser.getOP(msg);
            JSONObject json_obj = JSON.parseObject(msg, Feature.OrderedField);
            Boolean is_delete = false;
            String out_str = "";
            Object out_obj = new Object();
            if(op.equals("c")){
                out_obj =  json_obj.getJSONObject("payload").get("after");
            }
            else if(op.equals("u")){
                out_obj =   json_obj.getJSONObject("payload").get("after");
            }
            else {
                is_delete = true;
                out_obj =   json_obj.getJSONObject("payload").get("before");
            }
            Map out_map = (Map)out_obj;
            out_map.put("_hoodie_is_deleted",is_delete);
            out_map.put("op",op);

            return out_map.toString();
        });
    }
}

3.TransformerExample

Description: The fields that need to be specified when entering the lake hudi table or hive table

The following is the source code, add comments

package com.huawei.bigdata.hudi.examples;

import org.apache.hudi.common.config.TypedProperties;
import org.apache.hudi.utilities.transform.Transformer;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;

/**
 * 功能描述
 * 对获取的数据进行format
 */
public class TransformerExample implements Transformer, Serializable {

    /**
     * format data
     *
     * @param JavaSparkContext jsc
     * @param SparkSession sparkSession
     * @param Dataset<Row> rowDataset
     * @param TypedProperties properties
     * @return Dataset<Row>
     */
    @Override
    public Dataset<Row> apply(JavaSparkContext jsc, SparkSession sparkSession, Dataset<Row> rowDataset,
        TypedProperties properties) {
        JavaRDD<Row> rowJavaRdd = rowDataset.toJavaRDD();
        List<Row> rowList = new ArrayList<>();
        for (Row row : rowJavaRdd.collect()) {

            Row one_row = buildRow(row);
            rowList.add(one_row);
        }
        JavaRDD<Row> stringJavaRdd = jsc.parallelize(rowList);
        List<StructField> fields = new ArrayList<>();
        builFields(fields);
        StructType schema = DataTypes.createStructType(fields);
        Dataset<Row> dataFrame = sparkSession.createDataFrame(stringJavaRdd, schema);
        return dataFrame;
    }

    private void builFields(List<StructField> fields) {
        fields.add(DataTypes.createStructField("uid", DataTypes.IntegerType, true));
        fields.add(DataTypes.createStructField("uname", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("age", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("sex", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("mostlike", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("lastview", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("totalcost", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("_hoodie_is_deleted", DataTypes.BooleanType, true));
        fields.add(DataTypes.createStructField("op", DataTypes.StringType, true));
    }

    private Row buildRow(Row row) {
        Integer uid = row.getInt(0);
        String uname = row.getString(1);
        String age = row.getString(2);
        String sex = row.getString(3);
        String mostlike = row.getString(4);
        String lastview = row.getString(5);
        String totalcost = row.getString(6);
        Boolean _hoodie_is_deleted = row.getBoolean(7);
        String op = row.getString(8);
        Row returnRow = RowFactory.create(uid, uname, age, sex, mostlike, lastview, totalcost, _hoodie_is_deleted, op);
        return returnRow;
    }
}

4.DataSchemaProviderExample

Note: Specify the data format returned by MyJsonKafkaSource as the source schema, and the data format written by TransformerExample as the target schema.

The following is the source code

package com.huawei.bigdata.hudi.examples;

import org.apache.avro.Schema;
import org.apache.hudi.common.config.TypedProperties;
import org.apache.hudi.utilities.schema.SchemaProvider;
import org.apache.spark.api.java.JavaSparkContext;

/**
 * 功能描述
 * 提供sorce和target的schema
 */
public class DataSchemaProviderExample extends SchemaProvider {

    public DataSchemaProviderExample(TypedProperties props, JavaSparkContext jssc) {
        super(props, jssc);
    }
    /**
     * source schema
     *
     * @return Schema
     */
    @Override
    public Schema getSourceSchema() {
        Schema avroSchema = new Schema.Parser().parse(
                "{\"type\":\"record\",\"name\":\"hoodie_source\",\"fields\":[{\"name\":\"uid\",\"type\":\"int\"},{\"name\":\"uname\",\"type\":\"string\"},{\"name\":\"age\",\"type\":\"string\"},{\"name\":\"sex\",\"type\":\"string\"},{\"name\":\"mostlike\",\"type\":\"string\"},{\"name\":\"lastview\",\"type\":\"string\"},{\"name\":\"totalcost\",\"type\":\"string\"},{\"name\":\"_hoodie_is_deleted\",\"type\":\"boolean\"},{\"name\":\"op\",\"type\":\"string\"}]}");
        return avroSchema;
    }
    /**
     * target schema
     *
     * @return Schema
     */
    @Override
    public Schema getTargetSchema() {
        Schema avroSchema = new Schema.Parser().parse(
            "{\"type\":\"record\",\"name\":\"mytest_record\",\"namespace\":\"hoodie.mytest\",\"fields\":[{\"name\":\"uid\",\"type\":\"int\"},{\"name\":\"uname\",\"type\":\"string\"},{\"name\":\"age\",\"type\":\"string\"},{\"name\":\"sex\",\"type\":\"string\"},{\"name\":\"mostlike\",\"type\":\"string\"},{\"name\":\"lastview\",\"type\":\"string\"},{\"name\":\"totalcost\",\"type\":\"string\"},{\"name\":\"_hoodie_is_deleted\",\"type\":\"boolean\"},{\"name\":\"op\",\"type\":\"string\"}]}");
        return avroSchema;
    }
}

Upload the project package (hudi-security-examples-0.7.0.jar) and json parsing package (fastjson-1.2.4.jar) to the MRS client

DeltaStreamer start command

source /opt/hadoopclient/bigdata_env
kinit developuser
source /opt/hadoopclient/Hudi/component_env

The DeltaStreamer startup command is as follows:

spark-submit --master yarn-client \
--jars /opt/hudi-demo2/fastjson-1.2.4.jar,/opt/hudi-demo2/hudi-security-examples-0.7.0.jar \
--driver-class-path /opt/hadoopclient/Hudi/hudi/conf:/opt/hadoopclient/Hudi/hudi/lib/*:/opt/hadoopclient/Spark2x/spark/jars/*:/opt/hudi-demo2/hudi-security-examples-0.7.0.jar \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
spark-internal --props file:///opt/hudi-demo2/kafka-source.properties \
--target-base-path /tmp/huditest/delta_demo2 \
--table-type COPY_ON_WRITE  \
--target-table delta_demo2  \
--source-ordering-field uid \
--source-class com.huawei.bigdata.hudi.examples.MyJsonKafkaSource \
--schemaprovider-class com.huawei.bigdata.hudi.examples.DataSchemaProviderExample \
--transformer-class com.huawei.bigdata.hudi.examples.TransformerExample \
--enable-hive-sync --continuous
kafka.properties配置
// hudi配置
hoodie.datasource.write.recordkey.field=uid
hoodie.datasource.write.partitionpath.field=
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.datasource.write.hive_style_partitioning=true
hoodie.delete.shuffle.parallelism=10
hoodie.upsert.shuffle.parallelism=10
hoodie.bulkinsert.shuffle.parallelism=10
hoodie.insert.shuffle.parallelism=10
hoodie.finalize.write.parallelism=10
hoodie.cleaner.parallelism=10
hoodie.datasource.write.precombine.field=uid
hoodie.base.path = /tmp/huditest/delta_demo2
hoodie.timeline.layout.version = 1

// hive config
hoodie.datasource.hive_sync.table=delta_demo2
hoodie.datasource.hive_sync.partition_fields=
hoodie.datasource.hive_sync.assume_date_partitioning=false
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
hoodie.datasource.hive_sync.use_jdbc=false

// Kafka Source topic
hoodie.deltastreamer.source.kafka.topic=hudisource
// checkpoint
hoodie.deltastreamer.checkpoint.provider.path=hdfs://hacluster/tmp/delta_demo2/checkpoint/

// Kafka props
bootstrap.servers=172.16.9.117:21005
auto.offset.reset=earliest
group.id=a5
offset.rang.limit=10000

Note: Kafka server configuration allow.everyone.if.no.acl.found is true

Use Spark query

spark-shell --master yarn

val roViewDF = spark.read.format("org.apache.hudi").load("/tmp/huditest/delta_demo2/*")
roViewDF.createOrReplaceTempView("hudi_ro_table")
spark.sql("select * from  hudi_ro_table").show()

Mysql increase operation corresponds to the query result of the hudi table in spark:

Mysql update operation corresponds to the query result of the hudi table in spark:

Delete operation:

Use Hive query

beeline

select * from delta_demo2;

Mysql increase operation corresponds to the query result in the hive table:

Mysql update operation corresponds to the query result in the hive table:

Mysql delete operation corresponds to the query result in the hive table:

follow, and learn about Huawei Cloud's fresh technology for the first time~

Hudi's own tool DeltaStreamer's real-time lake entry best practice

background

Hudi DeltaStreamer writing tool introduction

Scene description

Introduction to sample data

Introduction to the CDC tool debezium

Debugging steps

Huawei MRS Hudi sample project acquisition

Sample code modification and introduction

DeltaStreamer start command

Use Spark query

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 pyflink 的算法工作流设计和改造

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目

Flink CDC YAML：面向数据集成的 API 设计

Hudi&#39;s own tool DeltaStreamer&#39;s real-time lake entry best practice

background

Hudi DeltaStreamer writing tool introduction

Scene description

Introduction to sample data

Introduction to the CDC tool debezium

Debugging steps

Huawei MRS Hudi sample project acquisition

Sample code modification and introduction

DeltaStreamer start command

Use Spark query

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 pyflink 的算法工作流设计和改造

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目

Flink CDC YAML：面向数据集成的 API 设计

Hudi's own tool DeltaStreamer's real-time lake entry best practice