阿里云开发者

Introduction to Cloud Senior Technical Expert Wang Ye (Mengdou) organized his manuscript in a speech at the Apache Hudi and Apache Pulsar joint Meetup Hangzhou station. This topic introduces how Alibaba Cloud uses Hudi and OSS object storage to build Lakehouse, for everyone to share What is Lakehouse, how the Alibaba Cloud Database OLAP team builds Lakehouse, and also introduces the problems and challenges encountered when building Lakehouse, and how to solve these problems and challenges.

PPT download link of this article:

"Alibaba Cloud Building Lakehouse Practice Based on Hudi".pdf 1613f131be68ea

Other dry goods:

Shaofeng (Fengze)-Alibaba Cloud Technical Expert-"Apache Hudi-based CDC Data into the Lake".pdf

Zhai Jia-Co-founder of StreamNative, Member of Apache Pulsar PMC-"Pulsar 2.8.0 Feature Overview and Planning".pdf

Yufan-StreamNative Software Engineer-"Design, Development and Use of the New Pulsar Connector Based on Flink".pdf

1. Data Lake and Lakehouse

At the 2021 Developers Conference, one of our researchers shared a topic that mentioned a lot of data. The main thing I want to explain is that the industry has developed to this stage, the data expansion is very strong, and the data growth rate is very terrible. Whether it is data scale or real-time production and processing, to intelligent production and processing, and the cloudification process of data acceleration to the cloud.

These data come from the analysis of Gartner and IDC, which are all precipitated and summarized in the industry's most authoritative analysis reports. This means that we have great opportunities and challenges in the field of data, especially in the field of analysis.

With respect to massive amounts of data, we will face many challenges to truly do a good job in mining and using the value of data. The first is that the existing architecture has to be slowly migrated to the cloud; the second is the amount of data; the third is Serverless pays by the amount, slowly changing from an tentative choice to the default choice; fourthly, there are diversified applications and heterogeneous data sources. I believe everyone who has been in contact with cloud knows that no matter which cloud vendor it is, there are many cloud services available, especially the large number of data services. At this time, a large number of data sources will definitely bring a problem: analysis is difficult, especially when you want to do correlation analysis, how to connect heterogeneous data sources is a big problem. The second is the differentiated data format. Usually we choose convenient and simple formats when writing data, such as CSV, Json format, but for analysis, these formats are often very inefficient, especially when the data reaches TB, At the PB level, there is no way to analyze it. At this time, analysis-oriented column storage formats such as Parquet and ORC were derived. Of course, it also includes link security and differentiated groups, etc. The process of data expansion adds a lot of difficulty in analysis.

In a real customer scenario, a lot of data has already been uploaded to the cloud and "into the lake". What is the lake? Our definition and understanding of the lake is more like AWS's S3 or Alibaba Cloud OSS, which is a simple and easy-to-use API format that can store a variety of differentiated data formats, with unlimited capacity, pay-as-you-go, etc. Many benefits. Previously, it was very troublesome to analyze based on the lake. Many times, I needed to build T+1 warehouses and deliver various cloud services. Sometimes the data format is wrong and human flesh is required to do ETL. If the data is already in the lake for meta-information discovery and analysis, etc., the entire operation and maintenance link is very complicated and there are many problems. These are all offline data lake problems that online customers are actually facing. Some have high priority and some are low. In short, there are many problems.

In fact, Databricks began to adjust its research focus from Spark to Lakehouse in about 19 years. They published two papers, which provide a theoretical definition of how the data lake can be accessed uniformly and better.

Based on the new concept of Lakehouse, what I want to do is to shield various differences in formats, provide a unified interface for different applications and more simplified data access and data analysis capabilities. The architecture says that the data warehouse, data lake, and lakehouse will evolve step by step.

Two of his papers expounded many new concepts: First, how to design and implement MVCC, so that offline data warehouses can also have MVCC capabilities like databases, so as to meet most of the needs for batch transactions; second, provide different storage Mode, can adapt to different read and write workloads; third, provide some near real-time write and merge capabilities to provide link capabilities for the amount of data. In short, his ideas can better solve the problem of offline data analysis.

There are currently three products in the industry that are relatively popular. The first is Delta Lake, which is a data lake management protocol released by Databricks itself; the second is Iceberg, which is also an open source project of Apache; the third is Hudi, which was originally developed by Hudi. Uber's internal research and development, and later open source projects (in the early days, Hive's ACID was used more frequently). At present, these three products can adapt to the underlying lake storage because they can interface with the HDFS API, and OSS can adapt to the HDFS storage interface. Because the core principles are similar, the capabilities of the three products are gradually approaching. At the same time, with the theoretical support of the paper, we will have the direction to practice.

For us, Hudi was chosen at the time because of its product maturity and its database-oriented data access ability, which satisfies the business needs of our database team as CDC in form.

The early definition of Hudi is the abbreviation of Hadoop Updates anD Incrementals, followed by the concepts of Update, Delete, and Insert for Hadoop. The core logic is transaction versioning, state machine control, and asynchronous execution. It simulates the logic of the entire MVCC and provides information on internal columns. Incremental management of stored files such as Parquet, ORC and other object lists to achieve efficient storage and reading. It is very similar to the Lakehouse concept defined by Databricks and coincides with Iceberg, and its capabilities are gradually improving in this direction.

The structure provided externally by Hudi's official website is in this form. When we did technical selection and research before, we found that many peers had already made full use of Hudi for data entry into the lake and offline data management program selection. First, because the product is relatively mature; second, it meets the needs of our CDC; third, Delta Lake has a set of open source versions, a set of internally optimized versions, and only open source versions are provided externally. We think that it may not necessarily provide the best things. Present. Iceberg started relatively late, and its capabilities were less complete than the other two products in the early days, so it was not considered. Because we are both a Java team and we have our own Spark product, Hudi fits our ability to support data into the lake with our own runtime, so we chose Hudi.

Of course, we have been paying attention to the development of these three products. Later, an open source project in China, StarLake, did similar things. Each product is improving. In the long run, the capabilities are basically aligned. I think it will gradually match the capabilities defined in the paper.

"Based on the open source Hudi columnar and multi-version format, the heterogeneous data sources are incrementally and low-latency into the lake, stored in open, low-cost object storage, and in this process, data layout optimization and meta- The ability to evolve information will ultimately realize the unified management of offline data, and support the above calculation and analysis capabilities indiscriminately. This is the overall plan. "This is our understanding of Lakehouse and the direction of our technological exploration.

2. Alibaba Cloud Lakehouse practice

The following introduces the technical exploration and specific practice of Alibaba Cloud Lakehouse. First of all, I will briefly introduce the concept of "database, warehouse, and lake integration" strategy that the Alibaba Cloud database team has been proposing in recent years.

Everyone knows that database products are divided into four levels: one is DB; the other is NewSQL/NoSQL products; the third is data warehouse products; the fourth is lake data products. The higher the value, the greater the value density of the data, and the data in the form of meta-table and meta-warehouse will be associated with the analysis. For example, the DB data format is very simple and clear; the lower the data is, the larger the data volume and the more complex the data format. There are a variety of storage formats. The data lake forms are structured, semi-structured, and unstructured. To analyze, you must do a certain amount of refinement and mining to truly use the value of the data.

The four storage directions have their own fields, and they also have requirements for correlation analysis. The main thing is to break the data island and integrate the data to make the value more three-dimensional. If you only do some log analysis, such as the associated region and customer source, you will only use relatively simple analysis capabilities such as GroupBy or Count. For the underlying data, it may be necessary to clean and reflow multiple times to analyze the online and high-concurrency scenarios layer by layer. Here not only directly write data from the lake to the database, but also to the warehouse, to the NoSQL/NewSQL product, to the KV system, and make good use of the online query capabilities, and so on.

The reverse is also true. The data in these databases/NewSQL products and even data warehouses will flow downwards, building low-cost, large-capacity storage backups and archiving, reducing the above storage pressure, analysis throughput pressure, and forming a powerful combination Skills of analyze. This is also my own understanding of the integration of database, warehouse, and lake.

I just talked about the development direction and positioning of the database, and then look at the positioning of Lakehouse in the hierarchical data warehouse system of OLAP itself under the database. The classmates who have done data warehouse products are more familiar than me. (PPT diagram) is basically such a layered system. At the beginning, there are various forms of non-digital warehouses or non-data lake systems. To store data, we understand that through Lakehouse's ability to enter the lake and build warehouses, through cleaning, precipitation and aggregation, the ODS or CDM layer is formed. Here we have made preliminary data aggregation and aggregation capabilities to form the concept of a data mart.

On Alibaba Cloud, we will store these data on the entire OSS based on the Hudi protocol based on the Parquet file format. Internally, the initial data set will be further aggregated into a clearer and more business-oriented data set through ETL, and then the ETL will be constructed. Import in real-time data warehouse, etc. Or these data sets are directly oriented to low-frequency interactive analysis and BI analysis, or oriented to Spark and other engines for machine learning, and finally output to the entire data application, which is an overall layered system.

Throughout the process, we will access a unified meta-information system. Because if each part of the system has its own terminology, it must retain a copy of its own meta-information, which is split for the OLAP system. Therefore, the meta-information must be unified, and the scheduling is the same. The tables of different data warehouse levels must be connected in series in different places, and they must have complete and unified scheduling capabilities. The above is my understanding of Lakehouse's positioning in the OLAP system, mainly the ability to paste the source layer and aggregate offline data.

The previous introduction of Lakehouse's positioning in the database and OLAP team, the latter focuses on the form of Lakehouse's design in our field. Because I have used K8s to do analysis systems on the cloud before, I am quite clear about many concepts of K8s.

We also tried to refer to and learn from the K8s system when we designed it ourselves. K8s has the DevOps concept that we often mention, which is a practical paradigm. Under this paradigm, many instances will be created, and many applications will be managed in the instances. These applications are finally executed by atomic scheduling through the Pod method, and some business logic and various Containers are run in the Pod.

We think Lakehouse is also a paradigm, a paradigm for processing offline data. Here, the data set is our core concept, such as building a set of data sets oriented to a certain scenario and a certain direction. We can define different data sets of A, B, and C. In our opinion, this is an example. Arrange various Workload workloads around this data set, such as doing DB into the lake. There are also workloads of analysis and optimization, such as index construction, such as z-ordering, clustering, compaction and other technologies, and query optimization capabilities are improved better. There is also a management type workload, such as regularly cleaning up historical data, and doing hot and cold storage tiering, because OSS provides many such capabilities, and make good use of these capabilities. The bottom layer is various jobs. We build offline computing capabilities based on Spark. We arrange Workload into small jobs before and after. All atomic jobs are executed on Spark. The above are the areas of our technical practice for Lakehouse. design.

This is the overall technical architecture. First of all, there are various data sources on the cloud. Various workloads are defined through orchestration and run on our own Spark elastic computing. The core storage is based on Hudi+OSS. We also support other HDFS systems, such as Alibaba Cloud's LindormDFS, internal meta-information system management database, tables, columns and other meta-information. Later, all management and control services are scheduled based on K8s. The upper layer connects computing and analysis capabilities through the native Hudi interface. This is the entire elastic architecture.

In fact, Serverless Spark is our computing foundation, providing job-level flexibility, because Spark itself also supports Spark Streaming, and stream computing can be realized by popping up a Spark job in a short time. Choosing OSS and LindormDFS as the storage base mainly takes advantage of the advantages of low cost and unlimited capacity.

In this architecture, how to connect the user's data to realize the ability of data into the lake, storage, and analysis? The above is our security solution based on VPC. First of all, we use the shared cluster mode. The user side can connect through the SDK and VPDN network, and then the Alibaba Cloud internal gateway can open up the computing cluster to achieve management and scheduling; then, through the Alibaba Cloud elastic network card technology, the VPC of Unicom users can realize the data path, and at the same time To achieve routing capabilities and network isolation capabilities, different users may also have sub-network segment conflicts. Through flexible network card technology, the ability to connect the same network segment to the same computing cluster at the same time can be realized.

Students who have used Alibaba Cloud OSS know that OSS itself is the public network in the Alibaba Cloud VPC network. It is a shared area and does not require a complicated network. RDS and Kafka are deployed in the user's VPC, and multiple networks can be connected through a set of network architecture. Compared with VPC network segment conflicts, there is no such problem in shared areas. Secondly, data isolation. ENI has end-to-end restrictions. For example, VPCs have ID signs and different authorization requirements. Illegal users try to connect to the VPC. If it is not for this network card, the network package cannot be connected, and safe isolation and data can be achieved.的pathway.

The network architecture has been determined, how to run it? In the whole design, we will take the K8s DSL design as an example. As mentioned earlier, many tasks into the lake will be defined. A Workload may have many small tasks. At this time, we need to define the orchestration capabilities of the DSL, job1, job2, and then job3. , Define a set of orchestration scripts; these orchestration scripts are submitted through SDK, console and other entrances, and then received through API Server and scheduled by Scheduler. This Scheduler will communicate with Spark's gateway to realize task management, status management, task distribution, etc., and finally schedule internal K8s to pull up jobs for execution. Some full jobs run once, such as DB pull once. There are also resident streaming jobs, triggered asynchronous jobs, timed asynchronous jobs, etc. Different forms of the same scheduling capability can be expanded. In the process, there are continuous feedback status of job status, intermittent statistics, and so on. In K8s, K8s Master assumes such a role, as well as the roles of API Server and Scheduler. It is similar here, and the HA mechanism of scheduling capability is realized through the one-master, multiple-slave architecture and so on.

Here, why do we split a user-facing task of a Workload into N different jobs? Because these tasks are completely run in one process, the water level of the entire Workload varies greatly, and it is very difficult to make flexible scheduling. It is enough to run a full task once, but how many resources are appropriate? In many cases Spark is not so flexible, especially asynchronous tasks and timed tasks are very expensive to pull up, but after using up, you don't know when the next time it will come, which is difficult to predict. Just like in many signal system processing, Fourier transform is needed. If the complex waveform is split into multiple simple waveforms, the signal processing is simple. We also have this perceptual understanding. Using different Jobs to perform tasks of different roles in Workload makes it easy to achieve flexibility. Like timing or temporary triggering of a job, temporarily pulling a job, the resource consumption is completely independent of the resident streaming task, and it does not affect the stability of the streaming task, the delay in entering the lake, and so on. This is the thinking behind the design, which is to simplify complex issues. Because from the perspective of elasticity, the simpler the waveform is removed, the better the elasticity will be, and the prediction will be simpler.

Entering the lake will involve the account and secret information of many users, because not all cloud products use AWS's IAM or Aliyun's RAM to build a completely cloud-based resource permission control. Many products still use account and secret methods for authentication and authorization management, including user-built systems, database systems, and so on. In this way, the user has to give us all the connection accounts and secrets, how to manage them more safely? We are based on two systems of Alibaba Cloud: one is KMS, which encrypts user data with a hardware-level data encryption system; the second is STS, which has a fully cloud-based three-party authentication capability to achieve secure access to user data, especially The isolation or protection mechanism of sensitive data is our current system.

There is another problem. Different users are completely isolated through various mechanisms, but the same user has many tasks. There is a four-layer structure in the Lakehouse concept. There are multiple libraries under a data set, multiple tables under the libraries, different partitions under the tables, and different data files under the partitions. Users have a sub-account system and various operations, so there may be mutual influences when operating data.

For example, different lake-entry tasks want to write the same table, and the online task A has been running normally. As a result, another user has configured task B and also writes to the same space. This may cause the already online task A to be written into the same space. All the data is washed away, which is a very dangerous thing. There are other users who delete jobs, which may delete the data of running tasks on the Internet. It is possible that other tasks are still accessing but cannot perceive it; for example, through other cloud services or other programs in the VPC. , Self-deployed services, etc., may operate this table, causing data problems. Therefore, we have designed a complete set of mechanisms. On the one hand, it is a mechanism to implement locks at the table level. If a task has the earliest permission to write a piece of data, the subsequent tasks are not allowed to write again until the end of the task life cycle. , Don’t write dirty.

On the other hand, based on the Bucket Policy capabilities of OSS, the permission verification capabilities of different programs are constructed. Only tasks in Lakehouse are allowed to write data, while other programs are not allowed to write, but other programs can read. The data of the same account is originally for sharing, analysis, and access to various application scenarios. It can be read, but it must not be polluted. We have done reliability work in these areas.

We talk more about the architecture system. Let’s go back to the overall look at how to understand the data model. We believe that the entire process is centered on behavior (because the data warehouse is still row-by-row data, stored in the range of the table), and the row data is used to build a unified entry Lake, storage, analysis, meta-information model, etc. First of all, there are a variety of data sources (text or binary, binlog is binary data; or similar to Kafka can store a variety of binary), these data eventually pass through a variety of Connectors, Readers (different systems have different Call it), read the data and map it into row data. In these row data, there are key description information, such as source information, change type, etc., as well as a variable column set. Transform through a series of rules, such as filtering out some data, generating primary keys for the data, defining version, type conversion, etc.; finally through Hudi Payload encapsulation, conversion, metadata information maintenance, file generation, etc. , And finally written to the lake storage.

In the storage, through data maintenance such as meta-information and partition, and subsequent calculation and analysis, you can seamlessly see the meta-information of all the data stored in the lake and warehouse, and seamlessly connect with different forms of application scenarios.

Let's introduce our support for common data source access forms. DB entering the lake is the most common scenario. On Alibaba Cloud, there are products such as RDS and PolarDB. Take the MySQL engine as an example. Generally, there are architectures such as a master library, a slave library, and an offline library, and there may be a master-slave access point, but it is always the same. When the DB enters the lake, a full synchronization must be performed first, and then an incremental synchronization must be performed. For users, DB entry into the lake is a clear Workload, but for the system, the full synchronization must be done first, and then the incremental synchronization must be automatically connected. The data must be connected through a certain mechanism. , To ensure the correctness of the data. The entire scheduling process obtains DB information through a unified management and control service, automatically selects from the library or the instance with the least online pressure, performs full synchronization and writes to the library, and maintains the corresponding Watermark, and records the time from which the full volume starts and from the library. How much delay is there between and the main library, etc. After the full amount is completed, start to do incremental tasks, use DTS and other synchronization binlog services, do data backtracking based on the previous Watermark, and start to do increments. Use the Upsert capability in Hudi to merge data with user-defined PK and version according to a certain logic to ensure the final consistency of the data and the correctness of the analysis side.

There are a lot of considerations in the entire Watremark maintenance. If the entire amount is down, try again. Where should the site start? If the increment is down, not only should the increment be considered where the increment has been carried out, but also the incremental maintenance increase. The measuring point cannot be returned to the position before the initial full volume every time the increment is suspended, and the data delay after that is too serious. This information is maintained at the Lakehouse table level, and can be automatically connected during Workload running, restarting, and retrying processes, which is transparent to users.

The second is the entry of information-like products into the lake. We have also made some technical explorations and business attempts. Its data is not as clear as DB. Like Alibaba Cloud’s existing Kafka service, its schema has only two fields, Key and Value. Key describes the message Id and value is customized. Most of the time, it is a Json or a binary string. First of all, it is necessary to solve how to map into rows, there will be a lot of logical processing, such as some Schema inference first, to get the original structure. Json's original nested format is easier to store, but it is more laborious to analyze. It is only convenient to analyze as a wide table. Therefore, some logics such as nesting and flattening, format expansion, etc. are needed, and then the core logic mentioned above , And finally realize file writing, meta-information merging and so on. This meta-information merging means that the number of source columns is uncertain, and sometimes there is this column for different rows, sometimes not. For Hudi, meta-information needs to be maintained at the application layer. The Schema Evolution in Lakehouse is the merging of Schema, compatible processing of columns, automatic maintenance of newly added columns, and so on.

We have Lindorm-based solutions internally. Lindorm is our self-developed KV bank that is compatible with large and wide table interfaces such as HBase and Cassandra. It has a lot of historical files and a lot of log data. Through the internal LTS service adjustment, the full and incremental data are converted into column storage files through the Lakehouse method, which supports analysis.

Both Kafka and SLS systems have the concept of sharding (Partition, Shard). When the traffic changes greatly, the capacity needs to be automatically expanded and contracted. Therefore, the consumer side must actively perceive the changes and continue to consume without affecting the correctness of the data. And this kind of data is Append-Only, which can make good use of Hudi's small file merging capabilities to make downstream analysis simpler, faster, and more efficient.

Three, customer best practices

The above is the sharing of technical exploration, and then I will introduce the application in customers. A previous customer of a cross-border e-commerce company, their problem is that DB data is not easy to analyze. Currently, there are PolarDB and MongoDB systems, and they hope to import all data into the lake in near real-time for analysis on OSS. Now the industry federation analyzes FederatedAnalytics. The problem is that the original database is under a lot of pressure when directly connecting to query data. The best way is to enter the lake to perform analysis in the offline lake. The offline lake warehouse is built through the Lakehouse method, and then the calculation and analysis are connected, or the ETL is clearly connected to avoid the impact on online data. The same architecture builds the overall data platform, application and analysis are full of flowers, without affecting anything.

The difficulty for this customer is that they have a lot of libraries, tables, and various application cases. We have done a lot of optimizations on Hudi, and we have completed more than 20 patches and contributed to the community to improve Hudi, including meta-information opening and some schemas. Evolution capabilities are also applied on the client side.

Another number of customers is near real-time analysis of Kafka logs. It turns out that their plan requires human flesh to do many steps, including entering the lake, data management, and merging small files. Through the Lakehouse solution, customer data is docked, merged into the lake automatically, and meta-information is maintained. Customers can use it directly and get through directly internally.

There is also a small problem file. In their scenario, participate in the construction of Clustering technology with the Hudi community. Clustering is to automatically merge small files into large files, because large files are good for analysis. Secondly, during the merging process, the data can be sorted according to certain specific columns. When these data columns are subsequently accessed, the performance will be much better.

Four, future prospects

Finally, let me share our team’s thinking about the future and how Lakehouse can be applied.

First, a richer source of data entering the lake. The important value of Lakehous lies in shielding various data differences and breaking data silos. There are various data in many systems on the cloud, which have great analytical value. In the future, more data sources will be unified, only supporting one DB or Kafka, and customer value is not maximized. Only by summarizing sufficient data to form a large offline warehouse and shielding the complexity, the value to users becomes more and more obvious. In addition to cloud products, there are other forms of access to the lake, such as proprietary clouds, self-built systems, and autonomous upload scenarios. The main thing is to strengthen the ability to stick to the source layer.

Second, lower-cost and more reliable storage capabilities are centered on data lifecycle management. Because Alibaba Cloud OSS has a very rich billing method, it supports a variety of storage (standard storage, low-frequency storage, cold storage, and colder storage), etc. There are dozens of items in the billing logic, and most people are not fully aware of it. But for users, the cost is always the design center, especially the construction of massive offline lake warehouses, because the amount of data is getting larger and the cost is increasing.

I have contacted a customer before. He needs to store 30 years of data. Their business is stock analysis. All the data of exchanges and brokerages needs to be crawled down and transmitted to the big lake warehouse. Because 30 years of analysis are to be done, cost optimization is very critical. It turned out that I chose an online system, and I couldn't hold it for a few months, because the amount of data was too large. Analyzing data has the characteristics of accessing from cold to hot, from relatively low frequency to high frequency. Lakehouse uses these characteristics to automatically shield users from complex maintenance of which directories require cold storage and which directories require hot storage by defining rules and logic. Users go one step further.

Third, stronger analytical capabilities. In Hudi's ability to accelerate analysis, in addition to the aforementioned Clustering, there is also Compaction. Clustering is the merging of small files. For example, in log scenarios, one file is generated every time a batch is written. These files are generally not very large, but the smaller the file, the more fragmented the access cost during analysis. Access to a file requires authentication, connection establishment, and meta-information access. The process of accessing a large file is done only once, while accessing a small file is multiplied and the overhead is very high. In the Append scenario, clustering is used to quickly merge small files into large files to avoid linear degradation of analysis performance caused by writing and ensure efficient analysis.

In Hudi, if it is a Merge On Read type table, such as Delete and Update, it will be quickly written to the log file. During subsequent reading, the data will be merged to form a complete logical data view. The problem here is also obvious. If there are 1000 log files, each read needs to be merged 1000 times, and the degradation of analysis ability must be very serious. At this time, Hudi's Compaction capability will periodically merge the log files. As mentioned earlier, if it is completely implemented in the same lake entry job, especially file merging, the computational overhead is very high. When doing these heavy loads, the delay of the lake entry link will be greatly affected, and it must be asynchronous. The scheduling method realizes the write delay guarantee. And these processes are flexible. Whether it is 100 files or 10,000 files, it can be fast and flexible without affecting delay, which is very advantageous.

Fourth, richer scenario-based applications. I personally feel that Lakehouse is still oriented towards the ability to stick to the source layer and cooperate with a certain degree of aggregation. Because of the higher level of aggregation and real-time, there are more real-time data warehouse options. Now the popular DorisDB and ClickHouse in the industry have great advantages for real-time high-frequency analysis. Real-time analysis based on Hudi, Lakehouse, and OSS does not have many advantages, so the ability to build the source layer is the main thing.

It turned out to be near-real-time entering the lake scene, but some users may not have so many real-time requirements. Periodic T+1 logical warehouse opening can meet the requirements. You can use the Hudi+Lakehouse ability to query a part of the logical incremental data and write it into Hudi every day. And maintain the partition, and realize the Schema Evolution capability.

In the early days, the amount of data was getting larger and larger, and customers realized logical splitting through sub-databases and tables. During the analysis, it was found that there were too many libraries and tables, and the analysis and correlation were difficult. At this time, you can build the ability to combine multiple databases and multiple tables to build a warehouse and summarize them into one table for analysis.

Then there is the analysis of cross-regional integration. There are many customers asking for this kind of demand, especially overseas. Some customers want to serve overseas users, and they must have part of their business overseas, especially in the cross-border e-commerce scenario, and its procurement system, warehousing system, logistics system, distribution system, etc. are all built in China, and many data want to be integrated and analyzed. How to do? First of all, OSS provides cross-domain replication, but it only goes to the data level without any logic. Here, Lakehouse can be used to build the logic layer to mix data from different regions and aggregate them into the same region, providing unified SQL join, Union and other capabilities.

Finally, Hudi has the capabilities of TimeTravel and Incremental query. At this time, incremental ETL is constructed to clean different tables, which is generalized to a certain extent, making it easier for users to use. In the future, more scenario-based capabilities will be built in, making it easier for users to build and apply lake warehouses!

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Technical dry goods | Alibaba Cloud based on Hudi to build Lakehouse practice exploration

1. Data Lake and Lakehouse

2. Alibaba Cloud Lakehouse practice

Three, customer best practices

Four, future prospects

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置