2
头图

This article is a data lake interview question and an explanation of data lake knowledge points.

content:
1. What is a data lake 2. Development of a data lake 3. What are the advantages of a data lake 4. What capabilities should a data lake have Do data lakes? The difference is that?
8. Challenges of data lake 9. Integration of lake and warehouse

1. What is a data lake

This article was first published on the public account [Learn Big Data in Five Minutes], click to get it: Nanny-level tutorial on building a warehouse

A data lake is an evolving and scalable infrastructure for big data storage, processing, and analysis; it is data-oriented and enables full acquisition, full storage, and multi-mode data from any source, any speed, any scale, and any type of data. Processing and full life cycle management; and through the interaction and integration with various external heterogeneous data sources, it supports various enterprise-level applications.

It can be quickly explained with the architecture diagram, using Ali's data architecture diagram as an example:

  • ODS (operational data store, staging area) stores raw data from various business systems (production systems), which is a data lake.
  • CDM is data that has been consolidated and cleaned. The DWS summary layer is a subject-oriented data warehouse (in a narrow sense), which is used for BI report output.

Simply put, the definition of a data lake is the original data storage area. Although this concept is rarely discussed in China, most Internet companies already have it. In China, the entire HDFS is generally called a data warehouse (in a broad sense), that is, the place where all data is stored.

2. The development of data lakes

Data lake is a concept first proposed by Pentaho CTO James Dixon in 2011. He believes that data marts and data warehouses will inevitably bring about data island effect due to their orderly characteristics, and data lakes can be due to their orderly characteristics. The characteristics of openness can solve the problem of data silos.

Why not Data River?

Because, the data must be able to be stored, rather than a river flowing eastward.

Why not a datapool?

Because, it needs to be big enough, the big data is too big to be stored in one pool.

Why not data sea?

Because enterprise data must have boundaries and can be circulated and exchanged, but it pays more attention to privacy and security.

Therefore, the data must be able to be "stored", the data must be "stored" enough, and the data must be "stored" with boundaries. Enterprise-level data needs to be accumulated for a long time, so it is a "data lake".

At the same time, the lake water will naturally be layered to meet the requirements of different ecosystems, which is consistent with the needs of enterprises to build a unified data center and store and manage data. Hot data is convenient for circulation and application in the upper layer, and warm data and cold data are located in different storage media in the data center to achieve a balance between data storage capacity and cost.

But with the application of data lakes in various enterprises, everyone thinks: Well, this data is useful, I want to put it in; that data is also useful, I also put it in; so all the data is thrown into the data lake without thinking. In the related technologies or tools, there are no rules and circles. When we think that all data is useful, then all data is garbage, and the data lake has also become a data swamp that causes high costs for enterprises.

3. What are the advantages of data lakes

  • Easily collect data : A big difference between a data lake and a data warehouse is that Schema On Read requires schema information only when using data; while data warehouse is Schema On Write, which requires schema design when storing data. In this way, the data lake can collect data more easily since there are no restrictions on data writing.
  • Extract more value from data : Data warehouses and data marketplaces can only answer some pre-defined questions because they only use some of the attributes in the data; while data lakes store all the most raw and detailed data, so they can answer more many questions. And the data lake allows various roles in the organization to analyze data through self-service analysis tools, and use AI and machine learning technologies to extract more value from data.
  • Eliminate data silos : Data lakes aggregate data from various systems, which eliminates data silos.
  • Better scalability and agility: The data lake can utilize a distributed file system to store data, so it has high scalability. The use of open source technologies also reduces storage costs. Data lakes are less rigid in structure and therefore inherently more flexible, increasing agility.

4. What capabilities should a data lake have?

  1. Data integration capabilities :

It needs to have the ability to integrate various data sources into the data lake. The storage of the data lake should also be diverse, such as HDFS, HIVE, HBASE, etc.

  1. Data Governance Capabilities :

The core of governance capability is to maintain the metadata of the data. Mandating that all data entering the data lake must provide relevant metadata, which should be used as a minimum governance control. Without metadata, a data lake risks becoming a data swamp. Richer features also include:

  • Automatically extract metadata and classify data according to metadata to form a data catalog.
  • The data catalog is automatically analyzed to discover relationships between data based on AI and machine learning methods.
  • Automatically build blood relationship diagrams between data.
  • Track data usage so that data can be used as a product, forming a data asset.
  1. Data search and discovery capabilities :

If you think of the entire Internet as a huge data lake. Well, the reason why people can use the data in this lake so effectively is because of search engines like Google. People can easily find the data they want by searching, and then analyze it. Search capability is a very important capability of the data lake.

  1. Data security management and control capabilities :

Controlling data usage rights and desensitizing or encrypting sensitive data are also the capabilities that data lakes must have for commercial use.

  1. Data quality inspection capabilities :

Data quality is the key to correct analysis. Therefore, the quality of the data entering the data lake must be checked. Identify issues with data quality in the data lake in a timely manner. Provide guarantee for effective data exploration.

  1. Self-service data exploration capabilities :

There should be a series of useful data analysis tools available so that all kinds of users can conduct self-service exploration of the data in the data lake. include:

  • Supports joint analysis capabilities for multiple repositories such as streaming, NoSQL, and graphs
  • Support interactive big data SQL analysis
  • Support AI, machine learning analysis
  • Supports OLAP-like BI analysis
  • Support report generation

5. What are the problems encountered in the implementation of the data lake

When the data lake was first proposed, it was just a simple concept. From an idea to a system that can be implemented, there are many practical problems that have to be considered:

First, the idea of storing all raw data is based on the premise that storage costs are low. Now that the speed of data generation is getting faster and the amount of data generated is getting larger and larger, and all the original data, regardless of the value, are stored, whether this cost is economically acceptable or not, it may be a question mark.

Secondly, the data lake stores all kinds of the most primitive detailed data, including transaction data, user data and other sensitive data. How to ensure the security of these data? How are user access rights controlled?

Again, how to manage the data in the lake? Who is responsible for the quality of the data, the definition of the data, the changes to the data? How to ensure the consistency of data definition and business rules?

The idea of a data lake is very good, but it still lacks a set of methodology as the basis for a data warehouse, supported by a series of operable tools and ecology. Because of this, Hadoop is currently used to process specific and high-value data and build a data warehouse model, which has achieved more success; while the model used to implement the data lake concept has encountered a series of failures. Here, some typical reasons for data lake failure are summarized:

  1. Data swamps : Data swamps occur when more and more data is fed into the data lake, but there is no effective way to track this data. In this kind of failure, people put everything in HDFS hoping to discover something later, but it didn't take long for them to forget what was there.
  2. Data mud ball : All kinds of new data are inserted into the data lake, and their organizational forms and quality are different. The lack of self-service tools for examining, cleaning, and reorganizing data makes it difficult to create value from this data.
  3. Lack of self-service analytics tools : Directly analyzing data in the data lake is difficult due to the lack of easy-to-use self-service analytics tools. Typically, a data engineer or developer creates a small, curated dataset that is delivered to a wider range of users for data analysis using familiar tools. This limits wider participation in exploring big data, reducing the value of the data lake.
  4. Lack of modeling methodologies and tools : In a data lake, it seems that every job has to be started from scratch, as there is little way to reuse the data generated by previous projects. In fact, we say that data warehouses are difficult to change to adapt to new needs. One of the reasons is that it takes a lot of time to model the data, and it is with these modeling that the data can be shared and reused. Data lakes also need to model data, otherwise analysts have to start from scratch every time.
  5. Lack of data security management : The common idea is that everyone has access to all data, but that doesn't work. Enterprises have an instinct to protect their own data, and ultimately they must manage data security.
  6. One data lake does it all : Everyone is excited about the idea of storing all their data in one repository. However, there will always be new repositories outside the data lake, and it's hard to kill them all. In fact, what most companies need is joint access to multiple repositories. Whether it is stored in one place or not is not that important.

Six, the difference between data lake and data warehouse

Data warehouse , to be precise, is used for historical data precipitation and analysis , and has three characteristics:

  • The first is integration . Due to the large number of data sources, technologies and specifications are required to unify the storage methods;
  • The second is non-volatile and changes over time . The data warehouse stores snapshots of each day in the past and is usually not updated. Users can compare data changes forward or backward on any day;
  • The third is theme-oriented , effectively coding the data according to the business, so that the theoretical optimal value can be implemented in the application.

Data lake , to be precise, its starting point is to complement the lack of new technologies such as real-time processing capabilities and interactive analysis capabilities of data warehouses . Its most important feature is its rich computing engines: batch processing, streaming, interactive, machine learning, everything that should be available, and whatever the enterprise needs. Data lakes also have three characteristics:

  • One is flexibility . The uncertainty of the default business is normal. When future changes cannot be expected, the technical infrastructure must have the ability to fit the business "on-demand";
  • The second is manageability . The data lake needs to save the original information and the processed information. In terms of data source, data format, data cycle and other dimensions, the flow process of data access, storage, analysis and use can be traced;
  • The third is polymorphism . The engine itself needs to be enriched as possible, because the business scenarios are not fixed, and the polymorphic engine support and expansion capabilities can better adapt to the rapid changes in the business.

7. Why do you need a data lake? The difference is that?

Data lake and data warehouse are the difference between raw data and data warehouse model. Because the tables in the data warehouse (narrow sense) are mainly fact tables-dimension tables, which are mainly used for BI and report generation, which are different from the original data.

Why the emphasis on data lakes?

The real reason is that data science and machine learning have entered the mainstream, and raw data needs to be used for analysis, while the dimensional model of data warehouse is usually used for aggregation.

On the other hand, the data used in machine learning is not limited to structured data. Unstructured data such as user comments and images can also be applied to machine learning.

But there is actually a bigger difference behind the data lake:

  • The traditional data warehouse works in a centralized manner: business personnel send requirements to the data team, and the data team processes and develops dimension tables according to the requirements for the business team to query through BI reporting tools.
  • The data lake is open and self-service: open data for everyone to use, the data team provides tools and environments for business teams to use (but centralized dimension table construction is still needed), and the business team provides tools and environments for business teams to use. Develop and analyze.

That is, the difference in organizational structure and division of labor - the data team of a traditional enterprise may be regarded as IT, requiring data collection all day long, while in the new Internet/technology team, the data team is responsible for providing easy-to-use tools, and the business department directly processes data. usage of.

8. Data Lake Challenges

Converting from a traditional centralized data warehouse to an open data lake is not easy and will encounter many problems

  • Data discovery: How to help users discover data and understand what data is there?
  • Data Security: How to manage data permissions and security? Because some data is sensitive or should not be directly available to everyone (such as phone numbers, addresses, etc.)
  • Data management: multiple teams use data, how to share data results (such as portraits, features, indicators) to avoid repeated development

This is also the direction that major Internet companies are currently improving!

Nine, lake and warehouse integration

In 2020, Big Data DataBricks first proposed the concept of Data Lakehouse, hoping to integrate data lake and data warehouse technologies into one. As soon as this concept came out, various cloud vendors followed suit.

Data Lakehouse is a new data architecture that absorbs the advantages of data warehouse and data lake at the same time. Data analysts and data scientists can operate on data in the same data storage, and it can also It brings more convenience to the company's data governance.

1) The current data storage scheme

Historically, we have used two data stores to structure our data:

  • Data warehouse : It mainly stores structured data organized by relational database. The data is transformed, consolidated, and cleaned, and imported into the target table. In the data warehouse, the structure of the data storage is strongly matched with its defined schema.
  • Data Lake : Store any type of data, including unstructured data like pictures and documents. Data lakes are usually larger and their storage costs are cheaper. The data stored in it does not need to meet a specific schema, and the data lake does not attempt to enforce a specific schema on it. Instead, the owner of the data typically parses the schema when reading the data (schema-on-read), and applies transformations to it when processing the corresponding data.

Nowadays, many companies often build two storage architectures of data warehouse and data lake at the same time, one large data warehouse and multiple small data lakes. In this way, the data will have some redundancy in the two storages.

2) Data Lakehouse (one lake and warehouse)

The emergence of Data Lakehouse attempts to integrate the differences between data warehouses and data lakes. By building data warehouses on data lakes, storage becomes cheaper and more flexible. At the same time, lakehouse can effectively improve data quality and reduce Data redundancy . In the construction of lakehouse, ETL plays a very important role, it can convert unstructured data lake layer data into data warehouse layer structured data.

The following is explained in detail:

Data Lakehouse :

According to DataBricks' definition of Lakehouse: a new paradigm that combines the advantages of data lakes and data warehouses, addressing the limitations of data lakes. Lakehouse uses a new system design: implementing similar data structures and data management functions as in a data warehouse directly on low-cost storage for the data lake.

Explanation expansion :

The integration of lake and warehouse, a simple understanding is to combine enterprise-oriented data warehouse technology with data lake storage technology to provide enterprises with a unified and sharable data base.

Avoid data movement between traditional data lakes and data warehouses, and store raw data, processed and cleaned data, and modeled data in an integrated "lake warehouse", which can achieve high concurrency, precision, and high efficiency for business. Performance historical data, real-time data query services, and can carry analytical reports, batch processing, data mining and other analytical services.

The emergence of the integrated solution of lake and warehouse helps enterprises to build a new and integrated data platform. Through the support of machine learning and AI algorithms, the closed loop of data lake + data warehouse is realized to improve business efficiency. The capabilities of the data lake and the data warehouse are fully combined to form complementarity, and at the same time connect to the diverse computing ecology of the upper layer.

10. What open source data lake components are currently available

At present, the open source data lakes include Hudi, Delta Lake and IceBerg, which are known as the "Three Musketeers of Data Lakes".

1) Hudi

Apache Hudi is a data lake storage format that provides the ability to update and delete data and consume changed data on top of the Hadoop file system.

Hudi supports the following two table types:

  • Copy On Write

Store data in Parquet format. The update operation of the Copy On Write table needs to be implemented by rewriting.

  • Merge On Read

Data is stored using a mix of columnar file formats (Parquet) and row-based file formats (Avro). Merge On Read uses columnar format to store Base data and row format to store incremental data. The newly written incremental data is stored in the row file, and the COMPACTION operation is executed according to the configurable policy to merge the incremental data into the column file.

Application scenarios

  • Near real-time data ingestion

Hudi supports the ability to insert, update and delete data. It can ingest log data such as message queue (Kafka) and log service SLS into Hudi in real time, and also supports real-time synchronization of change data generated by the database Binlog.

Hudi optimizes for small files generated during data writing. Therefore, compared to other traditional file formats, Hudi is more friendly to the HDFS file system.

  • Near real-time data analysis

Hudi supports multiple data analysis engines, including Hive, Spark, Presto, and Impala. As a file format, Hudi does not need to rely on additional service processes, and is more lightweight in use.

  • Incremental data processing

Hudi supports the Incremental Query query type, which can query data that has changed after a given COMMIT through Spark Streaming. Hudi provides an ability to consume HDFS changing data, which can be used to optimize existing system architectures.

2) Delta Lake

Delta Lake is a storage middle layer with schema information data between Spark computing framework and storage system. It brings three main features to Spark:

First, Delta Lake enables Spark to support data update and delete functions;

Second, Delta Lake enables Spark to support transactions;

Third, support data version management, run user query historical data snapshots.

Core Features

  • ACID Transactions: Provides ACID transactions for data lakes to ensure data integrity when multiple data pipelines read and write data concurrently.
  • Data versioning and time travel: Data snapshots are provided, enabling developers to access and restore earlier versions of data for review, rollback, or replay experiments
  • Scalable metadata management: store the metadata information of tables or files, and treat the metadata as data, and store the corresponding relationship between metadata and data in the transaction log;
  • Unified processing of streams and batches: Tables in Delta are both batch, stream and sink;
  • Auditing of data operations: The transaction log records details of every change made to data, providing a complete audit trail of changes;
  • Schema management function: Provide the ability to automatically verify whether the schema of the written data is compatible with the schema of the table, and provide the ability to display adding columns and automatically update the schema;
  • Data table operations (similar to traditional database SQL): merge, update and delete, etc., provide fully Spark compatible Java/scala API;
  • Unified format: All data and metadata in Delta is stored as Apache Parquet.

3) IceBerg

Iceberg official website definition: Iceberg is a general table format (data organization format), which can be adapted to Presto, Spark and other engines to provide high-performance read-write and metadata management functions.

Compared with traditional data warehouses, the most obvious feature of data lakes is the excellent T+0 capability, which solves the stubborn problems of data analysis in the Hadoop era. The traditional data processing flow from data storage to data processing usually requires a long link and involves many complex logics to ensure data consistency. Due to the complexity of the architecture, the entire pipeline has a significant delay.

Iceberg's ACID capability can simplify the design of the entire pipeline and reduce the delay of the entire pipeline. Reduce the cost of data correction. Traditional Hive/Spark needs to read out the data when modifying the data, and then write it after modification, which has a huge modification cost. The modification and deletion capabilities of Iceberg can effectively reduce overhead and improve efficiency.

  1. ACID capability, seamlessly fit the last layout of flow-batch integrated data storage

With the continuous development of technologies such as flink, the integrated stream-batch ecosystem has been continuously improved, but there has always been a gap in the data storage of stream-batch integration. Until the emergence of data lake technologies such as Iceberg, this gap was slowly filled.

Iceberg provides ACID transaction capabilities, and upstream data can be seen as soon as it is written, without affecting current data processing tasks, which greatly simplifies ETL;

Iceberg provides upsert and merge into capabilities, which can greatly reduce the data storage delay;

  1. Unified data storage, seamless connection between computing engine and data storage

Iceberg provides a streaming-based incremental computing model and a batch-based full-scale computing model. Batch and streaming tasks can use the same storage model, and data is no longer isolated; Iceberg supports hidden partitions and partition evolution, which is convenient for businesses to update data partition policies.

Iceberg shields the differences in underlying data storage formats, providing support for Parquet, ORC and Avro formats. Iceberg acts as an intermediate bridge, transferring the capabilities of the upper-level engine to the lower-level storage format.

  1. Open architecture design, development and maintenance costs are relatively controllable

The architecture and implementation of Iceberg are not bound to a specific engine. It implements a general data organization format, which can be used to easily interface with different engines. Currently, the computing engines supported by Iceberg include Spark, Flink, Presto and Hive.

Compared with Hudi and Delta Lake, Iceberg's architecture implementation is more elegant, and it has a complete definition and evolutionary design for data formats and type systems; object-oriented storage optimization. Iceberg fully considers the characteristics of object storage in the way of data organization, avoiding time-consuming listing and rename operations, making it more advantageous in the adaptation of data lake architecture based on object storage.

  1. Incremental data reading, a sharp sword for real-time computing

Iceberg supports reading incremental data through streaming, and supports Structed Streaming and Flink table source.

11. Comparison of the three major data lake components

1) Overview

Delta lake

Due to the great success of Apache Spark in commercialization, the Delta lake launched by the business company Databricks behind it is also particularly eye-catching. Before the delta data lake, Databricks customers would typically use the classic lambda architecture to build their streaming batch scenarios.

Hudi

Apache Hudi is a data lake project designed by Uber engineers to meet the needs of its internal data analysis. The fast upsert/delete and compaction functions it provides can be said to precisely hit the pain points of the masses. Coupled with the active community construction of project members, including technical details sharing, domestic community promotion, etc., it is also gradually attracting the attention of potential users.

Iceberg

Netflix's data lake was originally built with the help of Hive, but after discovering many flaws in Hive's design, it began to develop its own Iceberg, and eventually evolved into Apache's next highly abstract and general open source data lake solution.

Apache Iceberg looks relatively mediocre at present. In short, the community attention is temporarily not as high as delta, and its functions are not as rich as Hudi, but it is a wild project because it is highly abstract and non-functional. The very elegant design has laid a good foundation for becoming a general data lake solution.

2) Common ground

All three are data storage middle layers of Data Lake, and their data management functions are based on a series of meta files. The role of the Meta file is similar to the catalog\wal of the database, and it plays the functions of schema management, transaction management and data management. Different from the database, these meta files are stored in the storage engine together with the data files, and users can see them directly. This approach directly inherits the tradition that data is visible to users in big data analysis, but it also increases the risk of data being insignificantly damaged. Once the meta directory is deleted, the table is destroyed and recovery is very difficult.

Meta contains schema information for the table. Therefore, the system can master schema changes by itself and provide support for schema evolution. Meta files also have the function of transaction log (requires atomic and consistent file system support). All changes to the table will generate a new meta file, so the system has ACID and multi-version support, and can provide the function of accessing history. In these respects, the three are identical.

3) About Hudi

The design goal of Hudi is just like its name, Hadoop Upserts Deletes and Incrementals (formerly Hadoop Upserts and Incrementals), emphasizing that it mainly supports Upserts, Deletes and Incremental data processing, and its main writing tools are Spark HudiDataSource API and The HoodieDeltaStreamer provided by itself supports three data writing methods: UPSERT, INSERT and BULK_INSERT. Its support for Delete is also supported by specifying certain options when writing, and does not support a pure delete interface.

In terms of query, Hudi supports Hive, Spark, and Presto.

In terms of performance, Hudi designed the HoodieKey , something like a primary key. For query performance, the general requirement is to generate filter conditions based on query predicates and push to datasource. Hudi doesn't do much work in this regard, and its performance is completely based on the predicate pushdown and partition prune functions that come with the engine.

Another great feature of Hudi is that it supports Copy On Write and Merge On Read. The former does data merge when writing, and the writing performance is slightly worse, but the reading performance is higher. The latter does merge when reading, and the read performance is poor, but the writing data will be more timely, so the latter can provide near real-time data analysis capabilities. Finally, Hudi provides a script called run_sync_tool to synchronize the schema of data to Hive tables. Hudi also provides a command line tool for managing Hudi tables.

4) About Iceberg

Iceberg doesn't have a similar HoodieKey design that doesn't emphasize primary keys. Without a primary key, operations such as update/delete/merge must be implemented through Join, and Join requires an execution engine similar to SQL.

Iceberg has done a lot of work on query performance. It is worth mentioning its hidden partition function. Hidden partition means that for the data input by the user, the user can select some of the columns to do appropriate transformation (Transform) to form a new column as the partition column. The partition column is only used to partition data, and is not directly reflected in the schema of the table.

5) About Delta

Delta is positioned as a stream-batch integrated Data Lake storage layer that supports update/delete/merge. Since all data writing methods from Databricks, spark, including dataframe-based batch, streaming, and SQL Insert, Insert Overwrite, etc. are all supported (open source SQL writing is not supported for the time being, EMR has done support). The primary key is not emphasized, so the implementation of update/delete/merge is based on the join function of spark. In terms of data writing, Delta and Spark are strongly bound, which is different from Hudi: Hudi's data writing is not bound to Spark (you can use Spark, or you can use Hudi's own writing tools write).

In terms of query, open source Delta currently supports Spark and Presto, but Spark is indispensable because Spark is required for delta log processing. This means that if you want to use Presto to query Delta, you need to run a Spark job when querying. To make matters worse, Presto queries are based on SymlinkTextInputFormat. Before querying, run a Spark job to generate such a Symlink file. If the table data is updated in real time, it means that you have to run a SparkSQL and then Presto each time before querying. To this end, EMR has improved this aspect so that it is not necessary to start a Spark task in advance.

In terms of query performance, the open source Delta has hardly any optimizations.

Delta is not as good as Hudi in data merging, and not as good as Iceberg in query performance. Does this mean that Delta is useless? actually not. One of the great advantages of Delta is its ability to integrate with Spark, especially its stream-batch integrated design. With the multi-hop data pipeline, it can support various scenarios such as analysis, Machine learning, and CDC. Flexible use and complete scene support are its biggest advantages compared to Hudi and Iceberg. In addition, Delta claims to be an improved version of the Lambda architecture and the Kappa architecture, and does not need to care about batches or architectures. At this point Hudi and Iceberg are beyond the reach.

6) Summary

The original intentions of the three engines are not exactly the same. Hudi is for incremental upserts, Iceberg is positioned for high-performance analysis and reliable data management, and Delta is positioned for stream-batch integrated data processing. This difference in the scene also caused the difference in the design of the three. Hudi, in particular, has a more pronounced design compared to the other two. Therefore, it is unknown whether there is convergence behind and building up barriers to the advantages of their respective expertise.

Among the three open source projects, Delta, Hudi, and Iceberg, Delta and Hudi are deeply bound to Spark's code, especially the write path. At the beginning of the design of these two projects, Spark was basically used as their default computing engine. The direction of Apache Iceberg is very firm, and the purpose is to make a universally designed Table Format.

It perfectly decouples the computing engine and the underlying storage system, which facilitates the diversification of computing engines and file formats, and completes the implementation of the Table Format layer in the data lake architecture, so it is easier to become the Table Format layer. The open source de facto standard.

On the other hand, Apache Iceberg is also developing towards a data storage layer that integrates streams and batches. The design of manifest and snapshot can effectively isolate changes in different transactions, which is very convenient for batch processing and incremental computing. In addition, Apache Flink is already a stream-batch integrated computing engine, both of which can be perfectly matched to create a stream-batch integrated data lake architecture.

The community behind the Apache Iceberg project is very rich. In foreign countries, Netflix, Apple, Linkedin, Adobe and other companies have petabyte-level production data running on Apache Iceberg; in China, giants such as Tencent also have huge data running on Apache Iceberg, the largest The business of tens of terabytes of incremental data is written every day.

Reference link

  1. Nursing-level tutorial for digital warehouse construction
  2. The strongest and most comprehensive guide to data warehouse construction specifications
  3. The strongest and most comprehensive big data SQL classic interview questions
  4. Meituan data platform and data warehouse construction practice, summary of over 100,000 words

This article participated in the SegmentFault Sifu essay "How to "anti-kill" the interviewer?", you are welcome to join.


园陌
74 声望22 粉丝