Big data study notes 0: the basic framework of big data

This article was first published in short book : https://www.jianshu.com/u/204b8aaab8ba

version	date	Remarks
1.0	2021.5.9	First article

This is my study notes, a large number of excerpts from the Internet and books, to present the content that I think is highly relevant.

Big data starts from the data source, through analysis, mining, and finally obtaining value generally needs to go through 6 main links, including data collection, data storage, resource management and service coordination, computing engine, data analysis and data visualization. The technical system is shown in the figure. Show. Each link faces different degrees of technical challenges.

data source

The data collection layer is composed of modules directly connected to the data source, and is responsible for collecting the data in the data source in near real-time or real-time. The data source has the characteristics of distributed, heterogeneous, diversified and streaming generation:
❑ Distributed: Data sources are usually distributed on different machines or equipment and connected together through a network.
❑ Heterogeneity: Any system that can generate data can be called a data source, such as web servers, databases, sensors, bracelets, and video cameras.
❑ Diversity: The data formats are diverse, including relational data such as basic user information, and non-relational data such as pictures, audio, and video.
❑ Streaming generation: The data source is like a "faucet", it will continuously produce "flowing water" (data), and the data collection system should send the data to the backend in real time or near real time to analyze the data in time.

Data collection layer

❑ It is mainly composed of relational and non-relational data collection components and distributed message queues.
❑ Sqoop/Canal: A relational data collection and import tool, which is a bridge connecting relational databases (such as MySQL) and Hadoop (such as HDFS). Sqoop can import all data in relational databases into Hadoop, and vice versa. Canal It can be used to achieve incremental data import.
❑ Flume: Non-relational data collection tool, mainly streaming log data, which can be collected in near real-time, filtered, aggregated and loaded into storage systems such as HDFS.
❑ Kafka: Distributed message queue, generally used as a data bus, which allows multiple data consumers to subscribe and obtain data of interest. Compared with other message queues, it adopts a distributed high fault tolerance design, which is more suitable for big data application scenarios.

Data storage layer

In the era of big data, because the data collection system will continuously send all kinds of data to the centralized storage system, this has high requirements for the scalability, fault tolerance and storage model of the data storage layer, which are summarized as follows:
❑ Scalability: In actual applications, the amount of data will continue to increase, and the storage capacity of the existing cluster will soon reach the upper limit. At this time, new machines need to be added to expand the storage capacity, which requires the storage system itself to have very good linear scalability .
❑ Fault tolerance: Considering cost and other factors, the big data system is assumed to be built on cheap machines from the beginning, which requires the system itself to have a good fault tolerance mechanism to ensure that data will not be lost when the machine fails.
❑ Storage model: Due to the diversity of data, the data storage layer should support multiple data models to ensure that structured and unstructured data can be easily preserved.

Typical application:

HDFS
Kudu
HBase

Resource management and service coordination layer

Mainly to solve:

Resource utilization
High cost of operation and maintenance

Iaas, K8S, and Omega all belong to this layer.

Computing engine layer

The computing engine layer is the most active layer in the big data technology. To this day, new computing engines are still being proposed. Generally speaking, according to the requirements of time performance, the calculation engines can be divided into three categories:

❑ Batch processing: This type of computing engine has the lowest time requirements. Generally, the processing time is at the level of minutes to hours, or even days. It pursues a high throughput rate, that is, the amount of data processed per unit time is as large as possible. Typical applications include Search engine construction index, batch data analysis, etc.
❑ Interactive processing: This type of computing engine requires relatively high time requirements. Generally, the processing time is required to be at the second level. This type of system needs to interact with people. Therefore, it will provide a SQL-like language for users to use. Typical applications include data query, Parameterized report generation, etc.
❑ Real-time processing: This type of computing engine has the highest requirements for time, and the general processing delay is within seconds. Typical applications include advertising systems and public opinion monitoring.

Data analysis layer

The data analysis layer directly interfaces with user applications to provide easy-to-use data processing tools for them. In order to make it easier for users to analyze data, the computing engine will provide a variety of tools, including application APIs, SQL-like query languages, and data mining SDKs.

When solving practical problems, data scientists often need to select appropriate tools from the data analysis layer according to the characteristics of the application. In most cases, multiple tools may be used in combination. The typical use mode is: first use the batch processing framework to compare the original Mass data is analyzed to produce a smaller-scale data set. On this basis, interactive processing tools are used to quickly query the data set to obtain the final result.

Data visualization layer

The data visualization layer is a layer that directly displays results to users. Since this layer directly connects users and is a "portal" for displaying the value of big data, data visualization is of great significance. Considering that big data has the characteristics of large capacity, complex structure and multiple dimensions, it is extremely challenging to visualize big data.

Big data study notes 0: the basic framework of big data

data source

Data collection layer

Data storage layer

Resource management and service coordination layer

Computing engine layer

Data analysis layer

Data visualization layer

泊浮目

引用和评论

紧跟Flink 2.0，FlinkSQL提效神器v2025.3.0发布！

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 pyflink 的算法工作流设计和改造

湖仓实时化升级：Uniflow 构建流批一体实时湖仓

Flink CDC 在阿里云实时计算Flink版的云上实践

Flink CDC YAML：面向数据集成的 API 设计

Big data study notes 0: the basic framework of big data

data source

Data collection layer

Data storage layer

Resource management and service coordination layer

Computing engine layer

Data analysis layer

Data visualization layer

泊浮目

引用和评论

紧跟Flink 2.0，FlinkSQL提效神器v2025.3.0发布！

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 pyflink 的算法工作流设计和改造

湖仓实时化升级 ：Uniflow 构建流批一体实时湖仓

Flink CDC 在阿里云实时计算Flink版的云上实践

Flink CDC YAML：面向数据集成的 API 设计

湖仓实时化升级：Uniflow 构建流批一体实时湖仓