This article was first published in short book : https://www.jianshu.com/u/204b8aaab8ba
version | date | Remarks |
---|---|---|
1.0 | 2021.5.9 | First article |
This is my study notes, a large number of excerpts from the Internet and books, to present the content that I think is highly relevant.
Big data starts from the data source, through analysis, mining, and finally obtaining value generally needs to go through 6 main links, including data collection, data storage, resource management and service coordination, computing engine, data analysis and data visualization. The technical system is shown in the figure. Show. Each link faces different degrees of technical challenges.
data source
The data collection layer is composed of modules directly connected to the data source, and is responsible for collecting the data in the data source in near real-time or real-time. The data source has the characteristics of distributed, heterogeneous, diversified and streaming generation:
❑ Distributed: Data sources are usually distributed on different machines or equipment and connected together through a network.
❑ Heterogeneity: Any system that can generate data can be called a data source, such as web servers, databases, sensors, bracelets, and video cameras.
❑ Diversity: The data formats are diverse, including relational data such as basic user information, and non-relational data such as pictures, audio, and video.
❑ Streaming generation: The data source is like a "faucet", it will continuously produce "flowing water" (data), and the data collection system should send the data to the backend in real time or near real time to analyze the data in time.
Data collection layer
❑ It is mainly composed of relational and non-relational data collection components and distributed message queues.
❑ Sqoop/Canal: A relational data collection and import tool, which is a bridge connecting relational databases (such as MySQL) and Hadoop (such as HDFS). Sqoop can import all data in relational databases into Hadoop, and vice versa. Canal It can be used to achieve incremental data import.
❑ Flume: Non-relational data collection tool, mainly streaming log data, which can be collected in near real-time, filtered, aggregated and loaded into storage systems such as HDFS.
❑ Kafka: Distributed message queue, generally used as a data bus, which allows multiple data consumers to subscribe and obtain data of interest. Compared with other message queues, it adopts a distributed high fault tolerance design, which is more suitable for big data application scenarios.
Data storage layer
In the era of big data, because the data collection system will continuously send all kinds of data to the centralized storage system, this has high requirements for the scalability, fault tolerance and storage model of the data storage layer, which are summarized as follows:
❑ Scalability: In actual applications, the amount of data will continue to increase, and the storage capacity of the existing cluster will soon reach the upper limit. At this time, new machines need to be added to expand the storage capacity, which requires the storage system itself to have very good linear scalability .
❑ Fault tolerance: Considering cost and other factors, the big data system is assumed to be built on cheap machines from the beginning, which requires the system itself to have a good fault tolerance mechanism to ensure that data will not be lost when the machine fails.
❑ Storage model: Due to the diversity of data, the data storage layer should support multiple data models to ensure that structured and unstructured data can be easily preserved.
Typical application:
- HDFS
- Kudu
- HBase
Resource management and service coordination layer
Mainly to solve:
- Resource utilization
- High cost of operation and maintenance
Iaas, K8S, and Omega all belong to this layer.
Computing engine layer
The computing engine layer is the most active layer in the big data technology. To this day, new computing engines are still being proposed. Generally speaking, according to the requirements of time performance, the calculation engines can be divided into three categories:
❑ Batch processing: This type of computing engine has the lowest time requirements. Generally, the processing time is at the level of minutes to hours, or even days. It pursues a high throughput rate, that is, the amount of data processed per unit time is as large as possible. Typical applications include Search engine construction index, batch data analysis, etc.
❑ Interactive processing: This type of computing engine requires relatively high time requirements. Generally, the processing time is required to be at the second level. This type of system needs to interact with people. Therefore, it will provide a SQL-like language for users to use. Typical applications include data query, Parameterized report generation, etc.
❑ Real-time processing: This type of computing engine has the highest requirements for time, and the general processing delay is within seconds. Typical applications include advertising systems and public opinion monitoring.
Data analysis layer
The data analysis layer directly interfaces with user applications to provide easy-to-use data processing tools for them. In order to make it easier for users to analyze data, the computing engine will provide a variety of tools, including application APIs, SQL-like query languages, and data mining SDKs.
When solving practical problems, data scientists often need to select appropriate tools from the data analysis layer according to the characteristics of the application. In most cases, multiple tools may be used in combination. The typical use mode is: first use the batch processing framework to compare the original Mass data is analyzed to produce a smaller-scale data set. On this basis, interactive processing tools are used to quickly query the data set to obtain the final result.
Data visualization layer
The data visualization layer is a layer that directly displays results to users. Since this layer directly connects users and is a "portal" for displaying the value of big data, data visualization is of great significance. Considering that big data has the characteristics of large capacity, complex structure and multiple dimensions, it is extremely challenging to visualize big data.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。