Abstract: This Kunpeng BoostKit training camp introduces developers how to accelerate application performance based on the Kunpeng BoostKit enablement kit, and focuses on analyzing performance optimization technologies and key capabilities.

This article is shared from HUAWEI CLOUD COMMUNITY " [Cloud-based co-creation] How does the Kunpeng BoostKit enablement kit realize a double-level performance improvement in big data scenarios? ", the original author: Bailu first commander.

Preface

In the era of data and economy, the diversity of business and data requires a new computing architecture, and the growth of massive data has also brought higher computing requirements. In this process, the Kunpeng computing industry is also becoming a new generation of IP base for more computing scenarios. Kunpeng's full-stack IT technology implementation facility industry applications and services based on Huawei Kunpeng processors are committed to continuously providing our advanced computing power support for the intelligent world, so that various industries can achieve digital transformation. The migration and optimization of application software has always been the difficulty and key of Kunpeng's software ecology. This Kunpeng BoostKit training camp introduces developers how to accelerate application performance based on the Kunpeng BoostKit enablement kit, and focuses on analyzing performance optimization technologies and key capabilities.
image.png

1. The origin of open source big data and Kunpeng's multi-core structure

1.1, the problem of massive data processing

With the development of science and technology, more and more industries need to collect more data. How to analyze massive data and get the results we want has become a problem we face, and the rapid development of big data technology makes This problem is easily solved.

1.2. The characteristics of big data parallel computing naturally match Kunpeng's multi-core architecture

Mass data requires higher concurrency to speed up data processing. In the case of very large data sets, if we run in a single-core (or sequential) execution scenario, the execution process may not be possible or the efficiency is extremely low. This is unacceptable to us, so massive data needs to be processed with higher concurrency, so the characteristics of Kunpeng multi-core computing can perfectly match this demand, accelerate the computing performance of big data, and improve the concurrency of big data tasks.

We take the MapReduce model as an example for processing and calculation. As shown in the figure below, the source data we collected is a piece of English, and we need to count the number of times each word in this paragraph appears.
image.png

Run the process : First, we split the source data, then Map is mapped to each node for operation, then Sort, Merge, and finally the results are summarized and Reduce to form the final result.

It can be seen that we distribute a large number of calculations to each node. This is distributed computing, and it is also the concept of what we call "concurrency". If our concurrency increases, theoretically speaking, the execution time of our entire model will be shortened accordingly.

2. Introduction to Open Source Big Data and Components

Above we introduced the concept of open source big data and the corresponding characteristics of Huawei Kunpeng multi-core computing. The following introduces some of the components we often use in big data development.

2.1. Big data component: Hadoop-HDFS module

HDFS is one of the three core modules of the Hadoop ecosystem and is responsible for distributed storage. The specific structure is shown in the figure below:
image.png

  • HDFS: It is a distributed storage system that adopts the master-slave structure of Master and Slave, and is mainly composed of NameNode and DataNode. HDFS cuts the file into several blocks according to a fixed size and stores them in all DataNodes in a distributed manner. Each file can have multiple copies. The default number of copies is 3.
  • NameNode: Master node, responsible for source data management and processing client requests.
  • DataNode: Slave node, responsible for data storage and read and write operations.

use process: users want to read the data stored in HDFS, they need to find the NameNode first, and use the NameNode to know which DataNode our data is stored on. When the NameNode finds the specific data, the data is returned to the user .

2.2. Big data component: Hadoop-Yarn module

Yarn is one of the three core modules of the Hadoop ecosystem, responsible for resource allocation and management. The specific structure is shown in the figure below:

  • Yarn: It is a distributed resource scheduling framework. It adopts the master-slave structure of Master and Slave. It is mainly composed of the master node ResourceManager, ApplicationMaster, and slave node NodeManager. It is responsible for the resource management and scheduling of the entire cluster.
  • ResourceManager: It is a global resource manager, responsible for the resource management and allocation of the entire cluster.
  • NodeManager: Runs on the Slave node and is responsible for the resource management and use of the node.
  • ApplicationMaster: Started when the user submits the application, responsible for applying for resource and application management from the ResourceManager, and interacting with the NodeManager. In the case of use, the user can know the progress of the current task and which jobs have been executed through the ApplicationMaster.
  • Container: Yarn's resource abstraction is the basic unit for executing specific applications. Any job or application must run in one or more Containers.

2.3. Big data component: Hadoop-MapReduce module

MapReduce is one of the three core modules of the Hadoop ecosystem and is responsible for distributed computing. The specific structure is shown in the figure below:
image.png

  • MapReduce: is a distributed computing framework, mainly composed of two stages, Map and Reduce. Supports dividing a computing task into multiple subtasks, which are distributed to each cluster node for parallel computing.
    Map stage: The initial data is divided into multiple parts and processed in parallel by multiple Map tasks.
    Reduce phase: Collect and merge the output results of multiple Map tasks, and finally form a file as the result of the Reduce phase.

2.4. Big data component: Spark platform

Apache Spark is a unified analysis engine for large-scale data processing. It has the characteristics of scalability and memory-based computing. It has become a unified platform for lightweight and fast processing of big data. It has a variety of applications, such as real-time information stream processing, Machine learning, interactive query, etc. can all be built on different storage and operating systems through Spark. The specific structure is shown in the figure below:
image.png

  • Apache Spark core: Spark Core is the basic general-purpose execution engine of the Spark platform, and all other functions are executed on the platform. It provides reference data sets in memory computing and external storage systems.
  • Spark SQL: Spark SQL is a component on top of Spark Core. It introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
  • Spark Streaming: Spark Streaming utilizes the fast scheduling function of Spark Core to perform streaming analysis. It collects data in small batches and performs RDD (Resilient Distributed Data Set) conversion on these small batches of data.
  • MLlib: MLlib is a distributed machine learning framework on Spark because it is a Spark architecture based on distributed memory.
  • Graphx: Graphx is a distributed graph processing framework on top of Spark. It provides an API for expressing graph calculations, and you can use Pregel abstract API to model user-defined graphs. It also provides an optimized runtime for this abstraction.

3. Introduction of Kunpeng BoostKit enable kit

3.1. What is Kunpeng BoostKit?

BoostKit is an application enablement kit, not just composed of one software package, but composed of many software packages.

BoostKit deployed on the Kunpeng whole machine (server) environment. The specific structure is shown in the figure below:
image.png

The Kunpeng application enablement kit BoostKit releases multi-level performance advantages and provides eight scenarios of application enablement kits: big data, distributed storage, database, virtualization, ARM native, Web/CDN, NFV and HPC. In the following process, we will divide it into three parts and introduce them in turn.
image.png

3.2. Open source enablement: open source software is available and easy to use

image.png

  • Huawei: Contribute to open source, lead open source, and enable mainstream open source software to support Kunpeng's high performance.
  • Partners: Obtain high-performance open source components from open source communities and Kunpeng communities, and directly compile/deploy them.

Let’s take Hadoop as an example. First, we need to run Hadoop on Kunpeng server. However, this is far from enough. We also need to develop related features according to requirements to make Hadoop run more perfectly and conveniently on Kunpeng. At the same time, we will develop New product features are incorporated and contributed to the open source community.

3.3. Basic acceleration: application performance beyond the industry level

The reason why this piece is called the foundation is that many applications will use this acceleration package, such as: NUMA optimization, KAE acceleration library, IO intelligent prefetching, etc.
image.png

  • Huawei: Provide basic acceleration software packages and documents such as basic performance optimization, basic acceleration libraries, and acceleration algorithms, and provide guidance on how to use them.
  • Partner: Obtain the basic acceleration software package from the Kunpeng community, and perform compilation, deployment and performance optimization under the guidance of the Kunpeng Innovation Center.

Let’s take the KAE acceleration library as an example. We will accelerate this function when we use the relevant functions such as compression, encryption and decryption. If our upper application uses compression, encryption and decryption and other related functions, it will be greatly improved. Improved performance.

3.4. Application acceleration: extreme transaction times application performance

image.png

  • Huawei: Provide application acceleration software packages and documents such as application innovation acceleration components and algorithm innovation components.
  • Partners: Partners and Huawei carry out joint solution design, development and commercial practice. The change in the way of cooperation varies with the acceleration function.

Fourth, the results of BoostKit on open source enablement

BoostKit has made a lot of investment in the open source community, mainly for the following two aspects, fully supports open source big data components, and realizes the operation of ARM CI on the community version.

4.1, fully support open source big data

  • Support for open source Apache big data components.
  • Support the open source HDP big data component and management component Ambari.
  • Support for open source CDH big data components (Note: CDH Manager management component is a closed source version, currently not supported).

4.2. The open source community accepts the ARM ecosystem

image.png

  • The open source communities of core components such as Hadoop, Hive, Hbase, Spark, Flink, ElasticSearch, and Kudu support ARM (Note: Hadoop, ElasticSearch open source communities have provided official versions of ARM software packages). Promoted the development of the ARM open source software ecosystem.

5. How does Kunpeng BoostKit deal with the key challenges of big data?

Focus on the key challenges of big data and provide solutions to the existing pain points to make data processing faster and easier.
image.png

5.1. Problems encountered

  1. Diversified queries cannot be unified, and the efficiency is low. The inconsistency of query methods such as Spark SQL and Hive leads to lower query efficiency.
  2. The performance of IO-intensive components cannot meet the requirements.
  3. Disk IO has a bottleneck, and HDFS performance is difficult to improve.
  4. In the data collection process, because the data is diverse and diversified data formats, it is difficult to read data across data sources.
  5. Data is not shared and it is difficult to access data across data centers. In the process of data reading, data is mostly stored in different data centers and cannot be shared. It is a difficult problem to read data across data centers.

5.2. How to deal with key challenges?

  • In response to questions 1, 4, and 5. Use cross-source and cross-domain query acceleration. Use openLooKeng virtualization engine to unify data entry, support cross-source and cross-domain analysis, and improve query performance.
  • In response to question 2. Use Spark performance acceleration. The native machine learning/graph algorithm is deeply optimized, and the performance of Spark is doubled.
  • In response to question 3. Use HDFS performance acceleration. IO intelligent prefetching, efficient fetching, Spark/Hbase performance improved by 20%.

6. Deep optimization of BoostKit machine learning/graph algorithm

6.1. Examples of algorithm deep optimization

The BoostKit machine learning/graph algorithm is based on the deep optimization of the native algorithm, which has promoted the performance of Spark to be doubled. It has now been applied to Huawei’s partner business. The two actual scenarios shown in the following figure are used separately in massive data sets. Modeling with machine learning and graph analysis algorithms, we can see that the training time of the model has been greatly shortened, and the performance has been greatly improved.
image.png

The optimization of BoostKit's machine learning/graph algorithm makes the calculation performance increase by an average of 5 times in actual application scenarios, and the upper application does not need to be modified!

6.2, Kunpeng algorithm library

  • Including the machine learning GBDT algorithm and graph analysis PageRank algorithm mentioned and used above, the Kunpeng algorithm library has delivered 20+ algorithms, covering common algorithm types.
    image.png
  • Maintain the class and interface definitions that are completely consistent with the native Spark algorithm, without any modification of the upper-level application, just add the algorithm package when submitting the task.

The performance of the multi-dimensional and multi-scale data set algorithm published on the Internet is improved by 50% to 10 times.
image.png

7. What deep optimizations has BoostKit made?

7.1, Kunpeng affinity optimization effect

Key optimization points:

  • Communication-avoid reduces the data communication between unnecessary nodes.
  • Multi-core parallel computing. Taking advantage of Kunpeng's own advantages, the algorithm multi-core parallelism is improved, the data parallelism and the model parallelism are improved, and the bottleneck of the communication Shuffle is reduced to achieve an increase in training speed.
    image.png

Under the same calculation accuracy and different data sets, supporting machine learning algorithms (Covariance, Pearson, Spearman), the performance is improved by more than 50+%, as shown in the following figure:
image.png

7.2. Machine learning algorithm optimization scheme: distributed SVD algorithm

The SVD algorithm is the singular value decomposition algorithm, which is a commonly used matrix factorization algorithm in linear algebra. The SVD algorithm can be used not only for feature decomposition in dimensionality reduction algorithms, but also for recommendation systems, natural language processing and other fields. It is the cornerstone of many machine learning algorithms.
image.png
image.png

For the traditional SVD algorithm, we have also innovated on it, as shown in the following figure:
image.png

7.3. Graph analysis algorithm optimization scheme: distributed PageRank algorithm

PageRank algorithm, that is, page ranking algorithm, also known as page level algorithm, Google left side ranking algorithm or Page ranking algorithm. This algorithm is an algorithm for ranking the results and web pages searched by search engines. It is essentially an algorithm that roughly analyzes the importance of web pages with the number and quality of hyperlinks between web pages as the main factors. That is, more important web pages will be cited by more other web pages, and the PR value of each web page is calculated based on the cited links. The higher the PR value of a webpage, the more important the webpage is. PageRank is a method used by Google to identify the rank/importance of a webpage, and it is the only standard used by Google to measure the quality of a website.
image.png

As shown in the figure above, we regard each webpage as a point, and the connection between the webpage and the webpage as an edge, thus forming the data structure of the graph. We will process the data structure of this graph. So how do we optimize?

  • Memory occupancy optimization: Based on sparsely compressed data representation, the algorithm's memory occupancy is reduced by 30%, effectively solving the memory bottleneck problem under the large data scale.
  • Convergence detection optimization: Simplify the convergence detection process and reduce the overall task volume by 5%~10%.
  • Full iteration + residual iteration combination optimization: Effectively reduce the shuffle bottleneck caused by the previous data expansion, and the overall performance can be improved by 0.5X~2X.
    image.png

As shown in the figure above, after optimization, the overall shuffle volume is reduced by 50% through the adaptive switching of the algorithm calculation mode, and the performance is improved by an average of 50%+ compared to before the optimization.

8. Spark performance acceleration practice of Kunpeng BoostKit machine learning & graph algorithm

The Spark performance acceleration practice of the Kunpeng BoostKit machine learning & graph algorithm can be performed in the "sandbox laboratory" on the HUAWEI CLOUD platform.

8.1. Environmental preparation

Before the experiment, the environment will be prefabricated, as shown in the following figure:
image.png

8.2, environment configuration

Since our algorithm is running on a 4-node cluster, that is, running on 4 ECSs, the process of prefabricating the environment may be longer, and some components need to be configured on the cloud server. The time is about three minutes. . As shown in the figure below, we can see a master node and three slave nodes.
image.png

8.3. Deploy Hadoop, Spark and other components

In the process of prefabricating the environment, some Zookeeper related configurations have been completed. We only need to log in to each agent node and perform a small amount of configuration to start Zookeeper. The specific process is shown in the following figure:
image.png

The same is true for Hadoop. When the environment is prefabricated, the installation of Hadoop and the configuration of Hadoop on the server node have been completed. For the agent node, we only need to configure a small amount on the computing node. Start the JournalNode on the agent side and start other Hadoop components on the server side. Yes, the specific process is shown in the figure below:
image.png

For Spark, the system did not make any related deployments. It just downloaded Spark to the cluster. After that, we need to add Spark environment variables, modify Spark configuration files, synchronize to other nodes and submit tasks. The specific process is shown in the following figure:
image.png

8.4. Operation practice of algorithm library optimization effect

8.4.1, run the SVD algorithm

Call the algorithm library, the code is as follows:

sh bin/ml/svd_run.sh D10M1k
Do not call the algorithm library, the code is as follows:

sh bin/ml/svd_run_raw.sh D10M1k

8.4.2, run PageRank algorithm

Call the algorithm library, the code is as follows:

sh bin/graph/pr_run.sh cit_patents run
Do not call the algorithm library, the code is as follows:

sh bin/graph/pr_run_raw.sh cit_patents run

Since the current default parameters of the algorithm do not fully utilize the resources of the ECS cluster, the parameters of the Spark layer need to be tuned.
image.png

to sum up

Facing the era of diverse computing, Huawei is fully opening up Kunpeng's full-stack capabilities and sharing diverse computing tool suites: Kunpeng Application Enablement Kit Kunpeng BoostKit and Kunpeng Development Kit Kunpeng DevKit, accelerating industrial innovation, enabling minimalist development, and working with partners to build Kunpeng Computing industry ecology. This series of courses is mainly aimed at Kunpeng developers and ISV partners to help you quickly understand the best capabilities and practices in 8 scenarios supported by the BoostKit Kunpeng application enablement kit, Kunpeng's full R&D workflow tool kit Kunpeng DevKit and Kunpeng basic software open source And other related content, together with global developers to light up a new era of diverse computing.

Click to follow, and get to know the fresh technology of Huawei Cloud for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量