头图

This article is a live review of the full version of the second lecture of the 2021 MAXP series of open courses. It was lectured by Dr. Zhang Jian from the Shanghai Artificial Intelligence Research Institute of Amazon Cloud Technology. Introduced the overview of graph machine learning tasks, and focused on explaining the graph machine learning questions of this competition. The open source address of the baseline model is attached to the article.

In the past few years, the rise and application of neural networks have successfully promoted the research of pattern recognition and data mining. Many machine learning tasks that once relied heavily on manual feature extraction have now been completely changed by various end-to-end deep learning paradigms. Although traditional deep learning methods have been applied to extract the features of Euclidean spatial data with great success, the data in many practical application scenarios are generated from non-Euclidean spaces, and traditional deep learning methods are dealing with non-Euclidean spatial data. The performance on the Internet is still unsatisfactory.

In recent years, people have become more and more interested in the expansion of deep learning methods on graphs. Driven by the success of many factors, the researchers used the ideas of convolutional networks, recurrent networks and deep autoencoders for reference to define and design the neural network structure for processing graph data, thus a new research hotspot——" Graph Neural Network".

On October 11th, biendata invited Dr. Zhang Jian from Amazon Cloud Technology Shanghai Institute of Artificial Intelligence to launch an online live broadcast for the 2021 MAXP proposition contest: DGL-based graph machine learning tasks. Dr. Zhang Jian gave a detailed explanation of the graph neural network framework DGL and the official baseline model of this competition for the live audience.

Model address: https://github.com/dglai/maxp_baseline_model

Live video review: https://www.bilibili.com/video/BV1cL4y1B75C?spm_id_from=333.999.0.0

About 2021 MAXP

The 2021 MAXP High Performance Cloud Computing Innovation Competition (2021 MAXP) is guided by the High Performance Computing Professional Committee of the China Computer Society and the China Academy of Information and Communications Technology. It is co-sponsored by the ACM China High Performance Computing Expert Committee (ACMSIGHPC) and the Cloud Computing Open Source Industry Alliance. Amazon Cloud Supported by technology and Tencent Cloud, with participation from vendors such as Alibaba Cloud, Huawei Cloud, UCloud, and Tianyi Cloud. With the theme of high-performance cloud computing, the competition aims to further promote the development of domestic high-performance computing. Participants are provided with up to 400,000 yuan , as well as internship opportunities and authoritative certificates of honor.

Live review

This article will be divided into five modules to introduce graph data tasks.

  • Module one is to explain why to do the task of graph-related data.
  • Module 2 will introduce the progress of graph machine learning in recent years, especially the progress related to graph neural networks, and the open source Deep Graph Library (DGL) graph network framework mainly led by Amazon Cloud Technology Shanghai Institute of Artificial Intelligence. This is also The framework needed for the competition.
  • Module three is to explain the topic of the track.
  • Module 4 is the core content of this article. It will explain the baseline DGL model, especially the data processing and DGL model construction for the preliminary contest.
    Model address: https://github.com/dglai/maxp_baseline_model
  • Module 5 will share some related resources for learning and mastering DGL.

The graph data used in this competition is a new gold mine of artificial intelligence data, which is graph-structured data. In real life and work, graph data is everywhere. For example, social networks, knowledge graphs, and the relationship between users, etc., these are all graphs. Chemical molecules can also be seen as a picture, which is a new method of composition.

The image data and natural language data used in computer vision or natural language processing are essentially a kind of graphs, usually called grid-like graphs. For example, in image data, each circle represents a pixel, which expresses the relationship between this pixel and the 8 adjacent pixels.

The essence of the convolutional neural network used in computer vision is to model the pixels of a region or a region and the relationship between them. Then natural language can be regarded as a chain-shaped picture, which is connected in a linear relationship from the first word in speech to the following words.

Many models used in natural languages. For example, GPT, its essence is to construct the potential relationship between points and points on the graph structure of linear series.

For relatively new data such as graph data, the number of academic papers on graphs has continued to increase in recent years.

The number of papers containing graph in the paper title shows an exponential increase. Especially in 2016, after the image deep learning algorithm was applied, the past few years have maintained an annual growth rate of more than 40%. By 2020, there will be at least 2,000 papers discussing graphs and graph data and machine learning related to graph data.

With the widespread application of graph data and the emergence of new artificial intelligence ideas, some algorithms and models related to graph data have experienced a spurt of growth.

One of the more typical representatives is the "GCN Graph Convolutional Neural Network" paper that appeared in 2016. This paper represents the beginning of graph neural networks.

For machine learning on the graph, it is usually used to complete three types of tasks.

The first category is the classification of points and edges. Used to identify suspicious accounts and lock valuable users.

The second category is link prediction. There are some special applications for graphs, and such tasks are used in e-commerce product recommendations. The essence of this type of task is to predict whether there is an association relationship between the user and the product. If the possibility of the association relationship is high, then the associated product will be recommended.

The third category is the problem of graph classification and regression. Used to predict the properties of chemical molecules.

These are common tasks of machine learning on graphs. Before deep learning is applied to graph data, this discipline has developed many methods and models to complete machine learning tasks on graphs.

After the emergence of neural networks with connectionism as the core in 2012 became the new mainstream model in the field of artificial intelligence, many scholars began to study how to apply deep learning methods to graph-structured data.

In 2016, with the appearance of related papers, a new method of image machine learning was produced, called graph neural network.

The essence of graph neural network is a member of the neural network family, and more research is based on the characteristics of points, edges and graphs.

The essence of the graph neural network is to perform convolution operations on the main entities, points and edges and their characteristics on the structured data of the graph. At the same time, it uses the common nonlinear activation function and regularization method of neural networks to convert the points in the graph. Multi-layer convolution processing is performed on the edges and their characteristics, and finally a vectorized representation of the point, edge or the entire graph structure is obtained.

The use of vectorized representation can further help complete downstream tasks. For example, a series of applications such as classification or regression of points, edges, and graphs are required.

There are already a variety of machine learning or deep learning framework tools, such as TensorFlow, PyTorch and other tools are very easy to use, but in the process of practice, if you only use TensorFlow, PyTorch and other common frameworks to develop GNN, you will Encountered a problem specific to graph structured data.

Because of the particularity of the graph structure, problems such as out of memory are prone to occur in the process of using traditional or common deep learning frameworks.

Because when graphing structured data, a point often has many neighbors in the data, and the number of neighbors varies. However, for each pixel seen in the convolutional neural network, the number of neighbors around the pixel is fixed, so that the structure designed for a fixed number of tensors is prone to memory overflow.

Therefore, in the process of developing graph neural networks, we often encounter situations such as the need to deal with unbalanced or inconsistent data distribution. At this time, development with traditional neural network frameworks will encounter various problems. So there needs to be a general framework specifically for graph neural networks, and DGL is one of them.

DGL is an open source tool specifically used to write graph neural networks. Its purpose is to make it easier to write graph neural networks and to be easily and quickly applied to business scenarios.

Simply put, DGL is a deep learning framework for graph-oriented neural networks.

Different from other deep learning frameworks for graph neural networks, DGL is a framework compatible with multiple backends, which is called a backend ignorant framework. In the entire framework, DGL supports PyTorch to write backends, as well as MXNet and TensorFlow to write backends. DGL focuses on the work related to graph structured data and graph neural network models.

In DGL, it also supports multiple training modes, including single-machine single-card, single-machine multi-card or multi-machine multi-CPU distributed training and other scenarios that are usually used in deep learning training.

At the same time, built-in implementations have been made for commonly used GNN modules. In addition, DGL has more than 70 examples of classic and cutting-edge GNN models, which can better learn GNN and use the latest models to complete business needs or develop new model needs.

In the past September, DGL's entire open source project won the "OSCAR Pinnacle Open Source Project and Open Source Community" award hosted by the China Academy of Information and Communications Technology. The victory in this selection is also a recognition of DGL as an open source community.

Tournament explanation

The common tasks of graph machine learning include three categories: point classification/regression, edge classification/regression, link prediction, and whole image classification/regression.

What this competition does is the most common point classification, to complete the label classification of the points in the picture.

The data of the competition was obtained by the data supplier, based on the citation graph of Microsoft's academic paper citations.

The data is the paper citation relationship graph, the nodes are the papers, and the edges are the one-way paper citation relationships. Each node has a 300-dimensional feature value, and the node belongs to one of 23 categories, which is also the field label of the paper.

The task is to learn and predict the labels of unknown nodes by using the known labels.

For this task, we developed a GNN baseline model based on DGL. The main purpose is to help participants quickly master and use DGL to develop GNN models and apply them to the competition.

The code structure consists of two parts.

The first part is 4 data preprocessing files, each file has a different purpose.

The second part is the GNN directory, which includes GNN model files written in DGL and model training files and corresponding help files for model files and competition data.

First introduce the data preprocessing part of the code logic.

The main purpose of File 1 is to complete a list of points constructed for the preliminary data.

The preliminary data contains points and edge lists from paper to paper, but the edge list contains some paper IDs, which do not appear in the point list of the paper, so data exploration and processing need to merge the paper IDs of all points , Constitute a complete point list, and give each paper ID a new point index.

File 2 mainly completes reading out all the 300-dimensional features of the original file and constructing compressed data.

Because the original data file is stored in a text format, it is very time-consuming to process, and the speed is very slow, and the memory requirements are relatively high. Therefore, the second file is saved after data preprocessing and directly used in the subsequent modeling process, which will greatly improve the speed.

The main purpose of file three is to use the newly created point index to replace the original edge list with the point index corresponding to each paper ID.

The DGL Graph is constructed using DGL's Graph class, and at the same time, it is saved as a graph.bin file with the help of DGL's save graph function.

The main purpose of file four is to complete the segmentation of tags. Train or validate the labeled data in train_nodes.csv. The test data set is constructed by indexing the unlabeled points in validation_nodes.csv.

Secondly, the GNN part of the code logic is introduced. The GNN part mainly contains two files.

The first file is the models.py file.

In the models.py file, the three commonly used GNN modules built in DGL are called basic model modules, which build model codes based on neighbor sampling patterns.

The model code of the neighbor sampling mode is divided into three categories: GraphSage, GCN, and GAT.

Because the data of the competition questions is relatively large, reaching the scale of millions of points and tens of millions of edges. Therefore, if you want to use the full-image training mode for both the graph and the 300-dimensional features, and put them all on the GPU for training, it is easy to cause the GPU memory to overflow and fail to complete the training; putting all the features on the CPU for training will be very slow.

So the mini batch method is used for training. When doing point classification, the neighbor sampling mode can be used to reduce the data volume of each batch of calculation data, so as to ensure that the GPU can be used to quickly complete data training.

After each model is built, there are input parameters. Some of the input parameters can be shared by the three models, and the other part of the input parameters is unique to some models.

The second file is the model_train.py file, which is also the core file of the baseline model code. It uses the constructed GNN model and the preprocessed data file to construct the training model code.

The first step of the main logic of the code is to complete the reading of the preprocessed data. Use auxiliary functions to quickly read preprocessed data for use in training models.

The code for training the model contains three ways:

One is to use the CPU for training; the other is to use a single GPU for mini batch training; the other is to use a single machine with multiple GPUs for data parallel mini batch training.

There is a corresponding entry in the main function, and the training of the CPU is more for code debugging. To modify the model or use the model developed by yourself, you can first perform debugging training on the CPU to ensure that the code is error-free, and then migrate to the GPU for training.

The training logic of the three training methods is the same. If you use multiple CPUs for training data in parallel, you need to divide the data set multiple times, send different divisions to different graphics cards for training, and use the DGL point sampler and data loader to complete the mini batch. Finally, choose to build a GNN model.

Because the code is written based on the PyTorch backend, you can build common PyTorch loss functions and optimizers, and finally complete the training mini batch loop and verification in the entire Epoch loop, and save the best model.

For the DGL competition, we also provide core resources related to DGL for everyone to learn and master.

There are two main core resources of DGL graph neural network: The first core resource is the official website of dgl.ai. The core content of the official website is divided into two parts:
One is Getting Started, this page can assist in installing DGL. Because the installation process has different installation commands for different systems, whether to use GPU, whether there is a graphics card, and the corresponding version of the graphics card. Therefore, use the Getting Started page to quickly find the installation command corresponding to the environment, and one command can install DGL.

The other is docs, this page contains all the core help documents related to DGL and the main ones.

The Docs includes the API description of DGL open to the outside world, and the user manual based on Chinese. It is convenient for users in China to quickly use DGL to build hello word code.

The second core resource is the DGL GitHub site.

Here you can see the DGL source code and the official DGL model implementation example. In addition to GCN, GAT and GraphConv, there are many classic examples that can be found in the examples directory. These examples can be used for reference, and the model can also be modified to complete this competition.

Questions about DGL can be discussed in the official discussion channel of DGL.

The first channel is the discuss forum on the DGL official website;

The second channel is the issue area on the DGL GitHub site;

The third channel is to search for DGL user group or DGL user 1 group, DGL user 2;

The fourth channel is to ask questions through the Slack Channel, find relevant colleagues to reply, and give the participants answers.


亚马逊云开发者
2.9k 声望9.6k 粉丝

亚马逊云开发者社区是面向开发者交流与互动的平台。在这里,你可以分享和获取有关云计算、人工智能、IoT、区块链等相关技术和前沿知识,也可以与同行或爱好者们交流探讨,共同成长。