On June 26, Amazon Cloud Technology Community Day was held in Shanghai. Amazon Cloud Technology Chief Developer Evangelist, Senior Data Scientist, Senior Application Scientist, and Amazon Cloud Technology Machine Learning Hero were all present to share and discuss the technology trends of AI open source and practical projects.
Issue 1: | Wang Yubo: Four in One, Building an Open Source Machine Learning Ecosystem
The second issue: big coffee | Wang Jijie: the exploration and research of deep graph in artificial intelligence
The third issue: AI open source in the eyes of big coffee | Wu Lei: The application and
In this issue, we bring you Amazon Cloud Technology senior data scientist Zhang Jian on the practical application of graph neural network and DGL.
📢 To learn more about the latest technology releases and practical innovations of Amazon Cloud Technology, please pay attention to the 2021 Amazon Cloud Technology China Summit! Click on the image to sign up
As a senior data scientist of Amazon Cloud Technology, Dr. Zhang Jian's important work is to use graph neural network and DGL as tools to help customers solve core business problems and enhance business value in actual customer scenarios. In this share, he from the data, models, speed, interpretation introduces four challenges in view of the neural network and DGL in landing the project encountered and thinking of this.
your diagram contain enough information?
In the academic circle, many scholars will use open data to build models and enhance algorithms. The most commonly used datasets in the field of graph neural network research . These graphs are usually highly connected, with nodes of the same class clustered together. Using these graphs for model building, the results of graph neural networks tend to perform well. The actual business scenarios, means limited to the data collected, stored data and the ability to process data, construct the map data is sometimes very sparse , resulting in a lot of energy and time to tune the model, but the result was not ideal. If the graph connectivity provided by the client is so low that no matter what graph neural network model is used, they will eventually degenerate into a common MLP. In addition, the business graph provided by the customer often has very little label data. In the graph with hundreds of millions of points, only hundreds of thousands of nodes have labels, and only 0.01% of the label data. This makes it difficult to build connections through one labeled point to find other labeled points, which greatly reduces the effectiveness of graph neural networks.
There is a saying in the data scientist circle: data characteristics determine the upper limit of model performance, and the model only goes to infinitely approach this ceiling. It is better to spend more effort on the model than to think about the data . Since it is said that the information of the picture determines the upper limit? So what is the information of the graph? How to measure "information"? Can information values guide GNNs? Do you want to make pictures? These problems are often addressed by machine learning practitioners and even developers. Zhang Jian raised these problems, and hoped that everyone could brainstorm to solve them.
what circumstances is the GNN model more advantageous?
"I know that your graph neural network has various models. Can you see which model is suitable for our graph?" An industrial customer once asked Dr. Zhang Jian. And this question is difficult to answer. First, the design space of the model is much larger than the options. Second, different business scenarios correspond to different business requirements. It is not easy to judge how the model design or model selection in the business scenario is aimed at specific businesses. In addition, the core development model of DGL It is message passing (MP), in the field of graph class, and some problems can already be implemented without MP. We also see that in the field of graph machine learning, there has not been a model like GPT in the field of NLP, which can quickly solve most problems.
Zhang Jian said that the most worrying thing is far more than this, but the customer directly questioned: "Dr. Zhang, you see that our XGBoost and other models are better than this GNN!" After the knowledge graph obtains various relationships between customers, LightGBM is directly used, and after combining with more than one thousand dimensional features, the graph neural network model is directly killed. Although the graph neural network model surpassed the client's LightGBM model through some subsequent techniques, it also left a lot of room for thinking. For example, how are graph neural network models better than traditional machine learning models? Which is better?
Zhang Jian believes that the vast majority of traditional machine learning models are based on features, but in real business scenarios, not every point or every feature can be obtained, especially with the enhancement of privacy protection regulations, big data supervision is becoming more and more important. The stricter it gets, the harder it is to collect data. But for the graph neural network model, although there are no features, it can still establish an association relationship, which is the advantage of the graph neural network model.
Figure neural network model and the traditional model of machine learning, not either-or relationship , based on business scenarios need to decide how to choose and business issues, and even can be combined together to solve the problem. What is the applicability of different GNN models? How to use point/edge features? Is it necessary to use GNN? How to combine GNN and other models? Zhang Jian left these questions for everyone to think about.
graph model be used for real-time inference?
After the model is effective, whether it can be online for real-time inference has become a frequently asked question by customers? This question involves two levels. There is a correlation between data in the graph structure . So compared to traditional CV and NLP, the data points are not IID. has two modes when doing graph data inference, namely Transductive mode and Inductive mode . In Transductive mode, during the training phase, the nodes/edges to be predicted already exist in the graph, and the training nodes can "see" these nodes/edges. The problem with this mode is that when predictions are needed, these points must already be Existing, the graph has been constructed, and there is almost no way to do it in real time. Because to be real-time, the model has to deal with future points. In Inductive mode, the node that needs to be predicted is not in the graph during the training phase and is invisible. Only when inferring is applied to a graph, this point can be seen. Using Inductive mode to infer invisible points, there are two cases. The first is to do batch prediction, such as anti-fraud, to build a graph data training model with the data of the past seven days. When detecting the user behavior that occurs tomorrow, it is necessary to combine the data of tomorrow and the data of the previous seven days. Make a graph and use the trained model for inference. That's batch inference, it's not real-time inference. In order to achieve real-time inference, it is necessary to add the nodes/edges that need to be predicted to the existing graph in real time, and extract the N-hop subgraphs and hand them over to the trained model for inference.
According to Zhang Jian, not only a graph community, but the entire machine learning community, including the big data community, has not yet designed a method for storing, extracting, and querying real-time (such as streaming) graph data for graphs. The existing graph databases are often not fast enough for adding and checking, especially when sampling a point/edge as the center point/edge, the sampling speed of the graph database cannot keep up with the real-time inference. speed. For the system architecture of real-time inference, the industry does not have a particularly mature method for the time being. This is a problem that needs to be solved at present, and it is also a very big opportunity for developers.
How to interpret the results of the
After the model is launched, one of the problems faced is how to interpret the results of the model? This issue can be seen in some research results in academia, but rarely in industry to see such discussions.
For example, after using the graph model to get the prediction of a node, the business person asks why? Tell him that because the "neighbor" next to it has the greatest influence on it, the business people will definitely not accept it.
In addition, although the graph neural network model can identify some patterns through the graph structure, the points in it are all characteristic. This characteristic is finally some real numbers. After a series of linear transformations and nonlinear changes, the relationship between them It has far exceeded human understanding of cause and effect. interpret the results of the graphical model? There is a long way to go for developers.
Floor map neural network is facing many challenges, Zhang said, these challenges is like supporting a moon rocket. The data is equivalent to the fuel, the model is equivalent to the engine, the problem of all data pipelines and implementation architecture is the overall rocket design, and the interpretation of the model is like the need for a flight control center. Only when these four levels of problems are solved, the rocket can really fly to the moon.
written at the end
Over the years, Amazon Cloud Technology has accumulated many projects and practical experience in the field of artificial intelligence, and has been committed to co-creation with global developers, hoping to bring new vitality to the field of artificial intelligence. The 2021 Amazon Cloud Technology China Summit Shanghai Station has . The 161e56f62d1ccd conference, with the theme of "Building a New Pattern and Reshaping the Cloud Era", joins hands with leading technology practitioners in the cloud computing industry to share the story of remodeling and construction in the cloud . At the same time, the Shanghai station is only the vanguard of this summit. The Amazon Cloud Technology China Summit will also meet you online from September 9th to September 14th
The summit covers more than 100 technical sessions, and has a sub-forum on technologies in the field of artificial intelligence, bringing you dry and exciting content. In addition, will also join hands with the open source community and technical experts to collide with online opinions to release their vitality!
👇 Click the image below to register
Click "Read the original text" to learn about DGL
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。