Open source golden ten years, on the trend of AI open source technology and implementation practice

On June 26, Amazon Cloud Technology Community Day was held in Shanghai. Amazon Cloud Technology's chief developer evangelist, senior data scientist, senior application scientist, and Amazon Cloud Technology Machine Learning Hero were all present to share and discuss the technology trends and practical projects of AI open source.

: Amazon's contribution and practice in the field of open source machine learning

The concept of open source originated in the 1980s. In recent years, with the continuous development of machine learning and cloud computing, open source has gradually become the core of many developers' discussions, and its importance has increased significantly. Currently, four of the top five open source contributors are cloud computing vendors, and seven of the top ten open source contributors are cloud computing vendors. Wang Yubo said that cloud computing is an important driving force behind open source. Cloud computing leads open source forward, and open source further promotes the development of cloud computing.

As a cloud computing service platform, Amazon adheres to the concept of user first, and by providing a series of cloud and open source tools integration and integration, to meet the needs of developers to use open source tools for rapid production practices in the cloud. In addition, when developers hope to implement new ideas through some new tools, Amazon will also actively build and contribute a series of open source codes to help developers achieve various needs.

According to Wang Yubo, the number of open source contributors and open source projects within Amazon Cloud Technology has been increasing year by year. At present, Amazon has more than 2,500 open source warehouses, covering many fields such as data, analysis, security, and machine learning. Many projects revolve around open source, such as an open source analysis platform based on OpenSearch; an open source architecture based on container microservices. Amazon firmly believes that the combination of cloud and open source can empower developers more quickly, and can also conduct more exchanges and interactions to help developers make good use of open source on the cloud.

When it comes to the combination of open source and machine learning, Wang Yubo believes that not only should we pay attention to how open source leads the development of machine learning, but more importantly, we should pay attention to the problems faced by developers in actual production practices, so that more developers can learn and master. Open source technology and quickly build machine learning applications. He summarized Amazon's efforts in building an open source machine learning ecosystem from four dimensions: product, research, empowerment, and community.

The first is products. There are a series of machine learning and artificial intelligence products in the Amazon cloud, many of which are built on open source projects. Amazon hopes to use these products to accelerate the rapid application of open source machine learning in production practices.

The second is research. Amazon has a lot of scientists engaged in artificial intelligence and machine learning research all over the world. They continue to make contributions in the academic field and publish many cutting-edge papers. Amazon hopes that these researches can be combined with production practices and quickly Landing to build a good environment for developers.

The third is empowerment. Amazon believes that artificial intelligence and machine learning should be in the hands of every developer, through a series of products and capabilities to help everyone get started and learn quickly, so that everyone can get more from open source and machine learning. Growth opportunities.

Finally, there is the community. Amazon helps developers gain a deeper understanding of open source and machine learning by building a machine learning community, so that they can move forward and develop faster and better.

In response to these four points, Wang Yubo gave a detailed introduction of the four in one at the Community Day site.

Amazon's machine learning products provide a very complete stack, ranging from frameworks and platforms to SaaS-based applications. There are many products and services in each field to help developers quickly build. All machine learning cloud services are based on a solid open source foundation built by Amazon.

From a global perspective, Amazon is the preferred platform for developers to build applications using the open source framework TensorFlow and PyTorch. Amazon SageMaker can help developers implement machine learning quickly. There are two ways for Amazon SageMaker to extend machine learning. They are self-contained training scripts and self-contained Docker containers. Both methods are very simple. Amazon SageMaker itself uses a lot of container technology, but for Amazon SageMaker users, there is no need to specifically understand or operate the underlying architecture. Developers can bring their own training scripts and use almost the same code as in local or other environments. They only need to pass parameters and generate a series of files. At the same time, they can pull standard images from the container's mirror warehouse. The script and container are combined together to achieve fast and good training results. Amazon SageMaker also supports its own Docker container, integrates the script into a self-built container, and publishes it in the container warehouse at the same time, and trains to obtain very good results. For now, using the built-in script is a very simple way. Developers can develop and test locally, perform distributed training and deployment in the cloud, or use cloud functions to quickly iterate to build a better machine learning application.

In addition, SageMaker itself also comes with many capabilities, such as SageMaker's automated tuning capabilities, which can quickly adjust super parameters, and the managed Spot method can greatly save developers the cost of machine learning training models.

Wang Yubo also introduced some open source machine learning projects initiated by Amazon.

The first is Gluon, which is an open source deep learning interface that enables developers to build machine learning models easier and faster without affecting performance. Amazon hopes to help more developers quickly use leading algorithms and paper pre-training models through toolboxes and toolsets. In the fields of computer vision and natural language processing, Amazon's toolkits GluonCV, GluonNLP, and GluonTS have all reproduced the SOTA results at the top conferences. Amazon makes these toolkits available to more customers and developers.

The second is the Deep Java Library. Many independent developers often use Java for deep learning development. Amazon hopes that through the Deep Java library, developers can use the Java language for machine learning training and deployment in a portable and efficient manner. Currently Deep Java Library provides full engine support, and also provides up to 70 pre-trained models.

In addition, Wang Yubo also introduced from several other fields.

The first is Jupyter, which helps developers think with code and data, then builds a narrative around the code and data, and communicates these code and data-driven insights to others. Amazon continues to optimize the jupyter experience, such as providing notebook sharing functions for enterprise-level developers. At the same time, Amazon is also continuing to contribute to the Jupyter community. Jupyter steering committee members currently work for Amazon to help Jupyter further integrate open source and cloud.

The second is Amazon SageMaker Clarify, which is built on open source products to provide machine learning developers with deeper training data and models so that they can identify and limit deviations and interpret predictions.

The third is Penny Lane. Amazon started participating in the Penny Lane open source project at the end of last year. Currently Penny Lane can run on Amazon Braket in the cloud. Amazon hopes that through the cloud, quantum computing and machine learning can be better integrated.

In addition, Amazon also provides many entertaining and entertaining tools and hands-on tools, and use open source solutions to help you start the journey of machine learning.

Wang Yubo said: “Hands-on is a very critical process for developers. Amazon uses a series of technical leadership, technical guidance and technical lectures to drive the overall development of the developer community to flourish and stimulate a good atmosphere for technical discussions. Developers provide more help and influence."

2. Wang Minyi: Exploration and Research of Deep Map in Artificial Intelligence

Speaking of Deep Map's exploration and research in artificial intelligence, we must first clarify a concept-what is artificial intelligence? Wang Minyi believes that there are two important points to realize true artificial intelligence. The first is to understand why current artificial intelligence algorithms make mistakes, and the second is to explore the structural consistency between artificial intelligence algorithms and the human brain.

"The study shows that the order of Chinese characters does not affect the reading." For example, after reading this sentence, you find that the characters in this display are all messy. When people understand natural language, they do not understand it in a linear way, but understand the text in blocks. Many models understand text in a linear manner.

From the perspective of image recognition, if an algorithm is used to recognize a picture with a dog sitting on a motorcycle, it can only be recognized that the picture itself is composed of a dog and a motorcycle, and no more structured information can be obtained. The human brain can feel the interest of the picture.

A lot of data in life exists in the form of graph structures, ranging from microscopic molecules to large production and life. It is a very common requirement to complete machine learning tasks on graphs.

In recent years, how to apply deep learning algorithms to graph data has become the focus of developers' attention. Therefore, Graph Neural Networks (GNNs) were born. The so-called graph neural network refers to a type of deep neural network used to learn the vector representation of points, edges or the entire graph, and its core idea is message transmission. For example, if you want to judge which NBA team a person likes, you can find out which team his friends like on social networks. If 80% of his friends like it, then he will probably like this team as well. When modeling a certain point, collect information through other adjacent points. This process is information transmission.

Collect the information of all adjacent nodes together to make a cumulative sum, and after obtaining a weighted cumulative sum message, update the existing information of the node where it is located through the update function. This is the most basic mathematical modeling of graph neural networks.

Graph neural networks have very wide applications in different fields.

Molecular medicine: The first is the prediction of molecular properties. The input data is a molecular structure diagram. Then through message passing modeling, using graph neural network to obtain vector representation, input to the downstream classifier, you can judge the nature, toxicity, etc. of chemicals. The second is the generation of drug molecules. First, build a coding model, and then turn it into a vector representation through a graph neural network. At the same time, add some guidance to generate molecules that can meet our needs. The third is drug relocation. In this regard, Amazon has constructed a drug knowledge map DRKG to represent the relationship between drugs, disease proteins, compounds and other objects. After using the graph neural network to model the data, the connection relationship between the drug and the disease protein node can be predicted, thereby predicting the potential drugs for the treatment of new diseases. Currently, 11 of the 41 drugs recommended by graph neural network modeling have been used in clinical practice.

Knowledge graph: In the knowledge graph, graph neural networks can be used to accomplish many downstream tasks. Such as knowledge completion, task node classification, etc.

Recommendation system: The mainstream recommendation system is mainly based on the interaction data between users and commodities. If user A purchases a product, the system leaves a purchase record. Through data analysis, if user B's purchase record is found to be similar to user A, then there is a high probability that user A will be interested in the product purchased by user A. At present, the recommendation system based on graph neural network has been commercialized.

Computer vision: input the scene graph, model by graph neural network, add a picture generator at the end, and use this scene graph to generate better pictures in reverse.

Natural language processing: The structure of graphs is also ubiquitous in natural language processing. For example, in TreeLSTM, the sentence itself is not a linear structure. It has a grammatical structure. Use the sentence grammar tree structure for training to get a better analysis model. In addition, the hotter now is "Transformer", which is also a variant of Deep Map.

Graph neural networks, whether in academia or industry, have some very good implementation solutions. But there are also many problems that need to be solved urgently. If the scale is getting bigger and bigger, how to model it? How to extract structured data from unstructured data? This requires good tools to develop models.

It is not easy to write graph neural networks using traditional deep learning frameworks (TensorFlow/Pytorch/MXNet, etc.). Message passing calculation is a fine-grained calculation, while the tensor programming interface needs to define coarse-grained calculations. The difference between coarse-grained and fine-grained makes it very difficult to write graph neural networks. Amazon developed DGL as a bridge to this challenge. Wang Nixi introduced DGL from three directions: programming interface design, low-level system optimization, and open source community construction.

The first is programming interface design. Use the concept of graphs to do programming, and the core concept is based on graphs. Wang Minji believes that developers should first understand that graphs are the "first-class citizens" of graph neural networks. The so-called "first-class citizen" means that all DGL functions and NN modules can accept and return graph objects, including the core messaging API.

The second is the design optimization of the underlying system. Other graph neural network frameworks (such as PyTorch Geometric, PYG) often use gather/scatter primitives to support message passing calculations. A large number of redundant message objects are generated during the calculation process, which takes up a lot of memory bandwidth. DGL uses efficient sparse operators to accelerate graph neural networks, which is 2~64 times faster than PYG, saves 6.3 times of memory, and is very friendly to giant graphs.

Finally, Wang Jiyi shared his experience in open source community construction. He mainly shared the following experiences.

First, code is not the only important thing, and documentation also occupies half of the open source projects. Amazon has designed different levels of documentation. For novices, there are 120 minutes to get started with DGL, just download and run, you can learn how to train hands-on. For advanced users, there is a user guide, which covers design concepts, and a DGL interface manual, which allows users to grow from novices to experts in a stepwise manner.

Second, the open source community needs to have a wealth of GNN model examples. The community develops very fast. If the response speed is to keep up with the development of the community, it needs GNN to have many different application scenarios, which are covered by models. At present, DGL has about 70 examples of classic GNN models, covering various fields and research directions.

Third, we need to focus on community interaction. Amazon has set up a lot of community activities to organize developers to communicate with each other, such as regularly holding GNN user group sharing sessions, and inviting academic and industry cutting-edge scholars or developers to share results in the GNN field. In addition, user forums, Slack, and WeChat groups also provide communication platforms through different channels.

3. Wu Lei: Application and landing of large-scale machine learning in computational advertising

As a company that provides advertising services to thousands of customers, FreeWheel is committed to building a unified trading platform that integrates buyers and sellers. While connecting media and advertisers, it provides comprehensive, product-efficiency, and multi-screen computing advertising services. .

From the perspective of marketing appeal and purpose, computing advertising is divided into brand advertising and performance advertising. In the field of brand advertising, FreeWheel will use machine learning to calculate inventory forecasts and inventory recommendations for advertising. In the field of performance advertising, when FreeWheel participates in the market as an SSP traffic owner, it will use machine learning to optimize the system, and when FreeWheel participates in the market as a DSP advertiser, it will combine historical bidding records and use machine learning. Construct a predictive model, which can judge the win rate based on the price, or give the win rate, and then recommend the corresponding price. Combined with advertising inventory forecasts, according to the fluctuations of traffic and prices in the market, you can flexibly gain traffic and buy the largest ROI.

Inventory forecasting plays a pivotal role in the field of computing advertising. Whether it is brand advertising or performance advertising, inventory forecasting has laid a solid foundation for supply and demand planning and bidding strategies. The so-called inventory forecast refers to forecasting the inventory of advertising inventory for a period of time in the future under different targeting conditions. For advertisers, the biggest appeal is to reach the most relevant users with the lowest advertising budget. Therefore, we need to group different orientation conditions such as gender, age, and geographical dimensions first, and then make predictions.

There are many directional conditions used to characterize traffic in the field of computing advertising. The combination of different dimensions is the Cartesian product. The number of combinations will exponentially explode as the dimensionality and the diversity of the dimension itself increase. Assuming there are 1 million combinations, then there are 1 million time series to predict. With traditional methods, such as ARIMA, millions of models need to be trained and maintained, which is obviously unrealistic. In addition, in actual scenarios, it is necessary to predict 2160 units in the future with the granularity of hours. For such a long time sequence, it is a big challenge to ensure its accuracy and prediction efficiency. Because it is necessary to predict 2160 time units, in order to ensure accuracy, at least the same length of time needs to be backtracked. In FreeWheel, the daily ad placement logs are at the level of 1 billion, and the overall data volume is very large.

In summary, Wu Lei believes that inventory forecasting mainly faces four challenges, namely dimensional explosion, engineering complexity, ultra-long time series, and massive data samples.

In order to meet these four challenges, FreeWheel designed and implemented a customized depth model. This model is designed based on the wide and deep proposed by Google in 2016.

First of all, for the problem of dimensional explosion and engineering complexity, FreeWheel uses wide and deep to extract orientation conditions and corresponding time series, so that one model can deal with millions of different time series, only training and Maintaining a model also greatly reduces engineering complexity.

Secondly, in order to deal with the problem of super long time sequence, FreeWheel designed the element wise loss function, so that the back propagation of 2160 time units is independent of each other and does not affect each other.

Finally, in response to the challenge of massive data, FreeWheel chose the Amazon SageMaker service provided by Amazon Cloud Technology and migrated all its business from the data center to Amazon Cloud Technology. Compared with building and maintaining a distributed environment independently, this greatly saves time and energy. Wu Lei said: This is in line with FreeWheel's consistent philosophy of entrusting professional work to professional people.

For the effect of the model, the design and tuning of the model is of course important, but for the energy and time invested in the entire pipeline, it basically conforms to the 2/8 law. In actual application and landing, it is often 80% Time and energy to process data, prepare features and training samples.

FreeWheel mainly uses Apache Spark for sample engineering, feature engineering, and related data processing. Wu Lei briefly introduced this process.

For timing issues, the first thing to face is the problem of sample completion. User behavior is often not continuous in time, so when reflected in a time sequence, it will be found that some time is missing. At this time, samples need to be supplemented. To solve this problem, FreeWheel's idea is to first prepare all combinations in advance, and set the ad exposure to zero at all times. Then summarize the "positive samples" under different combinations and different time periods from the online log. Then, as long as the two tables are left-connected, the desired business effect can be achieved. After connecting the two tables with Spark, you will find that the performance is very poor. It took nearly 7 hours on 10 EC2 Spark clusters. In order to reduce the execution time, the FreeWheel team has tuned the performance of Spark-using hash values to replace the huge and numerous join keys. After tuning, on the same cluster scale, the execution time was reduced to less than 20 minutes.

After obtaining the timing samples, feature engineering is required. Feature engineering is mainly divided into two parts. The first part is to use the Spark Window operation to make a window sliding operation for the impressions sorted by hour in advance, so that the timing samples are really created. The second part is feature generation, such as generating various time features based on timestamps. Because the data is ultimately fed to the Tensorflow deep model, all fields need to be encoded in advance.

After the sample is ready, the next step is model training and inference. The first is training. In order to take into account the model effect and execution efficiency, FreeWheel refers to the idea of migration learning, pre-training the model with large quantities of data to ensure the model effect, and then uses incremental data to fine-tune model parameters every day. The second is inference. Because the model needs to serve different downstream and some require batch prediction results, it is divided into the following four training and inference tasks in terms of task types.

After the model is online, the effect can ensure that the finest-grained MAPE is controlled at about 20%, and the aggregated MAPE can be controlled below 10%. In terms of execution efficiency, the offline cold start, that is, the pre-training model time is 2 hours, the incremental training actually only takes 10 minutes, and the batch inference can be completed in 5 minutes.

4. Zhang Jian: Practical application of graph neural network and DGL

As a senior data scientist at Amazon Cloud Technology, an important job of Dr. Zhang Jian is to use graph neural networks and DGL as tools in actual customer scenarios to help customers solve core business problems and enhance business value. In this sharing, he introduced the challenges encountered in the implementation of the graph neural network and DGL in the four aspects of data, model, speed, and explanation and his thoughts on this.

your picture contain enough information?
In the academic circle, many scholars use open data to build models and enhance algorithms. The most commonly used data sets in the field of graph neural network research are Cora, Citeseer, and PubMed. These graphs are usually highly connected, and nodes of the same category are clustered together. Using these graphs for model construction, the results of graph neural networks often perform well. However, in actual business scenarios, limited by the means of collecting data, the way of storing data, and the ability to process data, the constructed graph data is sometimes very sparse, resulting in a lot of effort and time for model tuning, but the effect is not ideal . If the connectivity of the graph provided by the customer is too low, no matter what graph neural network model is used, they will eventually degenerate into a common MLP. In addition, business graphs provided by customers often have very little label data. In a graph with hundreds of millions of points, only 100,000 nodes have labels, and only 0.01% of label data. This makes it difficult to find other labeled points to build connections through one labeled point, which greatly reduces the effectiveness of the graph neural network.

There is a saying in the data scientist circle: data characteristics determine the upper limit of the model's performance, and the model just goes wireless to approach this ceiling. It’s better to put more effort into the model than to think of a solution on the data. Since the information of the picture determines the upper limit? So what is the information of the graph? How to measure "information"? Can information values guide GNN? Do you want to make pictures? These problems are often solved by machine learning practitioners and even development engineers. Zhang Jian brought up these issues and hoped that everyone will brainstorm to solve them.

what circumstances is the GNN model more advantageous?
"I know that your graph neural network has a variety of models, can you see what model is suitable for our graph?" Industrial customers once asked Dr. Zhang Jian. And this question is difficult to answer. First of all, the design space of the model is much larger than the options. Second, different business scenarios correspond to different business needs. It is not easy to judge how the model design or model selection in the business scenario is specific to the specific business. In addition, the core development model of DGL It is message passing (MP). In the field of graphs, some problems can already be realized without MP. We have also seen that in the field of graph machine learning, there has not yet been a model like GPT in the NLP field, which can quickly solve most of the problems.

Zhang Jian said that the most worrying thing is far more than these, but the customer directly questioned: "Dr. Zhang, you see that our xGboost and other models are better than this GNN!" There was once a customer in the financial industry who used the financial industry. After the knowledge graph acquires various relationships between customers, it directly uses LightGBM, and after combining a thousand multi-dimensional features, it directly kills the graph neural network model. Although following some technologies, the graph neural network model surpassed the customer's LightGBM model, but it also left a lot of room for thinking. For example, how is the graph neural network model better than the traditional machine learning model? When is it better?

Zhang Jian believes that the vast majority of traditional machine learning models are based on features, and in real business scenarios, not every point or every feature can be obtained, especially with the enhancement of privacy protection regulations, the greater the supervision of big data The more stringent it is, the more difficult it is to collect data. But for the graph neural network model, even though there is no feature, it can still establish an association relationship, which is the advantage of the graph neural network model.

The graph neural network model and the traditional machine learning model are not an all-or-nothing relationship. It is necessary to decide how to choose according to business scenarios and business problems, and can even be combined to solve problems. What is the applicability of different GNN models? How to use the point/edge feature? Do I have to use GNN? How to combine GNN and other models? Zhang Jian left these issues for everyone to think about.

graph model can do real-time inference?
After the model is effective, whether it can be online for real-time inference has become a frequently asked question by customers? This question involves two levels. There is a correlation between data in the graph structure. Therefore, compared with traditional CV and NLP, the data points are not independent and identically distributed. There are two modes when inferring graph data, namely Transductive mode and Inductive mode. In the Transductive mode, during the training phase, the nodes/edges to be predicted already exist in the graph, and the trained nodes can “see” these nodes/edges. The problem with this mode is that these points must already be predicted when predictions are needed. Exist, the graph has been constructed, and there is almost no way to achieve real-time. Because in order to be real-time, the model must deal with future points. In the Inductive mode, the node that needs to be predicted is not in the graph during the training phase and is invisible. This point can only be seen when inference is made and then applied to a graph. Inductive mode is used to infer invisible points, there are two situations. The first is to do batch prediction, such as anti-fraud, and use the data of the past seven days to build a graph data training model. When detecting user behaviors that occur tomorrow, you need to combine the data of tomorrow with the data of the previous seven days. Make a picture, and then use the trained model to make inferences. This is batch inference, it is not real time inference. To truly achieve real-time inference, it is necessary to add the predicted nodes/edges to the existing graph in real time, and extract the N-hop subgraph to the trained model for inference.

According to Zhang Jian, not only the graph community, but the entire machine learning community, including the big data community, have not designed real-time (such as streaming) graph data storage, extraction, and query methods for graphs. At present, the existing graph databases are often not fast enough to increase and check, especially when a point/edge is used as the center point/edge for sampling, the sampling speed of the graph database cannot keep up with the speed that requires real-time inference . For the system architecture of real-time inference, the industry does not have a particularly mature method for the time being. This is a problem that needs to be solved at present, and it is also a very big opportunity for developers.

How to interpret the results of the
After the model was launched, one of the problems faced was how to interpret the results of the model? Some research results can be seen in the academic circle on this issue, but such discussions are rarely seen in the industry.

For example, after using a graph model to obtain a prediction of a node, the business person asks why? Tell him that because the "neighbors" next to it have the greatest impact on it, the business staff must not be able to accept it.

In addition, the graph neural network model, although it can identify some patterns through the graph structure, but the points in it all have characteristics. This characteristic is finally some real numbers. After a series of linear transformations and nonlinear changes, the relationship between them It has greatly exceeded the human cognition of cause and effect. How to interpret the results of the graphical model? For developers, there is a long way to go.

The landing of the graph neural network faces many challenges, Zhang Jian said, these challenges are like supporting a moon landing rocket. Data is equivalent to fuel, and model is equivalent to engine. The problem with all data pipelines and implementation architecture is the overall rocket design, and the interpretation of the model is like the need for a flight control center. Only by solving these four levels of problems can the rocket really fly to the moon.

5. Write at the end

Over the years, Amazon has accumulated many projects and practical experience in the field of artificial intelligence, and has been committed to co-creating with global developers, hoping to bring new vitality to the field of artificial intelligence. The 2021 Amazon Cloud Technology China Summit in Shanghai will officially open on July 21. The conference will be titled "Building a New Pattern and Reshaping the Cloud Era", and join hands with leading technology practitioners in the cloud computing industry to share the reshaping of the cloud era. And build the story. At the same time, the Shanghai station is only the pioneer of this summit. In August in Beijing and September in Shenzhen, the Amazon Cloud Technology China Summit will continue to open.

The summit covers more than one hundred technical sessions and has technical sub-forums in the field of artificial intelligence. It will bring you hands-on practice, technical architecture and other content around the construction of databases, big data, and intelligent image warehouses. It will also focus on Some customer cases and practices bring technical interpretation to everyone. In addition, there will be a dedicated open source sub-forum at the site, and many big names will be invited to bring you wonderful sharing. Scan the QR code at the bottom of the article to learn more about the summit!

Open source golden ten years, on the trend of AI open source technology and implementation practice

亚马逊云开发者

引用和评论

Amazon Bedrock 助力 SolveX.AI 构建智能解题 Agent，打造头部教育科技应用

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！