1.75 trillion parameters, Zhiyuan released the world’s largest pre-training model "Enlightenment 2.0"

On June 1, the third Beijing Zhiyuan Conference officially opened. At the opening ceremony of the conference, Zhiyuan Research Institute released the world's largest super-large-scale intelligent model "Enlightenment 2.0".

The parameter scale of the "Enlightenment 2.0" model reached 1.75 trillion, which is 10 times that of GPT-3. It broke the 1.6 trillion parameter record previously created by the Google Switch Transformer pre-training model. It is currently the first trillion in China and the largest in the world. Level model.

From 1.0 to 2.0, "Enlightenment" to explore general artificial intelligence

On March 20 this year, Zhiyuan Research Institute released the super-large-scale intelligent model "Enlightenment 1.0", which trained a series of models for Chinese, multimodal, cognition, and protein prediction. When President Zhiyuan Institute Professor Huang Tiejun, "enlightenment" development model introduced in mind that in recent years the development of artificial intelligence has from "big refining model" gradually move towards a "large model refining" phase , through advanced design It is an inevitable trend to integrate as much data as possible, gather a large amount of computing power, and train large models intensively for use by a large number of enterprises.

The "Enlightenment 2.0" released yesterday is another successful exploration of the "Large Model".

From the 1.5 billion parameter GPT-2, the 175 billion parameter GPT-3, to the 1.6 trillion parameter Switch Transformer, deep learning models actively embrace violent aesthetics, but these models are not based on Chinese. Enlightenment 2.0 with 1.75 trillion parameters has not only achieved a breakthrough in the amount of parameters, it is also the first trillion-level Chinese pre-training model. Zhang Hongjiang, chairman of the Zhiyuan Research Institute, believes that currently "large model + large computing power" is a feasible path towards general artificial intelligence.

Professor Tang Jie, the academic deputy dean of the Zhiyuan Research Institute, said that "Enlightenment" aims to create cognitive intelligence driven by two wheels of data and knowledge, allowing machines to think like humans and achieve machine cognitive capabilities beyond the Turing test. The "Enlightenment" team has done a lot of basic work in the research and development of large-scale pre-training models, and formed an independent super-large-scale intelligent model technology innovation system. It has everything from pre-training theory and technology to pre-training tools, to pre-training model construction and finalization. The complete chain of model evaluation is technically complete and mature. Through a series of original innovations and technological breakthroughs, the "Enlightenment 2.0" released this time has realized "big and smart", featuring large-scale, high-precision, and high-efficiency features.

Enlightenment 2.0: "Big and Smart"

The parameter scale of Enlightenment 2.0 reached a record-breaking 1.75 trillion. According to reports, the new generation FastMoE technology is the key to the realization of the "Trillion Model" cornerstone of Enlightenment 2.0.

In the past, due to the strong binding of Google's trillion model core technology MoE (Mixture of Experts) with its distributed training framework and its customized hardware, most people could not get the opportunity to use and research. The FastMoE technology researched and open-sourced by the "Enlightenment" team is the first MoE system that supports the PyTorch framework. It is simple to use, flexible, and high-performance, and supports large-scale parallel training. The new generation of FastMoE supports complex balancing strategies such as Switch and GShard, supports different models of different experts, and makes up the last shortcoming for the realization of the trillion model.

FastMoE data parallel mode, each worker puts multiple experts, data parallel between workers. The top-2 gate means that the gate network will select the 2 expert networks with the highest activation scores. (Source: