The 2021 ECUG Con hosted by ECUG (Effective Cloud User Group) will be held in Shanghai from April 10th to 11th, 2021. At the meeting, Cloud CEO Xu Shiwei published a theme sharing on 16078b76b55344 "Data Science and Go+" , and talked about his understanding of the changes in data science, his vision and planning for the new language Go+, and boldly pointed out that Data Science is coming In the outbreak period, there will only be more and more new types of companies like . The following is the content of the speech.
Just now I was talking about ECUG becoming taller and taller. In fact, I am becoming more and more like a simple lecturer. This year is the 14th year of the ECUG community, and this event is also the 14th ECUG Con. In fact, this session was supposed to be held last year, but it was postponed because of the epidemic.
In fact, there are two concepts that I have always implemented in ECUG:
First, let yourself write code continuously. Because every time I come to ECUG, I'm very nervous, so I can't have nothing. So this is also a good opportunity to keep yourself in the front line of technology;
In addition, the themes I share every year have a certain continuity, presenting my own thinking about the future.
Since last year, I was talking about data science. In the first three years, I was talking about some practices on the end. The reason is that I think the first era of cloud computing should belong to machine computing, that is, virtual machines; the second generation is cloud native, I think this is a revolution called "infrastructure". In other words, the first stage is resources, and the second stage is infrastructure. In the third stage, my judgment is application computing, which involves the collaboration of front-end and back-end.
Since last year, my sharing has turned to data science. A very important factor and trend is the arrival of the data age. Especially after 2017, after a large amount of data is digitized, there will be a wide range of applications involving data science in all walks of life.
Last year was quite a coincidence, and I came up with a language as soon as my brain heated up. I have done a lot of languages before, and there are some audiences. But that is very clear, and I never thought that one day it could be commercialized. It may happen that some companies use it for commercialization, but basically from the moment of birth, it is not aimed at commercialization.
I spent a lot of energy on evangelizing Go in 2012, because as a start-up company, it was too difficult to recruit people. A better recruiting logic is to let others find you interesting, and the company's technical atmosphere is very good. Go+ is the first language I earnestly hope to commercialize , but there is not much publicity yet, and 1.0 has not been released yet. I want to talk about my own thoughts on Go+ and data science, why I think Go+ has a commercial opportunity .
There are about four aspects of the topic I talked about today:
- Language development
- The development of data science
- Go+ design philosophy
- Go+ implementation iteration
First of all, we talk about the development of language, programmers are very interested in this topic. I divide the history of language development into three parts.
First, the history of the development of static languages. I chose the TOP20 language. This is based on the current ranking of the most popular languages. I ranked the top 20 languages. This is probably the case. The first release was C, but it is still in the rankings. The top three positions. The second is C++, Objective-C, Java, C#, Go, Swift, Go+. We can see a more interesting phenomenon. Almost every 6-8 years, a new and influential static language will appear, which is a sign of productivity iteration.
Third, the development of languages related to data science. But for data science, I chose TOP50 because there are too few TOP20. It is also quite interesting, the first is SQL, the second SAS, MATLAB, Python, R, Julia. Python never thought that it would be a data science language at first, but it eventually became the most popular language in the field of artificial intelligence.
There is another obvious feature here: its span is as large as a static language, so the development of data science is actually ancient and long, but it has not developed so fast. Static languages have an iteration almost every 6-8 years, but data science languages are not, and the intermediate span is particularly large. But I think we are now entering an accelerated period of data science.
You may be thinking, why should I analyze the history of language development? Several conclusions are key.
First of all, I think that the scripting language is a product of a specific historical stage. In the long run, static languages are more vital.
Second, data science is the initial requirement for computers, and the earliest computers were used for calculations. It has a long history but slow progress, because the era of data explosion has never come.
The development of data science
After talking about the development of language, let's talk about the development of data science. Data science can also be divided into several stages. The first stage I call the "primitive period", or "mathematical software era" , this period can basically be summarized into two characteristics, the first is in limited In the field , the most typical is BI (Business Intelligence); the second limited data size of , typically like Excel, the number of rows and columns is very limited, and other software is basically the same.
What are the characteristics of data science during this period? First of all, is not an infrastructure. It is actually a mathematical application software , but it is very capable and powerful, including statistics, forecasts, insights, planning, decision-making, and so on.
The second period I called "data science infrastructure period" , the real made data science an infrastructure, the most typical representative is the rise of big data . Map/Reduce is a paper published by Google in 2004. Hadoop appeared in 2006 and Spark appeared in 2009. I think this is a stage of the rise of big data and the beginning of data science infrastructure. This period is different from the mathematics software just now. takes large-scale processing capabilities first, not powerful functions first, and its functions are relatively limited to .
The rise of deep learning and the rise of big data have a long time interval. Deep learning has TensorFlow in 2015 and Torch in 2017. These are the two most well-known deep learning frameworks. The essence of deep learning is to automatically derive y from data. =F function in F(x). We usually implement this F by programmers, but the core concept of deep learning is how to let the machine automatically generate this F to achieve the best curve fitting. It is actually an automatic calculation based on the measurement result.
Assuming that there is no Newton's three theorems today, but I have a bunch of measurement data, in theory, I should be able to find Newton's three theorems. This is the core logic of deep learning. It is not a mutual substitution relationship with big data, but an enhancement of capabilities, and more is actually how to make the capabilities of big data further and more powerful.
There is a view that is actually the core of the technology driving factors behind today's economic development. One is calculation, and the other is data .
The core of data is the data science we are talking about today. Data science has actually reached a new paradigm. There is a word called "fourth paradigm". There is a company in China also called fourth paradigm. We believe that data is a higher-level one. This kind of production capacity, compared with computing, stands at a higher level of dimension .
There are two stages of data science in front, so what is the third stage? I think it is the data science explosion period , which is today, in Ma Yun’s words, the "DT era". The original period is a kind of ability to do it in a limited field and a limited data scale. future first is the whole field of , first of all not limited to the field of Business Intelligence (BI) this category, second large-scale data , third is everywhere , including cloud can be seen everywhere, intelligent Mobile phones, embedded devices, etc., will be implanted in what we call data intelligence.
This means that the rise of today's mobile Internet has made many companies very good, and the popularization of the Internet or the birth of Internet applications has given birth to BAT. But we know that is now a new and relatively new company, such as Bytedance, which is not the success of the Internet, but the success of data science . It still cannot be said today that data science is civic, and its threshold is very high.
However, we have seen that smart applications have been produced. Smart applications will not be limited to the productivity amplification in a local area like Douyin. All walks of life will be affected by data intelligence, which is the fourth paradigm we just mentioned.
Data and data science will surely become the support of the next generation of productivity. . Today, emerging companies such as Bytedance and Kuaishou have emerged, but they are just the beginning, not the end.
In the original era of data science, data was just a by-product. Everyone imagine that in the BI field, data is just a by-product, and is only used for later operational decision-making.
But today we see that in a large number of applications, data is the raw material. This is a very different state, which is why I call it the data science explosion period. This is the reason why I think Go+ is needed today, and it is also the historical background behind it.
The future of data science must be the integration of universal language and mathematical software, so as to complete the infrastructure of data science in the true sense. But today, the infrastructure of data science is far from complete. This is my own judgment.
Today's Python is very good, why do you need Go+?
Of course, many people will have questions: Today's Python is already very good and has been widely used in the field of deep learning. Why is Python not enough, and Go+ is needed? In fact, I think that Python is not an infrastructure , it is a scripting language, I think it is only needed for a specific historical stage.
Data science itself is a computing power revolution. Even in the field of chips, data can do calculations. This is the core reason why Nvidia has turned Intel . This is even more so in the upper-level software field, and there will definitely be a new infrastructure bearer that needs to emerge.
Computing power is essentially a computationally intensive business. Python is behind C, and Python alone is not enough. Today, C and Python support the entire deep learning, but data science must sink further. What is the result of the sinking?
This is why we need Go+ today! In the previous section, I mainly talked about why I think Go+ has commercialization opportunities. Of course, what I said about commercialization is not necessarily making money. Don’t get me wrong. Language may be an unprofitable thing in most people’s minds, but this does not mean that it is not important, it is very important.
Go+ design philosophy
After talking about the development of data science, let's talk about the design concept of Go+. Why is Go+ like it is today? The programmer behind the calculation is the programmer, and the data scientist or analyst behind the data science. The two roles are actually different, although both are technical jobs. I think it is relatively easy to train programmers. Today, the number of programmers is very large, but the number of data scientists is relatively small. This is why after the rise of deep learning in the past few years, the so-called salary of AI engineers was fired. Much more expensive than programmers. In fact, it is because data scientists are not easy to find.
This role carries the connection between technology and business, and it is difficult to find people with both abilities. Data science is first of all a technical job, it needs technical ability, and it needs to understand business. Today, there is still no systemic ability to train data scientists, and there is no such systematic methodology.
So what is the core idea of Go+?
The first one is We are trying to use Go+ to unify programmers and data scientists, so that they can have a common discourse, so that they can have a natural dialogue . I think this is the core thinking point of Go+. A very important core logic of Go+ is to use one language to allow two characters to have a dialogue.
On this basis, we extended some design logic. First of all, Go+ is a static language, the syntax is fully compatible with Go ; second, is more script-like than Go in form, with a lower learning threshold . Although Go is a static language, the learning threshold may be low, but it is not low enough, not as low as Python; third, it is natural that we want to be a data science language, so it must have more concise, The language grammar of mathematical operations supports ; the fourth is dual engine, which supports static compilation into executable files, and also supports compilation into bytecode to interpret and execute .
Why did we choose the syntax to be fully compatible with Go? First of all, I personally firmly believe that static language has stronger vitality and can cross the cycle of history. It is easy for everyone to understand that language needs to span cycles, and the life cycle of a language is usually very long. We cannot say very limitedly that what is currently popular is how I decide the language design. In fact, we have to find those elements that can cross the cycle.
Second, why is Go? I personally think that has the most streamlined grammar design and the lowest learning threshold among static languages. It is easy to learn Go even if you have not learned a static language before. Our company was the first to recruit Go programmers, but most of the people who recruited did not know Go. When we use Go, not many people in the world think Go is the popular language of the future. Our own practical experience shows that the Go language is basically enough for two weeks of learning, and it is a static language with a very low threshold.
But in terms of data science language, the threshold of Go is not low enough. Although Go+ is fully compatible with Go, we hope it has a lower threshold than Go. So it is more like a script in form than Go, because scripts are often easier to understand. We hope that the Go+ and Python are at the same level as .
Go+ was just born in May and June last year, and around October, I started to let three children aged 13-14, from grade 6 to grade one, try to learn Go+. This practice has proved that this thing is feasible. They can understand the design of Go+ and can write code in Go+ freely. This also proves that all the simplification efforts we made on the basis of Go are very cost-effective.
I have briefly listed some Go+ syntax here, of course not all, but some I think is relatively concise expression . There are no rational numbers in Python. We believe that rational numbers are still very common in data science, especially in lossless numerical operations. Go+ has built-in support for rational numbers. Of course, Map and Slice basically have Python.
List comprehesion is actually Python has, but 16078b76b559cf our support for list comprehension is very complete. Basically, we understand how to write for loops in Go+ and we understand list comprehension . More is the concise expression of some routine operations of data science. The above is a general grammar. If some friends have not seen Go+, I hope to have a general understanding of Go+.
Go + very interesting point, it is the only twin-engine chosen language, supports both static compilation, also supports analysis execution .
Why do you want to have a dual engine? Because I think the demands of programmers and data scientists are different. Data scientists like single-step execution. You can recall in your mind the mathematics software you have seen, including SAS and MATLAB. The interaction of mathematics software is a single-step execution method. .
This is not because data scientists are lazy. Programmers understand that program logic can be put in their minds, and we know in our minds whether the program logic is written correctly. However, when data scientists do calculations, they cannot know whether the calculation results are correct, because human computing power is much weaker than computers, so must be performed in a single step to see the calculation results to know what to do next , this This is a point where the working model of data scientists and programmers is completely different.
Because he is doing calculations rather than doing a kind of program logic, it is difficult for him not to do single-step execution.
But when a data scientist builds a model and finally uses it, he still hopes to deliver maximum execution efficiency. He certainly does not want the code to run slowly, so at this time he needs to statically compile and execute it. , which is why Go+ wants to be designed as a dual engine, because the working mode is completely different in the debugging phase and the production use phase.
Iteration on Go+ implementation
After talking about the design concept of Go+, we enter the last session, the iteration of Go+ implementation. What does Go+ do now? Although Go+ has not yet released version 1.0, the grammar is currently supported by 60 to 70%, and the grammar is quite complete.
The source code of Go+ is transformed into a Go+ token through a scanner, and then transformed into an abstract syntax number of Go+ through a parser. Common languages do this. The abstract syntax tree of Go+ has two branches after conversion. One generates Go code so that it can be statically compiled, and the other branch generates bytecode analysis and execution. The polymorphism of the branch is introduced by the introduction of an execution specification (exec.spec). ) Is actually an abstract interface.
Currently, I personally discovered a problem in the iterative process. For a person who initially joined the Go+ team, it takes a while to familiarize themselves with the entire business. The part of the Go+ implementation specification is actually an abstract SAX interface, which is based on event-driven. I have an event sent to the recipient, and the recipient processes the event according to its own needs. This is more common in text processing.
The interface we designed before is basically an event-driven model to connect different components. The compiler parses the abstract syntax number and sends out some events. These events are received by the two code-generating modules and work according to their own needs. The code of this mode is still a bit difficult to understand, especially when some complicated things are done in the compiler, which makes the code more difficult to understand. If you know the implementation logic behind Go, type inference is more complicated in Go. In fact, most of the complexity of our compiler is caused by type inference.
I'm currently trying to refactor this logic. I want to make the implementation specification part no longer an abstract interface, but a standard-implemented DOM. This DOM itself includes the ability to type inference, making the compiler relatively simple. Talking about realization is particularly detailed today that I can't talk about, and I will have a chance to expand it later.
Next, I want to talk about the focus of Go+'s next step.
First, the core logic, or hope this year can be released 1.0 version , but the most important thing is the 1.0 version of the user's done to maximize the paradigm of confirmation, and I hope that after 1.0 Go almost syntax changes behind is more Less. The most important work at the clarify which core syntax Go+ needs, and try to support in version 1.0, unless there are some specific considerations, such as particularly complex syntax features like Go's paradigm, leave it to Follow-up versions to support. Go+ is similar, we may give up some particularly complex grammatical features, but basically as much as possible to determine the most of the grammatical features we need in version 1.0 .
For Go+ 1.0, we will first iterate with a single engine. first make a statically compiled engine, and after 1.0 is released, we will iterate the script engine . This is also a decision based on the concept of user paradigm first we mentioned above.
Finally, we hope to operate Go+ in a commercial way, and we will also recruit Go+ team members. Welcome to join the Go+ team !
I think the core of Go+ is to first unify the language of programmers and data scientists, so that the two sides can have a natural dialogue. In addition, I firmly believe that Go+ will be the next revolution in data science . I am very excited to be able to do such a thing, and I welcome people who recognize it to join us.
This is how to contact us. The first is the address of the project ( https://github.com/goplus ), the second is the email address for submitting your resume (email@example.com), and the third is my Twitter Address (@xushiwei).
The author Hsu Wei is seven cattle cloud founder and CEO , Go language Greater China chief preacher , Go + language creator , ECUG community sponsor . He has worked in Jinshan and Shanda, and has more than ten years of research and development experience in the field of search and distributed storage related technologies. At Jinshan, he led the architecture design and development of WPS Office 2005 as the chief architect. After founding Jinshan Lab, he led the development of distributed storage as the technical director, then joined the Shanda Innovation Institute, and successfully launched "Shanda Netdisk" and "Shanda Cloud". Xu Shiwei was selected in 2020 as "2020 Chinese Open Source Pioneers 33 People on the of Open Source" 16078b76b55bb4.