China Open Source Code Power List is a list of Chinese open source developers co-sponsored by SegmentFault, Open Source Society, Tengyuan Club, and X-lab.
The OpenDigger team from X-Lab analyzed the archived logs opened by GitHub, screened out the top 10,000 accounts with GitHub collaboration influence in 2021, and called on dozens of developers and more than a dozen cooperative communities in the community to pass Open collaboration jointly verified the annotation information, excluded robot accounts, and selected 99 Chinese developers in the first stage.
After the China Open Source Code Ranking was released, has attracted the attention of many developers. Many developers are very concerned about how this ranking was generated and what is the algorithm model behind it? We invited Dr. Zhao Shengyu, the founder of the OpenDigger open source project, to write a blog to share the algorithm model behind the open source code list.
Zhao Shengyu is the director of the Open Source Society in 2022, a PhD candidate in computer science at Tongji University, and the initiator of open source projects such as and . In 2020, he was selected as 33 Chinese open source pioneers.
The following content is reproduced from Dr. Zhao Shengyu's blog "The algorithm model behind the open source code list "
The China Open Source Code Power Ranking released recently in cooperation with Sifo has attracted the attention of many developers, and most of them will be more curious about how this ranking came about, what is the algorithm behind it, and why some developers are on the list. And some don't. This blog will let you know about the algorithm behind this list, and hope to get some feedback from you, so that the list can be continuously optimized to make it more comprehensive and fair.
Open Source Value Network
The previous three blogs have introduced a heterogeneous graph PageRank algorithm , an open source value network based on collaborative data, and this time the GitHub global developer-project value network that only contains collaborative data is used. Its structure is as follows shown:
This is a simplified version of the originally designed value network, which does not include the developer's attention relationship (star, fork), the attention relationship between developers (follow) and the dependency relationship between projects (dependent), the main It takes into account the problem of computing power, and some unsupported robustness to missing data.
After establishing a complete network, we collaboratively sort the developers and projects in the whole domain on a monthly basis, and get the value ranking of all developers and projects. That is to say, we can get the rankings of all developers and projects that are active in the whole domain every month from 2015 to the present, while the Chinese open source code list uses the developer summation data for the whole year of 2021.
Compared to traditional PageRank
This algorithm model is similar to the traditional PageRank algorithm. It uses global relational data for collaborative ranking. It has several basic value propositions:
1. The more valuable projects are easier to attract more valuable developers to contribute
2. The more valuable projects are, the more developers will be attracted to contribute
3. The more valuable developers will be active on the more valuable projects
The difference from the traditional PageRank algorithm is that in the open source value network, the calculation methods of different types of nodes (developers, projects) can be different, and this algorithm introduces prior knowledge, that is, the inherent properties of nodes as part of Reference, rather than just use network relationship data.
That is to say: in the open source value network, the value of each month's projects and developers will not only depend on the activity of developers and projects in the current month, but also part of the data inherited from the previous month, which makes the entire algorithm The results are very smooth, and also because we believe that the long-term value of open source is not only dependent on the current situation.
Specific parameters
In this model, we used the following parameters:
1. Developer-project activity, using the calculation method used by the laboratory in the China Open Source Annual Report and GitHub Insight Report in previous years, namely
$$ A=\sqrt{1 * C_{issue_comment} + 2 * C_{open_issue} + 3 * C_{open_pull} + 4 * C_{pull_review_comment} + 2 * C_{merged_pull}} $$
That is, 1 point for issue comments, 2 points for opening a new issue, 3 points for submitting PR, 4 points for review comments on PR, 2 points for adding PR, and the final square is used to correct excessive activity.
2. The initial value of the developer and the project, that is, the value when it is active for the first time is 1.
3. 50% of the value of developers each month comes from their own historical value, and 50% comes from the open source value network of the month.
4. 30% of the value of the project each month comes from its own historical value, and 70% comes from the open source value network of the month.
common problem
- Why are some project authors with extremely high user numbers and popularity not selected, such as Vue authors?
In fact, after reading the above description, everyone should understand that this algorithm is mainly calculated based on the collaborative relationship, and does not include indicators such as the number of users of open source software (of course, the number of users of open source software has always been very difficult to obtain. , even the project itself may not know the specific value). Therefore, for those projects with a large number of developers who are continuously active, it is more advantageous, but for projects with a large number of users, it cannot be reflected. This has a great relationship with the data used and the parameters of the model, especially Vue. It is a relatively independently maintained project with Youda as the core (refer to 2019 Vue Project Collaboration Network Diagram ).
- Why are some very active developers not on the list?
Although we have coordinated the ranking of developers and projects in the whole domain, we have no way to accurately know which accounts are Chinese developers, so we have spent a lot of effort on manual marking, but it is still inevitable that there are omissions. There are already marked Chinese developers. The user list has been deposited into OpenDigger and can be seen from here . If there is a new account that wants to be marked as a Chinese developer, can submit Issue to OpenDigger, and the calculation after the integration will be included.
future improvements
1. Introduce star and fork relationships. In this list, from the perspective of computing power, we did not introduce data such as star and fork, because the time complexity of PageRank-like iterative algorithms is positively correlated with graph density, and the low-cost operation of star will It makes the density of the whole graph increase very quickly, which greatly increases the operation time, especially in the cooperative sorting of tens of millions of nodes.
2. Introduce the follow relationship between developers. The developer's follow relationship has a good guiding significance for identifying the developer's KOL, but there is a mathematical problem here, which is to solve the Rank Sink problem under incomplete data. A way to solve some problems caused by one-way relationships at a low cost.
3. Project dependencies. In fact, from the user's point of view, if the number of users cannot be effectively obtained, the dependency relationship of the project is a very suitable data, which can be used to identify the usage relationship between projects, especially in the language ecology. But also has the same issues as the developer follow relationship above, plus some additional engineering issues.
- Taking the Node.js ecosystem as an example, packages that have been published to npm can easily track their dependencies, but only projects whose product packages have not been published in the repository need to further analyze their dependencies with the contents of the package.json file in the repository. The cost of doing this globally is extremely high.
- Taking the Java ecosystem as an example, the metadata of the Maven central warehouse does not contain the information of the upstream warehouse address, so the relationship between the product and the warehouse is a big problem, and in the Java release strategy, the warehouse and the product package are many-to-many. more, which makes building project dependencies more complicated.
If there are students who are familiar with the above problems and know how to solve them, please contact me
About China Open Source Code Power Ranking
China Open Source Code Power List is a list of Chinese open source developers co-sponsored by SegmentFault, Open Source Society, Tengyuan Club, and X-lab.
The OpenDigger team from X-lab analyzed the archived logs opened by GitHub, screened out the top 10,000 accounts with GitHub collaboration influence in 2021, and called on dozens of developers and more than a dozen cooperative communities in the community to pass Open collaboration to jointly verify the annotation information and exclude robot accounts. In the first stage, 99 Chinese developers were selected.
After the list was released, we received feedback from the community and added a new "Open Source Code Li" from China, Huan (Li Zhuohuan) , who meets the standards. We also invite every developer in the open source world to actively contact us through the Github project address. Feedback to us, we will continue to update.
Through the China Open Source Code Power List, we hope that the super code Li in the open source world and the developers behind open source projects can be known, recognized and respected by more people. Let more people pay attention to open source and the growth of open source developers.
Project address: https://github.com/OpenSourceWin
project official website: http://opensource.win/
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。