5

In the past few days, programmers are most concerned about the new AI programming tool Copilot launched by GitHub.

This so-called "Your AI pair programmer" (Your AI pair programmer) is based on a new algorithm called OpenAI Codex, and uses trillions of publicly available codes extracted from GitHub and English examples for training, and can be automatically completed The entire line of code or the entire function, the corresponding code is generated based on the comments, and the test can be written to quickly find alternative methods to solve the problem.


Copilot working principle

Copilot's productivity gains were praised, but problems followed one after another.

GitHub takes the lead in plagiarizing code?

Copilot currently has a technology preview version, and its official website shows: If the technology preview version is successful, GitHub will build a commercial version.

However, Copilot's training data is publicly available data , including trillion bytes of public code on GitHub. GitHub This is to turn the open source code contributed by developers into "paid products" and then sell them to developers?

Developer Eevee said:

Copilot admits that training is based on a large amount of GPL code. Isn't this a form of laundering open source code into a commercial product? "It usually doesn't reproduce the exact code block", which is not satisfactory.

Copyright not only involves copy and paste, but also derivative works. GitHub Copilot is based on open source code training, and everything it knows comes from open source code. "Derivative" cannot fail to include this meaning.

The GPL (General Public License) is a widely used free software license that gives end users the freedom to run, learn, share, and modify software. At the same time, it should be noted that the GPL is a copyleft license, which means that derivative works can only be distributed under the same license terms.

When answering “Does the GPL require the modified version of the source code to be disclosed”, the GNU official website stated: If you release the modified version to the public in some way, the GPL requires you to provide users with the modified version of the source code.

Regarding GitHub Copilot training with GPL code, Eevee said:

The GPL clearly states "Don't put my work in proprietary software" (proprietary software, also known as non-free software), and Copilot's mechanism just puts its work in proprietary software.

Copilot, likewise, ten hours ago, Flask author Armin Ronacher tweeted that Copilot was not using the correct license. It turns out that Copilot uses Quake's code without specifying the correct license (the license used by Quake is GPLv2).

Below this tweet, many netizens expressed similar views: Copilot will generate some strange licenses. Most of them are licensed under MIT, and there is even one that belongs to US Dept. Of Energy.

Privacy and security issues

In addition to code infringement issues, Copilot also faces privacy issues, after all, the training set contains personal data. Copilot stated on its official website that internal test results show that GitHub Copilot's recommendations rarely contain exactly the same personal data as the training set. Sometimes, GitHub Copilot may suggest personal data such as email addresses, API keys, or phone numbers, but these data are synthesized based on patterns in the training data. In the technology preview, Copilot implemented filters to eliminate emails displayed in standard formats.

However, this does not seem to be the case.

When software engineer Kyle used Copilot to generate the About me page, he got the personal information of another developer, David Ceils.

And the real David Ceils is still waiting for Copilot to pass his application...

Some people were shocked by the efficiency improvements Copilot brought, while others were shocked by GitHub's disregard for licenses, copyright violations, and privacy. Copilot, do you still use it?

Reference link:


小魔
735 声望1k 粉丝