These days, programmers are most concerned about Copilot, GitHub’s new AI programming tool.

The program, called Your AI Pair Programmer, is built on a new algorithm called OpenAI Codex and trained with terabytes of publicly available code extracted from GitHub and examples in English. It can automatically complete entire lines of code or functions, generate corresponding code based on comments, write tests, and quickly discover alternative ways to solve problems.



How Copilot works

Copilot received much praise for its productivity gains, but problems persisted.

GitHub takes the lead in copying code?

Copilot is currently out with a technical preview, and according to its website, if the technical preview is successful, GitHub will build a commercial version.

However, Copilot’s training data is publicly available, including GitHub’s trillions of bytes of publicly available code. Is GitHub going to turn open source code contributed by developers into a “paid product” to sell to developers?

According to developer Eevee:

Copilot acknowledges training based on large amounts of GPL code, isn’t it a form of open source code laundering as a commercial product? “It usually doesn’t reproduce an exact block of code” is an unsatisfactory phrasing.

Copyright involves not only copying and pasting, but also derivative works. GitHub Copilot is trained on open source, and everything it knows comes from open source. “Spawn” cannot be without this implication.

The GPL (General Public License) is a widely used free software license that gives end users the freedom to run, learn, share, and modify software. It’s also important to note that the GPL is a Copyleft license, which means that derivative works can only be distributed under the same license terms.

In response to the question “Does the GPL require the modified source code to be made public?” the GNU website states that if you make the modified source code available to the public in some way, the GPL requires you to make the modified source code available to users.

Regarding GitHub Copilot’s training with GPL code, Eevee said:

The GPL explicitly states that it “does not take my work as proprietary software” and Copilot’s mechanism takes its work as proprietary software.

Just a few hours ago, Flask writer Armin Ronacher tweeted that Copilot was not using the correct license. It turns out that Copilot used the code for Quake without stating the correct license (the license for Quake was GPLv2).

Under the tweet, a number of users expressed a similar view: Copilot generates some strange licenses. Most are MIT licenses, and even one is under U.S. Dept. Of Energy.

Privacy security issues

In addition to code infringement issues, Copilot also faces privacy concerns, since the training set contains personal data. According to Copilot’s website, internal testing has shown that the advice given by GitHub Copilot rarely contains the same personal data as the training set. Sometimes, GitHub Copilot might suggest personal data such as email addresses, API keys, or phone numbers, but that data is synthesized based on patterns in the training data. In the technical preview, Copilot implements filters that eliminate email that is displayed in standard formats.

However, this does not seem to be the case.

Kyle, a software engineer, gets the personal information of another developer, David Ceils, when he uses Copilot to generate the About Me page.

And the real David Ceils is still waiting for Copilot to approve his application…

Some are shocked by the efficiency gains that Copilot has brought, others by GitHub’s disregard for licenses, copyright violations, and privacy. Copilot, can you still use it?

Reference link:

  • https://twitter.com/eevee/sta…
  • https://mobile.twitter.com/mi…
  • https://mobile.twitter.com/ky…
  • https://copilot.github.com/
  • https://www.gnu.org/licenses/…
  • https://www.gnu.org/licenses/…