Introduction: Principle and demonstration.

01 / What is code encryption?

The cloud encryption code service is the self-developed product of the cloud efficiency team. It is the first hosting service that supports code encryption in China and the first code hosting service that implements encryption scheme based on native Git in the world at present.

By encrypting the code base hosted in the cloud Codeup in the cloud, people other than the data owner can be effectively prevented from accessing the plaintext data of users and avoiding data leakage in the cloud. At the same time, the code encryption process is completely transparent to users, who can use any official Git side (including but not limited to Git, JGit, libgit2, etc.) to access the code repository on Codeup.

02 / Linux Community Review of major security incidents

In late August 2011, several servers used to maintain and distribute the Linux operating system were infected with malicious software that was so powerful that it could obtain access to the root, change the system software on it, and change the login password. But community defenders say that maintaining Linux source code is unaffected by the vulnerability.

Why is that? Because they use Git for code maintenance, each of the nearly 40,000 files of Linux kernel code is hash to ensure uniqueness, making it difficult to change an older version without being noticed.

While Git can address the source tampering concerns of the open source community, it can’t address the data leakage concerns of the enterprise. For enterprise code management, the issues today are not only data security, but also reliability and cost.

When an enterprise is small and has low reliability requirements, a self-built code hosting service seems to suffice. But as you scale up and your code grows, you may need better server configurations to meet the needs of multiple people working together, or even dedicated maintenance to ensure reliability, and you have to think about cost.

The cloud code hosting service has higher reliability and lower cost than the self-built code hosting service, but compared with the self-built code hosting service, because it does not open the direct access to the underlying storage, indirectly causing users uncontrollable security psychology.

And code encryption technology, it is by the underlying storage is not controllable to nearly completely controllable, the user code on the cloud concerns.

03 / Is it really safer to build yourself than to go to the cloud?

Before answering that question, let’s get some background on Git’s storage structure.

When you commit code with Git, the first thing you come across are commit records and branches. Branches, or tags, can be collectively referred to as references. They are stored in a single file with the pathname as the reference name and the corresponding version hash as the content. Since branch names are generally non-business related, it can be assumed that they do not contain sensitive data.

In addition to the commit record, our code file is stored in the BLOb object, file name and directory information, in the Tree object, and tags with additional information are stored in the Tag object.

Objects are the basic unit of data in Git. Typically, objects are stored in a single file named with the hash value of the content, which we call loose objects. After gc (garbage collection) is performed, these objects are packaged together and a packfile is generated. The code content, along with the file name, is stored in the BLOb and Tree objects, so it can be assumed that the object contains the user’s code content data, i.e. sensitive data.

In order to reduce the disk usage, Git object storage will be compressed through zlib. In other words, the content of the data can be obtained only by decompressing it, so it can be considered plaintext storage. That is, anyone with access to the store can view the code data stored there.

Trust issues caused by plaintext storage

To answer the previous question, it is because of the characteristics of Git code is not safe to store, self-built code hosting service, not only to prevent some attack risks from the outside, but also to prevent internal ghosts, because usually enterprise code data leakage occurs from the inside.

And for cloud code hosting services, we can use Ali Cloud security, effectively avoid the risk of external hacking attacks, then, how to solve the user’s trust in cloud code hosting services, so that the code is invisible to the operation and maintenance personnel?

The code encryption technology is introduced to encrypt the code data hosted by the cloud by using the user’s key, which not only increases the security of the static storage data, but also blocks the visibility of the code to the operation and maintenance personnel, so as to eliminate the user’s concerns on the cloud.

04 / Code encryption technology revealed

I divide it into three problems to solve:

1. Key management

Use a security compliant way to escrow key, key storage security, to ensure encryption security. This can be done with alibaba Cloud’s key management service KMS.

2. Use the key

Git is a computation-intensive service, and this performance is not acceptable if you directly use the encryption and decryption capabilities of the key management service.

So what else is going on here? We can use envelope encryption. As the name implies, we can use data key to encrypt our plaintext code data, and use digital envelope technology to ensure the security of key preservation, transmission and use. Since we only store ciphertext data keys and ciphertext code data, user authorization is required to complete the decryption of running code data. Code data in static storage cannot be accessed by operation and maintenance personnel.

3. Encryption implementation based on native Git

On the basis of native Git, by adding code encryption patch, in order to achieve the encryption ability at the same time, to maximize the advantages of native Git.

Native Git is a top-down, hierarchical architecture like the one shown in the figure, very similar to our common application architecture.

At the top is the presentation layer, which contains a complex array of command-line entries that are directly exposed to application services for invocation.

The middle layer is the business processing layer, which can be divided into reference operation and object operation from the perspective of data content. An encryption and decryption module is added to encrypt data in memory and write ciphertext data to disk, thus ensuring the security of static data.

For maximum performance, only object data related to user code assets is selected for encrypted storage, while data such as reference lists and object indexes remain in plain text. With hardware acceleration, the extra performance loss of code encryption is controlled at about 10%, and it is almost non-sensitive during user use.

Demonstration of local Git code encryption

Have a repository configured with code encryption. The warehouse is empty.

Let’s add a file to it.

Looking at the binary contents of this file with hexdump -c, we can see that it starts with the first byte 78 01, which is a typical Zlib-compressed header.

Next, we create an encrypted commit record using Git commit by turning on encryption. Looking again at the saved binary content of the commit record, we find that the object data created at this time no longer starts with 78 01, but with the encryption flag bit we specified.

Note that, depending on the time, we do not directly compare the unencrypted and unencrypted states of the same object. Instead, we judge whether the file header is encrypted or not.

After completing the loose object encryption, we can use git GC to convert the loose object into a packaged object, and then see what happens to the packaged object. As can be seen from the figure, in the encrypted package file, the package header version is no longer the original 00 00 00 02, but adds a specific identifier of 82 00 00 02. In addition, the package header is expanded from the original 12 bytes to 24 bytes, and 12 bytes of NONCE is added to increase security.

So, after we remove the key configuration, can we continue to access the repository?

After we remove the key, when we try to check the current version with git show HEAD because of the missing key, we get an error message saying that the key is not provided.

This bug is based on the fact that we have customized the encryption capability of the code based on the native Git.

For the encrypted packfile, a message is displayed indicating that the current version is younger. Please upgrade the Git version. For loose objects, the message is that the file header is incorrect because it is not a Zlib compression header.

Welcome to use the cloud efficiency code management Codeup, all-round protection of enterprise code assets, help enterprises to achieve safe, stable, efficient code hosting and research and development management.

The original link

This article is ali Cloud original content, shall not be reproduced without permission.