Coding code asset security series-build full link security capabilities to guard the security of code assets

Author of this article: Wang Zhenwei-CODING R&D Director
One of the founding team members of CODING, with many years of system software development experience, good at Linux, Golang, Java, Ruby, Docker and other technical fields. For the past two years, he has been engaged in system architecture and operation and maintenance in CODING.

Different types of corporate assets have different management methods, but the security of the assets is the top priority without exception, but there is no unified understanding of how to ensure the security of code assets. This article will start a comprehensive elaboration on the topic of "security of code assets", and try to conduct a full-link analysis from the life cycle of code management, and readers can review the security of their own company's code assets based on this.

What is code asset security

Code asset security is not equal to information security

Code asset security is not equal to information security, which is easy to understand. The information system of the entire enterprise is not only code assets, it can even be said that code assets are not involved in most cases. An enterprise's information system is often composed of basic computing facilities, network platforms, software, and databases. The focus of information security is to pay attention to the security issues in the operation of the above-mentioned information facilities after they are put into production. Most software packages are compiled from source code and are separated from the source code itself. Information security concerns are more comprehensive. code asset security is only a part of , and it is often not the most concerned part.

Code asset security is not equal to code security

Code asset security does not mean code security, which is not easy to understand. Code security often refers to the security of the code itself, such as whether there are remote process execution vulnerabilities in the code, injection vulnerabilities, and so on. The security of code assets is a management concept that emphasizes the security of the management process rather than the security of the code itself. For example, a research institution needs to study a certain computer virus, and they need to store the source code of the virus in the source code repository. The source code of this virus is an important asset of this organization.

Do not look at the source code management system vulnerabilities or malicious act, but must faithfully sure that the original file code stored .

Code asset management is the whole life cycle management around the code warehouse

The core of code asset management is the code warehouse. The warehouse stores all the company's code, configuration files and all historical versions. The core of guarding the security of code assets is to build full-link security capabilities around the three key links of the code warehouse. These three links are check-in, storage and check-out .

Check in security

Check-in can be understood as the process in which the developer edits the code in the development environment and transfers the code to the code repository. This link focuses on two aspects, namely confidentiality and integrity .

Confidentiality

Confidentiality means that the process by which developers check the code in the development environment into the code warehouse is not stolen by a third party, and is generally realized by encrypting the transmission process. The most commonly used Git code repositories are HTTPS and SSH transfer protocols.

The HTTPS protocol is implemented through the HTTP protocol plus the Transport Layer Security Protocol (TLS). The HTTP protocol is a clear text transfer protocol, which means that without TLS, the routing devices in the network nodes can easily steal the code. TLS can establish two-way encryption capabilities on top of the TCP protocol, and it is HTTPS on top of the HTTP protocol. The HTTPS client and server first negotiate an encryption algorithm and key through asymmetric encryption, and then use the negotiated algorithm and key for symmetric encryption transmission. This article does not involve an introduction to the security of specific algorithms, but with the development of cryptography and algorithms are advancing with the times, we can consider the encryption algorithm itself to be secure.

However, this process is not complete. An attacker can create an intermediate server, so that the client mistakenly connects to the intermediate server when initiating a connection, so as to perform encrypted communication with the intermediate server. This will cause to be stolen by a malicious server even though it is encrypted transmission, which forms a man-in-the-middle attack .

The industry has introduced a CA (Certificate Authority) mechanism to deal with this problem, that is, before the server provides encrypted transmission services, it must bind its own public key to the service domain name and register it with the global public trust CA. In this way, when the HTTPS client tries to establish an encrypted link, it will ask the server to show a certificate issued by the CA. The client can use the CA public key pre-installed in the operating system or browser to verify and confirm the server’s ownership of the domain name. , This way you can ensure that there will be no man-in-the-middle attacks. There are industry credibility CAs and internal enterprise CAs, and the latter needs to install the certificate file of the enterprise internal CA on the client.

Qualys, a well-known security organization, can conduct various SSL/TLS report evaluations on HTTPS servers online. The following picture shows the evaluation of code hosting servers launched by two domestic cloud computing companies:

Although HTTPS solves transmission security, the Git code warehouse still relies on the Basic Auth mechanism to authenticate users. The Git code repository will ask the HTTPS client to provide the account password, attach it to the request body and transmit it to the server, and the server will confirm the identity of the operator. During the transmission, the account and password are encrypted and transmitted together by TLS, so we don’t have to worry about the leakage of the password during the transmission. However, developers usually do not have to enter the account password for each operation, so that the computer will remember the password. If it is not handled properly, it may lead to leakage. The point here is that you must not splice your account password into the remote warehouse access address. The correct way is to use Git's credential manager under various operating systems. For example, macOS uses keychain management, and Windows uses Git Credential Manager for Windows. To manage.

SSH is a secure encryption protocol commonly used for remote management of Linux/Unix servers, and its functions are very diverse. Git-based code hosting also often uses this protocol for encrypted code transmission. After the user configures his public key file on the server in advance, he can confirm his identity in the subsequent transmission process.

SSH uses asymmetric encryption (the user's public key) to confirm identity, and uses symmetric encryption to transmit data. Unlike HTTPS, the SSH protocol cannot specify a domain name, so the CA mechanism cannot be introduced to prevent man-in-the-middle attacks.

However, when the SSH client connects to an unknown server, it will prompt the server's public key fingerprint information. The user should compare the public key announcement and the command line prompt information provided by the service provider to confirm the identity of the server and ensure that it is not attacked by a man in the middle.

The figure shows the public key fingerprint of Tencent Cloud CODING SSH server:

As shown in the figure, the fingerprint confirmation of the server public key given by the SSH client when trying to connect to the server:

After the user confirms the identity (input yes and press Enter), the SSH client will record the server's public key information in ~/.ssh/known_hosts, and you can connect directly next time without asking.

Summary of main points

The transmission of the code should use a two-way encryption protocol, both HTTPS and SSH can be used
The HTTPS protocol needs to pay attention to the authority of the server's certificate issuer (CA)
The HTTPS protocol needs to pay attention to whether the client has installed an untrusted CA file (to prevent CA fraud)
Use the Git Credential Manager to keep the account and password of the Git HTTPS protocol
When using the SSH protocol, you need to carefully compare the public key fingerprint provided by the server and the public key fingerprint announced by the service provider to prevent man-in-the-middle attacks.
The client needs to pay attention to prevent attackers from maliciously tampering with the content of the ~/.ssh/known_hosts file or the SSH client configuration (you can ignore the server public key trust mechanism)
Keep the SSH private key file (usually stored in ~/.ssh/id_rsa) properly. For example, under Linux, ensure that the permission of this file is 400, etc., to prevent them from reading

Completeness

The integrity of code check-in includes two aspects:

Whether the code changes submitted by the developer at one time are complete (the content has not been tampered with)
Whether a certain submission is indeed a change made by a certain developer (not to be replaced by an imposter)

Taking Git as an example, this code version control software has ensured that the content is not tampered with from the endogenous mechanism. Git uses a Merkel-like hash tree mechanism to implement hierarchical verification.

Hash is an algorithm that maps arbitrary data into equal-length data, and it is irreversible. The characteristic of the hash algorithm is that the original data changes a little, the result of the mapping will have a big change, and this change is irregular. The mapped data of the same length is called a fingerprint.

The hash algorithm is very suitable for quickly comparing whether two pieces of data are completely consistent (the fingerprints are consistent can almost infer that the original text is consistent). The comparison of the public key fingerprint shown by the SSH server and the fingerprint announced by the service provider mentioned above is an application of this principle.

Merkel hash tree:

Git hashes the content of each file in the warehouse and its basic information integration. All file paths and file hash values under a directory tree are combined and then hashed to form a hash of the directory tree. The directory tree and commit information are combined and then hashed, and the result of this hash is the version number of Git. This means that each commit produces a completely different version number, which is a hash. Given a version number, we can think that all the file content, history, submission information, and directory structure behind this version are completely consistent. has no possibility of tampering with the determined version number .

The hash algorithm has a small probability of conflicts (the same fingerprint corresponds to multiple different original data), which may cause the consistency check to fail. Therefore, the hash algorithm is also advancing with the times. For example, the current MD5 algorithm is almost out of date. Git is currently using the SHA1 algorithm, and it may be upgraded to the more secure SHA256 algorithm in the future.

The following figure shows the content information of a directory tree in Git:

Even if the version submitted by the developer is hashed layer by layer in Git, it can ensure that the content is not maliciously tampered with, but is still in danger of being replaced by an impersonation .

Because Git does not need to verify the identity of the user during the submission process, and the submission can be transmitted and displayed by different people in various transmission processes. Imagine how scary it would be for an attacker to pretend to be an employee of the company to make a submission, but be considered by other employees of the company to be an insider of the company. Currently based on Git, the common practice in the industry is introduce the GPG signature mechanism .

GPG is an application based on an asymmetric encryption algorithm. Its principle is to use a private key to process a piece of information to obtain a piece of new information. This piece of new information can only be generated by the private key, and the corresponding public key can be used to identify this piece of information. The source of the new information, this piece of new information is called a digital signature.

In simple terms, the information publisher uses his own private key (private seal) to sign the information to be published (document to be signed), and sends the original document and digital signature to the user. The user holds the public key of the issuer and verifies the received digital signature and the original document to confirm that it is indeed issued by the issuer and has not been replaced by an imposter. This is similar to stamping a chapter on the information to be published.

The figure shows the effect of a Git submission by the developer adding a GPG signature:

Summary of main points

Git's own hashing mechanism can ensure that the content is not tampered with
Use GPG to sign submissions to prevent imposters
The server side needs to verify the Git submission mailbox statement and GPG signature

Storage security

Storage security refers to how to ensure the confidentiality, integrity and availability of data after the code is checked into the code warehouse. Regardless of the security of the infrastructure, for code storage, data is often composed of database data and code library files. Here, we will focus on the storage security of code files.

Confidentiality

Most of the code in the code warehouse is directly stored on the disk of the operating system. When the server software performs read and write operations, the confidentiality risk of network transmission is not involved, but the files directly written to the disk are not controlled. It can often be read and written at will by many unrelated processes on the operating system, and these unexpected code reads and writes will cause additional risks.

One approach is to control the read and write permissions of each file, such as a unified setting of 600, the other approach is to simply allow only one business process to run on the server to achieve operating system level isolation.

Container technology provides a good isolation process solution: For example, under the Kubernetes system, the code warehouse is stored on the PV, and is only read and written in the application container mounted in the code warehouse, and the container-based scheduling and elastic characteristics can be compared Good support for high availability and avoid waste of resources.

Completeness and availability

We know that Git itself uses a hash verification mechanism to ensure the integrity of the warehouse, but the premise is that the warehouse files are complete. If the files in the warehouse are lost or damaged, Git's hash check will not work . There are many solutions for data integrity. The most common cold backup, semi-real-time backup, real-time backup, disk snapshot and other solutions are to ensure that files can be retrieved when they are lost or damaged to ensure the integrity of the warehouse. However, in general, backup is often an after-the-fact recovery method and cannot achieve immediate self-healing. In the end, data recovery based on the backup mechanism often affects availability.

Although the industry does not have a general high-availability solution for code warehouses, database master-slave strategy and RAID mechanism are two practices that can be referred to. Here is a brief introduction.

Database master-slave strategy, one way is to write data into the master database, and the slave database automatically incrementally synchronizes the data. When the master library fails, the slave library automatically replaces it. Code storage is similar, and storage nodes can be divided into master nodes and slave nodes.

The RAID mechanism is a redundancy mechanism for disk slice storage. There are many methods, such as RAID5, slice storage, and store a copy of parity information. When any disk is broken, the data can be restored through the parity information.

Tencent Cloud CODING DevOps has conducted in-depth research in this area, combined with the master-slave and RAID ideas, to achieve a high-availability strategy for code warehouses, which can properly guarantee the integrity of the warehouse.

As shown in the figure, for warehouse D, his master warehouse D(m) is stored in the second node, and his slave warehouse D(s) is stored in the first node (in fact, more slaves can be set Warehouse, for the convenience of illustration, only one is shown). This design allows each node to not idle computing resources, and any node can be quickly recovered if it is damaged.

Check out security

The code can be used after checking out, and checking out also involves transmission confidentiality issues, which is no different from the check-in part. For Git warehouses, the warehouse integrity of the checkout link will be guaranteed by Git's hash verification mechanism, and there will be no major problems. The detection part of the security problem is often because inappropriate rights policy and key management lead to code leak .

Enterprise internal code usually has the following four scenarios:

Check out development
Reading review
Automated execution (CI, automated testing, etc.)
Management audit

Check out development permissions

It is necessary to distinguish the scope of permissions that developers can read and write, and protect key resources and keys, according to the following principles:

Classified storage according to business, components, etc., warehouse isolation
Configure warehouse permissions according to the department and organization relationship
Set read and write permissions for the branch, and only allow members with permissions to write
Use file locking to protect sensitive files from being modified by mistake
Unified transport protocol, such as allowing only HTTPS or SSH
Set a limited period for personal passwords, tokens, public keys, etc.
Audit the use records of passwords, tokens, public keys, etc.
Set read and write permissions for the directory, and only allow designated developers to read or write to certain directories
Prohibit forced push policy to prevent code from being rolled back

As shown in the figure, set the directory permissions in the warehouse:

Reading review permissions

The appeal is to look at the source code and auxiliary information, and make your own review results, without writing code, according to the following principles:

Distinguish between read-write and read-only member groups, and disable the write permission of the latter
Distinguish between in-depth review and lightweight review, disable the latter's code checkout permission, and only allow its Web page to view the source code
Use CODEOWNERS mechanism to automatically designate review members

As shown in the figure, set the CODEOWNERS of the warehouse:

Automatic execution permissions

Automatic checkout, there is no one person behind the checkout behavior, and no code writing back is involved, according to the following principles:

Members are forbidden to use their passwords, tokens, and keys for automatic execution
Use project/repository tokens and deploy public key mechanisms to ensure that tokens and keys are only authorized for the specified warehouse
Set up dedicated tokens for different scenarios, not to be mixed, and not to be used for other purposes
Set the validity period for tokens, public keys, etc.
Set write prohibition permissions for tokens, public keys, etc.
Use records of audit tokens, public keys, etc.

As shown in the figure, set the permissions and validity period of the token:

Manage audit permissions

In this scenario, non-technical personnel want to understand warehouse statistics, activity status, and the progress of the research and development process, etc., according to the following principles:

Provide members with access to the web pages of the warehouse list and warehouse details under their jurisdiction
Forbid members to use HTTPS/SSH protocol to check out the source code locally
Prohibit members from downloading source code packages on the web

As shown in the figure, set permissions such as prohibiting warehouse writing

Summarize

Code asset management is a systematic project. Security in this process cannot be fully guaranteed at a single point. It is necessary to conduct risk analysis on the entire chain from three links: check-in, storage, and check-out. Many companies attach great importance to these aspects, but focus on the wrong direction and may have made a lot of effort, but in essence they still risk the huge risk of loss and leakage of code assets. I hope this article can help companies face up to the security of code assets and provide code asset managers with a basic framework for reviewing security.

Let CODING escort your code assets

Coding code asset security series-build full link security capabilities to guard the security of code assets

What is code asset security

Code asset security is not equal to information security

Code asset security is not equal to code security

Code asset management is the whole life cycle management around the code warehouse

Check in security

Confidentiality

Summary of main points

Completeness

Summary of main points

Storage security

Confidentiality

Completeness and availability

Check out security

Check out development permissions

Reading review permissions

Automatic execution permissions

Manage audit permissions

Summarize

CODING

引用和评论

Mojo——会燃的 AI 编程语言

HTTPS是如何对数据进行加密的？——深度解析HTTPS的加密机制

代码托管平台深度解析：Gitee如何赋能本土开发团队

现代开发团队的代码管理工具选型策略

DevOps 平台选型对比：功能与价值剖析

七、MyBatis自定义映射resultMap

从Facebook到银行：HTTPS如何守护巨头与用户？