How to cut machine budget by 50%? "Man vs. Machine" explores the path to the cloud

preface

Man-machine confrontation aims to unite various security teams to jointly manage the black ash production. Due to historical reasons, the business end has many access modes to various security capabilities, including more than a dozen docking systems/protocols, showing a fragmented state, which is not conducive to the convenient access of business to security capabilities externally, and is not conducive to the coordination and construction of security teams internally. In order to improve the efficiency of various aspects, cloud services are widely used in the construction process of man-machine countermeasure service, and good results have been achieved. Looking back on the past of security capability cloud, is a process from vague to clear, from hesitant to firm, here to do a brief share with you.

On cloud

What is the cloud

I understand the nature of cloud can be understood as free and flexible resource sharing. Resources at any time to join, at any time to leave, like the floating clouds in the sky, come and go without certainty, clouds rise and clouds fall, it looks like that cloud, look at it is different.

The expectation of the cloud

From a computing perspective, the ideal cloud would be one where computing/networking/storage resources become the water and electricity of your life, with a switch to give you an easy way out of your head. Compared with the era of physical machine deployment, cloud users do not have to go mad because of data damage caused by a machine crash, nor do they have to work overtime to restore service because of machine damage.

Self-active disaster recovery (abnormal pull up, fault migration)
Easy remote deployment (multiple clusters)
Resource isolation – the death of friends die poor, you greedy, rest in peace
Fast expansion of capacity, resources call to come, come to the war

How perfect everything is, the good news for the development, operation and testing brothers!

Cloud analysis on the system

With the improvement of the company’s basic services, the existing service facilities in the company can support us to go to the cloud. After investigation, we found that the company’s cloud-related platforms and deployment methods are as follows:

CVM

On the basis of physical machine, resources such as hard disk and CPU are virtualized. The user’s way of use is essentially the way of physical machine, but it can avoid the pain of machine removal. There is no essential change in user experience level.

Containerized deployment platform

Docker containerization deployment is the main cloud deployment mode in the industry at present. Docker containerization deployment allows us to construct at a time and run everywhere, which perfectly meets the requirements of free operation and resource isolation. The system environment is naturally strongly maintained, and all programs/scripts/configurations are in the mirror image, so there is no longer the problem of lost or omission maintenance. In the era of physical machines, the problem of unrecoverable script configuration damage caused by machine damage is no longer present, and the problem of system maintenance by self-awareness or strong constraints is naturally gone.

The emergence of container orchestration and scheduling systems like K8S supports powerful platform features such as self-active disaster recovery/failover/multi-cluster deployment, which makes us closer to the goal of cloud services. Based on the K8S container scheduling mechanism, the company has developed a series of deployment platforms, such as 123 platform, Gaiastack, TKE, etc., which perfectly coordinate with the automatic association management of addressing services such as L5/ Polaris, providing a complete platform mechanism support for cloud services. In addition, the flexible allocation of resources based on the resource management platform makes the use of cloud computing more convenient. For example, in the ladder resource application TKE container resources (CPU/ memory/storage, etc.), the process is as smooth as to Taobao order shopping, the resources in place quickly, under the strong promotion of approval can reach the minute level in place, my first experience is surprised and praised.

Based on the in-depth understanding and analysis of the company’s services, we finally decided to use the TKE deployment platform and use Docker container deployment to cloud the man-machine countermeasure service

The core impact of the cloud on development

A core change brought by the upper cloud is that the resources are changeable. In order to facilitate the system resource scheduling, the IP of the service node is variable. After the upper cloud, it needs to face the IP changes including the upstream business end/itself/the downstream dependent end, thus deriving a series of constraints and dependencies

Upstream changes: The source-IP authentication model for clients is no longer feasible, and a more flexible and flexible approach to authentication is needed
Self-change: the external service address can be associated with the service address bound to the Polaris way to provide services, if the dependent downstream needs to authenticate and use the source IP authentication, the downstream needs to be modified to support more flexible authentication. In most cases, the service needs to do some routine operations on itself, such as the need to change the configuration frequently, the old operation and maintenance tools are no longer feasible, need a centralized operation and maintenance configuration center.
Downstream changes: This is not a problem, as long as L5 or Polaris auto-addressing is provided, and the platform currently provides service management capabilities.

System architecture and cloud planning

The main module of man-machine countermeasure data center is variable sharing platform, which has two core modules, one is query service module, the other is Web module which supports variable management API. Both modules are developed based on TRPC-GO framework. The system architecture diagram is as follows:

Ignoring some dependent systems, TKE is currently only used for two core parts to deploy on the cloud. The entire TKE deployment architecture is as follows:

The deployment planning of the whole system is to create two system loads, BLACK_CENTRE and HTTP_APISERVER, on TKE respectively. These two parts are the core, in which BLACK_CENTRE carries the variable query of the user, and the request of the Web side goes through the intelligent gateway and CLB. Then enter HTTP_APIServer to get the real business processing, mainly support to check case and system variable management, variable query access application and other functions. You may notice the reason why HTTP_APIServer is introduced CLB rather than directly accessed by the smart gateway. The main reason is that after the module is in the cloud, the IP of the computing node may change at any time. However, when the company applies for the domain name specified service address, it does not support Polestar or L5 configuration, but can only configure the fixed IP. CLB provides a fixed VIP feature, a good solution to this problem.

A quick mention of CLB (Cloud Load Balancer) is a service that distributes traffic to multiple Cloud servers. The CLB can expand the external service capability of the application system through traffic distribution and improve the availability of the application system by eliminating the single point of failure. The most common usage is to automatically forward to a rule-bound workload based on the associated forwarding rules (access port/HTTP URL, etc.). Back to the application scenario of man-machine confrontation, we mainly focus on the HTTP service with low load. Multiple services can share the same service address resource as long as they configure the forwarding rules of URL. For example, both cluster services A and B provide external HTTP services, but both need to use port 80. In the traditional way, at least two machines are required to be deployed. However, using the shared CLB, we can distribute different interfaces to the corresponding cloud services as long as we configure URL distribution rules. There are similar features with Nginx, of course, but in terms of ease of use, maintainability and stability, CLB is more than one notch better

When deploying a TKE cloud application, the container tends to shrink and expand, and the container itself is generally not fixed to a certain IP, so the cloud application should be designed to be stateless by nature. The whole mirror should be as small as possible, and the business logic should be as micro service mode as possible to avoid mixing too much logic at the same time. Because the man-machine confrontation related module is a new module, there is not too much burden, although the protocol is flexible and compatible, but essentially independent function, single responsibility, better adapt to this scene.

As to how to apply for resources, how to create a space, create a load, such as the process is very long, have lost a lot of screenshots, products help document has provided a good guidance, see https://cloud.tencent.com/doc… , although I use Intranet TKE, but the internal and external network cloud service, the overall experience is not much different.

In the process of using TKE, feel a strong and stable TKE continuously, but the feelings of the most urgent is the need for a container reference replication, because in the real usage scenarios, often want to based on an existing container, slightly take two to three parameters (probability is big load/mirror versions), the load can be quickly created. Common usage scenarios are test validation/off-site deployments, or heavy load deployments (there are many parameters in the load that cannot be changed but can only be rebuilt), or even deployment of new services (the same resource usage and operation pattern as the existing load). Now encountered to create a new load, to fill in a lot of parameters operation more feel very tedious, small demand, big progress, if you can solve this problem, I believe that the convenience of TKE can also be a big step up.

Found the king in the cloud

From the above analysis of the architecture process, ready to mirror, using the TKE platform our service is already running, but how can others find my service address? Some people say that it is OK to input the service address into L5/ Polaris, but don’t forget that during the operation of cloud nodes, the service IP is changeable at any time, so we need to find a way to establish an association between the changing address of cloud service and Polaris, and manage the address list of Polaris in an association way, so as to be able to lift a load as if it were light. TKE provides just such a feature, which is perfect for our purpose. Follow the following steps to solve the problem:

To create a load is to get the service running, and ours is already fine.
Create a corresponding Polaris service for the payload for later use.
To create a new Polaris association rule, first enter the operation page of Polaris association as shown in the figure below

Note that select the corresponding business cluster and enter the creation page:

In this way, we fill in the information of the Polaris service that we have created, associate it with the specified container service, commit it, and finally bind the Polaris to the dynamic service address

As we scale the container services under load, we can see that the address list in the Polaris service is also added or removed, so that the service address seen by the business side is valid regardless of the service deployment change or disaster relief migration.

At the same time, in order to be compatible with some old users using L5 habits, Polaris also supports the creation of L5 alias for Polaris services, users can use the alias, using the L5 way, can happily address to the same service published by Polaris. Go to the left menu bar of Polaris official website “Service Management “->” Service Alias” to create an alias

The past analysis is that the old version of L5Agent is not compatible enough in this respect, so it can be solved by upgrading the L5Agent to the latest version.

The cloud on the right to identify the change of thinking

After the cloud is on the system, due to the variable IP of nodes, the most typical change is the influence of authentication mode. The old mode of authentication by source IP is no longer applicable, and a more flexible authentication method is needed. The industry commonly used two kinds of certification certification scheme:

SSL/TLS authentication, which is often used for transport encryption rather than access control, is already supported by the TRPC-GO API.
Token-based authentication method, which is also the key scheme to be realized in this system

Although there are many methods, we need to choose different solutions according to different scenarios and requirements.

Upstream access authentication

When a user application access, need to certification authentication, user authentication source IP is certainly doesn’t work, on balance, we use the token authentication way, when users apply for access, we will give them to distribute appid/token, when a user to access, we will ask them to bring the timestamp in the header, serial number, AppID, RAND string, signature. The signature is generated by a combination of timestamp, sequence number, appid, rand string and token. When the server receives the request, it will use the same algorithm as the client to generate the signature according to the corresponding field, and compare with the signature of the request. If it is consistent, it will pass; otherwise, it will reject. The timestamp function can be used to prevent replay.

Knocknock, an internal verification platform, offers a simple token signature verification method based on TRPC-GO authentication. However, based on the simplicity of user access and the reduction of platform dependence, human-computer confrontation finally customized its own token authentication method according to the above idea.

Downstream depends on authentication

At present, our downstream is relatively simple. The big data query agent (see the architecture diagram) has been transformed to support token authentication. CKV + is the password authentication method when logging in. Apart from CDB, there are not many problems with other services. To access CDB, the security specification does not allow the use of root, and requires access to IP authorization, and authorization needs to specify a specific IP first, during the process of cloud, container IP often drift. In these cases, TKE has been planning to achieve authorization with downstream businesses, including CDB. Each business can also connect with TKE to get through the registration. In fact, the essence is to add an automatic registration mechanism between CDB and TKE. According to the change of service IP, IP is automatically registered in CDB’s authentication list, which is similar to the correlation between Polaris and the change of TKE load in logic.

Why TRPC-GO

At the beginning of man-machine confrontation service platform construction, we once faced the problem of language framework selection. The department was used to C ++, and there was SecappFramework or SPP Framework on the framework. How should we choose the new system? Before answering this question, let’s go back to the question and goal we face:

Large traffic, high concurrency requirements and high machine resource utilization
The rapid growth of our business requires efficient support, including development/operation/maintenance release/positioning issues/data analysis
Cloud is the trend of various services in the company, and various cloud-related platforms and services in the company will be used. The language and framework used would be better to use these capabilities conveniently and quickly. C ++ and AppFramework seem to be too big, and various services are not supported enough or difficult to use.

Facing a series of choices such as the department’s old Framework/ SPP Framework/ TRPC-CPP/TRPC-GO, TRPC-GO was finally selected from multiple perspectives, such as performance/development convenience/concurrent control/richness of surrounding service support. The goal was to improve research efficiency, and the detailed analysis was as follows:

Golang is a simple language, complete and rich in various code packages, and its performance is close to C ++, but it naturally supports coroutines and has simple concurrency control. Simple concurrency design can drain machine resources, and it has lower mental burden and stronger productivity than C ++.
A series of coroutine frameworks in the company, such as SPP Framework and TRPC-C ++, are all based on C++. Meanwhile, under the SPP framework, the worker single process can only use a single core at most, and proxy itself will become the bottleneck. TRPC-C ++ has also been used before, so its complexity is relatively high
TRPC is the OTeam promoted by the company, and is constantly improving. The development of TRPC-GO service interface is simple, and the supporting components of surrounding services are rich. When added to the configuration file, you can run, such as Polaris /L5, Chi Research Log/Monitoring, various storage access components (such as MySQL/Redis, etc.). Use R to configure services, etc. Development level of each link basic coverage, coupled with the original use of a certain degree of experience, familiar with a high degree, call a variety of services.

In the process of using TRPC-GO to build the system, in addition to encounter some problems in the familiar process, but did not encounter a large pit, code written wrong can also be quickly positioned, did not encounter that kind of mysterious mysterious problems. It is smooth and smooth in general, less mental burden, more focused on business, and strong in productivity

The power of all living beings

In addition to the core logic of the module, in order to make the service more stable and operation and maintenance more efficient, a series of peripheral services are needed to support, such as logging/monitoring/configuration center and other support services

Unified log

Cloud node, write the local log is not appropriate for the location problem, the core reason for the problem, the local log you’ll have to find the problem of the first node in view, this process is a trouble, light and cloud node restart might make log is missing, so the use of a unified network log center is imperative, the current mainstream of log service in the company are:

Zhiyan, TEG products, simple operation, easy to use, in the verification code has been successful practice. In TRPC-GO framework, it is easier to use. With a little configuration, you can forward the logs to the log center without modifying the business code
Eagle Eye, the system has rich functions, but the access is complex, and the ease of use needs to be strengthened
Uls, CLS, etc

The final choice is Zhiyan log, which is simple and easy enough to use under the conditions of meeting the requirements. With the support of TRPC-GO, only the following is needed:

Introduce code packages into your code
A simple configuration in YAML can be used:

When the problem is located, you can view the log on the Web side:

The specific middleware implementation details and help can be seen in TRPC-GO below the Intelligent Logging Plug-in Project

Powerful monitoring

Monitoring services include:

Zhiyan: TEG products, multi-dimensional monitoring, powerful and easy to use, simple to use, repeatedly used and verified in the department
Monitor: Monitoring based on attribute definition, old product, mature and stable, but the monitoring method is single
007: The system features rich, access complex, easy to use needs to be strengthened

Finally chosen is wisdom research monitoring, research of multi-dimensional monitoring, the monitoring ability is richer, more rapidly positioning problem, and use simple enough, under the tRPC – Go system, only need to configure/plugin registration/call API for data reporting, can complete the data of routine monitoring, multi-dimensional monitoring at the same time, can use the appropriate dimension definition, Make monitoring data more three-dimensional, more conducive to the analysis of the problem:

In the pictured above example, you can define a variable of the query index, two latitude associated source IP/processing results, when we focus on the corresponding business visit, you can see from each source IP traffic, can also see each each traffic processing results (pictured above), the whole situation of services at a glance, Compared with the original Monitor monitoring, this is a dimension improvement.

Unified Configuration Center

Our business typically runs on a certain configuration, and we often change the configuration to release. The traditional approach used to be to go to each machine and change the configuration file and then restart. Or, the configuration is stored centrally, and the modules on each machine periodically read the configuration and then load it into the program. In the case of cooperation with OTeam, the company provides various configuration synchronization services. Currently, the company mainly has two synchronization schemes:

T Configuration Service

It is simple to use, and the functional richness is general

Configuration authority control is single, after consulting data encryption also no version plan

R configuration service

Company level solutions, with special OTEAM support

Configuration synchronization has permission control, and encryption features will be supported later

Configuration forms support rich, JSON, YAML, String, XML, etc

Support public/private configuration group, convenient configuration reuse and configuration in the module division, and support grayscale/step-by-step release

Both services are simple and easy to use under the development mode of TRPC-GO, and the background can take effect immediately after data modification. After comprehensive consideration, R configuration service is selected as the configuration synchronous platform, and the use mode is relatively simple:

First register the project with the R configuration service and configure the grouping
Connect to the R configuration service in your code to read data and listen for changes to mutable configuration items.

Here is the easy to use interface:

Take the configuration of the external publishing variable ID as an example. This list of IDs will be continuously increased or modified according to the demand. As long as the user modifies the publication on the Web side, the changes in the configuration will be perceived by each node on the cloud and will take effect immediately. Both service stability and publishing efficiency have been qualitatively improved compared with the traditional configuration modification publishing method

TRPC-Go’s plug-in design of services greatly simplifies the invocation of various services. This, coupled with active open source projects under TRPC-Go, has improved the research efficiency by more than 50%. Take MySQL/Redis as an example, take the usual open source library use or encapsulation, need to handle the exception/connection pool encapsulation/addressing encapsulation (use L5 or Polaris in the company), etc., the development + test basically takes 1-2 days. But using the corresponding open source component under TRPC-GO/Database can be reduced to about 2 hours. Furthermore, the development time can be reduced from 2 days to 4 hours when configuring the use of the synchronization/logging center service compared to self-development or semi-self-development. Objectively speaking, if the related components are self-developed or semi-self-developed, in such a short time, they can only be customized and used. In terms of stability and versatility, it is believed that it is difficult to match the platform services. The key is that the code collaboration and various OTeams within the company will have a process of increasingly rich and powerful features, and their increased maturity will translate into increased productivity for the entire organization. Overall, using mature middleware in the company is the right choice

What does Shangyun bring

At present, the human-computer countermeasure service is constantly introducing the traffic of the old system or accessing new services. The maximum reading and writing traffic of the whole system is more than 12 million /min, and the variable latitude access traffic is up to 140 million /min(a single access request can access multiple variables). The whole service has been running steadily on the cloud for more than 3 months, living up to expectations.

Stability improvement

Regardless of the code quality of the module itself, stability is provided primarily by several platform features of container deployments on the cloud:

TKE supports service heartbeat check, application exception pull up, and failure migration

Easy to support remote deployment, disaster recovery in many places
Resource isolation, abnormal business does not affect normal business

If 99.999% is what we are looking for in service stability, it is predictable that service stability will improve by 1-2 9 times relative to the hourly/day-level machine exception recovery time under the old deployment mode

Enhance resource utilization

In the physical machine /CMV deployment mode, the general machine resource utilization rate reaches 20% is good, especially some small modules. However, under the elastic expansion capacity mechanism of TKE after the cloud, the container resource utilization can reach 70% easily through configuration.

For example, the following is my current application of market monitoring, in order to deal with the possible impact of large traffic, I configured the automatic expansion, and set more base node points, so the CPU occupation is low, in fact, these can be configured. Container mode deployment on the cloud can improve system resource utilization by 50% compared to traditional mode.

Ascension after going to the cloud

Increase demand development efficiency by more than 50% by utilizing open source technology in the company
Due to the improved utilization of machine resources, the machine budget for man-machine countermeasure has been reduced by 50%
The centralization of various services makes publishing and problem locating faster, such as R configuration services and intelligent research, publishing and problem solving from half an hour to BBB 0 minutes
Container deployment mode, the system into the natural strong maintenance state. Because the mirror image is completely reproducible, it will also be sure to record into the library after the problem is fixed, without the worry of missing or missing. In the era of physical machine/CVM, program/script/configuration is likely to be a single place to a machine (especially some offline pretreatment systems), machine failure or system crash these things just didn’t, of course you would say physical machine deployment can be timely backup or storage, but the need to rely on the people consciously or the behavior of the strong constraints, the cost is not the same. But more often than not, Murphy’s Law turns into a nightmare that you can’t escape

conclusion

Looking back on the efforts made by the center and the company in improving the research efficiency over the years, the system framework developed from SRV_Framework at the very beginning, to SEC AppFramework of the center, and then to various coroutine frameworks such as SPP Framework, and finally to TRPC. Data synchronization consistency from the manual synchronization, advocate for synchronization, to use synchronous center to the company all kinds of distributed storage service, system deployment from the physical machine manual deployment to all sorts of multifarious publishing tools to weave cloud blue whales, from the physical machine, CVM to the success of hosting service cloud get massive business validation, etc., the research work of improving along the way, From a toddler, grew into a fast – footed teenager. Count in modern power system on the cloud of top back, understanding of cloud and exploration practice, the company also has experienced from fuzzy to clear, from the hesitation to staunch process, and our efforts have not been disappoint to this time, various indicators of ascending description and user recognition is the affirmation and praise to us, inspire us back to the future.

Understanding and building cloud, innovation and development of cloud, we foot on the earth, but we look at the stars. The son on the river said, “The deceased are like Sifu! Day and night.”

The above is a bit of my practice and feeling in the process of man-machine confrontation on the cloud, to share with you.

[Tencent cloud native] cloud said new, cloud research new technology, cloud tour new live, cloud appreciation information, scan the code to pay attention to the public number of the same name, timely access to more dry goods!!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How to cut machine budget by 50%? “Man vs. Machine” explores the path to the cloud

preface

On cloud

What is the cloud

The expectation of the cloud

Cloud analysis on the system

CVM

Containerized deployment platform

The core impact of the cloud on development

System architecture and cloud planning

Found the king in the cloud

The cloud on the right to identify the change of thinking

Upstream access authentication

Downstream depends on authentication

Why TRPC-GO

The power of all living beings

Unified log

Powerful monitoring

Unified Configuration Center

T Configuration Service

R configuration service

What does Shangyun bring

Stability improvement

Enhance resource utilization

Ascension after going to the cloud

conclusion

How to cut machine budget by 50%? “Man vs. Machine” explores the path to the cloud

preface

On cloud

What is the cloud

The expectation of the cloud

Cloud analysis on the system

CVM

Containerized deployment platform

The core impact of the cloud on development

System architecture and cloud planning

Found the king in the cloud

The cloud on the right to identify the change of thinking

Upstream access authentication

Downstream depends on authentication

Why TRPC-GO

The power of all living beings

Unified log

Powerful monitoring

Unified Configuration Center

T Configuration Service

R configuration service

What does Shangyun bring

Stability improvement

Enhance resource utilization

Ascension after going to the cloud

conclusion

Related Posts

[Play WordPress] Serverless builds WordPress = 2 minutes

Countdown 3 days, Tencent cloud storage product conference is about to open!

Technical Architecture Decryption – Application and Service Choreography Workflow ASW