The author

Yuhuliu, Tencent r&d engineer, focuses on storage, big data, cloud native fields.

Abstract

In the process of rapid development, medical information business has formed dozens of businesses covering different scenarios, different users and different channels, as well as thousands of services. In order to efficiently meet the diverse needs of users, the medical technology team uses TKE to go to the cloud, Coding DevOps platform and cloud observable technology to improve the efficiency of r&d and reduce the cost of operation and maintenance. This article introduces some of our practice and experience in the process of cloud, as well as some thinking and choice.

Business background

  • Stage1: Medical information mainly includes core businesses such as medical code, doctor and medicine, among which medical code mainly provides access to medical related content and popularization of medical knowledge. Doctors meet the doctor-patient connection; Medicine has served the majority of pharmaceutical enterprises. In the process of business development, we built a large number of background services based on TAF platform, and completed the rapid establishment of initial business. Due to the large number of businesses, a large number of businesses have multi-regional requirements, so we finally deployed multiple business clusters on the TAF platform. At this time, the release, operation and maintenance, and troubleshooting solely rely on manual stage, with low efficiency.

Business in the cloud

  • Stage2: With the rapid expansion of business scale, the traditional development, operation and maintenance mode has formed great constraints on business iteration in terms of agility, resources and efficiency. With the advancement of the company’s self-research cloud project, embracing cloud protobiogenesis, meeting business’s diverse demands for different resources and flexible scheduling based on K8s, and carrying out agile iteration based on the existing mature DevOPS platform, it has become more and more the right choice for business. The medical background team started the cloud migration of the overall service.

  • There are a few more things to consider before going to the cloud

1: There are many services, how to manage the code

2: How to quickly locate and troubleshoot problems after the cloud is installed

3. How to select an alarm monitoring platform

4: How to choose the basic mirror

About service code management

Use Git for code version control, set up project team according to business, use separate code repository for each service, and use the same naming convention for the repository name.

About Troubleshooting

There is a mature ELK service on the survey cloud. You only need to store logs in the same directory. After collecting logs using FileBeat, you can import Elasticsearch logs using ETL logic. Another advantage of this approach is that it supports the collection of service logs at the front and back ends. The technology is mature and reuse component capabilities. By adding traceid to requests, problems can be located in the whole link.

About monitoring the alarm platform

The CSIG provides the CMS platform based on log monitoring. After service logs are imported to the CMS, you can configure the monitoring and alarm functions based on the reported logs. You can define the monitoring dimensions and indicators. We use dimensions such as the main call, the called call, and the interface name, as well as parameters such as the call volume, time consuming, and failure rate, to meet the alarm requirements of service monitoring. Log-based monitoring can reuse the same data acquisition link, and the system architecture is unified and simple.

About Base Mirroring

To facilitate cloud services in the early stage, unified service startup, and data collection and reporting, it is necessary to process basic service images, create directories in advance, and provide scripts and tools for rapid service access. We’re running a base image in different languages and versions, and we’re running the container container for the FileBeat and The container container for pulling up the FileBeat and the business service.

Devops

  • Stage2: In the process of cloud ascending, the original manual operation steps in the development process were pipelined through gradual improvement with quality classmates to improve the efficiency of iteration and standardize the development process; Improve service stability through single test and automatic dial test. After adopting the unified pipeline, the development and deployment efficiency is reduced from the original hour level to the minute level.

The coding platform is mainly used here. In order to distinguish different environments, four different pipeline templates of development, test, pre-release and test are established. In addition, a convergence mechanism is introduced to join the manual code review stage.

In the confluence phase: through MR HOOK, automatically poll the code review results to ensure that the code is approved before proceeding to the next step (different teams may have different requirements).

In CI stage: improve code standardization through code quality analysis and ensure service quality through unit test.

In CD stage: improve service stability by introducing manual approval and automatic dial test.

Resource utilization improvement

  • Stage3: In the overall business, due to the requirements of many businesses with multi-regional deployment (Guangzhou, Nanjing, Tianjin and Hong Kong), plus each service needs four sets of different environments (development, testing, pre-release and formal), after the cloud, we preliminary collate, a total of 3000+ different workload. Due to the great uncertainty of different business visits, resources are allocated according to the ideal state basically in the early stage, and there is a lot of waste.

In order to improve the overall utilization of resources, we have carried out a series of optimizations, roughly following the following specifications:

In this case, HPA leads to dynamic expansion and shrinkage of the service container. If the original traffic is still accessed or imported during the stop process, services may fail. Therefore, you need to enable preStop and readiness detection on the TKE in advance.

1: gracefully stop the process and wait for Polaris and CL5 route cache to expire before stopping. Entry: TKE -> Workload -> Specific Business -> Update workload If the service found to be used is CL5, preStop70s is recommended, Polaris configuration 10s is sufficient

2: ready and survivable detection, redeployment of traffic after the process is started; Entry: TKE -> Workload -> Specific Business -> Update workload. Configure different detection modes and intervals according to different services.

Through the above series of adjustments and optimization, our resource utilization rate has been greatly improved. Through the elastic scaling up and down of TKE, the problem of insufficient local peak access resources has been basically solved while ensuring normal business access, avoiding resource waste and improving service stability. However, multiple environmental problems still lead to certain losses.

Observability techniques

  • Stage4: At the initial stage, log/metric/tracing was used to meet the initial requirements of fast service access to the cloud and efficiency improvement of troubleshooting. However, with the growth of business scale, more and more huge log flows occupy more and more resources, and log accumulation becomes normal in peak periods. The interval between the CMS alarm and the actual alarm occurs is half an hour, and the maintenance cost of ELK increases sharply. Cloud native observable technology has become necessary. Here, we introduce observable technology solutions recommended by Coding application management to collect business data through uniform coding-SidECAR:

Monitoring: Cloud monitoring center

Log: CLS

Tracing: APM

Through access to these platforms, we have greatly improved the efficiency of problem finding, locating and troubleshooting, and greatly reduced the cost of business operation and maintenance. Through monitoring and tracing, we have found many potential problems of systems and improved the quality of service.

At the end

Finally, I would like to thank all the development students for their hard work in the process of shangyun, as well as all the r&d leaders for their strong support.

About us

More about cloud native cases and knowledge, can pay attention to the same name [Tencent cloud native] public account ~

Benefits:

① Public account background reply [Manual], you can get “Tencent Cloud native Roadmap manual” & “Tencent Cloud native Best Practices” ~

② Public number background reply [series], can get “15 series of 100+ ultra practical cloud original dry goods collection”, including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.