The author | TongZiLong owners architects education infrastructure department

* * takeaway: ** This article is compiled from the author’s “Cloud Native Implementation practice of Handheld education” shared at the cloud Native Service Conference in 2020. This article mainly introduces the cloud native implementation practice of handheld education. Mainly around Spring Cloud Alibaba & Nacos & Sentinel & Arthas and other micro-service Cloud native technology stacks, based on the implementation of Docker and Aliyun Kubernetes Cloud native container, The high availability deployment and monitoring of Nacos server, high availability bidirectional synchronization and disaster recovery of Nacos and Eureka synchronization server, and integration with DevOps operation and maintenance publishing platform are introduced.

Alibaba cloud original public number background reply818You can get the address and conference PPT collection.

background

Since the formal transformation of online education in 2014, Helmmen Education has been adhering to the purpose and vision of “let education share intelligence, let learning efficient and happy”. Through cloud computing, big data, artificial intelligence, AR/VR/MR and the most popular 5G, helmmen Education has always been using technology to empower education. The business of handheld education has developed rapidly in recent years, especially the epidemic this year, which makes online education become a new outlet and also gives new opportunities to handheld education.

With the further expansion of the business scale, the further explosion of traffic and the further growth of the number of micro-services, the old micro-service system adopted the registry Eureka is overwhelmed. Meanwhile, the Spring Cloud system has evolved to the second generation, and the first-generation Eureka registry is not suitable for the current business logic and scale. It is currently officially in maintenance mode with Spring Cloud and will not move forward. How to choose a more excellent and suitable registration center, this topic is placed in front of the manager.

Why Spring Cloud Alibaba&Nacos

After in-depth investigation and comparison of open source registries such as Alibaba Nacos and HashiCorp Consul, the features of each registry are compared as follows:

  • Nacos
    • AP+CP consensus protocols are supported
    • The Agent DNS-F service is registered and discovered in a cross-language manner
    • Supports load balancing and avalanche protection mechanisms
    • Supports multiple data centers and cross-registry migration
  • Consul
    • Only CP is supported
    • Supports HTTP/DNS
  • K8s CoreDns
    • DNS protocol support

Conclusion: Nacos meets the current service governance technology stack, can realize the smooth migration of registry, the community development is very active, the features provided by The Spring Cloud Alibaba&Nacos can be very convenient to build the dynamic service registration discovery of Cloud native applications.

Nacos Server landing

1. Deploy the Nacos Server

(Nacos Server Deployment Overview Diagram)

  • Nacos Server environment and domain name

The application environment of the palm is divided into four sets, DEV, FAT, UAT and PROD correspond to development, test, quasi-production environment and production environment respectively. Therefore, Nacos Server is also divided into four independent environments. The DEV environment is deployed on a single machine, and other deployment modes are clustered. External access is by domain name, SLB load balancing, including SDK connection to Nacos Server and access to the Dashboard page of Nacos Server.

  • Nacos Server environment isolation and call isolation

Nacos data model consists of namespace/Group/Service. You can create different namespaces to achieve more fine-grained partitioning based on the same application environment, isolating service registration and discovery. In some scenarios, you can set the Enabled property of NacosDiscoveryProperties to False if the developer has Nacos Server that needs to connect to the test environment, but other test services cannot be called to the developer.

  • Nacos Server integrates with Ldap

Nacos Server Dashboard integrates with the company’s Ldap services and logs user information when they log in for the first time.

2. Nacos Server interface

 

  • Nacos interface permissions

When a Nacos Server Dashboard user logs in for the first time, he/she is assigned a common user (other than ROLE_ADMIN) role by default and has no operation permission on buttons except query. Otherwise, services may be offline or offline abnormally due to misoperations.

  • The Nacos interface displays the service overview

Added statistics about the total number of services and instances to the Nacos Server Dashboard page. The information is refreshed every 5 seconds.

 

3. Nacos monitoring

 

  • Standard monitoring

Monitoring Nacos from the system layer, based on the company’s existing Prometheus, Grafana, AlertManager.

  • The high-level monitor

According to Nacos monitoring manual, monitoring Nacos indicators with Prometheus and Grafana.

  • Service instance status monitoring

  • Listen for instance offline events
  • Listen for instance logout events
  • Listen for instance registration events
  • Listen for instance online events
  • Listen for the heartbeat timeout event of the instance

4. Nacos log

 

  • Log merging and JSON formatting

The logs of multiple Nacos modules were uniformly merged at info, WARN and ERROR levels, and the schema fields were defined to mark different modules. The logs were exported to files in JSON format for ELK to collect and display.

5. Nacos alarm

 

  • Service service offline alarm

  • Service name Uppercase alarm

6. Performance test of Nacos

 

  • Core script
def registry(ip): fo = open("service_name.txt", "r") str = fo.read() service_name_list = str.split(";" ) service_name = service_name_list[random.randint(0,len(service_name_list) - 1)] fo.close() client = nacos.NacosClient(nacos_host, namespace='') Print (client. Add_naming_instance (service_name, IP, 333, the "default", 1.0, {' preserved. IP. Delete. The timeout: 86400000}, True, True)) While True: print (client. Send_heartbeat (service_name, IP, 333, the "default", 1.0, "{}")) time. Sleep (5)Copy the code
  • Pressure test data

  • Diagram of pressure measurement results

Summary: Nacos Server is a cluster of three 1C4G servers, with 1499 services and 12,715 instance registrations at the same time, and CPU and memory remain in a reasonable range for a long time. Indeed, Nacos performance is quite OK.

Nacos Eureka Sync landing

 

1. Nacos Eureka Sync scheme selection

 

① Sync official program

After research, we adopted the official Nacos Eureka Sync solution and tried it on a small scale with good results. However, once deployed in the FAT environment, we found that it was not feasible. One synchronization server could not withstand the frequent heartbeat of nearly 660 services (non-instances), and the solution did not have the characteristics of high availability.

② Sync High availability consistent Hash + Zookeeper solution

If one doesn’t work, you can have more, but how do you make high availability?

The first thing we came up with was a consistent Hash. When one or several synchronization servers fail, the Watch mechanism of the Temporary Zookeeper node is used to monitor the failure of the synchronization server and notify the remaining synchronization servers to perform reHash. The remaining synchronization servers are responsible for the failure of the synchronization server. The uniform Hash is used to evenly allocate the synchronized service list. The algorithm of the uniform Hash is implemented based on the binary transformation of the service name as the Key of the Hash. We developed this algorithm ourselves, and found that the average distribution was not ideal. We immediately suspected whether there was a problem with the algorithm, so we found that Kafka’s own algorithm (see utils.murmur2) was still not ideal, because the distribution of business service name itself is not even, so we went back to our own algorithm for optimization. Basically meet expectations, as will be discussed in detail below. But to be honest, the absolute average is still not very good until now.

③ Sync ha active/standby + Zookeeper solution

This scheme is a small episode. When a synchronization server is down, its “standby” takes over. Of course, the active/standby switchover is also implemented based on the Watch mechanism of the temporary Zookeeper node. After discussion, the master/slave scheme was not adopted because of its high machine cost and elegant implementation as consistency Hash.

④ Sync Hash + Etcd solution for high availability consistency

After several iterations, it is found that the synchronization service list is persisted in the database, and the ReHash notification mechanism is taken care of by Zookeeper when the synchronization server fails. Can the two be combined in one middleware to reduce the cost? So we came up with the Etcd solution, which implements synchronous business service list persistence + notification of increase or decrease of business service list + ReHash notification of synchronous server failure. At this point, the solution was finally decided to be a bidirectional synchronization scheme between the two registries (Eureka and Nacos), with Etcd as a bridge.

2. Nacos Eureka Sync landing practice

① Nacos Eureka Sync objective principle

Registry migration objectives:

  • The process is not accomplished overnight. The process of gradual migration of business services should ensure that online invocation is not affected. For example, business service A is registered with Eureka and business service B is migrated to Nacos, and mutual invocation of business service A and B must be normal.
  • The process must ensure that both business services exist in both registries, and the number and status of business service instances in the target registry must be strictly consistent in real time with those in the source registry.

Registry migration principles:

  • A service can register with only one registry, not both.
  • Whether a business service is registered with Eureka or Nacos, the end result is equivalent;
  • In most cases, there is only one synchronization task for a business service. If a business service registered with Eureka needs to be synchronized to Nacos, there is a synchronization task for Eureka -> Nacos, and vice versa. In a smooth migration, a business service with one part of its instances on Eureka and the other on Nacos generates two bidirectional synchronized tasks.
  • The synchronization direction of a business service is determined by the syncSource tag of the business service instance Metadata.

② Nacos Eureka Sync problem pain point

  • Nacos Eureka Sync The synchronization node needs to report the heartbeat between the proxy service instance and the Nacos Server. Nacos Eureka Sync puts heartbeat reporting requests into a queue for consumption by fixed threads. If the number of service instances processed by a synchronous service node exceeds a certain threshold, the heartbeat of service instances cannot be sent in a timely manner, resulting in unexpected loss of service instances.
  • When the Nacos Eureka Sync node breaks down, all heartbeat tasks processed above will be lost, resulting in a large number of online invocation failures with disastrous consequences.
  • When Nacos Eureka Sync is already working, a new business service (not an instance) from Eureka or Nacos needs to be aware of in real time.

③ The architecture idea of Nacos Eureka Sync

  • Obtain the service service list from each registry, initialize the service service synchronization task list, and persist it in the Etcd cluster.
  • Subsequent migration process incremental business services are persisted to Etcd clusters through API interfaces, and the business service migration process integrates the DevOps publishing platform. The whole migration process is fully automated to avoid the omission caused by human operation;
  • The synchronization service subscribes to the Etcd cluster to obtain the task list and listens to the node status of the synchronization cluster.
  • The synchronization service finds the processing task node based on the consistent Hash algorithm of the surviving nodes, and the back-end interface deletes the nodes polled by the task instructions through SLB load balancing. If the task is handled by yourself, remove the heartbeat; otherwise, find the processing node and delegate out;
  • The synchronization service monitors the status of each service instance in the source registry and synchronizes normal service instances to the target registry to ensure real-time synchronization of service instance status in both registries.
  • After all service instances are transferred from Eureka to Nacos, the service department needs to notify the infrastructure department to manually remove the synchronization task from the Nacos Eureka Sync interface.

(4) Nacos Eureka Sync cluster sharding and high availability scheme

Hash sharding for service consistency:

  • Based on the multi-cluster deployment shown in Figure 1, set a configurable number of virtual nodes for each node to evenly distribute the virtual nodes on the Hash ring.
  • Calculate the Hash value of each service based on the FNV1_32_HASH algorithm of the service name, compute the Hash value clockwise for the nearest node, and delegate the task to this node.

Synchronous node downtime failover:

  • Node monitoring: monitors the survival status of other nodes, configures Etcd cluster lease TTL, and sends at least 5 renewal heartbeats within THE TTL to avoid node loss in case of network fluctuation;
  • Node breakdown: When a node breaks down, its tasks will be transferred to other nodes. Since the relationship between virtual nodes has expired, the tasks of this node will be ReSharding to other nodes. Then, the task processing of the cluster is fragmented at any time. The tasks of ##1 and ##2 virtual nodes will be transferred to C and A nodes respectively, so as to avoid the continuous avalanche of the remaining nodes caused by one node taking on all the tasks of the crashed node.
  • Node recovery: As shown in Figure 3, the virtual node of the node is added to the Hash ring again, and the Sharding rule changes. The restored node will undertake part of the tasks of other nodes according to the new Hash ring rules. Once the heartbeat task is generated on the node, it will not disappear automatically. In this case, it is necessary to clear the redundant tasks of other nodes (that is, the tasks reassigned to the recovery node), reduce the load of other nodes (this step is very important, or it may cause a continuous avalanche of the cluster), and ensure that the cluster is restored to the original normal task synchronization state.
  • Node Dr: If the Etcd cluster is disconnected, the surviving nodes are obtained from the configuration file. The cluster runs normally but loses the Dr Capability.

3. Nacos Eureka Sync safeguard measures

 

① Nacos Eureka Sync synchronization interface

The following interface ensures that Nacos Eureka Sync can sense a new service (not an instance) from Eureka or Nacos in real time. But we took it to the next level of intelligence and automation:

  • New synchronization: Combined with the DevOps publishing platform, when A new business service (non-instance) is online, it will determine which registry it is online from and then call back Nacos Eureka Sync interface to automatically add synchronization interface. For example, A business service is registered with Eureka. The DevOps publishing platform automatically adds synchronization tasks for its Eureka -> Nacos and vice versa. Of course, this function can also be realized by the operation of the following interface.

  • Delete sync: Because the enterprise publishing platform could not judge a business services (instance) offline, or have already migrated to another registry, already all finished (students will ask, can be judged that the instance number see the business service is zero for the standard, but we should consider, instance number is zero at the time of network fault can also occur, Namely, all heartbeat is lost, so this judgment basis is not rigorous), which is left to the business personnel to judge, and at the same time, with the alarm reminder of the nailing robot, students from the infrastructure department will operate this function from the following interface.

② Nacos Eureka Sync Etcd monitoring

On the following page, you can check whether the service service list is evenly distributed in a consistent Hash mode on the cluster that synchronizes services.

③ The Nacos Eureka Sync alarm is generated

  • Nacos Eureka Sync An alarm is generated

  • The alarm indicates that service synchronization is complete

4. Upgrade Test of Nacos Eureka Sync

  • Since 10:00 PM in July, the FAT environment has been rehearsed, and the one-click upgrade and rollback can be performed twice using the automated operation and maintenance tool Ansible.
  • At 11:30 PM, a catastrophic operation was performed to observe the intelligent recovery. Three of the nine Nacos Eureka Sync operations were suspended, and only one instance was lost. However, the operation was recovered after 5 minutes (after investigation, the problem was located as an abnormal state of a service instance on Eureka).
  • At 11:45 PM, 2 more failed, only 4 remained, failover, normal synchronization;
  • At 11:52 PM, two Nacos Eureka Sync clusters were recovered. ReHash was rebalanced and the synchronization was normal.
  • At 11:55 PM, all the data was recovered. The Nacos Eureka Sync cluster ReHash was rebalanced and the synchronization was normal.
  • At 12:14, extreme disaster drill, 8 of 9 failed, the remaining 1 could withstand, failover, normal synchronization;
  • 12:22 am, UAT environment upgrade smooth;
  • 1:22 am, PROD environment upgrade smooth;
  • The ReHash time is less than 1 minute. That is, the recovery time is less than 1 minute when a large-scale fault occurs in the Nacos Eureka Sync service.

Iii.Solar cloud native microservice practice

(Solar Cloud Native Microservice System)

Solar microservice system, including microservice governance components, middleware and basic components ease of use encapsulation, alarm monitoring system, connects the host business services and the underlying infrastructure, each service is subject to a strong contract, evolving towards the cloud native microservice architecture.

1. Based on SDK such as Spring Cloud Alibaba, Nacos and Sentinel

1) Alibaba Nacos

  • Solar Nacos SDK built-in DEV FAT | | UAT | PROD environment of four domain name, Business system non-sensing Solar Nacos SDK based on Spring Cloud Alibaba integrated With Ctrip VI Cornerstone to achieve micro-service ignition and flout pull in and pull out;
  • Solar Nacos SDK supports blue-green grayscale publishing and sub-environment functions called across registries in the transitional state of Nacos and Eureka.
  • Solar Nacos SDK integrates grayscale blue-green buried points into SkyWalking;
  • Solar Nacos SDK through @enablesolarService, @enablesolarGateway encapsulates the standard Spring Boot/Spring Cloud/Apollo/Zuul and many other annotations. Reduce the usage cost of the business;
  • Solar Nacos SDK and Solar Eureka SDK upgrade and rollback;
  • Solar Nacos SDK combined with Java Agent solves the context loss of cross-thread invocation in asynchronous invocation scenarios.

(2) the Alibaba Sentinel

  • Solar Sentinel SDK built-in DEV FAT | | UAT | PROD four environment domain, the business system without awareness
    • The SDK ADAPTS the Sentinel address to be connected by obtaining the environment of the current machine
  • Solar Sentinel SDK is deeply integrated with Apollo SDK
    • The Sentinel configuration is persisted under the namespace of the service’s Apollo and loaded into memory from Apollo the next reboot
    • Apply fusible traffic limiting rules in the dimension of appId
  • Solar Sentinel SDK integrates OpenTracing and SkyWalking to output Sentinel burial points to SkyWalking
    • Through the OpenTracing toolkit provided by SkyWalking, manually bury points, obtain span information, and push it to SkyWalking for persistence to ES
  • Solar Sentinel SDK Dashboard Persistent retrofitted with InfluxDB & Grafana
    • The Sentinel Metrics data was persisted to the timing database InfluxDB and then presented via Grafana
  • Solar Sentinel SDK LIMit-APP Fusing Extension
    • If a version matching failure is detected during the blue-green call, BlockException is thrown
  • Solar Sentinel SDK Gateway flow control, microservice single-machine current limiting
    • Full-link pressure measurement helps you understand the service’s capacity to carry peak traffic
    • Semaphore isolation /QPS, concurrent thread limit/average response time, second level exception ratio, minute level exception number fuse
    • Based on TCP BBR system protection algorithm, automatic detection of system bottlenecks, protect system stability
  • Solar Sentinel SDK cluster current limiting
    • Cluster traffic limiting is used to solve the problem that the overall traffic limiting effect is not accurate due to the uneven traffic of all the machines in the cluster

2. Grayscale blue-green and environmental isolation

  • Based on Spring Cloud Alibaba, Nacos SDK, Nepxion Discovery open source framework (github.com/Nepxion/Dis…)
  • Blue-green grayscale release: version matching grayscale release and version weight grayscale release
  • Multi-area routing: grayscale routing with area matching and grayscale routing with area weight
  • Environment isolation: Environment isolation and routing

3. Intelligent and semi-automatic grayscale blue-green based on DevOps publishing platform

 

  • ** Intelligent release entry interface **

  • Blue-green conditional drive mode interface

  • Blue green weight volume mode interface

 

  • Full release and rollback interface

4. Rolling non-destructive publishing based on DevOps publishing platform

  1. For the o&M CD publishing platform, the instance status is set to Disabled and the instance is pulled from the registry.
  2. Consumers subscribe to the registry to notify consumers that instances are unavailable;
  3. The consumer stops routing forward to an unavailable instance;
  4. Traffic on the service continues to be processed, and the instance publishing script will be started 30 seconds later.
  5. After the instance is restarted successfully, the CD platform checks the health status of the instance by requesting the VI interface lodged in the service Web container.
  6. After status check is healthy, register in Nacos registry;
  7. The consumer subscribes to the new instance and requests the normal load.

5. SkyWalking distributed tracking and APM system

  • Monitor all service anomalies

 

  • Interface performance Indicators

  • Service dimension exception monitoring kanban aggregates the number of exceptions in links, helping services quickly locate problems

 

  • Integrated Solar full link gray blue green buried point

  • Integrated Sentinel current limiting downgrading fuse burying point

  • Integrate service, instance, and interface dimension performance indicators
  • Fast locating faults, slow SQL, and fusing links
  • Integrate links and logs
  • Service, instance, interface, and link dimension topology
  • Integrated Applied Diagnostic System (Arthas & Bistoury)

6. Arthas & Bistoury Applied Diagnostic System

  • No need to log in to the machine
  • Arthas integration, web-based console for application diagnostics from CPU, thread, memory, JVM, and more
  • Hot spot method analysis
  • Online Debug

Iv. Practice of native container of Solar cloud

1. CI/CD continues to be released

  • CD Platform generates Jar packages and Docker IMG through Jinkens compilation and packaging, uploads images to OSS Platform through Harbor, and Harbor makes high availability through SLB.
  • Self-developed Hyperion intermediate service, as the API to connect Harbor and Ali Cloud K8s call the middle layer;
  • CD Platform invokes Ali Cloud K8s API through Hyperion to publish the image. K8s pulls the image from Harbor and deploys it to the physical node.
  • Ali Cloud K8s pushes status events to Hyperion, and Hyperion pushes data to.CD Platform to display real-time release status information.

2. Collect logs

  1. Pod writes logs to /opt/logs/{appID}/ XXX;
  2. /var/log/{appid}/{podid}/ XXX;
  3. A physical node starts a FileBeat process to collect all Pod log information on the physical node.

PS: After a comprehensive pressure test, a FileBeat process is started on a physical node to collect all Pod logs, and the performance is fine. If there is a log collection delay, you can mount a FileBeat process on one Pod to solve the problem, which consumes more system resources.

  1. FileBeat pushes logs to Kafka.
  2. GoHangout concurrently consumes Kafka data and persists it to the ES cluster.

GoHangout concurrent consumption Kafka messages perform better.

  1. Kibana displays all log index data in ES.

3. Elastic capacity expansion and self-healing of micro-services

  • CPU load
  • Memory
  • Metrics: Perform elastic scaling and self-healing according to the Metric indicators provided by fusing, current limiting and monitoring components;
  • Capacity planning based on class scheduling information: As shown in the figure below, the class situation is very regular, and the peak traffic is also regularly distributed. Plan the carrying capacity of micro-service in the next few hours according to the data such as class scheduling information, and allocate the number of Pod according to the plan

4. Smooth migration network solution

 

  • Ali Cloud Terway Kubernetes cluster provides CNI Plugin to realize Pod IP and Node IP in the same plane;
  • The IP address of Pod changes every time it is released. The Intranet uses the access and governance scheme of the VIRTUAL machine era.
  • The external network access through Ali Cloud SLB, SLB can automatically detect IP changes;
  • The infrastructure has accumulated deep accumulation in microservice construction. By making full use of a series of service governance schemes such as Spring Cloud Alibaba and Nacos, our Cloud native containerization process can be smoothly transferred to Ali Cloud K8s.

Conclusion: The above is an overall picture of the Cloud native micro-service technology system constructed by the palm education around Spring Cloud Alibaba. The palm education will also follow the steps of community technology development and evolve towards micro-service Mesh and serverless function computing. Many years ago, We may be discussing the risk of Cloud services. With the continuous maturity of Cloud native technology, now we are more talking about the risk of services not being connected to the Cloud. I hope this sharing can provide some practical reference and guidance for those students who are planning to use Spring Cloud Alibaba micro-service to connect to the Cloud or who are going to use the Cloud.

Spring Cloud Alibaba seven-day training camp opens!

PC click the link for details: developer.aliyun.com/learning/tr…

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”