The author | YanMing Yan technology architects

This article is adapted from the Architect Development series live lecture March 19. Pay attention to the “Alibaba Cloud native” public account, reply “319”, you can get the corresponding live playback link and PPT download link.

** Yanpu Technology was born because of the Beauty industry, “Yanpu Expert” is a specially designed for the beauty industry merchants to build a SaaS platform, in order to be able to provide merchants with more secure, stable, efficient platform, we have done a lot of attempts in technology, after several times of evolution, so that the system becomes more stable and reliable. Today, I would like to share with you the architecture evolution of Yanpu technology and the application practice of Nacos in Yanpu.

Single application era

The above picture is the structure diagram of our single service, which is divided into many modules, such as membership, order, store, etc. It seems clear when you look at the structure, but when you really see the package structure, you are really bald! Changing code is a pain in the neck. Individual services present several challenges:

  • ** Slow release cycle: ** Although the volume of business was not large at that time, there was a large amount of code, business iteration was involved, and the entire service had to be recompiled, packaged and deployed for each release. Especially in the beginning, when there was no build tool, the publishing process required a bunch of commands, and there was always a ** * feeling of “a sudden operation like a tiger, staring at the spot”.

  • ** Low synergy efficiency: ** There are many merge conflicts, sometimes you write code happily in front, while others may also delete your code happily when solving conflicts, increasing a lot of communication costs.

  • ** Poor stability: ** When the service is faulty, the whole system may be unavailable, and the system is not easy to expand, when merchants to promote, the activity may be over, the server has not completed the expansion.

  • ** Poor performance: ** Because in the single service, some developers in order to satisfy their own business, many unrelated business SQL join table queries, and do not pay attention to performance issues, resulting in frequent online load warnings.

In addition, we faced some business challenges:

  • ** Business transformation: ** In June 2018, our company decided to switch from pan-industry to beauty industry, and needed to create a beauty SaaS platform specially designed to provide technical support for beauty industry merchants.

  • ** occupies the market quickly: the transformation of ** business brings new demands from more merchants. If we cannot iterate quickly, it means that we will be eliminated from the market. Therefore, it is urgent for us to improve development efficiency and quickly occupy the market.

  • ** Poor merchant experience: ** As more and more merchants move in, performance and reliability problems gradually emerge and cannot be corrected in time. Merchant experience becomes poor, which violates our principle of customer first.

To sum up, we believe that service transformation is urgent.

Microservice transformation

After the discussion of the company development students, we finally decided to carry out the transformation in two steps: Servitization transformation 1.0 objective:

  • With the minimum cost of transformation, the new and old merchant platform can be opened first, so as to achieve the rapid migration of functions;
  • Business abstraction: the common part of the old and new merchant center is abstracted, and the code logic of the old merchant center is optimized, paving the way for the subsequent business construction in Taiwan.

The objective of servitization transformation 2.0 is to initially build a business medium platform so that various capabilities of the platform can be quickly reused and combined to support business exploration and development more quickly. Expected results of servitization transformation 1.0:

  • We hope that the old merchant center can not only provide services to the outside, but also provide service support to the new merchant center as a provider.
  • The new merchant center only provides external services, not directly connected to the database, and the business layer only deals with the special logic of the beauty industry.

So the idea was that the new merchant center would call the old merchant center directly through the interface exposed by the Controller and make remote calls, so we decided to try using Spring Cloud.

Service discovery selection:

  • Consul supports discovery as well as KV storage. Because we want to make KV storage for a configuration center, we would like to make a trial use of Consul.
  • Service health checks are relatively more detailed;
  • During our selection process, the announcement that Eureka 2.x open source work was being discontinued suddenly appeared, and although it turned out that this didn’t have much of an impact on us, our decision at the time led us to opt for Consul.

Service Transformation 1.0 Architecture Diagram:

Servitization 1.0 Our technical transformation plan is to register the old merchant center on Consul, and the new merchant center will get the server list on Consul, and make remote calls through Feign to open up the functions of the old and new merchant centers.

After the transformation of servitization 1.0, we solved the following problems:

  • ** Rapid improvement of functions: ** Functions of the old merchant center are quickly transferred to the new merchant center, and the adaptation to the beauty industry is completed;
  • ** Faster iteration: ** Most functions of the new merchant center can be modified and compatible through the old merchant center, laying a good foundation for the abstraction of the subsequent business;
  • ** Performance optimization: ** At the same time of business development, we optimized the old code of the old merchant center, and the performance and stability were improved.

However, there are still some challenges left unsolved after the 1.0 transformation of servitization:

  • ** Release cycles are still not fast enough: ** Most of the code is still in the merchant center, business iteration is still involved;
  • ** Synergy efficiency is not improved: ** While there are many code conflicts and high communication costs, there are compatibility problems between new and old businesses that make developers headache;
  • ** Maintenance cost: **Consul is developed in Go language and is not easy to maintain; I had a poor experience in the development of Spring Cloud. While writing the business, I had to learn the best practices of Spring Cloud and spent some time on the infrastructure of Spring Cloud.

So we decided to start the transformation of Servitization 2.0. Expected results of servitization transformation 2.0:

  • Complete the initial construction of the middle stage of the business, and re-divide the modules into independent services;
  • New and old merchant center services only do their own business compatibility, and external exposure interface;
  • Add c-side WEB services that support H5 and applets. Due to the poor Spring Cloud experience, we decided to try to use Dubbo as the RPC remote call framework for the base services in Servetization 2.0, so we had to select the registry.

First, I think the registry should have the basic functions:

  • Service registration is found in time, abnormal offline in time;
  • Service management, able to manually restore/remove services;
  • Health check to check whether services are available;
  • Metadata management;
  • Registries guarantee themselves high availability.

Zookeeper :

  • There is no guarantee that every service request is reachable. When the ZK cluster master is down, an election is required, during which the service is unavailable.
  • Cross-room routes are not supported. For example, eureka zones can be routed to other equipment rooms when the current equipment room is unavailable.
  • “Shock effect” : When zK nodes are too many, when the nodes of service change, the client will be notified at the same time. Instantaneous traffic may fill up the network card instantly, and there will be repeated notifications.

Nacos :

  • The registry section focuses more on usability
  • Service discovery and service management
  • Management of service metadata
  • Dynamic configuration Management

During this period, we also looked at Spring Cloud Alibaba. Alibaba technology has stood the test of “Double 11” for many years, and its performance and stability are worthy of trust. Spring Cloud Alibaba’s component open source community is highly active and easier to communicate than foreign open source projects. Its components, developed in the Java language, are easier for us to maintain, allowing us to locate and fix problems more quickly when they occur. And cooperate with Ali, it is easier to go on the cloud. For example, Nacos can cooperate with Ali’s MSE and ACM to put all the registry and configuration management on the cloud.

Therefore, we decided to embrace the Ali technology stack.

Service Transformation 2.0 Architecture Diagram:

We directly extracted the previous module into the basic Service, added services such as membership, order and store as providers, exposed our own Service, and registered in Nacos. The new merchant center service deals with the business logic of the beauty industry, the old merchant center service deals with the business logic of the pan-industry, and the C-end service provides services externally in the same way. Make remote calls through Dubbo.

Through the transformation of servitization 2.0, the effects are as follows:

  • Server cost reduced by 30% : 20+ servers, from 4-core 16G to 2-core 8G;
  • Improved system reliability by 80% : Load alarms are significantly reduced, and online faults can be quickly rectified to complete deployment.
  • Code conflicts reduced by 75% : conflicts are significantly reduced because boundaries are basically maintained separately;
  • Release iteration efficiency increased by 50% : 5 people did 30 points per iteration assessment, now they do around 45 points.

Nacos landing practice and problem analysis

Nacos also serves us well beyond the registry that our company handles. Let’s talk about our use of Nacos and the problems we encountered.

The first is usage:

  • Deployment mode: Single-node deployment in the development or test environment, and three cluster deployment in the production environment.
  • Version: Used in production environment from 0.9.0, currently used in production environment 1.1.4;
  • Usage time: Starting from March 2019 in the production environment;
  • Number of services: 20+ servers online, providing more than 600 services;
  • Stability: no major problems in a year, and smooth upgrade;
  • Compatibility: Old and new services are compatible with both Spring 4.3+ and Spring Boot projects in our company.

Nacos Registry:

  • Service registration: Register backend services with Nacos and invoke them through Dubbo. We are currently testing Seata in the development environment and registering the Seata service with Nacos as well;
  • Namespace: Registers services in the public domain.

Nacos Configuration Management:

Each service has a separate Namespace.

  • The configuration file information of the service: application.properties is all configured to Nacos, and only nacOS-related configuration is reserved in the project configuration file.
  • KV configuration of the business layer: such as service switch, default value of attributes, scheduled task configuration, etc.
  • Dynamic configuration of MQ Topics: The Binlog service collects topics dynamically sent to Nacos configuration and the required table information;
  • Rule configuration for Sentinel: The Sentinel traffic limiting rules are persisted to Nacos.

Problem Description:

On December 31, 2019, at around 3:15 PM, a large number of service alarms suddenly appeared online. Dubbo service reported an error. The whole process lasted about 3 minutes. None of the business groups released anything that day, and the database was in good condition.

According to the log, the error is reported because the store service cannot be invoked. In the store service log, there is no call record during the time period of the problem. When the system returned to normal, there were many notifications of service registrations.

Therefore, we targeted the problem at Nacos. A review of the Nacos logs revealed that a large number of services were coming online during the system recovery process.

In the process of investigation, the same alarm suddenly appeared online, and the service list on Nacos began to become unhealthy in large numbers. Therefore, we urgently restarted the online Nacos. During this period, we experienced more than 3 minutes of shock, and recovered the calm again.

Problem analysis:

  • In both cases, the problems were store services, but the JVM and database were healthy during the problems;
  • All the error messages are Dubbo call timeout, and there is no traffic in the store service during the problem.
  • When things go wrong, the services registered on Nacos start to become massively unhealthy. When the fault occurs, the services are offline and online again.

In summary, we began to suspect that it was caused by the network.

Problem confirmation:

After investigation, it was found that most of our services were deployed in availability zone B of Ali Huadong 1, and only store service and Nacos cluster were not deployed in availability zone B, indicating that network isolation occurred between availability zone B and other zones during this period.

As a result, we urgently deployed the store service in usable zone B and there were no further problems.

After communication with Ali, it was confirmed that from about 14:05 on December 31, 2019 (Beijing time), some users reported that part of the network in available area B of Ali East China 1 was abnormal, affecting access to some cloud resources.

Question review:

  • Problems occurred: at 3:00 p.m., Dubbo service reported an error due to a series of service alarms.
  • Nacos: A large number of services in the Nacos service list are unhealthy.
  • Network failure: Availability zone B is disconnected from other zones. As a result, services cannot be invoked.
  • Emergency deployment: Deployment of missing store services in Zone B;
  • Get back to normal.

Problem thinking:

  • Service deployment: Application services and Nacos are recommended for multi-room deployment, even between available areas on the cloud;
  • Disaster recovery: When the problem occurs, Nacos is not deployed in availability zone B, but services in availability zone B can communicate with each other and the configuration on Nacos can be read. Therefore, we believe that both Nacos and Dubbo’s disaster recovery strategies are reliable.

Review and Outlook:

“Yan shop expert” after rapid iteration, and constantly help beauty industry business managed stores, efficient and quick operation and data analysis, digital management, establish perfect member cycle management system, for the beauty industry merchants in the operation and management, to provide integrated solutions, the beauty industry is the traditional business model to make the Internet upgrade. Up to now, we have served more than 3000 brands and 1.1 million + stores. We provide a shop management system, member management system, marketing talk system, large data decision-making system, supply chain management system, employee performance management system of 6 large system capacity, at the same time support PC, mobile phone APP, pos machine, the operation, meet the demand of stores multiterminal operation, cover all the scenarios of store management requirements.

The future planning

Improve system availability

  • **Seata: ** Currently, our company mainly relies on MQ compensation for distributed transactions. This year, we plan to introduce Seata to improve distributed transactions, ensure data consistency and reduce the development and repair of data.

  • **Sentinel: ** At present, Sentinel is only enabled when merchants do activities, so we need to configure the best practices applicable to our company to ensure the high availability of the system;

  • ** Full-link tracking: ** Our company now mainly relies on logs and alarms to locate problems, but cannot do full-link tracking. Therefore, we need to do this part well to achieve quick fault location, performance analysis of each invocation link, and data analysis.

  • ** Remote DISASTER recovery: ** As more and more merchants come from different provinces in China, we need to ensure data protection for merchants to avoid data loss and ensure the reliability of services.

Community feedback

Because we’re a small company right now, what we can do is use the latest releases as much as possible, try out new features, and address issues as we find them, but we also want to contribute to the Nacos open source community.

** Yin Ming, Yanpu technology architect, is responsible for the application and practice of Yanpu SAAS platform middleware, leading the whole process of platform architecture in Yanpu to distributed evolution, and is also responsible for the practice and implementation of big data in Yanpu.

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”