Welcome to visit netease Cloud Community to learn more about Netease’s technical product operation experience.


Under the wave of digital transformation, cloud computing services have become the preferred solution for traditional enterprises to improve business agility and reduce operation and maintenance costs. Zhang Liang, senior solution architect of netease Cloud, shared how the architecture design of the traditional business system on the cloud meets the requirements of high data reliability and high service availability through a practical case of a logistics enterprise customer, and summarized the common problems and solutions of the traditional business cloud.


Logistics enterprise business system cloud demand

For logistics enterprises, internal communication and supply chain coordination are very important to optimize supply chain efficiency and enhance core competitiveness. As the industry leaders, the logistics enterprise customers set up an enterprise mobile office platform, the platform integrates instant messaging (IM), the enterprise internal ERP, OA, and the core of supply chain information, used to support the internal staff internal communication and collaboration, meetings, schedules, and supply chain information, ERP, OA, procedures for examination and approval and deal with the query, The platform also provides upstream and downstream partners to query supply chain information to meet the needs of supply chain collaboration. The system was initially deployed in the customer’s self-built machine room to serve the customer’s internal staff, but in order to provide similar capabilities to some large partners, the customer also adopted the construction and operation of netease Cloud Support system.


Customer self-built room and there are some differences between netease cloud infrastructure, such as the customer have Shared SAN and NAS, self-built room and use tape library and commercial backup software to meet the requirements of data compliance (cold storage, archive data), MySQL database running on a physical machine, and netease cloud as a mature Internet service, does not provide a Shared block storage service, The database runs on a virtual machine.


Since the system is to provide services to third parties as a business, the first phase of the project covers tens of thousands of users. Considering the difference of infrastructure, the customer pays most attention to the following four aspects:


Data security: Data affecting customer business, including employee communication records, schedules and supply chain information, must be guaranteed with the highest security and reliability;


Business availability: This business system is a service that the customer provides to its customers, so SLA is very important. Customers require quick business recovery capabilities in extreme situations (when the entire production site is unavailable);


Functions: Application probing, alarm monitoring, Intranet load balancing (LB), and Intranet DNS directly affect customer o&M capabilities and ha service deployment convenience.


Performance: The customer uses physical machines for the original database, but uses virtual machines to worry about insufficient performance. Therefore, it is necessary to avoid delays or even crashes in high concurrency scenarios.


Netease Cloud Cross-room Dr Architecture solution


At the bottom layer of the customer’s original service system is a traditional two-room architecture for same-city active-passive DISASTER recovery (if the primary room breaks down, the standby room takes over). Redis cache, Kafka, and Intranet load balancing are adopted on the application side, which is an Internet-like architecture. The system uses the hardware Global Load balancer (GSLB) to implement inter-site Dr. Back-end storage devices include traditional SAN, NAS, tape libraries, and VIRTUAL tape libraries (VTLS). Most of the persistent data is stored on storage devices. Generally, advanced licenses for commercial disk arrays provide these functions. However, advanced licenses are expensive, and the Dr Capability of the solution depends heavily on the Dr Capability of the disk array. Once the disk array fails, the data store is in danger. In addition, Jetty and Java applications are designed with soft load balancing and external hardware load balancing.


The deployment of the service system on netease Cloud is a step by step testing the waters and gradually improving the process. The instant messaging capability of the business system is provided by the SDK of netease Cloud Communication service (YUNxinservice), and the customer does the encapsulation of the upper layer by himself. The cloud messaging and application systems are deployed in different computer rooms and communicate with each other through the High-Speed internal communication of netease Cloud. Redis, Kafka, RDS, NAS all map to the original system architecture. GSLB is cancelled. Elasticsearch is used for log search.


After that is the improvement of cross-room disaster recovery. For the original system, the O&M department performs a Dr Switchover exercise every six months. Therefore, customers expect that Dr On the cloud can also meet requirements for data reliability and service availability.


According to the distance between equipment rooms and network delay, Dr Can be divided into two types: Cross-availability zone (AZ) Dr, which is generally used to support services requiring active-active services. Cross-region DISASTER recovery (Dr) is usually remote. The two equipment rooms are far away from each other, and the network delay cannot meet the requirements of hypermetro. Netease Cloud designs a cross-AZ Dr Solution for customers. Customers initially built Redis and Kafka in the cloud host. Later, they found that they needed to pay attention to monitoring and high availability. However, these functions were built in the existing Redis and Kafka services of netease Cloud.




The following Dr Services are provided by netease Cloud:


Each equipment room has two NLB instances (active and standby). However, customer traffic only views the same public IP address. If AZ1 at the production site is down, the IP address of the route is shifted and the IP address is associated with AZ2 at the Dr Site.


Data persistence: Two RDS high availability instances (active and standby) at the production site and one at the Dr Site can be accessed through the internal DNS. Applications are unaware of the breakdown of RDS on one side because the system automatically resolves domain names to RDS instances at the Dr Site.


File system service: Data synchronization through replication across the machine room, data persistence for the user does not need to care.


Kafka: Through MirrorMaker do synchronization across the room, it’s the consumer can go to consumption Kafka and message at the disaster site, if only part of the service outage, the system can provide access across the room, some delay will be higher than with machine room, of course, depending on the application requirement for access to the data delay, if delay requirement is very high, may need to do application unitized, Minimizing cross-room access and placing some delay-sensitive requests in one room is a business architecture design.


Zhang added that netease Cloud supports applications to access RDS in another machine room, but customers need to deploy the DISASTER recovery between applications by themselves. The NLB can determine the AZ from which the request is sent based on the Intranet DNS, so that applications can access the underlying services correctly.


The NLB does not provide cross-room HIGH availability instances. The external IP address is different. Therefore, an external service such as GSLB or DNS is required to switch traffic. Other replication, such as RDS and Kafka, depends on VPC flag flag, and some background operations need to be done on the core switch to make the two VPCS connected (the internal network is disconnected by default). Otherwise, the two VPCS need to be connected through the public network and then routed back from the front end, which cannot be directly accessed through the internal private line in the machine room.


Cross-region RPO (recovery point target) is worse than cross-AZ RDS. The RPO of cross-AZ RDS can be equal to 0. Write success is returned only after data is written to both sides. If data is written successfully at the primary site but not copied to the Dr Site, the primary site breaks down, causing permanent data loss. In the traditional Oracle RAC architecture, data is written in pairs to implement hypermetro only when the data is written in pairs.


Therefore, cross-region Dr Is generally used for remote DISASTER recovery in two places and three centers. If service availability requirements are high, you are advised to use the deployment architecture of cross-AZ Dr. For customers, the cloud-based cross-AZ Dr Solution is different from the self-built equipment room in terms of Dr, monitoring, and service usage. However, it meets the RPO and RTO requirements.


Disaster, zhang added, across AZ switch is a customer yourself, RPO depends on how long customer operations team to perceived failure, how long the disaster site service configuration is good, because the customer application configuration is more traditional, ahead of the back-end service IP, DNS domain name write configuration files, instead of using a service registry/service discovery mechanism dynamic release. If the application of micro-service transformation, RTO can also have a lot of room for improvement.


Common problems and countermeasures on cloud


For traditional industries that host IDCs and build their own computer rooms, cloud infrastructure changes need to be considered. Based on the experience of netease Cloud in supporting the traditional business cloud, Zhang Liang finally summarized the common problems and solutions of the traditional business cloud.


There is very little shared block storage on the cloud. The always-on failover cluster high availability solution provided by SQL Server requires the support of shared block storage. However, enterprises can now use SQL Server high availability solution that does not require shared storage. Solutions supported by SQL Server 2012 and later, such as log transfer and always-on availability groups, do not rely On any form of shared storage.





VPC networks do not support broadcast or multicast. However, some applications need to be clustered in broadcast or multicast mode.


Keepalived: Upgrade to a new version that supports unicast. If unicast is specified in the configuration file, it can work properly in a VPC network.


Tomcat supports cluster member communication through broadcasting. First, the Static mode of tomcat-static cluster membership is used to manually inform the members of cluster nodes in the configuration file. Elastic scaling is troublesome and only suitable for small-scale businesses. The second is stateless. Separate sessions from Tomcat and put them into services like Redis.


If N2N peer-to-peer VPN is used, you can create another VPN network on the VPC network to broadcast data. However, this mode has performance loss. Therefore, it can be used with caution in production environments for POC testing and proof-of-concept.


None Physical tape library. Some industries have compliance requirements, and infrequently accessed cold data is stored in tape libraries. On the cloud, there are two solutions: One is the VTL storage gateway provided by the cloud service provider. Backup software connects to the gateway and can use the back-end as a tape library. Second, many backup software supports the object storage service of mainstream cloud service providers. Data can be directly backed up to object storage in the backup software. The cost is much lower than that of disk array or NAS.




Netease Cloud provides you with services such as object storage and load balancing. If you are interested, please click here for a free trial.



How to quickly acquire a large number of target users at a low cost, rather than battling with competitors for a long time?