Article | Lin Yuzhi (Name: Yuansan)

Ant Group senior experts focus on microservices/service discovery related fields

Proofreading | Xudong Li

Read this article in 8624 words in 18 minutes

Guide | |

Service discovery is one of the most important dependencies for building distributed systems. In Ant Group, registry and Antvip are responsible for this responsibility. Registry provides service discovery capabilities within machine rooms, and Antvip provides service discovery capabilities across machine rooms.

This article focuses on registry and multi-cluster deployment patterns (IDC dimensions), where data synchronization between clusters is not involved.

PART 1 back view

Reviewing the evolution of the registry in Ant Group, it probably started in 2007/2008 and has been evolving for more than 13 years. Today, both the business form and its own capabilities have changed dramatically.

A brief review of the history of the registry:

V1: ConfigServer is imported from Taobao

V2: Horizontal expansion

From this version, Ant and Ali began to evolve independently, the main difference is in the direction of data storage choice. Ants choose to scale horizontally and store data in fragments. Ali chose vertical expansion to increase the memory specifications of data nodes.

This choice affected the storage architectures of SOFARegistry and Nacos years later.

V3 / V4: LDC support and Dr

V3 supports LDC unit.

V4 increases the decision mechanism and runtime list, solves the problem of requiring manual intervention when a single machine is down, improves high availability and reduces operation and maintenance costs to a certain extent.

V5: SOFARegistry

The first four versions were ConfReg, and the V5 project SOFARegistry was launched in 2017 with the objectives of:

1. Code maintainability: ConfReg code history is heavy

  • A small number of modules use Guice for dependency management, but most modules are static class interaction, which is not easy to separate core modules and extension modules, which is not conducive to open source products.

  • The client-server interaction model is nested and complex, expensive to understand, and not multilingual friendly.

2. O&m pain points: Raft is introduced to solve serverList maintenance problems. O&m of the whole cluster includes Raft and is simplified by operator.

3. Robustness: Add a multi-node backup mechanism to the consistent hash ring (three copies by default). When two copies are down, services are not felt.

4. Cross-cluster service discovery: Intra-station cross-cluster service discovery requires additional antVIP support, hoping to unify the capabilities of the two sets of facilities. Meanwhile, commercial scenarios also require cross-machine room data synchronization.

Some of these goals have been achieved, but some of them are not good enough. For example, there are still some pain points in operation and maintenance, and cross-cluster services find great stability challenges in the face of large-scale data from the master station.

V6: SOFARegistry 6.0

In November 2020, SOFARegistry took stock of the lessons learned from internal/commercial polishing and launched a massive refactoring program for version 6.0 to address future challenges.

It took 10 months to complete the development and upgrade of the new version, while rolling out application-level service discovery.

PART 2 challenge

The problems we face today

Cluster size challenges

  • Data growth: As the business grows, the number of instances of the business grows, and the number of Pub /sub grows accordingly. Taking one cluster as an example, the 2019 data as the baseline data, in 2020 the PUB is close to ten million.

The following figure shows the data comparison and application-level optimization effect of this cluster on Singles’ Day over the years. Compared with the 2019 Singles Day, pub and SUB at the interface level will increase by 200% and 80% in 2021.

  • Fault explosion radius increases: The more instances a cluster connects to, the more services and instances are affected by faults. Ensuring service stability is the most basic requirement and has the highest priority.

  • Test the horizontal expansion ability: After a cluster reaches a certain scale, whether it still has the ability of horizontal expansion requires that the cluster has good horizontal expansion ability. Expanding from 10 to 100 is not the same as expanding from 100 to 500.

  • HA capability: When the number of cluster instances increases, the overall hardware failure rate of nodes increases. Can the cluster recover quickly from various machine failures? Students with operation and maintenance experience all know that the difficulties faced by the operation and maintenance of a small cluster and a large cluster are exponentially increasing.

  • Push performance: Most service discovery products choose the ultimate consistency of data, but how long is this final consistency across different cluster sizes? None of the products actually give clear data.

But in fact, we see this metric as the core of our service discovery product. This duration has an effect on the call: the newly added address has no traffic; The deleted address is not removed in time. The push delay of ANT Group PaaS to the registry is constrained by SLO: If the delay of change push list exceeds the agreed value, the address list of the business end is wrong. We have had failures in our history due to delayed delivery.

The increase in the size of business instances also brings push performance pressure: the number of instances under pub on the publisher side increases; The number of service instances on the subscriber increases. A simple estimate is that if pub/sub increases by 2 times, the amount of data pushed is 2*2, increasing by 4 times, which is a product relationship. Meanwhile, the performance of push determines the maximum number of practical operation and maintenance cases that can be supported at the same time. For example, in an emergency scenario, services are restarted on a large scale. If this is a bottleneck, it will affect the recovery time of the failure.

Cluster size can be considered the most challenging, as the core architecture determines its upper limit, which is very costly to retrofit. And often by the time we find the bottleneck, we are already at war. We need to choose an architecture that raises the technical ceiling of the product.

Operational and maintenance challenges

One of SOFARegistryX’s main goals at the time of the project was to provide better operational capabilities than ConfReG: introducing Meta roles and providing control surfaces for clusters through Raft elections and storing meta information. But the fact proves that we still underestimate the importance of operation and maintenance, just as Lu Xun said: “The programmer’s job is only two things, one is operation and maintenance, the other is operation and maintenance.”

The goal set three years ago is seriously behind schedule today.

  • Increase in the number of clusters: Services in ant Group are deployed at different sites (each site is an independent service and requires different levels of isolation). At the same time, multiple clusters must be deployed at one site. Development requires multiple environments. The number of deployed sites has grown beyond our imagination. There are now hundreds of clusters, and they are growing rapidly, at a rate similar to the growth of the Federal Reserve’s money supply in recent years. In the past, I thought some operation and maintenance work could be done poorly, but after the number of clusters increased, the number of failures was too many, which occupied the energy of students in development/operation and maintenance, and completely had no resources to plan poems and distant places.

  • Business interruption: the operation and maintenance of business is 24/7, capacity self-adaptation/self-healing /MOSN is applied to the whole station once in one version per month, etc. The figure below shows the number of machine batches operated and maintained per minute. As you can see, the operation and maintenance tasks are constant even on weekends and late at night.

Ant group students should be familiar with and hate the operation and maintenance announcement of the registration center. Due to the sensitivity of the business, the registry has been shut down for publishing and operation, and at this time, it is necessary to lock the publishing/restart action of the whole site. In order to minimize the impact on business, registration center related students can only sacrifice a head of black hair, in the late night peak period to do relevant operations. Even so, there is still no way to achieve zero interruption to services.

Naming challenges in the cloud native era

In the era of cloud native technology, some trends can be observed:

  • The spread of microservices /FaaS has led to an increase in light applications: more instances needed to support a larger scale of business

  • The lifecycle of application instances is shorter: FaaS on demand, AutoScale capacity adaptation and other tools cause instances to rise and fall more frequently, and the performance of the registry is primarily reflected in how quickly instances change

  • Multi-language support: In the past, Ant Group’s main development system was Java, while non-Java language docking infrastructure was a second-class citizen. With the needs of AI and innovative business, non-Java system scenarios became more and more. If you have an SDK for each language, the maintenance cost will be a nightmare. Of course sidecAR (MOSN) is the solution, but can it support low invasive access mode, even SDK-free capability?

  • Service routing: In most past scenarios, the endpoint was considered equal, and the registry provided only a list of addresses for communication. In the precise routing scenario of Mesh, the Pilot provides BOTH EDS (address list) and RDS (Routing), so the registry needs to enrich its own capabilities.

  • K8s: K8s has become a de facto distributed operating system. How can K8S-service get through to the registry? Further, can the problem of k8S-service crossing multi-cluster be solved?

“Summary”

To sum up, in addition to down-to-earth solutions to current problems, we also need to look up to the stars. The possibility of solving the naming challenge in the cloud native trend is also a major goal of V6 refactoring.

PART. 3 SOFARegistry 6.0: Performance Oriented

SOFARegistry 6.0 is more than just a registry engine. It needs to work with surrounding facilities to improve development, operation, and emergency response performance and address the following issues. (The red modules are more challenging areas)

SOFARegistry 6.0-related work includes:

Structure optimization

The idea of architecture transformation: while preserving V5’s sharding architecture, the key goal is to optimize meta-information meta-consistency and ensure that correct data is pushed.

Meta information meta consistency

V5 introduces the strong consistency of Raft in meta roles to elect the leader and store meta information, including node list and configuration information. There are two problems with the consistency hash of data sharding by fetching the meta node list:

  • Raft/operator Complex operation and maintenance

    (1) Customized operation and maintenance process: support orchestration such as change peer. In Ant Group, the specialized operation and maintenance process costs a lot and is not conducive to output.

    (2) The cost of implementing a robust operator is very high, including access to change control and operator’s own change.

    (3) Sensitive to network/disk availability. In the output scenario, the hardware is bad and the troubleshooting cost is high.

  • Fragile strong consistency

Meta information is used to ensure strong consistency. If a network problem occurs, for example, a network partition with a session fails to connect to meta, an incorrect routing table may split data. Mechanisms are needed to ensure that data correctness can be maintained within a short period of time, even if meta information is inconsistent, leaving an emergency buffer time.

Push the right data

When data nodes are operated and maintained on a large scale, the node list changes drastically, resulting in continuous data migration and data integrity and correctness risks. V5 avoids this by citing 3 copies. As long as one copy is available, the data is correct. However, this limitation imposes a heavy burden on the o&M process.

For V5 and earlier versions, the operation and maintenance operation is relatively rough, and the release is stopped all at once. The PaaS is locked to prevent business changes, and the push ability is enabled after the data node is stabilized to avoid the risk of pushing wrong data.

In addition, the expected o&M work can be done, but there is still a risk of sudden multi-data node outages.

We need a mechanism to ensure that when data migration is caused by changes in the list of data nodes, additional slight push delay is tolerated and accepted to ensure the correctness of push data.

“Achievement”

  • Meta storage/election components plug-in, Raft, use DB to do leader election and store configuration information, reduce operation and maintenance costs.

  • Data is fragmented in fixed slots and meta provides scheduling capabilities. Slot scheduling information is stored in slotTable. Session/Data nodes tolerate weak consistency of this information to improve robustness.

  • Multi-copy scheduling reduces the cost of data migration when data nodes change. The amount of data currently online by followers upgrades the leader to about 200ms (followers hold most of the data), which directly allocate 2s-5s data synchronization time to the leader.

  • Optimize data communication/replication links to improve performance and scalability.

  • Large-scale operation and maintenance does not need to lock PaaS late at night, which can reduce business disturbance and protect the hair of operation and maintenance personnel, and improve happiness.

Data link and Slot scheduling:

  • Slot fragmentation Follows the Redis Cluster approach and adopts virtual hash slot partitioning. All dataids are mapped to integer slots from 0 to N based on the hash function.

  • The leader node of the meta senses the list of viable data nodes through heartbeat, allocates multiple copies of slots to data nodes as evenly as possible, stores related mappings in slotTable, and notifies session/ Data of any changes.

  • Session/Data also obtains the latest slotTable through heartbeat, avoiding the risk of meta notification failure.

  • There are Migrating -> Accept -> Moved status machines on the data node. The data in the slot is accepted only when the slot is the latest, ensuring the integrity of the data pushed.

Data migration with data node changes:

Test the push capacity of a cluster with access to 10W + clients. The push volume at the minute level is 12M, and the push delay P999 can be kept below 8s. Session CPU20%, Data CPU10%, low physical resource water level, and large push buffer.

At the same time, we also verified the horizontal expansion ability online, the cluster tried to expand to session370, data60, META *3; Meta Due to processing all node heartbeats, CPU reaches 50%, which requires 8C vertical expansion or further optimization of heartbeat overhead. According to the safe water level of a data node to support 200W pub, the cost of a pub is about 1.5K. Considering that 1/3 of data nodes are still capable of serving, the buffer of pub increase needs to be reserved. The cluster can support 120 million PUB. If two copies are configured, 6kW PUB can be supported.

Application-level service discovery

The registry retains a strong flexibility in the format of PUB. When some RPC frameworks realize RPC service discovery, they adopt the mapping mode of one interface and one PUB. SOFA/HSF/Dubbo2 all adopt this mode, which is relatively natural. However, this will cause pub/sub and push volume to swell considerably.

Dubbo3 proposed application-level service discovery and related principles [1]. In terms of implementation, SOFARegistry 6.0 references Dubbo3 and adopts a metadata center module that integrates services on the session side, with some compatibility adaptations.

“Application Level Service PUB Data Split”

“Compatibility”

A difficulty in the discovery of application-level services is how to make the interface level/application level compatible at a low cost. Although most applications can be upgraded to the application level in the end, the following problems will be faced in the upgrade process:

  • The number of applications is large, and there is a large gap between application upgrade time points

  • Some apps cannot be upgraded, such as some ancient apps

We use application-level services, while compatible with interface level solutions:

Two versions of SOFARegistry exist at the same time during upgrade, and different versions of SOFARegistry correspond to different domains. After the upgrade, the application (MOSN in the figure) adopts the dual subscription and dual publication mode to gradually switch the gray scale. During the switchover, the applications that are not connected to the MOSN or are not enabled are not affected.

After most applications are migrated to the registry of SOFARegistry 6.0, there are still a few applications that are not connected to MOSN, and these remaining old apps are also transferred to SOFARegistry 6.0 by domain name. Continue to subscribe to publications and interact with the registry at the interface level. To ensure that upgraded and unupgraded apps can subscribe to each other, some support has been made:

  • Provide the ability to convert application Publisher to mouth Publisher: Interface-level subscribers cannot directly subscribe to application-level publishing data. For interface-level subscriptions, they can convert from AppPublisher to InterfacePublisher on demand. Applications without MOSN access can subscribe to this part of data smoothly, because only a few applications without MOSN access. There are few app-level Publishers that need to be converted.

  • The application-level subscriber initiates an interface-level subscription to subscribe to the application publication data that is not upgraded. Since there are so few of these apps, the vast majority of service level subscriptions do not have push tasks, so there is no pressure on push.

“Effect”

The figure above shows the effect of a cluster after switching application level. The rest of the interface level PUB after switching is for the compatibility of converted data, and the interface level sub is not reduced for the compatibility of interface level release. When compatibility is not taken into account, pub data is reduced by up to 97%. Greatly reduce the data scale of the cluster pressure.

Sofaregichaos: Automated testing

The model of the registry’s ultimate consistency has always been a testing challenge:

  • How long will it last?

  • Is there any wrong data pushed before reaching the final

  • Have you pushed a few data before reaching the final

  • Impact on data correctness and latency when a cluster failure or data migration occurs

  • The client frequently calls the API in various order

  • The client is frequently disconnected

For this series of problems, we developed SOFARegistryChaos to provide complete testing capabilities specifically for final consistency, in addition to functional/performance/large-scale pressure/chaos testing capabilities. At the same time, through the plug-in mechanism, it also supports access to test the ability of other service discovery products. The deployment ability based on K8s enables us to quickly deploy test components.

With the above capabilities, you can not only test your own product capabilities, for example, you can also quickly test zooKeeper’s performance in service discovery for product comparison.

Test observability

Key data observation capabilities provided by Metrics, visualization capabilities provided by Prometheus:

  • Delivery time delay

  • Set the time for final consistency detection

  • The point at which fault injection occurred

  • Integrity of pushed data during final consistency

This ability test is an interesting innovation point. By solidifying part of the client and the corresponding PUB, the data pushed by other changes must be complete and correct every time.

  • Push the number

  • Push data volume

Troubleshooting of failed cases

In the test scenario, client operation timing and fault injection were randomly arranged. We recorded and collected all operation command timing in SOFARegistryChaos Master. When a case fails, you can quickly locate the problem based on the failed data details and API calls of each client.

For example, the following failed case shows that the subscriber on a Node failed to verify the subscription data for a dataId. It should have been pushed empty, but pushed a dataId down. It also shows the operation trajectories of all publishers associated with the dataId during the test.

Black box detection

Have you experienced a similar case:

  • Suddenly the business told the system problems, a face meng I: the system did not abnormal ah

  • A system fault has serious impact on services

Due to its own features, the impact of the registry on services is often delayed. For example, 1K IP addresses are pushed out of 2K IP addresses. This error does not cause services to immediately perceive exceptions. But the reality itself has gone wrong. For registries, there is a greater need for the ability to detect terminal diseases in advance.

Here we introduce the black box detection method: simulate the user behavior in a broad sense and detect whether the link is normal.

SOFARegistryChaos actually acts as a registry user and is an enhanced version that provides end-to-end alerting capabilities.

We deployed SOFARegistryChaos online, turning on small traffic as a monitoring item. We have the opportunity to intervene in a timely manner when a registry exception does not have a perceived impact on the business, reducing the risk of a risky event escalating into a major failure.

Sharpening a knife does not mistake cutting wood

Through SOFARegistryChaos, the verification efficiency of core ability is greatly improved, the quality is guaranteed at the same time, the development of students to write code is also easier. In the three and a half months from July to mid-October, we iterated and released five versions, nearly one every three weeks. This development efficiency was previously unimaginable, while also gaining perfect end-to-end alarm capability.

Operation and maintenance automation

nightly build

Although we have a large number of clusters, because we distinguish multiple environments, some environments have slightly lower requirements for stability than production flow, such as environments below gray scale. Whether clustering of these environments can be applied quickly and at low cost under the condition that the quality of the new version is guaranteed. In conjunction with SOFARegistryChaos, we and our quality /SRE classmates are building nightly Build facilities.

SOFARegistryChaos as change access control, automatic deployment of new version, accept SOFARegistryChaos test, automatic deployment to the cluster below the gray scale, only in the production of the release of manual intervention.

Through the nightly build, the cost of publishing in non-production environments is greatly reduced, and the new edition is subject to traffic verification as soon as possible.

Failure to rehearse

Although we do a lot of quality related work, how do we perform online in the face of various failures? Mule or horse, or to be pulled out and taken for a walk.

We and SRE students will regularly conduct online disaster recovery drills, including but not limited to network faults. Massive machine outages, etc. In addition, drills can not be a one-time deal, no fresh disaster capacity is actually equal to 0. In the simulation/grayscale cluster, normalize disaster recovery, drill – iteration cycle.

Locating diagnosis

After a disaster recovery (Dr) drill is conducted regularly, how to quickly locate the fault source becomes a problem on the table. Otherwise, the drill is inefficient.

SOFARegistry has made a lot of improvements to the observability of each node to provide rich observability. The DIAGNOSTIC system of SRE makes real-time diagnosis based on relevant data. For example, in this case, a session node failure leads to an SLO break. With the locating capability, the self-healing system can also play a role. For example, if a network fault is diagnosed on a session node, the self-healing system can trigger the automatic replacement of the faulty node.

At present, most of our disaster recovery drills and emergencies do not need human intervention, and only in this way can low-cost drills be normalized.

“Earnings”

By exposing problems through constant practice and rapid iterative fixes, SOFARegistry improves stability over time.

“Summary”

SOFARegistry 6.0, in addition to its own optimization, does a lot of work in testing/operations/emergency response. The goal is to improve the effectiveness of r&d/quality/operations personnel, and to help students get rid of inefficient human labor and improve their happiness.

PART. 4 open source: One person can go fast, but a group can go farther

SOFARegistry is an open source project and an important part of the Open source community SOFAStack. We want to use the community to drive SOFARegistry forward, not just ant Group engineers.

SOFARegistry has been at a standstill over the past year due to its focus on 6.0 refactoring, and this is an area where we haven’t done a good job.

We have a community plan for the next six months. In December, we will open source 6.0 based on the internal version. The open source code contains all the core capabilities of the internal version, the only difference is that the internal version has more compatibility support for Confreg-Client.

In addition, from 6.1 onwards, we hope that subsequent design/discussion will also be community-based and the whole r&d process will be more transparent and open.

We are still on our way

2021 is a year for SOFARegistry to look at the past, build a solid foundation, and improve performance.

Of course, we are still in the early stages and there is still a long way to go. For example, the scale of this year’s Singles’ Day faces a number of thorny questions:

  • The number of single application instances in a cluster is too large (the number of hotspot application instances in a cluster is up to 7K). As a result, the CPU and memory overhead of the service end receiving address push is too high.

  • The full address list is pushed, resulting in too many connections.

There are other challenges:

  • Incremental push reduces the amount of data to be pushed and the resource cost on the client

  • Unified service discovery across clusters

  • Adapt to new trends under cloud native

  • Community operation

  • Product usability

“Barometer”

[1] Dubbo3 proposed application-level service discovery and related principles:

Dubbo.apache.org/zh/blog/202…

About us:

Ant Application Service team is the core technology team serving the entire Ant Group, creating the world’s leading financial level distributed infrastructure platform, being the leader in cloud native fields such as Service Mesh, developing and maintaining the world’s largest Service Mesh cluster. Ant group’s message-oriented middleware supports trillions of messages every day.

We welcome students who are interested in Service Mesh/ microservices/Service discovery to join us.

Recommended Reading of the Week

Improved Stability: New features for SOFARegistry V6

We made a distributed registry

Prometheus on CeresDB evolution Path)

How do I troubleshoot the excessive Rust memory usage in the production environment