Why do you have to do extra work? Ele. me More technical architecture and operation and maintenance challenges

The rapid development of Ele. me business has brought the challenges of massive requests, high concurrency and micro services to the technology. Meanwhile, the fast-paced version iteration of the development team and the requirements of fast service launch also drive the operation and maintenance team to provide stable and efficient operation and maintenance services.

As the head of technical operation of Ele. me, I have witnessed the rapid development of ele. me business. I remember that when I joined Ele. me in 2015, our daily orders were only 300,000. By 2017, we were doing more than 10 million orders a day.

Given the size of our entire market and the upper limit of 20 million orders in a single machine room, we gradually moved forward with a new plan for 100 percent redundancy.

Today’s sharing is divided into three parts:

Active scenarios and service patterns
Hungry me more live operation challenge
Exploration of ele. me operation system

Active scenarios and service patterns

Hungry more living status quo

First, let me introduce the current situation of Ele. me: We have two computer rooms in Beijing and Shanghai to provide production services. An equipment room and an Ezone are two different concepts. A room can expand multiple Ezones. Currently, the relationship is one-to-one.

We also have two access points deployed in the public cloud that serve as national traffic request entry points. They handle some traffic requests from the north and south respectively. The access points are deployed on Aliyun and are used for O&M and DISASTER recovery.

Considering the possibility that the two cloud portals may “go down” at the same time, we are preparing a backup access point in IDC as a disaster recovery solution.

Live More Since our first successful walkthrough in May 2017, we have gone through 16 full live switches.

These 16 switches include both normal walkthroughs and actual switches due to a failure. Among them, the most recent switch was due to the failure of the public network exit of our Machine room in Shanghai, and we switched all its traffic to Beijing.

Implement a living background

Below, I would like to introduce some background conditions before the implementation of live more from five aspects:

Business characteristics
The complex technology
Ops out
Frequent failure
Computer room capacity

Business features: We have three main flow portals, namely client end, merchant end and rider end.

A typical ordering process is: the user opens the App to generate an order, the store receives the order at the merchant end, and then generates an waybill for logistics delivery service.

The difference between this process and traditional e-commerce orders is that if an order is generated in the mall, the back-end merchant side can receive it the next day, which is not a big delay.

However, ele. me is not the case. The timeliness of takeout is very high. If the merchant does not receive the order within 10 minutes, users will either complain or cancel the order and change to Meituan or Baidu Waimai, which will lead to the loss of users.

Also, we are very regional. Orders generated in Shanghai, for example, are generally only valid in Shanghai and do not need to be sent to other places.

At the same time, our business also has a significant peak, the peak in the morning, generally at 11 o ‘clock; The afternoon will be between 5 and 6.

We can see the whole link request at a glance through the monitoring curve. This is the business characteristics of our company and the whole takeout industry.

Technical complexity: The diagram above shows the entire technical architecture of traffic requests from entry to the bottom.

SOA (Service Oriented Architecture) system architecture is not complex in itself, and in fact most Internet companies’ technology architectures evolve similarly in the end.

Our real complication is that components, infrastructure, and the entire access layer are multilingual.

Before 2015, our front end was written in PHP and our back end was written in Python. After two years of evolution, we have now replaced everything written in PHP. In order to work with multiple languages, our components have to adapt one more time for one language.

For example, if we want to trace the entire link and use multiple languages, we need to develop multiple SDKS for it and spend a lot of money to maintain these SDKS.

As you can see, complexity often lies not in how many components we have, but in the maintenance we provide for each component.

Our current entire SOA framework architecture is primarily oriented toward two languages: Python and Java, and is evolving to be more Oriented toward Java.

The intermediate API Everything contains a number of various API projects developed for different application scenarios. Our infrastructure includes the entire storage and cache, as well as public and private clouds.

Bottom line operation: In the process of rapid business development, our operation and maintenance team is doing more “bottom line” work.

At last count, we now have nearly 16,000 servers, 1,600 applications, 1,000 developers, four physical IDCs, and two clouds with a protective layer deployed. There are also some very small third-party cloud service platforms, including AWS and Alitar.

In the process of business growth, based on the entire IDC infrastructure environment, we customized the delivered models and improved the procurement supply chain, including: standardized whole cabinet delivery and data cleaning.

For the database and cache used by the application, we have also done a lot of resource splitting and transformation, such as database, critical path isolation, vertical splitting, Sharding, SQL audit, access to database middleware DAL, and governance of cache Redis. Migrating from the redis Cluster agent Corvus, the joint framework realizes the standardization and servitization of storage use.

Once faced with a relatively big challenge is the database DDL, table design in each company has some of its own characteristics, such as Ali, Baidu, their weekly DDL frequency is very few.

However, we have approximately triple-digit DDL changes per week, which are related to project culture and business delivery.

The DBA team and the DAL team did several things: table data redlines, improved online Schema change tools based on GH-OST, and Edb self-publishing. This greatly reduces the database DDL accident rate and change efficiency.

In the process of multi-activity transformation, the speed of tool research and development is relatively backward. In the process of operation and maintenance deployment services and component promotion and governance, most of them are still manual promotion and governance.

We are also responsible for the stability of the whole network, as well as fault management, including plan rehearsal, fault discovery, emergency response, accident review, etc., and accident damage rating.

Fault management is not about accountability. It is about analyzing the cause of each failure through records and following up improvement measures to prevent the failure from happening again.

We also define a stability counter for the whole network, which records the accumulated time without major accidents. When the fault level should reach P2 or above, it should be cleared and restarted.

Our longest total stability record in history is 135 days, meituan has exceeded 180 days, there are still some gap.

Frequent failures: As you can see from “Frequent Failures” above, 2015 and 2016 were pretty miserable.

On a daily basis, we often have more than P2 accidents, and the shortest is more than 1 P2 accident every 1 day.

We had to improve, so we formed a team called NOC (Notification Operation Center).

This is based on the 7*24 emergency response team established by Google SRE, as well as the preliminary cause judgment, routine drill, retest organization, follow-up retest to improve the landing situation.

NOC defines the company’s common failure rating loss/liability criteria: the P0 — P5 incident scale, which is based on four dimensions of business characteristics:

Severe impact during peak/off-peak periods, including time and duration of damage.
Loss ratio to net business orders.
Amount of loss.
The influence of public opinion. Including competition with Meituan, Baidu Waimai and other platforms. But unlike the quality of takeout ingredients, we’re talking about technical glitches here.

For example, merchants cancel customers’ orders without any reason, or the number of customers complaining to the customer service department increases due to various other reasons.

These different dimensions, combined with the differences between peak and low peak periods, are the criteria for our grading.

According to various incident operation grading/liability specifications, we established a response troubleshooting SOP (standard operating procedure), and then we used reports to conduct statistics.

In addition to the number of failures, MTTR (average recovery time) is also an important metric. Through response SOP, we can analyze the cause of a fault itself, whether it takes a long time to discover, a long time to respond, or a long time to remove obstacles.

By standardizing the process and using the MTTR in the report, we can analyze which part took longer after the failure.

When it comes to “failure frequency,” we assume that all failures, both component failures and underlying server failures, will eventually be reflected on the business curve.

Therefore, our NOC office has a big screen to display important business curves. When the curve trend is abnormal, we can timely respond to inform the corresponding personnel.

In the peak of orders, we pay more attention to timeliness. The first thing we do after a failure, or our goal, is to cut our losses quickly rather than spend time locating the problem.

That’s what we’re doing to live more, and living more is what we’re doing to keep us alive. Originally, I had only one machine room. If the equipment of the machine room failed, and the business was at its peak, the consequences would be unimaginable.

Machine room capacity: Let’s take a look at the capacity of the whole machine room. Before 2015, there were few orders, and our servers were scattered in the machine room with random models.

By 2015, we had about 1,500 servers; In 2016, we grew to 6,000; In 2017, we had nearly 16,000. These do not include the number of ECS on the cloud.

Those who have worked in IDC may know that the delivery of large companies is usually contracted by modules.

But at the beginning, we did not know that the business would develop so fast. The server was common module and rack with other companies, the server was old and non-standard, and the networking environment was also very complicated. There was even a time when we didn’t have room to expand even if we had the money to buy servers.

Why do you have to do extra work

There are four reasons for doing more work: disaster recovery, service expansion, single room capacity, and other reasons.

As shown on the right side of the figure above, we evaluate by a curve similar to the X/Y axis. As the business grows, technology investment, and services expand, failure loss is no longer a parallel growth relationship.

Hungry me more live operation challenge

Let’s share the operation and maintenance planning we did at that time, which is mainly divided into five parts:

Live technology architecture
IDC planning
SOA Service Transformation
Database transformation
Disaster safeguard

Live technology architecture

We can divide Kunshan into Shanghai and Suzhou by setting up (this has nothing to do with administrative region, but only the radius of delivery). Therefore, we proposed the concept of geo-fencing and developed GZS components.

We divide the whole country into 32 shards by geographical fencing on GZS (globalzone service). After the request from each Shard enters the system, GZS determines that the request should be routed to the machine room to which it belongs.

As shown at the bottom of the figure, we propose the concept of Global zone for some data requirements with strong consistency requirements. For a database that belongs to the Global zone, write operations can be performed only in one machine room and read operations can be performed on local slaves in different zones.

Five core components of the Live Technology architecture:

API Router: Traffic entry API Router, our first core component, provides request broker and routing capabilities.
GZS: Manages geofencing data and Shard allocation rules.
DRC: Data Replication Center (DRC) is a cross-room database synchronization tool that supports Data change subscription for cache synchronization.
SOA Proxy: Call between live and non-live.
DAL: Originally database middleware, some modifications were made in the multi-live project to prevent data from being routed to the wrong machine room, resulting in data inconsistency.

The core goal of the entire Live technology architecture is to always ensure that the entire order process is completed in one machine room.

In order to achieve this goal, five functional components were developed, data requirements with strong consistency were investigated and identified, and overall planning and transformation were carried out together.

IDC planning

At the end of 2016, I started the multi-activity project, determined two computer rooms in the north and the south, as well as the traffic entrance, and started IDC selection. I made a field visit to several IDC companies in Shanghai, and finally selected wanguo Data room. At the same time, combined with the preparation of 100% traffic server budget, submit procurement department procurement requirements.

Plan the multi-activity test environment, simulate the production of dual Ezones, divide the VPC, and finally transform the service at the same time.

As shown on the right of the figure above, take two different traffic as an example. The traffic coming through the access layer in different areas corresponds to different machine rooms in Beijing and Shanghai respectively. Under normal circumstances, the whole order process will be processed in the machine room in this area, and can be distributed to each other when necessary.

SOA Service Transformation

We also made some changes to SOA service registry discovery. Let’s take a look at what was going on before multiple lives. An application service AppId was going online, the physical cluster environment was ready, and a CLUSTER of SOA clusters was registered at the time of SOA registration.

In other large clusters, different swimlanes are divided for different service calls, and these swimlanes are defined to different application clusters when applications are released. This is the logic of the entire AppId deployment.

This is very simple for a single room, but in a two-room scenario, it is necessary to transform into the same AppId and only call the SOA cluster of the room. We introduce an ezone concept similar to a unit on the basis of corridor and distribution cluster.

The transformation plan of SOA Mode includes the following three modes:

Orig: compatible mode, default service registration discovery mode.
Prefix: A way to reclaim service registrations and unify SOA service registrations. This model is targeted at many of our new live applications. For some older businesses, the default is the Orig model.
Route: This is the final pattern for reclaiming SOA service invocations, further enabling uniform SOA service registry discovery. The entire IDC, EZone, o&M architecture and other information are transparent to the business side, thus reducing the maintenance workload of the business side for SOA.

Database transformation

The database is also planned based on the division of application deployment, that is, multi-live, non-multi-live, and strongly consistent Global zones.

We have carried out research on business data consistency, Replication consistency planning, multi-active clusters were transformed into two-way Replication through DRC, and the Global zone adopted native Replication.

The specific transformation can be divided into three parts:

Database cluster transformation, according to the time point of backscheduling, special teams were assigned to follow up, and the whole process was split into a detailed operation plan.
Database middleware DAL transformation, increase the check function, to ensure that SQL will not write wrong machine room. Realized write error data protection, increase a pocket bottom protection.
DRC transformation, and more DRC replication between two instances.

Disaster safeguard

Dr Protection is classified into three levels:

Traffic inlet faults include DNS resolution changes, network egress faults, backbone line faults of a province or city, and AR faults.
IDC internal faults, common change release failure, historical Bug trigger, misconfiguration, hardware failure, network failure, capacity problems, etc.
Standalone rooms are completely unavailable. It’s not quite there yet, but we’re doing an off-grid exercise right now. If all zones in a simulation room are down due to force majeure, ensure that all applications in the room can be switched to another room to continue to ensure service availability.

Of course, this can not fundamentally solve the double machine room failure at the same time. When problems occur in both machine rooms at the same time, we still rely on experienced engineers and our automated fault location services.

Exploration of ele. me operation system

In the process of operation and maintenance transformation of Ele. me, how do we transform organizational capabilities into operational capabilities? Here are our five ideas:

Application of release
Monitoring system
Plan and drill
Capacity planning
Single room cost analysis

Application of release

Let’s start with app distribution. In the case of a single room, one AppId corresponds to one or more SOA clusters. Meanwhile, the operation and maintenance will configure grayscale machine groups and require key applications to require grayscale for 30 minutes.

So how does an application implement publishing in the live scenario? We have adopted two options in the planning:

All zones are regarded as a large “cluster”, following the publishing strategy of grayscale machine population. We first grayscale in a single machine room, then extend it to all zones to ensure that each critical application follows the rule of grayscale greater than 30 minutes, and finally apply it to all zones.
Consider a single zone as a cluster. There are as many clusters as there are zones. First, grayscale zoneA and combine full zoneA, then grayscale zoneB and combine full zoneB.

Alternatively, you can grayscale zoneA and zoneB, observe and verify the status of the publication at the same time, and then full-scale zoneA and full-scale zoneB. You can choose and implement them according to your own situation.

Monitoring system

Ele. me currently has three monitoring systems:

Full link monitoring. When the Agent starts, it reads a file to know which zone it is in. Then it adds a tag to the metric for ezoneID and aggregates metrics. By default, an equipment room can have multiple zones.
Service monitoring. Deploy the extension room, connect the STATSD to the machine room where the statSD resides, and switch the Data Source when viewing.
Infrastructure monitoring. For servers and network devices, eZone is not required and the host link is used to monitor them.

Plan and drill

Make preplans for common failures, develop routine drill plans, and conduct regular drills. At present, we are also making a drill arrangement system, which should have better effect after it is put online.

Capacity planning

As for Capacity planning, we currently only collect AppId’s server CPU utilization.

Combined with the normal load of the existing computer rooms in the two places, the zone in Beijing carries 52% of the traffic; And Shanghai machine room apportionment 48%.

A full-link pressure test is normally conducted every Wednesday. Through evaluation, we can know the capacity of the overall critical path.

In the future, we will also assume that if we increase traffic by another 15%, we will need to increase the number of servers on top of the existing AppId.

At the same time, we estimate the limit of order requests that can be carried by an existing single SOA cluster based on the number of existing orders.

As shown in the figure above, by looking at the AppId utilization statistics, we can see that due to the explosive growth of the previous business, the utilization rate of the server machines we purchased without considering the cost was actually low.

Single room cost analysis

For the existing IDC cost accounting, we apportion them to each month according to certain depreciation standards, and compare them with the total number of orders completed in a single month, and finally calculate the IT cost of each order and the cost of each core.

In addition, we can compare the cost to the cost of renting a cloud service to see how good the cost is.

For some common pooled resources, the pooled component services are spread across departments and each AppId.

This gives us guidance on how many servers each AppId uses and what the IT costs are, so we can do further cost analysis.

Cheng Yanling, the current technical operation director of Ele. me, from database to operation and maintenance, and then to technology + operation. Currently, I am mainly responsible for the operation and maintenance of thousands of APPids, IDC construction and stability guarantee of Ele. me. I joined Ele. me in 2015, and experienced the vigorous development of ele. me in terms of size and technology over the past two years, accompanied by the growth of the technical operation team in the challenges and difficulties. As a 10-year veteran of operations, I hope to share my story with you and look forward to learning and communicating with you.

This article is reprinted from the public account “51CTO Technology Stack”

It’s the best of times. It’s DevOps

The International DevOps Summit is officially underway

Conference preview video fresh out ⬇️

Click here to read more

Why do you have to do extra work? Ele. me More technical architecture and operation and maintenance challenges

Related Posts

Spark performance tuning: Several tuning points commonly used in code

Linux kernel design and Implementation – Study Notes – Chapter 3

I heard that… Seven niuyun was named by Shanghai Economic and Information Technology Committee?