Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

100 million traffic Architecture column:

  • How to support the storage and computation of billions of data
  • How to design highly fault-tolerant distributed Computing Systems
  • How to design a high-performance architecture to carry ten billion traffic ** *
  • How to design a high concurrency architecture with hundreds of thousands of queries per second **
  • How to design a Full link 99.99% High Availability Architecture ** *

First, write first

The previous article, “Evolution of Large-scale Systems Architecture to Support storage and computing of billions of dollars of data,” talked about the architecture evolution of the first phase of merchant data platforms. Through the separation of offline and real-time computing links, the incremental computing optimization of offline computing, the sliding time window computing engine of real-time computing, sub-database sub-table + read and write separation, and other technical means, the storage and computing of tens of billions of magnitude of data is supported.

Let’s go back and forth and talk about how the architecture should continue to evolve in the face of technical challenges such as high concurrency, high availability, and high performance.

2. Active-standby HA architecture

Take a look at the architecture diagram above, did you find a fatal problem? Is how to avoid system single point of failure!

In the initial deployment architecture, the data platform system had high requirements on CPUS, memory, and disks. Therefore, we deployed on a VM with 16 core cpus, 64 GB memory, and SOLID-state drives (SSDS). This machine is configured to ensure the normal operation of the data platform system under high load.

However, standalone data platform systems can lead to fatal single point of failure, which means that if the data platform system deployed on a single machine goes down, the entire system will crash immediately.

Therefore, in the initial stage, we implemented active-standby high availability architecture for the data platform, that is, two machines are deployed in total, but only one machine is running at the same time, while the other machine is standby. The system in the active state writes the calculation status and results of the sliding window computing engine to ZooKeeper and stores them as metadata.

As for the storage of metadata based on ZooKeeper, we fully refer to the architecture implementation of the open source Storm streaming computing engine, because Storm, as an excellent distributed streaming computing system, also needs to read and write a large number of computing intermediate states and data with high concurrency, which is based on ZooKeeper for storage.

Zookeeper provides high read/write performance, high availability, and a large number of functions required by distributed systems, including distributed locking, distributed coordination, master election, and active/standby switchover.

Therefore, based on ZooKeeper, we realize the automatic active/standby switchover. If the active node goes down, the standby nodes will automatically cut the flower to active, read the intermediate state of a computing engine they share, and then continue to restore the previous calculation.

Take a look at the picture below and feel it together.

After completing the above active-standby architecture, the single point of failure of the system must be eliminated, ensuring basic availability. Moreover, the system performs well in the actual online production environment. There are always several times a year when the system fails, but each time it can automatically switch over the standby machine and run stably.

Here are just a few examples of machine failures in the production environment. Since VMS are deployed in the company’s cloud environment, possible failures include but are not limited to the following:

  • The host where the VM resides hangs up
  • The vm network is faulty
  • The disk is damaged due to high load

So in an online high load environment, never expect the machine to never go down, you should always be prepared for the machine to go down! The system must have sufficient fault prediction, high availability architecture, and fault drills to ensure that the system can continue to run in various scenarios.

3. Distributed computing system with master-slave architecture

But another problem again, we consider a problem, data platform system is the core task is to a calculated the data of a window of time, but with the growing volume of data every day more and more, the amount of data in each time window will more and more big, at the same time can lead to computational load data platform system is higher and higher.

In the online production environment, the CPU load on the data platform system deployment machine is getting higher and higher. Peak times can easily reach 100% and the machine is under great pressure. A new round of system reconstruction is imperative.

Firstly, we completely reconstructed and designed the data platform system as a set of distributed computing system, which separated the two responsibilities of task scheduling and task calculation. A special Master node was responsible for reading the segmented data fragments (the so-called time window, where a window is a data fragment). Then the calculation tasks of each data fragment are distributed to multiple Slave nodes.

The task of the Slave node is to receive computation tasks one by one, and each calculation task is to execute a complex SQL statement of hundreds to thousands of rows on a data fragment to produce corresponding data analysis results.

At the same time, in order to avoid the single point of failure of the Master node, we still use the previous active-standby architecture. The Master node has one Active node and one Standby node deployed online. Usually, the Active node operates, and once the failure occurs, the Standby node will switch to the Active node. Then automatic scheduling runs each computation task.

Once deployed, the architecture works well because the Master node actually reads the data shards, constructs the computation tasks for each data slice, and then distributes the computation tasks to the Slave nodes for computation.

The Master node has few complex tasks, and deploying a highly configured machine is absolutely fine.

The load is mainly on the Slave node. Since the Slave node is deployed with multiple machines, each of which performs part of the computing tasks, the load of a single Slave node is greatly reduced. In addition, as long as necessary, the Slave cluster can be expanded to deploy more machines, so that no matter how busy the computing tasks are, The Slave machine can be continuously expanded to ensure that the load on a single Slave machine is not too high.

Flexible computing resource scheduling mechanism

After solving the problem of high computing load on a single machine, we encountered the next problem, that is, in the online production environment, we occasionally found that a certain computing task took too long, resulting in a large number of accumulated computing tasks on a Slave machine that could not be processed.

This problem is mainly caused by the data difference between peak and trough of the system.

As you can imagine, during the peak period, the instantaneous influx of data is very large. It is likely that a data fragment contains too much data, reaching several times or even dozens of times of ordinary data fragment. This is one of the reasons.

Another reason is that the calculations so far have been based on hundreds to thousands of rows of complex SQL landing in MySQL to perform calculations from the library.

Therefore, during peak hours, the CPU load and IO load of the database server where the MySQL library resides may be very high, resulting in the decline of SQL execution performance several times. At this time, the amount of data in the data fragment is large and the execution is slow, which may easily lead to the execution of a certain calculation task taking too long.

A final cause the reason of unbalanced load is each compute task corresponds to a data fragmentation and a SQL, but different SQL execution efficiency is different, some SQL may end just 200 milliseconds, some SQL to 1 second, so different SQL execution efficiency is different, caused the different computing task execution time.

Therefore, we have added mechanisms such as task metrics reporting, task time estimation, task execution status monitoring, machine resource management, and elastic resource scheduling to the Master node.

One effect achieved is roughly as follows:

  • The Master node can sense the computing task execution status, queuing load pressure, and resource usage of each machine in real time.
  • It also collects historical metrics for each machine’s computing tasks
  • Then, the Slave machines are distributed to the Slave machines with low load based on the historical metrics of the calculation task, the estimated time of the current calculation task, and the current load of each Slave machine.

Through this mechanism, we can fully ensure the balanced utilization of online Slave cluster resources, and avoid the situation of excessive load on a single machine and long queuing time of computing tasks. After the practice of production environment and some optimization, this mechanism works well.

High fault tolerance mechanism of distributed system

In fact, once the system is reconstructed into a distributed system architecture, a variety of problems may occur. At this time, it is necessary to develop a complete set of fault tolerance mechanisms.

Generally speaking, the problems that may arise in the current online production environment of this system include but are not limited to:

  • A Slave node suddenly breaks down during execution. Procedure
  • A computing task takes a long time. Procedure
  • A computing task fails to be executed. Procedure

Therefore, a fault-tolerant mechanism for computing task scheduling on Slave nodes needs to be implemented in the Master node. The general idea is as follows:

  1. The Master node monitors the running status of each computing task and the running status of each Slave node
  2. If a Slave breaks down, the Master redistributes the unfinished computing tasks of that Slave to other Slave nodes
  3. If a Slave fails to perform a calculation task, the Master reassigns the calculation task to another Slave node after several retries
  4. If a calculation task fails to be calculated in multiple slaves, the calculation task will be stored in a delayed memory queue. After an interval of time, for example, waiting for the peak period to pass, the calculation task will be tried again
  5. If a compute task fails to execute for a long time, perhaps because the hang node is dead, the Master node updates the version number of the compute task and assigns the compute task to other Slave nodes to execute.
  6. The reason for updating the version number is to avoid saying that after the newly assigned Slave executes and writes the results, the previous Slave hang dies for a while and recovers, and then writes the calculated results to storage to overwrite the correct results. This can be avoided by using the version number mechanism.

Six, stage summary

Until the degree of system architecture, it is running pretty good, at the time the daily grade request and data scenarios, the system architecture can carry very well, if you write database concurrency can always add more higher main library, if read high concurrency can always add more from the library, at the same time, a large amount of table data alone points more table, Slave Compute nodes can also be expanded as needed.

Computation performance can also be maintained at this level of request and data, because the data shard computation engine (sliding window) can ensure computation performance in seconds. At the same time, the load of each Slave compute node can be balanced by the elastic resource scheduling mechanism.

In addition, the whole distributed system also realizes the mechanism of high availability and high fault tolerance. The Master node is an active-standby architecture and can automatically failover. Any failure of the Slave node will be detected by the Master node and automatically retry the computing task.

Vii. Prospects for the next stage

If only every million level traffic requests to come over, this architecture can be held, but the problem is that, with the ensuing, is * * request flow started billions times a day or even billions of requests, * * the above set of architecture support not to live again, and need to continue to reconfiguration and evolution system architecture.

END

In the next article, we will talk about: ** “How to design a High-performance architecture for Ten billion Traffic”, ** stay tuned.

Stay tuned:

How to Design highly Fault-tolerant Distributed Computing Systems with Multi-billion Traffic System Architecture

How to Design a High-performance Architecture for Ten Billion Traffic

How to Design a High Concurrency Architecture with Hundreds of Thousands of queries per second

How to Design full Link 99.99% High Availability Architecture for 100 Million Level Traffic System Architecture

If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!

A large wave of micro services, distributed, high concurrency, high availability **** original series

The article is on its way,Please scan the qr code belowContinue to pay attention to:

Architecture Notes for Hugesia (ID: Shishan100)

More than ten years of EXPERIENCE in BAT architecture

** Recommended reading:

1. Please! Please don’t ask me about the underlying principles of Spring Cloud

2. [Behind the Double 11 carnival] How does the micro-service registry carry tens of millions of visits of large-scale systems?

3. [Performance optimization] Spring Cloud parameter optimization practice with tens of thousands of concurrent applications per second

4. How does the microservice architecture guarantee 99.99% high availability under the Double 11 Carnival

5. Dude, let me tell you in plain English what Hadoop architecture is all about

6. How can Hadoop NameNode support thousands of concurrent accesses per second in large-scale clusters

7. [Secret of Performance Optimization] How does Hadoop optimize the upload performance of large TERabyte files by 100 times

8, please, interview please do not ask me TCC distributed transaction implementation principle pit dad!

9, 【 pit dad ah! How do final consistent distributed transactions ensure 99.99% high availability in real production?

10, please, interview please don’t ask me Redis distributed lock implementation principle! 台湾国

11, * * * *[Eyes light up!] See how Hadoop’s underlying algorithms elegantly improve large-scale cluster performance by more than 10 times?

12. How to support the storage and calculation of ten-billion-level data in the architecture of billion-level traffic system