On a sunny Tuesday afternoon, I was at work looking at the upcoming release of Java 12 features when my girlfriend called.

After work in the evening, my girlfriend came back home and said to me that taobao could not be visited after ten minutes.

System availability

System Usability is the ratio of the uninterrupted running time of System services to the actual running time. So, usability is actually a percentage, like 99.9%.

We often hear the term “high availability,” which means high availability. High availability refers to the fact that the system service running time is larger than the actual running time.

To understand usability, there are three important indicators of system availability: MTTR, MTTF, and MTBF

MTTF means Time To Failure. Refers to the average time between the normal operation of the system and the occurrence of a fault.

MTTR refers To Mean Time To Repair. MTTR refers To the average Time between the occurrence of a fault and the completion of Repair.

MTBF means Time Between Failure. MTBF means the average Time Between failures.

The diagram above shows the relationship between the three possible withdrawals. As can be seen:


MTBF = MTTF + MTTR
Copy the code

MTTF/MTBR * 100% is MTTF/(MTTF + MTTR) * 100%

In the real situation, many systems are composed of several subsystems, so how to calculate the availability of the whole system? Let’s look at the architecture of the system.

For series systems:




For parallel systems:




For composite systems:

Measurement of usability

The high availability of a system is measured by a Service Level Agrement (SLA), which is a number of 9’s high availability. It’s not uncommon to see companies claiming 99.99%, 99.999%, etc.

Industry typically measures SLA by counting the time between failure and recovery. Generally, the system is unavailable for a year. The specific relationship is shown in the following table:




Murphy’s Law says “anything that can go wrong will go wrong”, and 100 usability is out of reach.

For SLA metrics, the higher the number of 9, the higher the availability, and the lower the downtime, the higher the percentage of time the system can perform at any given time. However, the greater the challenge to the system, the higher the cost of input. For example, five nines require the system to be down for about five minutes per year, while four nines require the system to be down for no more than an hour per year. This requires multiple approaches at different levels of design, infrastructure, data backup, and even increased infrastructure investment to ensure availability.

“When you’re dealing with a life-and-death situation, or when you lose a million dollars a minute in business, you can think about 99.99 percent reliability.” Robertson, Linux High Availability project developer

Different systems have different requirements for availability. For example, taobao, JINGdong and other e-commerce systems have a large number of users, and a large number of users are using the system at different times in different areas, which must have high requirements for the availability of the system.

Based on past failure statistics and inaccurate test data for these systems, their current availability is estimated at around three to four nines. In contrast, enterprise-type working software has lower availability requirements because it is usually used only during work hours, or only in certain areas, or only for certain groups of people at certain times.

Assurance of availability

There are many factors that affect availability, including system failures, infrastructure failures, data failures, security attacks, system stress, and more.

Ensuring usability involves many levels, including but not limited to:

  • Level of software design, coding, testing, launching and software configuration management

  • The skill level of an engineer

  • Operation and maintenance management and technical level

  • Data center operation management level

  • The level of management that depends on third-party services

  • Attitudes towards technology

  • The engineering culture of a company

  • Leader’s respect for engineering

The following table lists common high availability issues and solutions.



Ensuring high availability in the GLEason’s system is not a simple matter, and the above list is only a part of the methodology. Truly ensuring high availability requires a great deal of practice.

References:


https://blog.csdn.net/hexieshangwang/article/details/49126159

https://dev.to/fangdajiang/-abilities-8e1

https://www.oracle.com/technetwork/cn/community/developer-day/7-critical-busi-sys-solution-360101-zhs.pdf

https://blog.csdn.net/hustspy1990/article/details/78008324