Canine teeth live ops director zhang view stone | decryption SRE six kinds of ability and the tiger tooth operational practice

Zhang Guanshi, with more than 10 years of experience in website development, architecture, operation and maintenance; At present, it focuses on Internet service reliability system engineering, operation and maintenance platform planning and construction, website high availability architecture, etc. We have accumulated rich experience in audio and video transmission quality assessment and micro service operation and maintenance.

1. Architecture and Operation And Maintenance Challenges of Live Broadcast Platform (1) Audio and video transmission process and Challenges (2) Process of a live broadcast room (3) Operation and Maintenance challenges of live Broadcast Platform 2. Our Thinking and Operation and Maintenance Practice (1) Introduction to Google SRE • What IS SRE • Methodology of Google SRE (2) Our thinking: Six abilities of operation and maintenance (iii) our operation and maintenance practice

Operation and maintenance reliability management
awareness
Ability to repair
Anti-vulnerability capacity
Support capability
Security capabilities

Tiger Teeth live introduction

Huya Live is a live broadcasting platform with games as its main content, covering entertainment, variety shows, education, outdoor, sports and other contents. It was listed on the New York Stock Exchange in May 2018. Huya is a company that attaches more importance to technology in the whole live broadcast industry. If you can compare the viewing experience of several platforms, we should be the best one. League of Legends S8 is the world’s largest esports event, currently in full swing, starting today entered the finals of the knockout stage. IG is a Chinese team. This year, three Chinese teams entered the final 8, which is the best performance in the past year. The game is very exciting, if I didn’t come to share today, I might watch the game at home or go to the company on duty. Welcome to watch the live broadcast on Tiger Tooth live platform and cheer for LPL! (When this article is published, IG team China has won the final championship, and the audience number of Huya platform has also broken through a record high, without major problems during the live broadcast).

Today’s sharing happens to be about the operation and maintenance of this event.

General website such as e-commerce website users are sellers + buyers, the seller first edit the product information, after the release of the buyer to refresh and then see, is asynchronous, the seller can slowly change, wrong can slowly adjust. On the live broadcasting platform, an anchor will appear in front of the camera when it starts broadcasting, which may be watched by tens of thousands of people at the same time. The anchor should not make any small moves or leave. It costs too much to restart broadcasting, and the audience will run away after 10 minutes. If the interaction isn’t smooth, the rich won’t want to look at you. The anchor is also unlikely to take off the air to cooperate with our operation and maintenance staff to make some technical adjustments. In this way, live broadcasting platforms are different from traditional websites. Therefore, this challenge to operation and maintenance is greater. The technology of live broadcast platform is quite complex. First of all, there are many advanced technologies in audio and video processing. In fact, it is a large-scale audience and anchor, and it also has high requirements on real-time performance. This year s LEAGUE of Legends final S8 will be sent back to Korea via a complicated route.

First, the architecture and operation and maintenance challenges of the live broadcast platform

(I) Audio and video transmission process and challenges

Audio flow refers to a series of flow of the platform from broadcast to watch.

This year, tiger tooth operation research team did P2P technology, the architecture is much more complex than before.

(2) the process of a broadcast room

(III) Operation and maintenance challenges of live broadcast platforms

Because of the complexity of audio and video and the real-time performance of services, it poses great challenges to operation and maintenance. Traditional o&M can deploy, configure, optimize, and deploy highly available open source components. The audio and video technology changes rapidly and forms a system of its own. The logic of the anchor end and the audience end is strong. As there are many intermediate transmission routes, it is difficult for operation and maintenance personnel to participate in them, so we have to change a working mode. Google’S SRE gives us great inspiration. Under the guidance of SRE methodology, we are deeply involved in audio and video transmission business. Although we are not called SRE, we are still called business operation and maintenance, but we have absorbed many ideas of SRE. Today is to share this aspect of the content, I hope to give you some inspiration.

Ii. Our thinking and operation and maintenance practice

(1) Introduction to Google SRE • What is SRE? S is the object of Site/Service/Software operation and maintenance, and R is the Service on the website business Service line. Quality and value to external end users E is Engineer and Engineering. The essence of operation and maintenance is a systematic project involving people and machines. Unlike software engineering, we are responsible for the stable operation, reliability, quality and cost after the launch of the business. Someone compares the relationship between business research and development and operation and maintenance to be like: which is more difficult and which is easier to give birth to a child or raise a child? • Google SRE methodology: • Focus on r&d and cut down on trivia

• Secure SLO& Measure risk • Monitor and gold metrics • Emergency response • Change management • Demand forecasting and capacity planning • Resource deployment • Efficiency and performance

(2) Our thinking: six abilities of operation and maintenance

People often ask us what operation and maintenance do. We say we do quality, efficiency and cost. It is difficult to explain in a few words how to do it and how to do it. The book “SRE Google Operation, Maintenance and Decryption” emphasizes practical methodology, which can be implemented, but is not systematic enough. Different chapters may be written by different people. I had the opportunity to follow the path of reliability, find reliability research in traditional industry, and discover another world. Everyone thinks that SRE was invented by Google, but SRE in traditional industries has been around for decades and has become a discipline. After my personal research, I think this subject is more systematic and more complete, so I hope to apply it to the Internet service. Referring to some reliability theories of traditional industries, I made some migration of the framework, and transformed my thinking into a thinking framework of operation and maintenance, which is called the Six capabilities of operation and maintenance, which can be divided into the following six points:

① Reliability Management

Firstly, the reliability model of the target business should be analyzed, and then the reliability logical block diagram should be drawn to evaluate the reliability of each link and the overall reliability, and the measurement and evaluation can be qualitative or quantitative. (2) Perception ability After the business is online and connected, learn how to perceive its status, changes and problems.

When the reliability is not perfect in the design stage, the repair ability can help us to repair the fault in the state that the user is not aware of. (4) Anti-vulnerability capability Business runs in a certain internal or external environment, looking for vulnerable points and risk points, and then analyzes its vulnerable points, and designs anti-vulnerability capabilities, and ultimately promotes business research and development to modify the technical architecture. ⑤ Guarantee ability

Many businesses need to have assurance capabilities, build assurance designs, deliver resources quickly and put capabilities in place quickly. ⑥ How to ensure our business security and data security?

(III) Our operation and maintenance practice

We mainly focus on the core indicators of the core service of the business we are responsible for. We regard each end-to-end link as a service, so the service indicators can be success rate, delay or other, and the target is to reach a certain degree. The R&D and operations teams draw deployment architecture diagrams and reliability logic diagrams for the service (see figure below). Build reliability models for the business, and also do some FMECA; Analyze failure patterns and their impact, and discuss design solutions; For some critical services, draw fault trees, measure risks, select priority risks, and promote solutions; Reliability comes out of management and operation and maintenance, but it comes out of design first. The methods of reliability design include fault avoidance, fault correction and fault tolerance.

2. Perception What is perception, including but not limited to monitoring coverage, real-time alarm, accuracy, contact rate, problem location ability, trend prediction ability. (1) Monitoring, state perception ability to monitor data as the basis, improve the ability of artificial perception and machine perception, monitoring is the basis of perception, monitoring indicators, can not be said to have perception, which is far from enough. (2) Fault perception ability

Helps o&M personnel perceive business status, changes, and other problems

(4) AIOps is mostly big data to enhance operation and maintenance perception ability; Intelligent alarm automatic test, pressure test dial test, APM log trace can be read and analyzed

SRE is systems engineering that combats failures. No matter how well a program is written, it is difficult to achieve complete failure – proof. Measure repair ability -MTTR: For most of the faults, we should know its failure mode. According to the failure mode, we can make a failure plan (specify the conditions and the time for the repair), and make some repair tools according to the plan, that is, manual repair or intelligent self-healing. When some unexpected situations occur, maintenance and technical maintenance, expansion or optimization are needed. Repair evaluation was carried out according to the average and maximum repair time. Some practices of Huya: Uplink switch of anchor: the uplink problem is repaired from the early restart of the anchor to the background manual switch, and the automatic switch to the anchor end. Repair time (MTTR) was reduced from half an hour to five minutes to seconds. Audience scheduling system: based on anchor terminal, audience terminal scheduling, small operators scheduling, seamless switching, scheduling by protocol, etc., one-click offline in the machine room. The next level of troubleshooting is self-healing, which is the degree to which troubleshooting capabilities translate into software architecture design.

4. Anti-vulnerability Capability Anti-vulnerability design: To ensure that services remain tolerably robust under fragile conditions. Software always runs in different environments and under different conditions. These conditions are called “specified conditions” in reliability. There are always many vulnerable points in the environment. It is necessary to do vulnerability analysis, anti-vulnerability design, and final assessment and review. Common vulnerability factors of the Internet include equipment room, carrier, network, and single machine failure, heavy load and traffic of service emergencies, and timeout of micro-service requests. Robust design, disaster recovery design, high availability design, resource redundancy, etc. This is what Google SRE is all about embracing risk, measuring risk, and assessing risk tolerance.

Anti-vulnerability design of S8 source streams

Some of our practices are as follows: • Bandwidth resource guarantee: bandwidth scheduling can be realized at the minute level, and flow cutting can be realized within 1 minute • Server guarantee: It takes 3 minutes to get multiple machine rooms and servers and 3 minutes to deploy core services to ensure the capability of architecture design and interface design. We have made some special designs in the live broadcast room. Guarantee capability is the comprehensive embodiment of various capabilities: • test is the degree of automation, want to have a support system of security, to have the guarantee of the automation tool, to do human and personnel planning, test failure personnel in place of time, to do the hardware and software resource supply security is the requirement of software architecture, whether to support a smooth expansion, should have drills, to ensure that can perform 6. Security Capability Security is the most basic capability and one of the biggest risks. Data security: There are endless data leaks and user information confidentiality incidents. Business security: coupons are brushed, payment loopholes, anchor words and deeds, login risk control, etc. User security, such as didi’s security incident.

The above content is shared by Teacher Zhang Guanshi.

The 7th TOP100 Global Software Case Study Summit hosted by MSUP will be held in Beijing National Convention Center from November 30 to December 3. Mr. Zhang Guanshi will be the lecturer of the conference to bring you the topic of “Operation and Maintenance Guarantee Practice of Live Broadcasting Platform”.

Case target

Compared with Web services, the operation and maintenance of live audio and video is more special, and there is no good reference experience in the industry. When I first took over, the operation and maintenance challenge in this aspect was quite big.

(1) Huya Live is currently a heterogeneous and multi-cloud architecture. From the whole link, any audience can see the situation of any anchor on any line, which is of high complexity; (2) research and development personnel and each team will be focus on what you have to link, so in canine teeth operations teams after introduced the CDN much, not only technical and management complexity greatly increased, and the video stream path in such a complex scenario, must further audio and operational work, the quality of operations and operations staff skills put forward higher requirements.

Key points of success (or lesson)

The quality evaluation system of live broadcast audio and video transmission, the whole link monitoring of audio and video quality data, and the thinking of Internet service reliability system engineering.

Case the key

Improve operation and maintenance efficiency and live broadcast quality.

Case enlightenment

Due to the particularity of the live broadcast platform, which is different from any previous architecture, and the limited technology of the video plate at that time, we must find the focus of operation and maintenance as soon as possible. We solved this problem with DevOps and SRE, which we’ve been advocating for years.

Canine teeth live ops director zhang view stone | decryption SRE six kinds of ability and the tiger tooth operational practice

Related Posts

Programmers want to do machine learning? Take a look at Nodejs’ year in the making

I/O notes from the Haskell Guide to Fun Learning

(01) Performance theory – Daily notes