Huawei engineer SRECon Asia: Focus on reliability, resource optimization, and performance improvement

Content source: on June 17, 2017, huawei software architect maguire in June 17, “xi ‘an | enterprise MeetUp” to “SRECon Asia 2017 stories” speech to share. IT big said as the exclusive video partner, by the organizers and speakers review authorized release.

Read the word count: 1552 | 4 minutes to read

Guest lecture video review and PPT: suo.im/4ViT57

Abstract

40%-90% of the cost of a software system is in maintenance. For large-scale companies that focus on software availability, reliability and performance, software engineering becomes an option to solve problems in the field of operation and maintenance. As a result, Google launched reliability focused organizations like SRE(Software Reliability Engineers), which created Borg and Borgmon. In addition to Google, other large scale Internet companies focused on reliability, such as Facebook, Ebay, Dropbox, Linkedin, Baidu, Alibaba, etc., have adopted similar practices. SRECon is a meeting where companies share SRE practices in technology, culture, and more. Recently, I had the opportunity to attend the SRECon Asia conference in Singapore. I took this opportunity to share some interesting topics, ideas and some trends I observed in the SRE field.

What SRE

SRE is website reliability engineer. SRE requires very high skills. 50-60% of Goggle SRE employees are standard software engineers, and the rest of them meet the requirements of 80-90% software engineers and know Unix details and networks.

SRE will use software engineering thinking to solve operational problems, responsible for availability, performance, efficiency, monitoring, transaction processing, etc.

SRE methodology

SRE is primarily focused on r&d, maximizing iteration speed while maintaining service SLA/SLO. It involves monitoring systems, emergency response, change management, demand forecasting and capacity planning, resource deployment, and efficiency and performance.

SRECon Asia

SRECon is hosted by USENIX, while The Asian conference is mainly sponsored by Baidu, Facebook and Linkedin. There were about 250 people at the meeting. The contributors are lecturers from big Internet companies like Google, Facebook, Linkedin, PayPal, CloudFlare, Dropbox, Yahoo, Atlassian, REA Group, Domestic companies include Baidu, Alibaba, Didi, QiNiu, Tingyun and Tsinghua.

Monitoring and Alarm

As shown in the figure, the most basic requirement of software is monitoring. Everything is run on the basis of monitoring. Only when monitoring what kind of accident happens, can corresponding emergency treatment be made. Summarize the problem afterwards and analyze the root cause of the problem. Make improvements, test them, identify problems, modify the code and release them.

Open-Falcon: Motivation

Zabbix: It is difficult to scale horizontally when you manage more than 2000 servers.

OpenTSDB: Its advantages are write performance, good horizontal scaling, but slow Query.

InfluxDB: Some small foreign companies use InfluxDB. Query performance is excellent, aggregator aggregation is powerful, and horizontal scaling is difficult.

Open-Falcon: Performance

Easily scale horizontally, processing millions of transactions per minute (Query/judge/ Store/Search), easily supporting more than 100,000 hosts. RRA allows you to query 1-year historical data with 100+ metric seconds of response time. Store metric history data for more than 10 years.

The problem

The operation and maintenance of OpenStack and the repair of problems require complex knowledge and too many operations. This knowledge is difficult to Transfer.

solution

Use natural language to query system status, better than CLI and Regex.

Use the most basic rules to automatically discover system knowledge, build a knowledge graph SOSG, turn the queries of a specific system into graph traversal, anomaly detection finds hidden problems.

Talking to an OpenStack Cluster in Plain English by Xu Wei From Tsinghua

Service life cycle

Double distribution consistent algorithm, Paxos algorithm; Reliable launch scale, launch checklist; Seamlessly manage changes on Yahoo Hadoop infrastructure servers, 45,000 nodes managed by Chef.

Reliable Launches at Scale

Before launching, we will check the architecture, capacity, reliability, monitoring, automation, growth trend, and readiness of third-party (internal Google) services, and confirm that all of these are ok before launching.

Managing Server Secrets at Scale with a Vaultless Password Manager

Key/CredenHals increase with the number of servers.

Save Secrets in the CONFIGURATION management tool, start the configuration management tool requires key/pair etc, because each server password cannot be the same, so scale key, key RotaHon cannot be.

Another way is to save it on the server and generate it when the server starts. Root password: The disk encryption is difficult. If the disk is stateless, the disk cannot be stored on the server.

Accident management

Some of the challenges of incident management

How to achieve shorter MTTR;

The handling of many accidents is relatively simple, such as restart, how to automatically handle these accidents;

How to reduce Falsealarms;

Alarm how to give the correct information, quickly locate the problem.

Service extension

Small,Cheap, and EffecHveTesHng forProducHon Engineers.

Merou:A Decentralized, AuditedAuthorizaHon Service

Shameon facebook and dropbox.

Capacity planning/performance tuning

Capacity Planning and Flow Control

Capacity estimation: single machine pressure measurement;

Simulation: AB /jmeter/ Gatling;

Replication: Replicates the production environment traffic.

A redirect;

Load balancing: weight.

Why Flow Control

Queue stacking: Server performance deteriorates and response time increases, affecting applications and user experience.

Avalanche effect;

Overloaded traffic needs to be limited.

And a Formula!

Calculation principle:

EntranceSize= volume * RT(response Hme)

Requests= constants * LOAD * RT

Flow control principle: If the system is overloaded, the volume is limited; if the system is under normal load, the volume is removed.

Use dynamic threshold control.

conclusion

SRECon has a large number of participants and good exchange effect.

You can learn about different companies like Cloudfare, Amazon’s A9.

While many topics may seem small, most have something to learn about them.

One trend we can feel in operation and maintenance is data pipeline + big data + machine learning +AI+Bot.

That’s all for today’s sharing, thank you!