Hu Kai, head of operation and maintenance of Bilibili, used to work for Kingsoft software, Kingsoft Network and Cheetah Mobile, responsible for operation and maintenance. Bilibili is the largest youth trendy cultural entertainment community in China and a well-known UGC platform for bullet screen video sharing in the Galaxy.

Second after 95 yuan new man, to make video barrage, UP the famous bilibili (hereinafter referred to as the station B) more and more popular, millions of young people through terminal equipment such as computers, mobile phones, TV to them on the B site, see the barrage, especially in the new times online access when the pressure is very big, this gives B stand IT operations teams have brought huge pressure. Hu Kai joined the newly established operation and maintenance department of STATION B last year.

This article is based on what the author shared in the Monitoring and Performance Sharing Group.

There are mainly three pain points of operation and maintenance in station B: shortage of manpower, frequent failures and lagging operation and maintenance system. In view of these three pain points, station B adopts three ways to break the ice.


Free the labor force

At present, the CDN of station B is mainly self-built, with tB-level bandwidth and N PB video storage, so the operation and maintenance pressure is very high. Hiring can solve the problem, but in the magic city of Shanghai, recruiting the right operation personnel can not be completed overnight, how to do the shortage of manpower? Find a way to free up labor from the grind of day-to-day operations.

Since there was no special operation and maintenance department before, the authority of IT system was all in the hands of development. When problems occurred, operation and maintenance had to follow the development to find out the reasons. Besides, communication was often prone to problems due to low efficiency.

So our first step is to use Ansible + Jenkins to automate publishing. Ansible is a relatively simple batch management tool that supports advanced functions such as template management. With automatic publishing out of the way, the demand on the development server has dropped significantly, and once the code is committed to the Git trunk, the release will be triggered automatically.


Git uses GitLab. At the same time, we make a layer of LDAP proxy for security. The effect is equivalent to “general order”, and the operator, Git and Jenkins use OpenLDAP for unified authentication. The subsequent Redmine, Grafana, Zabbix, etc., are all connected to OpenLDAP authentication. Everyone has a dynamic password, which is needed for each authentication.

A stick monitors the alarm system

Since the original monitoring could not meet the rapid growth of the business, we deployed the open source monitoring system Zabbix. Although the operation and maintenance colleagues could use Zabbix well, colleagues in other departments always felt that the ease of use was not high, and many customized monitoring was very troublesome to implement.


Also, focus your energy on the things that matter most. We can do Zabbix and Open-Falcon well for a long time, but the result will be a score of 80 to 90 that is not significant, and a lot of monitoring is not what Zabbix and Open-Falcon are good at.

StatsD can be embedded into the code very flexibly for monitoring (Shell can), because the use of UDP protocol, so the performance and failure of the server will not affect the called program, can achieve business-level QPS, response time and other statistical monitoring.

The final effect of one alarm is as follows:


Station B has built CDN by itself, and there are hundreds of CDN nodes covering the whole country in China. Monitoring CDN has always been a difficulty. When a link has problems, it is difficult to find problems with traditional Zabbix and Open-Falcon monitoring. Although we developed http-Monitor monitoring ourselves, which can be used to monitor and alarm the website’s availability, we still use the service of third-party monitoring bao in consideration of independent resources and data reliability, as well as the detection of network quality at the client end. Monitoring treasure is simple to use, practical functions, monitoring points, distributed monitoring can be found in time on the network problems, the snapshot function can quickly locate problems and view details. Moreover, the monitoring treasure belongs to a third party and is independent. It can also issue the SLA certificate of the website as the basis of the internal work assessment of STATION B.


Love and hate of open source systems


Site B has a strong technical atmosphere and loves open source and new technologies, so it uses a lot of open source components, including SheepDog (lost data) and GlusterFS (Card Chengxiang). The biggest one is SD card + Ceph storage. Ceph itself is very well designed, but the wrong posture can lead to tragic death. For example, a server cluster in station B uses SD cards to run the system. As a result, the SD card kneels, causing the system to kneel, and the DISK IO of all virtual machines freezes or even crashes. After continuous tuning, the system is finally stable. Ceph gives me the biggest comfort: it doesn’t lose data, it doesn’t lose data!

In addition, open source systems such as Redis3.0, Codis and Twemproxy have all been used in SITE B. Finally, we developed BiliTW (open source) by ourselves. The main reason is that Codis has not been updated and Twemproxy has poor performance. Especially if there is a lot of Redis on the back end (and like Redis, it only eats single core). BiliTW’s biggest improvement is the support for multi-core, adding some easy operation and maintenance features.

Finally, I summarize the growth process of the operation and maintenance team of STATION B. Due to the failure of many, have to grasp the entrance, grasp the big; Because the operation and maintenance system can not keep up, we have to take open source against; Because of the use of a lot of open source systems, so a lot of holes.


Q: how to do the dynamic password, develop or open source Auth?

A: Using Google Dynamic Password, the open source Google authenticator.

Q: Is there any special processing required to deploy Ceph online? What’s the problem?

A: Ceph should pay attention to the version, must use the stable version, must use the big factory used version. In addition, Ceph is very resource-intensive, with all SSDS used by STATION B, and the internal exchange of Ceph is an independent 10-gigabit network. The biggest problem with Ceph is the perception that Ceph is a distributed single point of storage with several nodes and several copies. Large KVM block storage clusters with 64 nodes and 3 copies of data are complex to solve and require people who love research and can read the code.

Q: How many people are in the operation and maintenance team of Station B?

A: We started from 0 last year, and now we have more than 20 people, covering application, R&D, security, information and so on.

Q: Does GlusterFS store work with cards?

A: I think GlusterFS is only good for cold storage for large files.

Q: Why use KVM instead of Docker

A: We also use Docker. Docker has been attracting attention, but not many people actually use it. Only large companies have invested a lot of resources into it. As of Docker 1.9.0, we are running core SLB on Docker with Host mode. In the second half of this year, one of our big goals is to connect Docker to other online businesses. The current Mesos Macvlan mode is already treading water.

Q: Do hadoop-related operations need to be done?

Answer: Big data also do, temporarily no full-time personnel. Due to the lack of special personnel in technical research, I assign tasks to each application operation and maintenance. Big data is assigned to an application operation, learning with development.

Q: did you bind the server network card?

A: We all made dual network card binding, 10 gigabit bond0.

Q: many faults, how to solve this trouble quickly? A: It’s very difficult. On the one hand, you need to understand the business, and on the other hand, you need to have data and tools. At the beginning, we checked the problem very slowly, and then gradually improved, such as improving monitoring, adding fault anchor points, and summarizing the fault. Recently, Drapper link tracking has been done by many companies. In fact, it is marked in each link of the request and then selectively analyzed in real time. Drapper’s final implementation is like the browser’s censorship element, where you can see it slowly.

Q: Mode0 mode of the total bandwidth or a network card? I was testing mode=4, combined with the dynamic aggregation of switches, and the problem I encountered was that the bandwidth was the speed of a network card when the servers transmitted to each other.

A: Mode 0 is best configured on the switch. The bandwidth runs for two network cards, which can be redundant and also can be loaded. The bandwidth of our self-built CDN is very high, so the bandwidth of a single machine is 20G. In Cheetah, Mode4 is used, which is also good, Mode6 does not require special configuration, but there is a direction imbalance. Mode4 worked best in previous tests, but the company ended up using Mode6 because it was easy to maintain.

As for the bandwidth problem, it is necessary for 2 clients to transmit to a server at the same time to achieve the bandwidth of double network card. When TESTING MOde0 before, I encountered the phenomenon of dissatisfaction, and then USED Mode6. It was years ago, probably CentOS5 or 6, but now B is using Debian 8, Mode 0 didn’t find the problem.

Q: Is your Redis cluster 3.0 stable?

A: Redis 3.0 is quite stable, its Java client is better, other languages may have to be developed. There are many languages here, and some businesses are still running in the way of Proxy. We are developing a Cache management system, which will eventually be compatible with various methods and open source in the future.

Q: is BiliTWGitHub – anewhuahua/bilitw: twemproxy multi process?

A: No, this is a multi-process version based on Twemproxy made by a former colleague. A new one will be reconstructed in the future and placed under Bilibili · GitHub.

Q: Is the cloud of station B used much?

A: Internally, it’s like a private cloud. Games use the public cloud more.