As we all know, although I am a programmer, I love sports very much, such as dancing. Every day when I go home, I will learn related dances in the dance area of Station B before going to bed.

Yesterday was no exception, as soon as I washed, I flew to sit in front of the computer, open the B station dance area ready to learn to bite the cat, Xin Xiaomeng, Xiao Xian if their new dance moves, I have to say that the wives are very good, even I this introverted people unconsciously followed the twist up.

Just as I was about to learn the next move, I found 404 Not Found.

As a developer, my first instinct was that the system was down, and I even suspected it was my network. I found that the mobile network was working and the computer was also working on other websites, so I knew the development was in trouble.

I have refreshed it several times and found that it is still like this. I have a little sympathy for the corresponding development classmates. It should be gone by the end of the year. (The site has not been restored as of this writing.)

As a former programmer, I used to think about the composition of the website architecture of the B station, as well as the points that might go wrong when the accident is reworked. (The old profession is used to)

First we can sketch a simple architecture of a website, and then we can guess where the problem might be.

Because I stayed up late to write the article, I have never stayed in such a company that mainly relies on live video, and I am not very familiar with the technology stack, so I drew a sketch of the general logic of the electricity business, and everyone gently spray.

From top to bottom, from entry to CDN content distribution, to front-end servers, back-end servers, distributed storage, big data analytics, risk control to search engine recommendations and I’m just sketching it, I don’t think the overall architecture is going to be that different.

I went to the Internet to casually check some similar Douyu, B station, A station such companies, the main technology stack and technical difficulties mainly include:

Video access storage

  • flow
  • To the nearest node
  • Video codec
  • Breakpoint Continuation (much worse than the IO example we wrote)
  • Database system & file system isolation

Concurrent access

  • Streaming media server (all major manufacturers have, high bandwidth cost)
  • Data clustering, distributed storage, caching
  • CDN content distribution
  • Load balancing
  • Search engine (sharding)

Barrage system

  • Concurrency, Threading
  • kafka
  • NIO Framework (Netty)

In fact, they learn the same technology as us, but their corresponding microservice language composition may go, PHP, Vue, Node accounted for a large proportion.

Let’s analyze the possible causes and places of the accident:

1. Delete libraries and run

Before micro-alliance happened this thing, I think each company should not put the operation and maintenance permissions to so big, such as host permissions directly prohibited rm-rf, fdisk, drop such commands.

And the database is now most likely to be multi-master multi-master, multi-ground backup, disaster recovery should also be done very well, and just the database explosion, the CDN many static resources should not be loaded out, the whole page directly 404.

2. The failure of a single micro-service brings down a large cluster

Now the front end is separated, if the back end fails, a lot of things in the front end can still load but the data can not come out of the error report, so the cluster may also fail because the front end fails, or the front end together, but the same problem, now it seems that all the static resources can not be accessed.

However, I think there is a possibility that some services were suspended, leading to a large number of errors and the cluster was suspended. In addition, the more this happens, the more people will refresh the page constantly, which makes it more difficult for other services to restart. But this possibility is not as big as what I said at the end.

3. There is a problem with the server vendor

This kind of big website is CDN + SLB + station cluster, all kinds of current limiting degradation, load balancing will do very well according to the principle, so it is only possible that there is a problem with the server manufacturer hardware of these front-end services.

However, what I am confused about is that the BFF of Station B should be routed to the room where some access nodes are compared. In this way, when people from all over the country are surfing, some people should be good, some people should be bad, and some people should be good and some people should be bad. But now it seems that all of them are bad.

In principle, in theory, from CDN, distributed storage, big data, search engines should have done a lot of guarantee measures to be on, if true all in a place that is really not very wise.

My feeling is that all the cloud is not done well, the offline server has a problem, just is not on the cloud is the key business, now the company is public cloud + private cloud such hybrid cloud collocation, but the private cloud part is B station’s own internal business, so it should not be his own room problems.

If you really like I said, a bet on a server vendors, physical machine is wrong, the data recovery may be slow, I do big data before their own, I know the data backup is incremental + total quantity, restore the really good part of the can node to pull from other areas, but if it is in one place, it would be in trouble.

analyse

I think no matter what the cause is, what we technicians and companies should think about is how to avoid such a thing from happening.

Data backup: backup must be done, otherwise if what natural disasters really happen, it is very uncomfortable, so a lot of cloud manufacturers are now selected in my hometown in Guizhou where natural disasters are less, or the bottom of the lake, the bottom of the sea (cooler cost can go down a lot).

Full, incremental basically all the time to do, day, week, month continuous incremental data, as well as full data backup on time, so that the loss can be reduced a lot, afraid of all areas of the mechanical disk are broken (remote disaster tolerance in addition to the destruction of the earth otherwise can be recovered).

Operation and maintenance authority convergence, or afraid to delete the library run, anyway, I am often on the server RM-RF, but generally there is a jump machine to go in can do command prohibition.

Cloud + cloud native: cloud products are now very mature capabilities, enterprises should have enough trust in the corresponding cloud manufacturers, of course, also have to choose the right line, cloud products of various capabilities is one, as well as the critical moment of disaster recovery, emergency response mechanism are many companies do not have.

Cloud native is everybody to pay attention to technology in recent years, the docker + k8s that correspond to some of the portfolio, combined with the capabilities of the cloud computing, it can achieve unattended, dynamic expansion and shrinkage, as well as the above said emergency response, but the technology itself is an attempt to need some cost, and I don’t know B stand such video system, compatible or not.

In fact, I think whether it is on the cloud, or not on the cloud, can not rely too much on a lot of cloud manufacturers, their core technology system and emergency mechanism or have, if the cloud manufacturers really unreliable how to do? How to do the real high availability, this I think is the enterprise technical personnel need to think.

Very cloudy vendor, for example, is a physical machine is separated into multiple virtual machine sales, then there are more than a single physical machine hosts, if one party is electricity play double tenth, one side is the game makers, each other a lot of network bandwidth, you may be lost package, it is for the game user experience poor, This is why I say not to trust and rely too much on cloud vendors.

In case the other party bought to dig, it is more excessive, the calculation force squeeze dry, full load run more uncomfortable.

B station this time, fortunately such a problem was exposed in advance, and is at night, there should be a lot of traffic trough time to recover, when I write here, most of the web page has been restored, but I found that part of the recovery.

No matter how to say next time can completely put an end to, believe that B station behind a very long period of time will be busy with the transformation of the architecture system, to ensure their real high availability.

Hope to be able to let me stable in the evening to see the dance area, rather than staring at 502, 404 2233 niang in a daze.