We have a blogging service, built on WordPress, which was originally intended for internal use, so it was designed to support very low QPS and was deployed online using only one machine.

While it’s sad that the service is overwhelmed, it’s fun. We would like to thank our operation and maintenance team.

  1. The o&M team found that the blogging service was under too much pressure, causing the machine CPU to reach 100%. Inform our team immediately

  2. The o&M team quickly expanded the machine from one to five, but the CPU capacity of the five machines quickly reached 100%

  3. The operations team curbed the flow of blog requests, and the service and machine stabilized

We started looking at the source of the high traffic

  1. Looking at the logs, I found the most requested URL

  2. Check the log, the traffic increased dramatically from 15:30. It is suspected that the URL was sent to the APP PUSH or placed in the high traffic entry position.

  3. Check the log and confirm that all the users are accessed by mobile phones, but there is no refer. By checking the CLIENT IP, most of them come from India, and the log cannot provide more information

  4. Looking for product, planning and operation team leaders, all said they had not done relevant operations

  5. By looking at the contents of this URL, it is suspected that something might have been done by another team in the group. Make connections and find the team leader, who is in a meeting and has not responded

  6. Open the team’s APP and find that the launch page is the URL of the blog. The source has been identified.

After informing the operation and maintenance, the operation and maintenance decided to direct the traffic to the home page of the mall – all users who open the team APP will see the home page of the mall first. The service was up and running, and we harvested some traffic.

The team leader saw our contact, apologized, and removed the launch page.

It was an interesting journey to find the problem, a big reversal from service being overwhelmed to harvesting traffic. But it also exposed a lot of problems:

  1. With the services of other groups, it is not enough to check with the business side, but with the service owner. Because the business side may lack some technical thinking, and easy to produce this kind of flow is harvested, to others do wedding clothes
  2. There are no standby machine resources. Machine resources were limited, and although four machines were added, they still could not meet the demand, while only four machines were suitable for expansion at that time. Machine resource issues will be resolved after using containers.
  3. The processing scheme is not optimal. Machine expansion is not the first. In view of the traffic at that time, the machine can not support after expansion, the best practice is to limit the flow, at the same time know the URL to do nGINx layer cache. After capacity expansion, the CPU capacity of these machines also reaches 100%, which affects the service on these machines.
  4. There is no contingency plan for emergencies. For all services, there should be an emergency plan in case of high concurrency, which can be handled in a more orderly way.
  5. Lack of ability to downgrade, circuit breakers, etc., which is one of the necessary skills for a mature IT company, needs to be complemented later.

The last

If you like my article, you can follow my public account (Programmer Malatang)

Mp.weixin.qq.com/s/Y6Axp4wfa…

Commonly used caching skills: mp.weixin.qq.com/s/xElsNUjxi…

How effectively docking third-party payment: mp.weixin.qq.com/s/NM34aevx3…

Gin framework concise version: mp.weixin.qq.com/s/X9pyPZU63…