Test right shift: 360DSP online monitoring practice case sharing!

Odd technical guidelines

You’ve probably heard the idea of testing left, testing right. The most common way to think about it is to monitor in a production environment and get user feedback in real time. Today, let’s talk about online monitoring

This article was first published on Qtest Tao and is reproduced with permission from 360 Technologies.

preface

You’ve probably heard the idea of testing left, testing right. The product development cycle can be plotted out at the drop of a hand. In the past, testing was mostly done after the development phase and before the product went live. Currently, many test teams have moved part of their testing activities to the left: tests are designed before the development phase, such as BDD behavior-driven development, unit testing, and CodeReview practices. The most common way to think about it is to monitor in a production environment and get user feedback in real time. Today, let’s talk about online monitoring

Why are we doing online surveillance

Let me tell you a story. On a certain day, at 7:30 p.m. in an advertising business, a large number of channels were blocked by mistake due to program failure, but due to the low peak period of income accounted for a little, the whole day income alarm threshold was 20%, no alarm was triggered. The next day, the product student found that the peak income and request reduced by half, immediately inform technical investigation. Two hours later, the rollback is complete, the problem is resolved, and the number of requests and revenue return to normal. At this point, 16 hours had passed between the occurrence of the problem and the resolution of the problem, causing a lot of damage.

I don’t know how you feel about the story.

The author in the DSP business line daily receiving billions of AD requests, online services even in peak abnormal 1 minute can cause considerable losses, for such key business income, if not enough effective monitoring, would be a huge hidden trouble as a time bomb, because everyone does not guarantee system always works well, There’s no way to monitor people all the time. So, of course, online monitoring is very urgent.

With this background in mind, we started our “test right shift” monitoring tour. What did we do?

What kind of surveillance did we do

Interface-level monitoring

The first thing we do is based on the existing three musketeers Ialert monitoring system (http://www.jiekouceshi.com/ialert/), will all servers have separate interface level monitoring, and set the reasonable business logic assertions, once the interface after assertion does not meet more than 3 times, trigger the SMS alarm. Monitoring of the interface itself ensures that we are constantly aware of anomalies on each of our servers. The following figure shows the monitoring overview of an interface in the Ialert system.

However, as the upstream and downstream links of DSP service are long, the materials will eventually be returned to ADX (i.e., advertising trading platform) for bidding. If there is a problem in the upstream, the interface level monitoring is not perceptive. How to solve this problem?

The UI level monitoring

Since the final result is the normal display and click of the advertisement, can we consider it from this point? From this point of view, we considered uI-level monitoring, that is, phantomJS was used to analyze the DOM structure of the final AD page, confirm whether the browser rendered the AD correctly, and simulate clicking to see if it jumped to the predetermined landing page to determine whether the launch was normal. One problem is that ads are placed through RTB bidding, and running ads on real media pages doesn’t guarantee ads. Happens to have a built in advance test page can do each time, so we based on this page, and other internal media advertising has carried on the monitor, such as the AD below, as long as by parsing the dom structure, correctly analyze the images, and entry, jump the address is correct, just think of this AD request is upstream and downstream channel.

Then the problem comes again, this kind of monitoring is only for specific channel number (or advertising space) monitoring, for the global program abnormality is effective, but for the channel mistakenly banned accident mentioned in the beginning, but there is still the possibility of failing to find the problem.

Income monitoring

In the advertising business, revenue is often the most intuitive metric, and if there is a sudden drop in revenue at some point, be alert. After investigation, the BA system of the company (figure below) can produce the income in this period at the frequency of 5 minutes, so it naturally gets the interface, automatically requests every 5 minutes, and judges whether the current time point exceeds the set threshold compared with last week and yesterday. Once it exceeds the threshold, it will send a message to the police.

Daily monitoring of key indicators

In addition, be familiar with the advertising business of the classmate all know there are many indicators, such as the number of requests, the number of bidding success, the clicks, CPM, CPC key indicators, such as these were focused on the business side from the monitoring platform of development, but in order to check the daily data, we still based on the monitoring platform provides interfaces, do the key indicators of daily monitoring email reminder, You can clearly see yesterday’s data compared to last week, the day before yesterday’s same month-on-month situation, easy to find problems in time.

How effective is monitoring

After setting up the four-layer monitoring system, many business problems have been solved. Two examples are given to illustrate.

1. When the interface went online on a working day, some servers were abnormal due to process errors. After a few minutes, SMS alarms were frequently triggered, and THE RD was quickly repaired within 15 minutes.

2. On a Saturday, due to a large number of abnormal traffic requests to the server, resources were exhausted and a large number of interface requests could not be returned normally. After the alarm, the server was repaired in time. Without monitoring, it may be harder to detect problems as quickly as possible at a time like the weekend.

Write in the last

The above is some practice of DSP service online monitoring. It is true that there must be false positives in monitoring itself, whether the server jitter or the upstream data on which there are problems will produce false positives. Our direction of efforts is to reduce false positives while trying to improve sensitivity, shorten the time of problem discovery, can really achieve after the test right shift, the online reliability of the business has been greatly improved.

World of you when not

Just your shoulder

There is no

360 official technical official account

Technology of dry goods | | hand information activities

Empty,

Test right shift: 360DSP online monitoring practice case sharing!

Related Posts

Microsoft Office 2019 failed to start with error code 0x426-0x0

The Art of management

This is how Python’s decorators used to work