Accident Summary Highlights - Push Message caused by the Bloody 01 (a week)

I believe that push message has the largest impact range and the strongest blow of every e-commerce accident, such as the 36-krypton push door event in 2018. He Wenhui became popular in just a few days. I believe that students in charge of testing push message will have a dead heart.

Let’s take a look at another push message accident.

The handling procedure is as follows

17:29 Let to check a problem in the group. At that time, the R&D was discussing the test function points of this phase with the test, and did not check the group message in time.
17:31 In the pre-release test, only users who have visited the pre-release in the last 30 days will be circled;
The feedback in the group can be received by real users, not just test accounts;
17:43:47 Stop pre-releasing push worker, but have reached a large number of users;

[Root cause]

1. It is not practical for production environment users to create a set of users for pre-release, because the pre-release data is incomplete; 2. The developer does not know that the pre-published message channel can reach real users; 3. R&d has not developed the habit of sending messages in the test environment and using real copywriting;

“TODO”

1. Business closure

Change check-in to 7 days maximum one push; The configuration is complete.
Settlement page unpaid user push stops; Has stopped
Reminded to stop the arrival of goods in short supply; Has stopped

2. Training and Assessment (

In the past, the new employee gift package was distributed to new employees, but there was no corresponding training and investigation organized in the R&D group.
The weekly review of historical accidents, also did not do the corresponding investigation, can not determine how much people absorbed;
It is necessary to conduct systematic training and investigation on the new staff, and then investigate again after a month to ensure that the key issues have been fully understood and mastered;

3. System limitations

The underlying channel, the pre-release environment whitelist function, can only be sent to research and development and testing, the rest of the PIN all filter;
Bottom channel, pre-release + production environment, filtering copywriting: no Chinese characters, too few words, etc. At the same time, add risk control sensitive word filtering; Business systems have been connected, and the check-in system has passed the review.
Business system, sort out the access channel, join the approval flow; Business systems have been connected, and the check-in system has passed the review.
Hidden danger sorting, whether the business system uses JSF reTry mechanism when calling message center, coupon issuing, wechat contact message, need to check and disable all reTry mechanism;

[Second replay todo]

System todo:

In our system, which high-risk interfaces will be sent repeatedly if retry is used? Need to comb: send SMS, send push, send template message, send public number message, get coupons, exchange red envelopes;
In all our systems, what high-risk interfaces are not connected to the approval flow and need to be sorted out?
Underlying channels perform whitelist filtering in pre-advertising. You are advised to return an error message. Because new employees have not been added to the whitelist, they may not find the reason why they cannot receive messages when they try the new system.
Whitelist, the current physical gateway has been, can provide an interface, so that all systems rely on this unified interface, one maintenance, everywhere benefit;

Beyond the topic:

Interceptor, to prevent pre-sent cloth package to online, new projects often forget to add, can the project generated scaffolding in the new function;
System UMP alarm, duty table has been changed, need to be improved in a centralized manner, so as not to receive important alarms;
Each time to do a requirement, we should ask whether the product and the old version do AB, whether to do version compatibility; Since the time from release to full installation of small programs is very short, it is not necessary to keep the old version in most cases, but ab scenarios usually require it;
User data for the pre-release environment;
Development tools automatically pull branches, automatically change the compilation parameters, automatically compile, automatically publish, improve work efficiency at the same time, but also avoid the risk of online package to pre-release;

Worker, it is suggested to add a Chinese remarks field to facilitate the description of risks. Whether in the generation environment or the pre-release environment, if a worker is in the disabled state, it must have its reason and cannot be started at will.

In the event of an accident

Stop loss first, do not check the cause, but the first time to think about the solution to the problem, fast implementation;
Then review, check the cause, reflect on how to avoid, calm down to think slowly, do not think “I have what work is not finished, have to hurry up to work” and so on;

Optimize the rhythm

After the demand testing, I asked the product about the next planning, and I thought about the future product planning, which optimization we should make in the system, what the value output of the optimization is, and how much the development volume of the optimization is. After communicating with the product, I mentioned the requirements in the next version together and applied for testing resources together.

Weekly meeting & review

Let’s resume the weekly meeting at 2pm every Wednesday, and the main content is still review of historical accidents. People in the group take turns to preside over the meeting. The host must preview the meeting and name the key points in a sentence so that everyone can understand it.

For some internal systems, do some sharing and knowledge exchange;

specification

Comments to the RPC layer should be supplemented with cf addresses, risks, and existing interface docking persons in the interface documentation;
Amount calculation, need to be careful, such as accurate to the number of decimal places, integer whether to retain the decimal point, precision problems and so on;
For high-risk scenarios, you should first sort out how to split the self-test;
When doing your own testing, first ask the surrounding people who have tested similar requirements, and use others’ test cases to guide your own testing methods;
Self-test data step by step. First test channel connectivity with your own account, and then note out the channel to test service logic.
When testing, watch the logs and sense the difference between pre-release and production by magnitude, speed of execution;

Do message testing with reasonable copy

Production environment calls are risky and need to be considered in advance. HTTP interfaces should distinguish between production and pre-publishing by domain name;
Avoid a system handover back and forth;
Workers usually cannot be tested, so we need to write test cases called testUnit by ourselves and split the test cases into enough details to avoid too much business logic operation at one time.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Accident Summary Highlights – Push Message caused by the Bloody 01 (a week)

[Root cause]

“TODO”

[Second replay todo]

Beyond the topic:

In the event of an accident

Optimize the rhythm

Weekly meeting & review

specification

Do message testing with reasonable copy

Accident Summary Highlights – Push Message caused by the Bloody 01 (a week)

[Root cause]

“TODO”

[Second replay todo]

Beyond the topic:

In the event of an accident

Optimize the rhythm

Weekly meeting & review

specification

Do message testing with reasonable copy

Related Posts

Soul Gateway releases milestone version 2.3.0 with new support for GRPC, Tars, Sofa protocols

What did ZooKeeper sacrifice for Kafka?

Java concurrency (4) – Synchronized and CAS