Accident Description:

In the promotion activity of the customer last week, the customer reported that some of the shipped orders were successfully refunded, which caused heavy losses and directly affected the customer’s trust. We did a special review of the accident afterwards.

Make an impact:

More than 100 problematic orders were involved, with a total amount of nearly ten thousand yuan.

What causes it?

After the WMS is delivered, OMS logistics information is returned and the order delivery interface fails to be invoked. The order delivery interface does not handle special exceptions. As a result, the order status cannot be synchronized to the platform in a timely manner. The correct business logic should be that the buyer initiates a refund application, and the OMS system automatically intercepts the WMS shipment before the customer service personnel manually click the shipment button to synchronize the platform status. The INTERCeption fails because the WMS shipment has been successful, but the platform order status has not been updated, so the refund application is approved by default. The core problem is that one of the instances of the order service failed to load the MQ configuration file, causing the instance to be unable to send MQ messages and lack a message retry mechanism.

Why was the problem not detected in time?

The project is private deployment by the customer, complicated release and maintenance is carried out by the customer’s operation and maintenance, and the monitoring system is replaced by the customer’s own monitoring system.

What was done when the exception was found?

  1. Identify problems by analyzing logs
  2. Contact customer operation and maintenance personnel to eliminate the order service instances that are out of order
  3. Technical means check out the problem order and send it to the customer’s business personnel to intercept the problem order.

How to avoid it in the future?

Through the review of the accident, the solutions to this accident are as follows:

  1. The interface exception is thrown in time for the caller to handle the corresponding business logic
  2. The message sending service provides an automatic retry mechanism. If the message fails to be sent, the system automatically retries the message for three times
  3. Provide short messages and pin messages to remind important nodes of faults and handle them in a timely manner
  4. Improve the monitoring system to monitor the status of each instance and deal with problem containers in a timely manner.

Accident summary:

Face up to every accident, analyze the cause of the accident, solve the cause of the accident, how to optimize the prevention work of the accident, to avoid the next more serious accident. Let’s hope technologists Revere every line of code!