In today’s distributed system and microservice architecture, service invocation failure has become the norm. How to deal with exceptions and how to ensure data consistency has become a difficult problem in the process of microservice design.

Solutions vary according to service scenarios. Common solutions are as follows:

  • Blocking retry;
  • 2PC, 3PC traditional transactions;
  • Use queue, background asynchronous processing;
  • TCC compensation transaction;
  • Local message tables (asynchronous assurance);
  • MQ transactions.

This article focuses on a few other items, about 2PC, 3PC traditional affairs, online information has been very much, here do not repeat.

Blocking retry Blocking retry is a common approach in microservice architectures.

Examples of pseudocode:

m := db.Insert(sql)

err := request(B-Service,m)

func request(url string,body interface{}){
  for i:=0; i<3; i ++ {
    result, err = request.POST(url,body)
    if err == nil {
        break 
    }else {
      log.Print()
    }
  }
}
Copy the code

As above, when a request to service B’s API fails, initiate a maximum of three retries. If three attempts fail, the log is printed and execution continues or an error is thrown to the upper layer.

This approach brings the following problems:

The current service considers that service B failed due to network timeout. In this case, service B generates two identical pieces of data. Service B fails to be called. The service B is unavailable, and the service still fails after three attempts. A record inserted into the DB by the current service in the previous code becomes dirty data. Retries increase the upstream delay for the call, and if the downstream load is heavy, retries magnify the downstream service. The first problem is solved by having the API of the B service support idempotency.

Second problem: It is possible to correct data with timed footsteps in the background, but this is not a good idea.

Third problem: This is a necessary sacrifice to improve consistency and availability through blocking retry.

Blocking retry applies to scenarios where services are not sensitive to consistency requirements. If data consistency is required, additional mechanisms must be introduced to address it.

Asynchronous queues Are a common and desirable way to evolve a solution. The following is an example:

m := db.Insert(sql)

err := mq.Publish("B-Service-topic",m)
Copy the code

After the current service writes data to DB, it pushes a message to MQ, which is consumed by a separate service to process the business logic. Although MQ is far more stable than a normal business service when compared to blocking retry, calls to push messages to MQ still have the possibility of failure, such as network problems or current service outages. This still causes the same problem of blocking retry, where the DB write succeeds but the push fails.

In theory, in distributed systems, code involving multiple service invocations will have such a situation, and in the long run, invocation failure will definitely occur. This is also one of the difficulties in distributed system design. In addition, MQ series interview questions and answers are all sorted out, wechat search Java technology stack, sent in the background: interview, can be read online.

TCC compensated transaction is a better choice in the case of transaction requirements and inconvenient decoupling.

TCC breaks each service invocation into two phases and three operations:

Phase 1. Try operation: Check and reserve service resources, such as checking and withholding inventory. Phase 2. Confirm operation: Confirm the resource reservation for the Try operation. For example, update inventory withholding to deduction. Phase 2. Cancel operation: After the Try operation fails, the resource withheld by the Try operation is released. For example, add back the inventory withholding. TCC requires each service to implement the APIS of the above three operations. The operations that were done in one call before the service was connected to the TCC transaction now need to be done in three operations in two phases.

For example, A shopping mall application needs to call A inventory service, B amount service and C points service, with the following pseudocode:

m := db.Insert(sql) aResult, aErr := A.Try(m) bResult, bErr := B.Try(m) cResult, cErr := C.Try(m) if cErr ! = nil { A.Cancel() B.Cancel() C.Cancel() } else { A.Confirm() B.Confirm() C.Confirm() }Copy the code

A, B, and C service APIS are called in the code to check and reserve resources, and Confirm operations are returned. If the Try operation of the C service fails, the Cancel APIS of A, B, and C are respectively called to release the reserved resources.

TCC solves the problem of data consistency across multiple services and databases in distributed system. However, there are still some problems with TCC, which need to be paid attention to in practice, including the call failure mentioned in the above section.

If c. terry () is a true call failure in the above code, then the following extra c. canel () calls will release the resource without locking it. This is because the current service is unable to determine whether the failed call really locks the C resource. If not called, it actually succeeds, but returns failed due to network reasons, which results in C’s resources being locked and never released.

Null-release occurs frequently in production environments, and services should support null-release execution when implementing TCC transaction apis.

If c.terry () fails in the code above the sequence, the c.canel () operation is then called. Due to network problems, c. Canel () request may be sent to C. service first, and c. Terry () request may be sent to C. Service later. As a result, empty release is caused, and C resources are locked.

So the C service should reject the Try() operation after the resource is freed. Implementationally, a unique transaction ID can be used to distinguish between a first Try() and a post-release Try().

Call failure Cancel and Confirm During the call, some failures may occur, for example, common network reasons.

If Cancel() or Confirm() fails, the resource is locked and cannot be released. This happens all the time

Solutions include:

Blocking retry. But they have the same problems, like downtime, failure all the time. Write to the log, queue, and then have a separate asynchronous service intervene automatically or manually. But there are also problems, and when writing to a log or queue, there are failures. Theoretically speaking, non-atomic and transactional two pieces of code, there will be intermediate state, there will be the possibility of failure.

Local message tables Local message tables were originally proposed by ebay to have local message tables in the same database as business data tables so that local transactions could be leveraged to meet transaction features.

This is done by inserting a message data as well as business data in a local transaction. Then perform subsequent operations. If other operations succeed, delete the message. If it fails, do not delete it, asynchronously listen for the message and retry.

Local message tables are a good idea and can be used in several ways:

With MQ sample pseudocode:

messageTx := tc.NewTransaction("order") messageTxSql := tx.TryPlan("content") m,err := db.InsertTx(sql,messageTxSql) if err! =nil { return err } aErr := mq.Publish("B-Service-topic",m) if aErr! Messagetx.confirm () // Update the status of the message to Confirm}else {messagetx.cancel () // Delete the message} // Asynchronously process the message to Confirm, Publish(" b-service-topic ", task.value ()) if err==nil {messagetx.cancel ()}}Copy the code

Insert messageTxSql into the local message table:

insert into `tcc_async_task` (`uid`,`name`,`value`,`status`) values ('? ', '? ', '? ', '? ')Copy the code

It is executed in the same transaction as the business SQL and either succeeds or fails.

If the message is successfully pushed to the queue, the local message is deleted by calling messagetx.cancel (). If push fails, mark the message as confirm. There are two status states in the local message table: try and confirm. Either status can be monitored in OnMessage to initiate retry.

Local transaction guarantees that messages and business will be written to the database, and asynchronous listening can follow up on subsequent execution, whether it is down or a network push fails, ensuring that messages will be pushed to MQ.

MQ guarantees that the consumer service will be able to process, or continue to post, to the next business queue using MQ’s QOS policies, thus guaranteeing the integrity of the transaction.

Work with service invocation sample pseudocode:

messageTx := tc.NewTransaction("order") messageTxSql := tx.TryPlan("content") body,err := db.InsertTx(sql,messageTxSql) if err! =nil { return err } aErr := request.POST("B-Service",body) if aErr! =nil {// Call b-service failed messagetx.confirm () // Update the status of the message to Confirm}else {messagetx.cancel () // Delete the message} // Asynchronize Confirm or -service func OnMessage(task * task){// request.POST(" b-service ",body)}Copy the code

This is an example of local message table + calling other services, without the introduction of MQ. This kind of asynchronous retry and local message table are used to guarantee the reliability of messages. It solves the problem of blocking retry and is common in daily development.

If there is no local operation to write to DB, you can just write to the local message table, also handled in OnMessage:

messageTx := tc.NewTransaction("order")
messageTx := tx.Try("content")
aErr := request.POST("B-Service",body)
// ....
Copy the code

Message expiration Configures handlers for Try and Confirm messages in the local message table:

TCC.SetTryHandler(OnTryMessage())
TCC.SetConfirmHandler(OnConfirmMessage())
Copy the code

In the message processing function, you need to determine whether the current message task exists for a long time. For example, if the task has been tried for an hour or fails, you need to send emails, SHORT messages, and logs and alarms to allow manual intervention.

Func OnConfirmMessage(task * tcc.task) {if time.now ().sub (task.createdat) > time.hour {err := task.cancel () Stop the retry. // doSomeThing() {return}}Copy the code

In the Try handler, it is also necessary to separately determine whether the current message task is too short, because messages in the Try state may have just been created and have not yet been committed or deleted. This is repeated with normal business logic execution, meaning that successful calls are also retried; To avoid this situation, you can detect if the message creation time is too short, or skip it.

The retry mechanism necessarily relies on the idempotent nature of the downstream API’s business logic, and while it is possible to do without processing, it is designed to avoid interfering with normal requests. In addition, recommended Java core technology tutorial and sample source: github.com/javastacks/…

Independent Message Service The independent message service is an updated version of the local message table, which is separated into a separate service. Before all operations, add a message to the message service. If the subsequent operations succeed, delete the message. If the subsequent operations fail, submit the confirmation message.

Then use asynchronous logic to listen to the message, do the corresponding processing, and the local message table processing logic is basically consistent. However, since adding messages to the message service cannot be put into a transaction with local operations, there will be successful adding messages, subsequent failure, then the message is a useless message.

The following example scenario:

err := request.POST("Message-Service",body) if err! =nil { return err } aErr := request.POST("B-Service",body) if aErr! =nil { return aErr }Copy the code

The message service needs to confirm whether the message is successfully executed. If it is not, the message will be deleted and the subsequent logic will continue to be executed. The message service has a state prepare in front of the local transaction tables try and confirm.

MQ Transactions Some implementations of MQ support transactions, such as RocketMQ. MQ transactions can be viewed as a concrete implementation of a stand-alone messaging service, logically consistent.

Before any operation, send a message to MQ. If a subsequent operation succeeds, Confirm confirms the commit message. If a subsequent operation fails, Cancel deletes the message. The MQ transaction also has a prepare state and requires MQ’s consumption processing logic to confirm the success of the service.

Conclusion From the practice of distributed system, to ensure the data consistency of the scene, it is necessary to introduce additional mechanism processing.

The advantages of TCC are that it acts on the business service layer, does not depend on a specific database, does not couple with a specific framework, and has flexible granularity of resource lock, which is very suitable for micro-service scenarios. The disadvantage is that each service has to implement three apis, which deal with various failed exceptions for business intrusion and change. It’s hard for developers to deal with all sorts of situations, and finding a mature framework, such as Alibaba’s Fescar, can greatly reduce costs.

The advantage of local message tables is that they are simple, do not rely on modification of other services, work well with service invocation and MQ, and are practical in most business scenarios. The disadvantage is that the local database has multiple message tables coupled with the business tables.

The advantage of MQ transactions and stand-alone messaging services is to separate out a common service to solve the transaction problem, avoiding message tables coupled to each service and increasing the processing complexity of the service itself. The disadvantage is that there is very little MQ to support transactions; In addition, the API is called before each operation to add a message, which will increase the overall call delay, and is an unnecessary overhead in most normal response business scenarios.

TCC reference: https://www.sofastack.tech/blog/seata-tcc-theory-design-realization/ MQ transaction reference: https://www.jianshu.com/p/eb571e4065ecCopy the code