Appeared in A few months ago, we project the problem of A distributed transaction, then what is the problem of distributed transactions, in A nutshell, we have both services A and B, their corresponding data source is a_db and b_db respectively, received A service request in performing to an operation, the need to invoke service B, continue to perform in the B service, B service in the execution of the involved removing operation and increase in b_db B after the service execution, such as A continued to perform, the results at the moment, A exception occurs, as A result of B in another service, has its own data source, it also does not belong to A transaction, and A lead to a_db rollback, b_db didn’t rolled back, it’s not just wrong, This is where the problem of distributed transactions in our system comes from.

Later, we introduced the distributed transaction solution component seATA in SpringCloud, about the introduction of distributed transaction SEATA installation of each mode I do not have a detailed explanation of one by one, of course, in order to facilitate the later problem occurrence and solution of a better understanding, Again, I’ll talk about the seATA process for solving distributed transactions.

An introduction to SEATA for solving distributed transaction flows

Seata is divided into three parts: TC transaction coordinator, TM transaction manager, and RM resource Manager. TC is an independent seata-sever service, which is used to coordinate the whole distributed transaction. It can be downloaded on git_hub. TM is the initiator of the global transaction and manages the whole global transaction, which is equivalent to the caller service in the project and equivalent to that A in our project. There can be more than one RM resource manager, which belongs to the caller in the project and is equivalent to B in our project.

Their execution flow is as follows:

  • 1.TM sends a request to TC, indicating that it wants to start a global transaction.
  • 2.TC receives the request from TM and generates an XID as the unique identifier of the global transaction, which is returned to TM.
  • 3.TM starts calling other RMS and passes the XID to those RMS as well.
  • 4. The RM will receive the XID and know that its transaction belongs to the global transaction. It will register its local transaction with the TC as a branch transaction under the XID and inform the TC of its transaction execution result.
  • 5. After the execution of each micro-service, TC will know the execution result of each branch transaction under this XID, and of course TM will also know.
  • 6. When TM finds that all branch transactions are successful, it sends a request to TC for submission; otherwise, it sends a request to TC for rollback.
  • 7. After receiving the request, the TC sends corresponding requests to all branch transactions under the XID.
  • 8. After receiving the TC request, each micro service executes the corresponding command and reports the execution result to the TC.

To help you understand better, here is a diagram for demonstration:

Sequelae after introduction of SEATA

Our project is A non-standard microservice, which is divided into service A and B, but there is no registry. Feign, A component of SpringCloud, is used for service communication invocation. However, without A registry, services cannot be discovered by service name, so we can only communicate with each other by USING FEign URL mode. This, of course, sowed the seeds of the problem.

In fact, there are many modes used by SEATA, such as AT, TCC, XA and Saga. Of course, we choose AT mode, which is non-invasive business mode and is also recommended by the official. As for the registration and configuration methods, there are many, including File, NACOS, Eureka, Redis, ZK, Consul, ETCD3, SOFA, etc. Finally, due to the particularity of our project without a registry, we had to choose the registration and configuration method of file single machine.

Then download and start seata-server, change the seata-server address of file.conf, and put file.conf and registory.conf configuration files into resources, change the data source to proxy data source. Add the undo_log database table to a_db and b_db, change @Transactional to @globalTransactional on the call side of A, and implement A seATA distributed Transactional framework. Simulated unit tests were carried out, the distributed transaction problem was handled perfectly, so the code was tested on the server, and after some time on the pre-production server.

At that time due to work much, only under the local simulated, the server can be met in the test only to look, not to care about this matter, after all, the local has been perfectly, the result after a month on the pre-production environment, a business code exception happened due to the data, the method also involved a distributed transaction, in observing the database data, Strange discovery, distributed transaction did not take effect, B service did not roll back!!

Step by step analysis and solution of SEATA problems

After the problem occurred, grasp in the local simulation, found on their own computer distributed transaction is still effective, the code is the same has not changed, why in the pre-production environment is not good, really see the ghost.

I was thinking about the difference between pre-production and testing locally. When testing locally, the url in the feign call between the two services was filled with IP: Port directly accesses the specified service, but the domain name used in the pre-production environment, that is, the request will be forwarded to the corresponding service through nginx based on the domain name, assuming that the B1 service is requested for the first time, and the B2 service is forwarded for the second time, not on the same service. Some colleagues put forward, will it be related to this B service cluster. But I still felt something was wrong, because I heard that although our pre-production environment B was clustered, the data sources connected by these B’s were still the same data source, but it seemed to be very reasonable.

After all, the test server does not have A cluster, it is also A A and A B, and they call each other by domain name instead of IP: Port, I thought you will be successful, the result is very miserable, in the case of B service didn’t do cluster, a distributed transaction is still not work, it looks like not cluster and the cluster do not have what relation, after all test environment is one to one service, every time also will only visit the a, B is still not effective, what has happened?

There was nothing left to do but change the FEign URL to IP: port to solve the urgent problem, but this was not A good idea, after all, it meant that in the production environment A would have to call one of THE B’s and the cluster would be meaningless.

After some time, a colleague said that it was ok to change the load balancing policy of Nginx to IP_hash mode, but no one verified this method, and I did not have the permission to change the server nginx.conf to verify, because there was no guarantee that distributed transactions would work in the 1:1 test environment. So what if I changed it to ip_hash?

Basically, nginx limits header requests with underscores by filtering them out, resulting in nginx forwarding requests with underscores lost. The format of transaction ID of seata distributed transaction is TX_XID. Then, when TM gets the XID to call RM, will the XID be lost, so RM cannot register its branch transaction under the global transaction represented by the XID, so it will not be managed? Of course, it does not take effect. So just filter out nginx’s request for underscores.

After thinking of this, I was so excited that I wanted to simulate the occurrence and solution of the problem on the server, but the nginx.conf of the project did not allow me to arbitrarily change this small development! Anyway, I have nothing to do for the time being, so I wrote a few service demo on my computer, which was forwarded by nginx in my virtual machine. I also built a SEata-server to run, matched all the configurations, and started to simulate. Sure enough, the problems on the server were simulated by me. The next step is to remove the underscore filter from nginx.conf by adding a line to the HTTP section of nginx.conf :underscores_in_headers on; Test again, success, THE URL is written as a domain name does not matter.

The following is my simulation record: