background

Springboot-based applications run in the Tomcat container as war packages, and when the WAR package is updated there is a period of time when the service returns 404, which is unacceptable for online services. The layer 4 load balancer can automatically offline the node whose port 80 is disabled. However, the remote script cannot be executed because the Intranet server is located behind the Bastion and cannot configure the SSH service according to company regulations. So you have to do it some other way.

The experiment material

  1. Nginx as web Server and Layer 7 load balancer
  2. Tomcat * 2 is used as the application backend
  3. Gitlab-ce code version control
  4. Jenkins Publishing Platform

The basic principle of

The basic principle is to have three Tomcat containers behind Nginx, one of which is active and one of which is standby. Normally, you can’t access the standby container, but you can use additional means to ensure that the standby container can provide services. Update all standby nodes before release and update the active container after verification to ensure that the service will not be interrupted.

The actual operation

Create a SpringBoot project

See Springboot for built-in and standalone Tomcat and other considerations.

Write different versions of the same interface

// tag v1
@RestController
public class HelloController {
    @GetMapping("/hello")
    public String hello(a) {
        return "V1"; }}Copy the code
// tag v2
@RestController
public class HelloController {
    @GetMapping("/hello")
    public String hello(a) {
        return "V2"; }}Copy the code

packaging

mvn clean package -Dmaven.test.skip=true
Copy the code

Create two Tomcat containers

docker run -itd --name tomcat-active -v /tmp/tomcat/active:/usr/local/tomcat/webapps -p 32771:8080 tomcat
docker run -itd --name tomcat-standby -v /tmp/tomcat/standby:/usr/local/tomcat/webapps -p 32772:8080 tomcat
Copy the code

Copy the WAR package into the container

Maybe there is a problem with Docker Toolbox and the directory cannot be mounted, so we have to copy the WAR package into it manually.

Docker cp ~ / workspace/spring - demo/target/spring - demo - 0.0.1 - the SNAPSHOT. War tomcat - active: / usr/local/tomcat/webapps/docker Cp ~ / workspace/spring - demo/target/spring - demo - 0.0.1 - the SNAPSHOT. War tomcat - standby: / usr/local/tomcat/webapps /Copy the code

Access services in both containers

After a while, the services in the two containers will be automatically deployed and can be accessed through the corresponding ports. A simple pressure test can reach 2000+ QPS without error.

$WRK - c - d 10-20 t 4 test Running 10 s @ http://192.168.99.100:32771/spring-demo-0.0.1-SNAPSHOT/hello http://192.168.99.100:32771/spring-demo-0.0.1-SNAPSHOT/hello 4 threads and 20 connections Thread Stats Avg Stdev Max + / - Stdev Latency 11.20 ms 8.70ms 122.66ms 111.20% Req/Sec 554.18 167.66 63.25% 22088 requests in 11.02 s, 2.43MB Read Requests/ SEC: 247.89 KB $WRK - c - d 10-20 t 4 test Running 10 s @ http://192.168.99.100:32772/spring-demo-0.0.1-SNAPSHOT/hello http://192.168.99.100:32772/spring-demo-0.0.1-SNAPSHOT/hello 4 threads and 20 connections Thread Stats Avg Stdev Max + / - Stdev Latency 11.30ms 14.24 MS 186.52ms 92.95% Req/Sec 557.54 207.91 11.24 k 67.17% 22025 requests in 11.03 s, 2.42MB read Requests/ SEC: 2196.36 Transfer/ SEC: 247.05KBCopy the code

Configure Nginx

upstream ha {
	server 192.168.99.100:32771;
	server 192.168.99.100:32772 backup;
}
server {
	listen       80;
	server_name  _;

	location / {
		proxy_next_upstream http_502 http_504 http_404 error timeout invalid_header;
		proxy_passhttp://ha/spring-demo-0.0.1-SNAPSHOT/; }}Copy the code

Note: by default forward will only GET/HEAD/PUT/DELETE/OPTIONS this idempotent requests, and not forward POST request, if you need to also be forwarded to a POST request, will need to add non_idempotent configuration, overall configuration is as follows

upstream ha {
	server 192.168.99.100:32771;
	server 192.168.99.100:32772 backup;
}
server {
	listen       80;
	server_name  _;

	location / {
		proxy_next_upstream http_502 http_504 http_404 error timeout invalid_header non_idempotent;
		proxy_passhttp://ha/spring-demo-0.0.1-SNAPSHOT/; }}Copy the code

Proxy_next_upstream http_502 http_504 http_404 error timeout invalid_header; In our current scenario, tomcat that is updating the WAR will return 404 (http_404) when the war package is published. If this line is not configured, It won’t do retweets. One problem with this simple configuration is that Nginx does not remove the offending backend from the upstream, meaning that the request will continue to reach the realserver that is being updated, but Nginx will forward the request to the next good realserver, which will increase the time. There are three methods to perform health checks on back-end node servers for Nginx load balancing. For details, see Nginx load balancing

Pass the Nginx pressure test

Basic testing

  1. Pressure the two Tomcat nodes when they are normal
$WRK - c - d 10-20 t 4 http://192.168.99.100:32778/hello test Running 10 s @ http://192.168.99.100:32778/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 57.36ms 32.06ms 335.36ms 71.29% Req/Sec 89.29 48.20 390.00 85.25% 3577 requests in 10.05s, 562.30KB Read requests/SEC: 355.77 Transfer/ SEC: 55.93KBCopy the code

The most obvious changes from the above non-Nginx pressure test are an 84% drop in QPS and a 5-fold increase in average response time. The problem.

  1. Delete immediately after start of manometrytomcat-activeWar package and directory in the container, resulting in the following
$WRK - c - d 10-20 t 4 http://192.168.99.100:32778/hello test Running 10 s @ http://192.168.99.100:32778/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 57.29ms 28.69ms 181.88ms 67.38% Req/Sec 87.93 39.51 240.00 75.25% 3521 requests in 10.05s, 553.50KB Read requests/SEC: 350.22 Transfer/ SEC: 55.05KBCopy the code

Again, there were no non-200 responses, and the overall response was close to normal.

  1. Pressure test when only the standby node is working
$WRK - c - d 10-20 t 4 http://192.168.99.100:32778/hello test Running 10 s @ http://192.168.99.100:32778/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 72.12ms 35.99ms 240.89ms 68.34% Req/Sec 70.04 31.84 180.00 76.50% 2810 requests in 10.05s, 441.71KB Read requests/SEC: 279.48 Transfer/ SEC: 43.93KBCopy the code

As you can see, there is a significant increase in response time and a significant decrease in QPS, which also verifies the above problem that the request 404 will be forwarded to the working node, but the problem node will not be removed, resulting in a longer response time.

Further testing

In order to eliminate the possibility that the pressure test was over before the impact of war package deletion on the service took effect, the pressure test time was extended to 60 seconds.

  1. Both nodes are normal
$WRK - 20-60 - t d c 4 http://192.168.99.100:32778/hello Running 1 m test @ http://192.168.99.100:32778/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 55.53ms 28.10ms 306.58ms 70.07% Req/Sec 91.52 39.35 300.00 69.23% 21906 requests in 1.00m, 3.36MB read requests/SEC: 364.66 Transfer/ SEC: 57.32KBCopy the code

The overall situation is the same as the above test. The backup node did not receive any request after viewing the log. To verify if this is caused by the Worker_PROCESSES configuration, change this value to 4 and re-test the result as follows

$WRK - 20-60 - t d c 4 http://192.168.99.100:32778/hello Running 1 m test @ http://192.168.99.100:32778/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 41.55ms 24.92ms 227.15ms 72.21% Req/Sec 125.06 46.88 373.00 71.76% 29922 requests in 1.00m, 4.59MB read requests/SEC: 498.11 Transfer/ SEC: 78.29KBCopy the code

As you can see, it’s almost a 20% improvement, but it’s still not quite what we expected.

  1. Update the WAR package on the active node as soon as you start the test
$WRK - 20-60 - t d c 4 http://192.168.99.100:32778/hello Running 1 m test @ http://192.168.99.100:32778/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 54.40ms 33.76ms 329.73ms 70.53% Req/Sec 95.85 56.28 420.00 81.60% 22914 requests in 1.00 M, 3.52MB Read requests/SEC: 381.42 Transfer/ SEC: 59.95KBCopy the code

There was no significant change. After the test started, the standby node received requests for a period of time, and then all the requests were directed to the active node. Probably because the service was too simple and reloaded too quickly, only a small number of requests (5750) were forwarded to the standby node, so the overall result was not affected. 3. Delete the WAR package on the active node immediately after the test starts

$WRK - 20-60 - t d c 4 http://192.168.99.100:32778/hello Running 1 m test @ http://192.168.99.100:32778/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 72.11ms 34.33ms 346.24ms 69.54%req /Sec 70.16 29.78 191.00 67.23% 16813 requests in 1.00m, 2.58MB read requests/SEC: 279.84 Transfer/ SEC: 43.99KBCopy the code

After the node is deleted, all requests are made to the active node first and then forwarded to the standby node by Nginx, so the throughput decreases significantly and the latency increases significantly.

Effect of the test

  1. Direct access to active
$WRK - 20-60 - t d c 4 1 m test Running @ http://10.75.1.42:28080/web-0.0.1-SNAPSHOT/hello http://10.75.1.42:28080/web-0.0.1-SNAPSHOT/hello 4 threads and 20 connections Thread Stats Avg Stdev Max + / - Stdev Latency 5.56ms 25.16ms 203.83ms 95.82% Req/Sec 7.54k 0.11k 8.311k 114.44% 1803421 requests in 1.00m, 217.03MB Read Requests/ SEC: 30006.18 Transfer/ SEC: 3.61MBCopy the code

Server performance is still much better than the local.

  1. Release new releases during performance pressure tests
$WRK - 20-60 - t d c 4 1 m test Running @ http://10.75.1.42:28080/web-0.0.1-SNAPSHOT/hello http://10.75.1.42:28080/web-0.0.1-SNAPSHOT/hello 4 threads and 20 connections Thread Stats Avg Stdev Max + / - Stdev Latency 4.47ms 22.31ms 401.95ms 96.67% Req/Sec 7.58k 11.88 k 11.26 k 11.12 111140 requests in 1.00m, 285.84MB Read non-2XX or 3XX responses: 72742 Requests/ SEC: 30181.93 Transfer/ SEC: 4.76MBCopy the code

Releasing a new version resulted in 4% of requests failing.

  1. Access the service through Nginx
$WRK - 20-60 - t d c 4 http://10.75.1.42:28010/web/hello Running 1 m test @ http://10.75.1.42:28010/web/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 2.940ms 16.21ms 248.18ms 6.92K 83.38% 1437098 requests in 1.00m, 260.33MB Read requests/SEC: 23948.20 Transfer/ SEC: 4.34MBCopy the code

Although the server is configured with Worker_Processes Auto, which actually runs 40 processes, it is still not throughput enough to directly access Java services.

  1. Release a new version during Nginx pressure testing
$WRK - 20-60 - t d c 4 http://10.75.1.42:28010/web/hello Running 1 m test @ http://10.75.1.42:28010/web/hello 4 threads And 20 connections Thread Stats Avg Stdev Max +/- Stdev Latency 4.09ms 2.50ms 2.11ms 97.12% Req/Sec 5.89k 733.62 6.86K 84.85% 1404463 requests in 1.00m, 253.67MB Read requests/SEC: 23401.54 Transfer/ SEC: 4.23MBCopy the code

As you can see, the latency has increased significantly, but the overall QPS has not decreased significantly, again due to some forwarding.

thinking

How much of an impact does it have on the load of a machine running two Tomcat containers instead of one? Memory and CPU usage can be observed by remote Tomcat on a VisualVM connection

You can see that under normal circumstances, the backup container’s load on the server is negligible. Even at release time, the Backup container only takes responsibility for the active container while it is reloaded, and then immediately resumes.

After a new version is officially running online, to ensure that the backup version is the latest when the new version is released next time, you need to release the backup version again. Of course, at this time, all traffic is on the active node, and the release and update operation of the Backup node does not affect the load.

conclusion

The backup mechanism of Nginx can be used to guarantee the release of new versions without service interruption. The overall release process is as follows:

  1. Publish the new version to the active container
  2. Release the new version to the backup container after confirming that the new version is stable

advantage

  1. A Tomcat container is guaranteed to be available on any machine at any time to ensure service continuity
  2. From the intuitive machine on-line to direct full on-line, and ensure that if there is a problem with the new version online will not affect the online service

disadvantage

  1. You need to go online twice
  2. You need to install Nginx and the Tomcat container as backup on the machine where the Tomcat container resides
  3. Backup container consumption in standby mode