Jing Tianyou, new house R&D Department, joined Lianjia in March 2017 as a PHP R&D engineer, responsible for the research and development of new house Link.

An accidental mistake encountered before, discuss how fat four ~

1

A strange online problem

Upstream sent too big header while reading response from upstream The result is service 502 bad gateway, but the strange thing is that some specific requests are stable, but other requests are always fine and return normally.

For a while, we don’t really know why, the only exception is that warninig logs are very large, we know that under normal circumstances, PHP warning does not cause the process to break. Php-fpm may add the warning log to the response header, resulting in a large header and an error.

First, let’s take a look at what the error code looks like. There are error fragments in the PHP code like the following:

An error similar to warning is reported:

A large number of warnings, if added to the header, will trigger nginx configuration (if any), causing an error that seems to make sense. But questions remain:

1). Is it really because the response header is too large?

2). Can this be resolved by increasing the size of the fastcGI_BUFFer_SIZE configuration as stated on the web?

So let’s talk about that

2

Briefly analyze the logging behavior of a request

Since the problem is online, there is no way to capture what happened between PHP-FPM and Nginx at the time of the problem, so we plan to reproduce it in our own development environment, but first we need to be able to capture data from the communication between nginx and php-FPM processes.

2.1 Fetch communication data between PHP-FPM and Nginx

 

2.1.1 PHP listens for port communication mode

The speed is slower than Unix socket, but it supports cross-machine invocation. We can grab using tcpdump.

Ps: open the PHP port to provide the fastCGI service

2.1.2 Capture of communication data through Unix socket

This method is fast and has high performance. It can get data by listening socket channel.

2.1.3 The general appearance of captured data

2.2 Simple Analysis

Let’s look at the log. First, we can see the TCP connection data, and we can intuitively understand the TCP three-way handshake 😛

2.2.1 Nginx-error log for a request (nginx-error)

2.2.2 Tcpdump Logs of one Request (tcpdump-log)

2.2.3 Simple conclusions

Observed phenomena

• PHP-FPM does pass both warning and error logs to nginx

• In tcpdump logs, the response header is larger than the fastcGI_buffer_SIZE value, but no error is reported

• Nginx-error logs, as you can obviously see, are segmented and truncated, and have the same size as fastcgi_buffer_size

The preliminary conclusions obtained

Comparing the number of nginx logs with the number of tcpdump logs, we can see:

PHP – FPM does pass warning and error to nginx

2). No matter how much fastcgi_buffer_size is set, the response header of php-fpm will be read according to the size of fastcgi_buffer_size. Read section by section and write to nginx’s log.

This error is not due to the size of the header, or to the fact that php-fpm is passing more data to nginx than the value of fastcgi_buffer_size.

3

What happened inside Nginx

First we know:

Nginx sends the request data and receives the response from the backend server simultaneously, which is a full-duplex mode. For different received data; Nginx transmits data to the back end in a different way. For response header, it will forward the data to the client only after receiving the response header completely, while for Response Body, it will forward the data to the client while receiving the data.

3.1 fastcgi_buffer_size

We know that, The fastcgi_buffer_size and proxy_buffer_size parameters affect the response header, Proxy_buffer_size affects the parameter used by nginx as a reverse proxy.

The document, The first part of the response received from the FastCGI server. This parameter specifies the first part of the response received from the FastCGI server Header), in the LNMP scenario, fastcGI-server is the PHP part of the first part. Since upstream is a generic component, it does not know the back-end protocol, whereas in the HTTP scenario, Since HTTP requires headers, the backend protocol may not have headers, so we need to parse the backend protocol to set the headers sent to the client, and finally sent to the client. Through the observation above, we find that PHP error messages will also be included in the header.

Here is the so-called normal structure of the first part of the response (header) :

  • The format of the received upstream header is shown above. As you can see, there are two types of headers: fastCGI header and HTTP header. Therefore, two state machines are required for parsing.

  • A FastCGI header, which may be followed by one or more HTTP headers;

  • A FastCGI header is grouped with a pair of HTTP headers, which is the minimum unit nginx uses to parse headers. Each group uses two state machines to retrieve the response header information.

3.2 Nginx code error analysis

The source code error location is as follows &error:

As you can see, an error is reported when process_header returns NGX_AGAIN and the end of response has been read.

3.2.1 Pseudo-code for processing the header callback function process_header


The process code above, which is not shown in detail, parses HTTP and FastCGI headers separately

If the fastCGI and HTTP headers are incorrectly parsed, NGX_AGAIN will be returned by the two state machines. If the header is not properly formatted or missing, the NGX_AGAIN will be returned by the state machines. Did PHp-fpm send a defective response header or did not send a response header?

Step forward and test the knife

4

Try to reappear

If we can reproduce 502, we can test our guess. We can keep increasing the number of warnings while listening for ports and logs. When 502 occurred, the interception of what data phP-Fpm passed to nginx was caused by too much warning written into fastcgi_buffer_size.

4.1 PHP script

We can try to increase the size of error step by step to observe the log situation when sending 502. According to this idea, we can use the following similar script:

4.2 Log Listening

The script runs while listening on the phP-bound port to get TCPDMP logs (fastcGI_BUFFer_size 4K at this point).

4.3 Pressure measurement method

A loop is used to request the script, continuously increasing the size and iteration number of a single log, making it write more bytes to the log, and filtering 502 returned and the number of iterations and the size of each log when an error is reported.

The above is the port phP-FPM is using to communicate with nginx.

By the way, the performance of Unix socket communication is really high, the same test method, for Unix socket communication LNMP, the test statement did not report 502. As can be seen from the above, we reproduce 502. Next, we make a tense analysis of the log.

4.4 Result Analysis

Nginx error:

You can see that we successfully duplicate the too Big header error 😛

Normal nginx-error logs:


Stress analysis

In the above logs, one is 39:01 and the other is 39:02. According to our test script, the later the time is, the larger the log volume will be (at least three times the previous one), but why did the log of 39:01 seconds with smaller data volume trigger 502, while the log of 39:02 with larger data volume did not report an error? Also, nginx logs the same amount of logs, obviously truncated, and significantly larger than 4K, which confirms our hypothesis. Fastcgi_buffer_size is not a check value, and exceeding this value will not generate an error. It’s more like the size of a temporary space.

Further tension analysis

We went back to the tcpdump logs we were monitoring and found something interesting.

TCP log for a normal request:

TCP log for a 502 request:

Found that under the condition of 502 error, tcpdump caught package, at the end of the lack of the HTTP response headers, it is verified before we speculate that PHP – FPM before returning to the body, Nginx did not pass the complete response header to nginx, causing nginx to report an error.

4.5 Tentative conclusions

Nginx failed to parse the header to the end of the header, so for now we can infer:

Php-fpm can send incomplete response header data to nginx under certain circumstances, causing nginx to parse fastCGI and HTTP headers incorrectly, resulting in 502.

It may seem strange that the downstream service was in error and not because the fastcgi_buffer_size in Nginx was too small, but the problem is not impossible:


Launch a new round of discussion, technology!

5

Case 502 with different fastCGI_BUFFer_size

5.1 fastcgi_buffer_size 1 k

Nginx log

– error log 502 will not be reported:

Error log 502 will be reported:

Tcpdump log

– error log 502 will not be reported:

Error log 502 will be reported:

The test results

5.2 fastcgi_buffer_size 2 k

Nginx log

– error log 502 will not be reported:

Error log 502 will be reported:

Tcpdump log

– error log 502 will not be reported:

Error log 502 will be reported:

The test results

5.3 fastcgi_buffer_size 4 k

Nginx log

– error log 502 will not be reported:

Error log 502 will be reported:

Tcpdump log

– error log 502 will not be reported:

Error log 502 will be reported:

The test results

5.4 Observations and Findings

Based on the comparison results of 1K, 2K and 4K, we can draw the following conclusions:

By multiplying the size of the error (502) by the iteration value, we can see that phP-FPM sends the same data to nginx with different fastCgi_buffer_sizes. Increasing the size of fastcgi_buffer_SIZE does not reduce the risk of too Big Headers, but it is still recommended to set it to the operating system paging size.

Finally out of the oven, xiaobian hand cramps!

6

conclusion

After exploration, we can finally get:

1. Php-fpm does pass warning and error to nginx;

2. No matter how much fastcgi_buffer_size is set, the response header of php-fpm will be read one by one based on the size of fastcgi_buffer_size. Read section by section, then write to nginx log;

3. Php-fpm may send incomplete response header data to nginx under certain circumstances, causing nginx to parse fastCGI and HTTP headers, resulting in 502;

4. Increasing the fastcGI_BUFFer_SIZE does not reduce the risk of too Big Headers.

Let’s talk about it a little bit

7

doubt

After the above exploration, we can see the above conclusions, but there are still many questions:

1. Why does phP-fpm send data without headers

2. Is there any other circumstances that may cause too Big Header errors

3. The problem online is that only certain request parameters are stable and others are not, what is going on? Is it the cause of our inquiry?

The above questions need to be further explored.

Program ape brother is persistent!

8

Further analysis of the question

After the guidance of the old driver, further analysis of the data obtained from the pressure test above, we explored the error situation of 502 under different FASTCGI_BUFFer_size, the specific log is as follows:



8.1 observe

Is there anything weird about these results? Why is it that the first error is reported when size=121 and Iteration =30?

8.2 Further analysis results

If we multiply the size of each iteration by iteration, we can get something like this: (Add the leading PHP message: these 13 characters)

We know that 1024 is Programmer day, is there anything sensitive about the data?

.

9

Jilt pan

Compared to the above data, it seems that the PHP log size will steadily reproduce the problem on a certain NK (most likely 4K on my test machine). This is also a good explanation for why some requests on the line will steadily reproduce 502, while others will not trigger 502 even if more PHP warning logs are typed.

This time, we can leave the pot to PHP-FPM, and we can take a wild guess:

Is there a similar buffer inside PHP-FPM (maybe 4K on my test machine), maybe this buffer is the buffer for TCP to send messages, and when the bytes sent are close to 4K, some bugs will be triggered. Results in the loss of some tail data, but when the size of the data differs greatly from the integer multiple of 4K, the data will not be lost. Emmm may have a problem when redistributing buffer, resulting in data loss 😛

 

The above exploration is only personal opinion, may be wrong, welcome to exchange ~

 Eamil:[email protected]

Author: Jing Tianyou


Supervisor: CAI Baiyin

Editor: Zhong Yan

Website: tech.lianjia.com

Please pay attention to identify the QR code

Lianjia product technology team

Product technology first