This is the fifth day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021 “.

The book continues…

5. Ghosts reappear

Since the test environment was not really consistent with the online environment, O&M opened another AWS cloud database (PostgresQL) environment and allocated a domain name for test calls.

In order to be cautious, I used multithreading to test before delivering the product group, but found no problem, and transferred to the product group.

However, the product group found that the problem still existed after use.

This fucking BUG, it’s like a ghost. What’s going on?

I feel like I’m stuck in a loop.

6. Meditate

Start over, don’t give up, don’t give up!

To solve this problem, I looked through the cloud service’s database logs and found a useful clue.

Database LOG 2021-11-10 05:28:03 UTC:218.76.52.112(26828):xxx@test:[26849]:LOG: could not receive data from client: Connection reset by peer 2021-11-10 05:28:16 UTC:218.76.52.112(26346):xxx@postgres:[24160]:LOG: could not receive data from client: Connection reset by peer 2021-11-10 05:28:16 UTC:218.76.52.112(26504):xxx@test:[25374]:LOG: could not receive data from client: Connection reset by peer 2021-11-10 05:28:16 UTC:218.76.52.112(26361):xxx@test:[24280]:LOG: could not receive data from client: Connection reset by peerCopy the code

Connection reset by peer Connection reset by peer Reset by whom?

7 The connection is reset

The TCP connection is reset as follows:

  1. If the Socket on one end is closed (either actively closed or closed due to an abnormal exit) and the other end still sends data, the first packet sent raises the exception (Connect reset by peer).

The Socket is connected for 60 seconds by default. If there is no heartbeat interaction within 60 seconds, that is, data reading or writing, the connection is automatically closed.

  1. If one end exits without closing the Connection, the other end throws a Connection reset if it is reading data from the Connection.

Simply put, it is caused by read and write operations after the connection is disconnected.

Common reasons for situations:

  • The number of concurrent connections to the server exceeds its capacity, and the server closes some of them.

If the number of concurrent clients connected to the server does not exceed the carrying capacity of the server, the network traffic may be abnormal due to a virus or Trojan horse.

  • The client shuts down the connection while the server is still sending data to the client.
  • Firewall problems

If the network connection passes through a firewall, the firewall usually has a timeout mechanism. If the network connection does not transmit data for a long time, the TCP session is closed.

An exception will be generated if you try to read or write after the function is disabled. If the firewall is disabled and the problem is solved, you need to reconfigure the firewall or write programs to implement TCP long connections.

To implement TCP long connections, you need to define heartbeat protocols and send heartbeat protocols once in a while to maintain the connection.

8. Solve problems

Yes, THERE was no problem when I locally tested AWS services, while problems frequently occurred when the test environment connected to AWS servers. The biggest difference between us is the difference in network hardware channels.

It can be concluded that the database connection is cached in the connection pool, and when retrieved, the connection may have been closed by some proxy/firewall/router on the network, so there is an exception.

Npgsql.net version of PostgreSQL database connection strings and parameters

  1. Set ConnectionLifetime Set the maximum lifetime to 60s to prevent reuse
  2. Set the Keepalive heartbeat to prevent it from being reclaimed. Set to 30 s

I strongly pushed these two configurations to the product group. After half an hour, the product group feedback, the problem was solved!

9. Read the source code to find the root

We went to the ConnectorPool class, looked at its code, and found that it checks for disconnected status when it gets free links from the pool.

bool CheckIdleConnector([NotNullWhen(true)] NpgsqlConnector? connector)
        {
            if (connector is null)
                return false;

            // Only decrement when the connector has a value.
            Interlocked.Decrement(ref _idleCount);

            // An connector could be broken because of a keepalive that occurred while it was
            // idling in the pool
            // TODO: Consider removing the pool from the keepalive code. The following branch is simply irrelevant
            // if keepalive isn't turned on.
            if (connector.IsBroken)
            {
                CloseConnector(connector);
                return false;
            }

            if(_connectionLifetime ! = TimeSpan.Zero && DateTime.UtcNow > connector.OpenTimestamp + _connectionLifetime) { Log.Debug("Connection has exceeded its maximum lifetime and will be closed.", connector.Id);
                CloseConnector(connector);
                return false;
            }

            return true;
        }
Copy the code

Of course, the comment also indicates that the disconnect check is valid if the KeepAlive parameter is set, and invalid if it is not. Yeah, without a heartbeat, it shouldn’t be able to check if the connection is down.

As you can see from the code, links that have timed out are also destroyed rather than returned to the client if the link lifecycle is set.

Moving on to the implementation code of NpgsqlConnector, you discover the secret of heartbeat checking.

_keepAliveTimer = new Timer(PerformKeepAlive, null, Timeout.Infinite, Timeout.Infinite);
Copy the code

If KeepAlive is set, then each link takes up 1 more thread.

The core code of the heartbeat check is as follows:

Log.Trace("Performing keepalive", Id);
 AttemptPostgresCancellation = false;
   var timeout = InternalCommandTimeout;
   WriteBuffer.Timeout = TimeSpan.FromSeconds(timeout);
   UserTimeout = timeout;
   WriteSync(async: false);
   Flush();
   SkipUntil(BackendMessageCode.ReadyForQuery);
   Log.Trace("Performed keepalive", Id);
Copy the code

A simple write SYNC takes care of the heartbeat maintenance and check.

From the above code, it makes sense to infer that if there is no heartbeat check, the status of links placed in the link pool is indeed unknown, and if they are disconnected by another gateway, the exception will be thrown if they are accessed again.

Yeah, so happy!

Nodule 10.

This troubleshooting, the total span of 2 weeks, a total of 2 times of fault analysis, each time about 2-3 days, it is not easy ah, you see to point support!

👓 have seen this, but also care about a thumbs-up?

👓 has been liked, but also care about a collection?

👓 are collected, but also care about a comment?