First, let’s talk about the problem phenomenon: the API of the sandbox environment on the Intranet is stuck for one week, and all apis are unresponsive

At first, when the test complained about the slow response of the environment, we restarted the application, and the application returned to normal, so no processing was done. But then the problems appeared more and more frequently, and more and more colleagues began to complain, so they felt that there might be something wrong with the code and began to troubleshoot.

First of all, I found that the development of the local IDE did not find the problem, the application freezes when the database, Redis is normal, and no special error log. I began to suspect that the sandbox environment machine was a problem (the test environment is inherently brittle! _!)

SSH runs the following command on the server

top

The machine was fine, so I decided to take a look at the JVM stack information

Let’s start with the problem of using resource-consuming threads

Run top-H-P 12798

Find the first three relatively expensive threads

Jstack looks at the heap memory

Jstack 12798 | grep hexadecimal ff 31 12799

I don’t see anything wrong, so I look at the next 10 lines and I do it, okay

See that some threads are locked. But no business-related code appears and is ignored. There was no clue at this point. Think about it. Decided to abandon the stuck machine

To protect the accident site, dump all the heap memory of the faulty process. Then restart the test environment application in debug mode, and plan to remotely debug the faulty machine when the problem appears again

The next day the problem reappeared, so I informed o&M nGINx forward to remove the problem application and remotely debug Tomcat myself.

They randomly find an interface, breakpoint in the interface entry place, the tragedy began, nothing happened! The API waited for the service response and did not enter the breakpoint. At this time a little confused, calm down for a while, before the entrance of the AOP place under a breakpoint, debug again, this time into the breakpoint, f8 N times later found that the execution of redis command when the main block. Keep following and finally find a problem at a place in Jedis:

/** * Returns a Jedis instance to be used as a Redis connection. The instance can be newly created or retrieved from a *  pool. * * @return Jedis instance ready for wrapping into a {@link RedisConnection}.
 */
protected Jedis fetchJedisConnector() {
   try {
      if(usePool && pool ! = null) {return pool.getResource();
      }
      Jedis jedis = new Jedis(getShardInfo());
      // force initialization (see Jedis issue # 82)
      jedis.connect();
      return jedis;
   } catch (Exception ex) {
      throw new RedisConnectionFailureException("Cannot get Jedis connection", ex); }}Copy the code

The thread starts wait after pool.getResource()

public T getResource() {
  try {
    return internalPool.borrowObject();
  } catch (Exception e) {
    throw new JedisConnectionException("Could not get a resource from the pool", e); }}Copy the code

return internalPool.borrowObject(); This code should be a rental code, and follow

public T borrowObject(long borrowMaxWaitMillis) throws Exception {
    this.assertOpen();
    AbandonedConfig ac = this.abandonedConfig;
    if(ac ! = null && ac.getRemoveAbandonedOnBorrow() && this.getNumIdle() < 2 && this.getNumActive() > this.getMaxTotal() - 3) { this.removeAbandoned(ac); } PooledObject<T> p = null; boolean blockWhenExhausted = this.getBlockWhenExhausted(); longwaitTime = 0L;

    while(p == null) {
        boolean create = false;
        if (blockWhenExhausted) {
            p = (PooledObject)this.idleObjects.pollFirst();
            if (p == null) {
                create = true;
                p = this.create();
            }

            if (p == null) {
                if (borrowMaxWaitMillis < 0L) {
                    p = (PooledObject)this.idleObjects.takeFirst();
                } else {
                    waitTime = System.currentTimeMillis();
                    p = (PooledObject)this.idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS);
                    waitTime = System.currentTimeMillis() - waitTime; }}if (p == null) {
                throw new NoSuchElementException("Timeout waiting for idle object");
            }
Copy the code

There’s a code in there

if (p == null) {
    if (borrowMaxWaitMillis < 0L) {
        p = (PooledObject)this.idleObjects.takeFirst();
    } else {
        waitTime = System.currentTimeMillis();
        p = (PooledObject)this.idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS);
        waitTime = System.currentTimeMillis() - waitTime; }}Copy the code

BorrowMaxWaitMillis <0 would always execute and then it would loop and start wondering if this value wasn’t configured

The redis pool configuration is not configured for MaxWaitMillis, and the else code is an Exception

Continue to F8


public E takeFirst() throws InterruptedException {
    this.lock.lock();

    Object var2;
    try {
        Object x;
        while((x = this.unlinkFirst()) == null) {
            this.notEmpty.await();
        }

        var2 = x;
    } finally {
        this.lock.unlock();
    }

    return var2;
}

Copy the code

When I find the lock word here, I begin to suspect that all request apis are blocked

Arthas is an open source Java diagnostic tool for Alibaba.

Executing the thread command

Http-nio-8083-exec – this thread is actually the Tomcat thread that produces the HTTP request

Pick a random thread to look at the heap

thread -428

This is a confirmation that the API is going around in circles, and this redis code is getting connections,

Read this memory code and all the threads are waiting for @53e5504e to release the lock. So JStack does a global search for 53E5504E and can’t find the thread.

Since then. The cause of the problem can be identified as a redis connection acquisition problem. But what causes the failure to obtain the connection can not be determined

Arthas thread-b again (thread-b, find threads currently blocking other threads)

Nothing. Instead of finding a blocking thread, I looked at the documentation for this command and found the following sentence

Well, we happen to be the latter…

Let me think again. This time modify the Redis Pool configuration to set the connection timeout to 2s, and then wait for the problem to recur to see what the application does when it finally works.

Add some configuration

JedisConnectionFactory jedisConnectionFactory = new JedisConnectionFactory(); . JedisPoolConfig config = new JedisPoolConfig(); config.setMaxWaitMillis(2000); . jedisConnectionFactory.afterPropertiesSet();Copy the code

Restart the service and wait…

Another day, and again

SSH server, check tomcat Accesslog, found a large number of API requests 500,

org.springframework.data.redis.RedisConnectionFailureException: Cannot get Jedis connection; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource fr om the pool at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.fetchJedisConnector(JedisConnectionFactory.java:1 40) at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:229) at  org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:57) at  org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:128) at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:91) at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:78) at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:177) at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:152) at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:85) at org.springframework.data.redis.core.DefaultHashOperations.get(DefaultHashOperations.java:48)Copy the code

Find out where the source first appeared at 500,

Find the following code

. Cursor c = stringRedisTemplate.getConnectionFactory().getConnection().scan(options);while (c.hasNext()) {
.....,,
   }
Copy the code

Analysis of this code, stringRedisTemplate. GetConnectionFactory () getConnection () to obtain the pool after the redisConnection, no follow-up action, In other words, the redisConnection in the Redis connection pool is not released or returned to the connection pool after the connection is leased. The redisConnection in the redis connection pool is idle, but the redisConnection in the pool has not returned to idle

Normal should be

Since then the problem has been found.

Summary: Spring stringRedisTemplate encapsulates redis general operations, but does not yet support Commands such as Scan SetNx, so you need to get jedis Connection to do special Commands

use

stringRedisTemplate.getConnectionFactory().getConnection()
Copy the code

Is not recommended

We can use

stringRedisTemplate.execute(new RedisCallback<Cursor>() {

     @Override
     public Cursor doInRedis(RedisConnection connection) throws DataAccessException {

       returnconnection.scan(options); }});Copy the code

To perform,

Or after using connection, use the

RedisConnectionUtils.releaseConnection(conn, factory);
Copy the code

To release the connection.

In addition, it is not recommended to use the keys command in redis, and the redis pool configuration should be properly configured. Otherwise, there will be no error log and no error report when problems occur, which is quite difficult to locate.