Remember a.NET e-commerce trading platform Web station CPU explosion high analysis

A: background

1. Tell a story

Some time ago, a friend of mine found me on WX and told me that one of his old projects often received 90% warning messages of CPU >, which was quite embarrassing.

Since I have been found, then use Windbg analysis. What else can you do?

2. WINDBG analysis

1. Exploration site

Now that CPU > is 90%, let me check if this is true.

0:35 9 >! tp CPU utilization: 100% Worker Thread: Total: 514 Running: 514 Idle: 0 MaxLimit: 2400 MinLimit: 32 Work Request in Queue: 1 Unknown Function: 00007ff874d623fc Context: 0000003261e06e40 -------------------------------------- Number of Timers: 2 -------------------------------------- Completion Port Thread:Total: 2 Free: 2 MaxFree: 48 CurrentLimit: 2 MaxLimit: 2400 MinLimit: 32

From the diagram, it is really spectacular, the CPU is directly full, the thread pool 514 threads are also running at full capacity, then in the end are running what? First I have to wonder if these threads are locked by some kind of lock.

2. Check the synchronization block table

To observe the lock situation, check the synchronization block table first, after all, everyone likes to use LOCK to play multi-thread synchronization, you can use! Syncblk command to view.

0:35 9 >! syncblk Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner 53 000000324cafdf68 498 0 0000000000000000 none 0000002e1a2949b0 System.Object ----------------------------- Total 1025 CCW 3 RCW 4 ComClassFactory 0 Free 620

B: I go. These figures look strange. What the hell is 498? The textbook says owner + 1, waiter + 2, so all you see is an odd number. What does an even number mean? StackOverflow is an overflow-only solution that can be used to create an overflow-only solution.

Bad memory

This situation is more difficult than winning the lottery, and I firmly believe that this kind of luck will not go…

He’s a lock convoy.

A few days ago, I shared a real-life story about a Web site of a travel agency called Convoy. It was because of Lock Convoy that the CPU blew up. It was a small world. I’m going to put up that picture just to make it easier for you to understand.

Threads that hold the lock have to dump the dump before they actually hold the lock. In other words, a thread that holds the lock has to dump the dump before it actually holds the lock. The current count of 498 is all the Waiter threads, which is 249 Waiter threads. Now we can verify that by calling up the stack of all the threads, and then retrieve the Monitor.Enter keyword.

There are 220 threads that are currently stuck at Monitor.Enter (29 threads are missing). However, a large number of threads are stuck, and the stack looks like they are stuck after setting the context in XXX. I’ll just export the problem code.

3. Review the problem code

Same old command! ip2md + ! Savemodule.

0:35 9 >! ip2md 00007ff81ae98854 MethodDesc: 00007ff819649fa0 Method Name: xxx.Global.PreProcess(xxx.JsonRequest, System.Object) Class: 00007ff81966bdf8 MethodTable: 00007ff81964a078 mdToken: 0000000006000051 Module: 00007ff819649768 IsJitted: yes CodeAddr: 00007ff81ae98430 Transparency: Critical 0:359> ! savemodule 00007ff819649768 E:\dumps\PreProcess.dll 3 sections in file section 0 - VA=2000, VASize=b6dc, FileAddr=200, FileSize=b800 section 1 - VA=e000, VASize=3d0, FileAddr=ba00, FileSize=400 section 2 - VA=10000, VASize=c, FileAddr=be00, FileSize=200

Then open the problem code with ILSpy, and the screenshot is as follows:

Nima, indeed as expected each DataContext. SetContextItem () method has a lock locks, perfect the lock subsequent acquisition.

4. Is this really the end?

Was ready to report, but think of more than 500 thread stack are tuned out, idle is idle, simply sweep, the results I went to, unexpectedly found to have 134 threads stuck in ReaderWriterLockSlim. TryEnterReadLockCore place, as shown in the figure below:

As the name suggests, this is an optimized read-write lock: ReaderWriterLockSlim. Why are 138 threads stuck here? Really curious. Again, that’s the question.

internal class LocalMemoryCache : ICache { private string CACHE_LOCKER_PREFIX = "xx_xx_"; private static readonly NamedReaderWriterLocker _namedRwlocker = new NamedReaderWriterLocker(); public T GetWithCache<T>(string cacheKey, Func<T> getter, int cacheTimeSecond, bool absoluteExpiration = true) where T : class { T val = null; ReaderWriterLockSlim @lock = _namedRwlocker.GetLock(cacheKey); try { @lock.EnterReadLock(); val = (MemoryCache.Default.Get(cacheKey) as T); if (val ! = null) { return val; } } finally { @lock.ExitReadLock(); } try { @lock.EnterWriteLock(); val = (MemoryCache.Default.Get(cacheKey) as T); if (val ! = null) { return val; } val = getter(); CacheItemPolicy cacheItemPolicy = new CacheItemPolicy(); if (absoluteExpiration) { cacheItemPolicy.AbsoluteExpiration = new DateTimeOffset(DateTime.Now.AddSeconds(cacheTimeSecond)); } else { cacheItemPolicy.SlidingExpiration = TimeSpan.FromSeconds(cacheTimeSecond); } if (val ! = null) { MemoryCache.Default.Set(cacheKey, val, cacheItemPolicy); } return val; } finally { @lock.ExitWriteLock(); }}

If you look at the code above, you might want to implement a GetOrAdd operation on the MemoryCache, and it seems that for security reasons, each CacheKey has a readerWriterLockSlim. After all, MemoryCache itself comes with thread-safe methods for implementing this logic, such as:


public class MemoryCache : ObjectCache, IEnumerable, IDisposable
{
    public override object AddOrGetExisting(string key, object value, DateTimeOffset absoluteExpiration, string regionName = null)
    {
        if (regionName != null)
        {
            throw new NotSupportedException(R.RegionName_not_supported);
        }
        CacheItemPolicy cacheItemPolicy = new CacheItemPolicy();
        cacheItemPolicy.AbsoluteExpiration = absoluteExpiration;
        return AddOrGetExistingInternal(key, value, cacheItemPolicy);
    }
}

5. What’s wrong with using ReaderWriterLockSlim?

Haha, there must be a lot of friends that ask? 😅 if, indeed, what’s the problem with that? Let’s start by looking at how many ReaderWriterLockSlims are currently in the _namedRwlocker collection? To verify it is very simple, on the managed heap search can.

0:35 9 >! dumpheap -type System.Threading.ReaderWriterLockSlim -stat Statistics: MT Count TotalSize Class Name 00007ff8741631e8 70234 6742464 System.Threading.ReaderWriterLockSlim

You can see that the current managed heap has 7W + ReaderWriterLockSlim, so what does that matter? Don’t forget, the reason readerWriterLockSlim has a Slim is because it can implement user-mode spin, which requires a little CPU. If you zoom in a hundred times, right? Can the CPU not be lifted?

Three:

In summary, there are two reasons why this Dump reflects a full CPU.

The constant scrambling and context switching caused by Lock Convoy gave the CPU a critical hit.
ReaderWriterLockSlim one hundred timesUser mode spinAnother critical blow to the CPU.

Once you know why, the solution is simple.

Batch operation, reduce the number of serialized locks, do not play lock volume.
Remove ReaderWriterLockSlim and use the thread-safe methods that come with MemoryCache.

For more high-quality dry goods: see my GitHub:dotnetfly