This is the first day of my participation in the Gwen Challenge in November. Check out the details: the last Gwen Challenge in 2021

Recently, business growth is very rapid, the background for us this is also a big challenge, the core business interface performance bottlenecks, and not a single problem, but a few questions together: we solve a hair online, then find another performance bottlenecks. This is also my lack of experience, resulting in a sudden positioning of the solution; And I have stubborn self-esteem for the whole backstage team, so I don’t want to survive the pressure peak through a large number of horizontal expansion. As a result, problems of varying degrees appeared on the line for several consecutive nights, which must have an impact on our business growth. This is also where I am immature and reflective. This series of articles mainly record the general technical optimization of our background micro-service system and the optimization of business process and cache for this business growth, which is only applicable to our business, so we will not repeat it here. This series will be divided into the following parts:

  1. Improved the client load balancing algorithm
  2. Develop a plug-in to filter the log output exception stack
  3. Improved asynchronous log waiting strategies for x86 cloud environments
  4. Increasing the monitoring of HTTP request waiting queues for synchronous microservices and cloud deployment requires careful response to slow requests due to reaching the upper limit of instance network traffic
  5. Add necessary intrusive monitoring for system critical services

Improved asynchronous log waiting strategies for x86 cloud environments

Due to the large amount of online business (daily requests reach hundreds of millions and daily active users reach hundreds of thousands), and the complicated logic involved in the business, the online log level is info level, resulting in a huge amount of online log, so we used Log4j2 asynchronous log very early. Log4j2 asynchronous logs are based on Disruptor, which uses a wait strategy. Before this optimization, we used blocks.

Log4j2 Waiting policy for asynchronous logs

Consumers of Disruptor are constantly checking to see if a message is coming and if a status bit is ready to read the message and consume it. As for how to constantly check, this is the wait strategy. Disruptor contains a number of wait strategies that will be familiar to anyone familiar with multiprocessor programming. These wait strategies can be simply defined as how threads should wait and free up CPU resources to start working as quickly as possible when a task does not arrive. In Log4j2, asynchronous log based on Disruptor, at the same time using AsyncLoggerConfig. WaitStrategy this environment variable configuration for the Disruptor waiting strategy, in the latest version of the Log4j2 can be configured:

switch (strategyUp) {
    case "SLEEP":
        final long sleepTimeNs =
                parseAdditionalLongProperty(propertyName, "SleepTimeNs", 100L);
        final String key = getFullPropertyKey(propertyName, "Retries");
        final int retries =
                PropertiesUtil.getProperties().getIntegerProperty(key, 200);
        return new SleepingWaitStrategy(retries, sleepTimeNs);
    case "YIELD":
        return new YieldingWaitStrategy();
    case "BLOCK":
        return new BlockingWaitStrategy();
    case "BUSYSPIN":
        return new BusySpinWaitStrategy();
    case "TIMEOUT":
        return new TimeoutBlockingWaitStrategy(timeoutMillis, TimeUnit.MILLISECONDS);
    default:
        return new TimeoutBlockingWaitStrategy(timeoutMillis, TimeUnit.MILLISECONDS);
}
Copy the code

Originally, we used BlockingWaitStrategy based on Wait/Notify. However, when the traffic surges due to this policy, the log writing thread cannot be woken up for a period of time, resulting in a backlog of log events in RingBuffer.

Why was the log writing thread not awakened

Let’s start with a few hardware basics. Instead of reading data directly from memory, there are several CPU caches in between. Most current multi-CPU architectures use the MESI cache consistency protocol in a single machine. Sharing, or contention for contention, occurs when one processor accesses the main memory address that another processor has loaded into the cache. Cache consistency needs to be considered, because if one processor updates a shared cache row, the other processor’s copy needs to be invalidated to avoid reading expired values.

MESI cache consistency protocol, cache rows have the following four states:

  • Modified: A cached row is Modified and must eventually be written back to main memory. No other processor can cache the cached row until then.
  • Exclusive: The cache row has not been modified, but other processors cannot load it into the cache
  • Shared: The cache line is not modified and can be loaded into the cache by other processors
  • Invalid: There is no meaningful data in the cache line

For example, suppose that the processor and main memory are connected by a bus, as shown in the figure below:

A) Processor A reads data from address A and stores the data in its cache Exclusive

B) Processor B reads data from address A, and processor A detects address conflict and responds to address A’s data in the cache. After that, the data of address A is loaded into the cache by A and B in the Shared state

C) Processor B writes to A, changes the state to Modified, and broadcasts a (all other processors that have loaded the data into the cache) with the state set to Invalid.

D) A will then need to access A, which will broadcast the request, and B will send the modified data to A and main memory, and set two copies of the state as Shared.

Disruptor is a looped buffer that provides a number of performance optimizations (see another series of Disruptor: Disruptor), which Log4j2 uses as follows:

On the consuming side, only a single thread consumes. When no log arrives, the thread needs to wait, and how to wait is the wait strategy mentioned above. So how do you determine readiness? The thread is constantly checking to see if a status bit is ready. In this case, it checks whether the production offset of RingBuffer is greater than the value currently consumed. If it is, it means that there is a new message to consume. This code, which constantly checks for a status bit, is called a spin-wait loop.

So why is this single thread slow to wake up during peak hours? The main reason is that a large number of threads, much larger than the number of cpus, enter the Runnable state when the service volume increases, and a large number of CPU calculations and cache line updates and failures occur. This will cause a large bus traffic. As a result, the Notify signal is affected by the single thread of log consumption. Log consumption is also affected when a single thread enters Runnable and occupies CPU resources to run. During this time, possibly active business threads are taking up longer CPU time, resulting in many log events being produced into the RingBuffer.

To switch to SleepingWaitStrategy

Let’s switch to the SleepingWaitStrategy, which is the most balanced of these strategies. In most current applications, the number of threads is much greater than the number of cpus, and even the number of RUNNABLE threads is much greater than the number of cpus. Using BusySpinWaitStrategy based on Wait will cause a sudden rush of traffic during off-peak hours. Log consuming threads are not waking up in a timely manner (the CPU has been preempted by a large number of RUNNABLE business threads). If you use the more aggressive BusySpinWaitStrategy (calling Thread.onspinwait ()) or YieldingWaitStrategy (spinning first and then calling Thread.yield()), There will be a high CPU usage in idle time. We expect a progressive waiting strategy, such as:

  1. Keep spinning for a certain number of times to reduce thread switching costs when the log volume is particularly high.
  2. After a certain number of times, it starts calling again and againThread.onSpinWait()orThread.yield()Causes the current thread to free up CPU resources to cope with intermittent log peaks.
  3. After the second step reaches a certain number of times, use Wait, orThread.sleep()Such functions block the current thread and reduce CPU consumption when dealing with low log peaks.

One such strategy is the SleepingWaitStrategy, which uses thread.yield () in the second step and thread.sleep () in the third step.

public final class SleepingWaitStrategy implements WaitStrategy { @Override public long waitFor( final long sequence, Sequence cursor, final Sequence dependentSequence, final SequenceBarrier barrier) throws AlertException { long availableSequence; int counter = retries; while ((availableSequence = dependentSequence.get()) < sequence) { counter = applyWaitMethod(barrier, counter); } return availableSequence; } @Override public void signalAllWhenBlocking() { } private int applyWaitMethod(final SequenceBarrier barrier, int counter) throws AlertException { barrier.checkAlert(); If (counter > 100) {--counter; } else if (counter > 0) {--counter; Thread.yield(); } // Finally, use sleep else {locksupport. parkNanos(sleepTimeNs); } return counter; }}Copy the code

Change thread. yield to thread. onSpinWait

We found that using SleepingWaitStrategy, through our custom JFR events, threads always have log events coming at Thread.yield() during periods of low business peaks to spikes in business. However, the interval between each Thread executing Thread.yield() is still a little too long, and a logging event occurs, but it is still possible to observe several more thread.yield () before the Thread realizes that the log is coming.

Therefore, we change thread.yield () to thread.onspinwait () because: The environment we are deploying to is an x86 machine, where thread.onspinwait () is called a certain number of times and the C1 compiler replaces it with PAUSE, an x86 instruction. Reference JVM source code:

x86.ad

instruct onspinwait() %{
  match(OnSpinWait);
  ins_cost(200);

  format %{
    $$template
    $$emit$$"pause\t! membar_onspinwait"
  %}
  ins_encode %{
    __ pause();
  %}
  ins_pipe(pipe_slow);
%}
Copy the code

As we know, the CPU does not always manipulate memory directly, but reads it as a cache line and stores it in the CPU cache. But code that constantly checks to see if a certain state bit is ready, constantly reading the CPU cache, will consider that state unchanged until the current CPU receives from the bus that the CPU cache has expired. On busy days, the bus can be so busy that the second step of SleepingWaitStrategy never checks for updates to the status bits, leading to the third step.

PAUSE the instructions (see: www.felixcloutier.com/x86/pause) is that… Spin-wait loop), then the CPU branch prediction will avoid the memory sequence collision according to the prompt, and the CPU will not cache the read memory, that is, do not cache the spin-wait loop, instruction reordering, etc. This improves the execution efficiency of spin-wait loop.

This directive makes thread.onspinwait () more efficient than thread.yield () for spin-wait loop scenarios. So, we change the thread.yield () of SleepingWaitStrategy to thread.onSpinwait ().

Wechat search “my programming meow” public account, a daily brush, easy to improve skills, won a variety of offers