OS Up close: Linux time may not be as reliable as you think!

Original: Little Sister Taste (wechat official ID: Xjjdog), welcome to share, please keep this statement if reprinted by non-official account.

If you’re going to do something periodically, you’re going to have to do it with time. For example, eat breakfast at 7am and fall asleep at 10pm. Of course, if you have a partner, this time of night may not be so fixed.

Computers control time more accurately than humans sense it, but we still struggle to schedule it with absolute precision, which leads to the ultimate philosophical problem. Understanding the causes of a problem is more difficult than the symptoms of the problem, and let’s talk about this problem.

Suppose I have a task that needs to be executed every 60 seconds. How do you design it?

A lot of Javaers naturally think of Timer and ScheduledExecutorService, but that just means you know some of the apis, and when it comes to scheduling with absolute precision, neither of them will suffice.

A case in point

The following code will start an actuator with a 5-second interval, and then record the actual interval time and the desired offset.

public class Main { private static ScheduledExecutorService schedule = Executors.newScheduledThreadPool(1); private static LinkedBlockingDeque<Long> q = new LinkedBlockingDeque<>(); public static void main(String[] args) { final long rateNano = TimeUnit.SECONDS.toNanos(5); final Random r = new Random(); final AtomicLong offset = new AtomicLong(0); final AtomicLong max = new AtomicLong(0); schedule.scheduleAtFixedRate(()->{ try { long eventTime = System.nanoTime(); long nanoOffset = q.size() == 0 ? rateNano : (eventTime - q.pollLast()); offset.addAndGet(nanoOffset); offset.addAndGet(-rateNano); max.set(Math.max(max.get(), Math.abs(offset.get()))); System.out.println(TimeUnit.NANOSECONDS.toSeconds(eventTime)+ "(s) #" + nanoOffset + "(us)," + offset.get() + "(us)," + max.get() + "(us)" ); q.offer(eventTime); Thread.sleep(r.nextInt(500)); } catch (InterruptedException e) { e.printStackTrace(); } }, 0, rateNano, TimeUnit.NANOSECONDS); }}Copy the code

Let’s break it down and print out the number of seconds and nanoseconds between each interval, and you’ll find something unusual.

978048(s) #4996295958(us),-688459(us),57185541(us)
978053(s) #5002982917(us),2294458(us),57185541(us)
978058(s) #5000489208(us),2783666(us),57185541(us)
978063(s) #4997937167(us),720833(us),57185541(us)
978068(s) #5002287042(us),3007875(us),57185541(us)
978073(s) #4999411375(us),2419250(us),57185541(us)

Copy the code

As you can see, the number of seconds is increasing at a rate of 5 seconds 5 seconds, but the actual execution time, if scaled up to nanoseconds, shows a very irregular distribution.

In order to get reliable data, I actually ran the task for a day, and by the end, the total offset was up to 57ms.

Leaving aside the scheduling of Java threads, there are errors on the operating system. Take the operating system itself. Windows and Linux, which we use a lot, are not real-time operating systems because of virtual memory, thread pools, drivers.

The main wait method is ConditionObject’s awaitNanos method in the take method of DelayedWorkQueue.

Further down, LockSupport’s parkNanos method. Moving on, that’s unsafe, which is essentially a native function.

public native void park(boolean isAbsolute, long time);

Copy the code

The first parameter is the absolute time, and the second parameter is the wait time value. If isAbsolute is true, the millisecond timing is implemented. If isAbsolute is false, nanosecond timing is implemented. Nanosecond, that’s pretty accurate.

In the JDK source code, we found specific native functions. In the case of Linux, files lie in./ OS /posix/ os_POSIx. CPP, which ultimately calls pthread_cond_timedwait.

All of the programming is for GliBC, gone.

pthread_cond_timedwait

In general, the platform provides functions such as sleep, pthread_cond_wait, and pthread_cond_timedwait for users to wait and wake up threads.

Pthread_cond_timedwait is the most widely used one. By using perf to log stack calls, we can see the general stack of function calls.

javac Main
java Main
ps -ef| grep java
perf record -g -a  -p 2019961
perf report

Copy the code

For those who understand the internal operating principle of Linux, it can be seen from the above function call that futex mechanism of Linux is mainly used here, and the two main methods of FUTEX are futex_wait and FUTEX_wake.

Regardless of all the synchronization jargon and function version differences, in kernel/futex/waitwake.c (Linux-5.16.12), we can see the related function calls.

The so-called wait counter is set up in the following code.

struct hrtimer_sleeper * futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout, int flags, u64 range_ns) { if (! time) return NULL; hrtimer_init_sleeper_on_stack(timeout, (flags & FLAGS_CLOCKRT) ? CLOCK_REALTIME : CLOCK_MONOTONIC, HRTIMER_MODE_ABS); /* * If range_ns is 0, calling hrtimer_set_expires_range_ns() is * effectively the same as calling hrtimer_set_expires(). */ hrtimer_set_expires_range_ns(&timeout->timer, *time, range_ns); return timeout; }Copy the code

Time round

So the question finally focuses on the topic of our article.

Hrtimers, the next high resolution timer for Linux. The hrtimer structure (also known as &timeout->timer), which has a function argument, accepts a callback function that will be called when the timer is triggered.

Before we talk about high resolution timers, let’s talk about low resolution timers. In earlier Versions of Linux, timers were implemented based on the CPU’s HZ, or tick cycle.

Obviously, the minimum value of the tick cycle is 1/CPU. However, the current CPU frequency is GHz to calculate, so the accuracy is relatively good, can achieve nanosecond level.

1GHz=1000MHz, 1MHz=1000kHz, 1kHz=1000Hz A jiffy =1 /HZCopy the code

Low-resolution timers, as well as a well-known time-wheel algorithm, hash timed tasks in a finite array of rings. Then, on this basis, refer to the way of water meter in daily life, through the low scale of the fast wheel to drive the higher level of the wheel to move, like a gear to drive a higher level of gear, so that you can avoid the problem of too large wheels.

And this is actually pretty easy to understand. If we take the time of a day, every second is engraved on the clock, need 86,400 scales. But in fact, our clocks only need 60 ticks to complete a day’s cycle.

The Linux timer divides the time wheel into 9 layers, which can be said to be very precise.

#define WHEEL_SIZE (LVL_SIZE * LVL_DEPTH)

/* Level depth */
#if HZ > 100
# define LVL_DEPTH 9
# else
# define LVL_DEPTH 8
#endif

Copy the code

This is a lot of detail, so we’ll pick another article to cover it, and don’t forget to focus on XjjDog.

hrtimer

What does Hrtimer do to qualify it as high precision compared to the low (and not so low) time wheel design?

The organization of hrtimers is actually a red-black tree (TimerQueue_node), which is the final choice based on comparing data structures such as hashes, hoptables, heaps, etc. High-precision code is almost entirely rewritten, and most are capable of O(1) time complexity operations.

The main task of the high-precision timer is not to achieve the accuracy of the time slice, but to provide stable and fast functions when performing the increase, deletion, change and check. Even for sorting, time consumption should be reduced as far as possible, because the instability of scheduling code execution time will also affect the stability of the whole scheduling system.

The timerQueue_head structure adds a next field to the red-black tree, which holds the first timer node to expire. These changes are immediately effective in terms of efficiency, just as B+ Tree changes B Tree.

The following structures give a general idea of how a red-black tree is organized.

struct timerqueue_node {
 struct rb_node node;
 ktime_t expires;
};

struct timerqueue_head {
 struct rb_root_cached rb_root;
};

struct rb_root_cached {
 struct rb_root rb_root;
 struct rb_node *rb_leftmost;
};

struct rb_node {
 unsigned long  __rb_parent_color;
 struct rb_node *rb_right;
 struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));

Copy the code

If we look at the hrtimer structure again, we see that there is a _softexpires element in it, and that its node member variable also has a variable called Expires.

struct hrtimer {
 struct timerqueue_node  node;
 ktime_t    _softexpires;
 enum hrtimer_restart  (*function)(struct hrtimer *);
 struct hrtimer_clock_base *base;
 u8    state;
 u8    is_rel;
 u8    is_soft;
 u8    is_hard;
};

Copy the code

These are some of the compromises made by Hrtimer to increase the efficiency of scheduling. It says, our task can expire at any point between _softExpires and Expires. Expires is called a hard expiration. It’s the last time a task expires. With this design, it is possible to avoid the process being woken up frequently by hrtimer and reduce contextswitch.

We can get a glimpse of the Settings for both points in time from the hrtimer_set_expires_range_ns function derived from perf. This is essentially a function of time alignment.

static inline void hrtimer_set_expires_range_ns(struct hrtimer *timer, ktime_t time, u64 delta)
{
 timer->_softexpires = time;
 timer->node.expires = ktime_add_safe(time, ns_to_ktime(delta));
}

Copy the code

Delta is set up in an interesting way, with different values on different hardware devices, indicating the minimum scheduling accuracy. This is also the root cause of time jitter in our uppermost Java program.

End

As you can probably imagine, there is no such thing as accurate scheduling. But with the increase of the main frequency, we can control the accuracy within a certain range.

Not to mention the accuracy of the time itself, just the subdivision of the time slice makes the time error of the current PC in the micro world become extremely huge, and it is almost impossible to carry out high frequency precision scheduling.

The most accurate clock in the world loses only one second every 15 billion years. But a second is time, and we can still put it into words. Wrestling with quasi-real-time is an answer that will never end until we can manipulate atoms. Add to that the uncertainty of the timing of the task scheduling code itself, and the current scheduler’s nanosecond accuracy is a miracle.

The world itself is a rough observation of man, let alone a machine made by man.

Xjjdog is a public account that doesn’t allow programmers to get sidetracked. Focus on infrastructure and Linux. Ten years architecture, ten billion daily flow, and you discuss the world of high concurrency, give you a different taste. My personal wechat xjjdog0, welcome to add friends, further communication.

Recommended reading:

2. What does the album taste like

3. Bluetooth is like a dream 4. 5. Lost architect, leaving only a script 6. Bugs written by architect, unusual 7. Some programmers, nature is a flock of sheep!

Taste of little sister

Not envy mandarin duck not envy fairy, a line of code half a day

328 original content

OS Up close: Linux time may not be as reliable as you think!

A case in point

pthread_cond_timedwait

Time round

hrtimer

End

Related Posts

Java Series: ThreadLocal principles and Usage scenarios in Interviews

Based on the practice of Docker continuous delivery platform construction

Laravel The scheduled task is executed in Windows