Node.js builds a real-time multiplayer framework

With Node.js in full swing today, we can do all sorts of things with it. UP recently participated in the Geek Pine event, where we wanted to make a game where phubs could communicate more. The core feature was real-time multiplayer interaction with the Lan Party concept. Geek Pine is a pitifully short 36 hours and requires everything to be sharp and fast. Under such a premise, the initial preparation seems to be a bit natural. We chose Node-WebKit, which is simple enough and meets our requirements.

As required, our development can be divided into modules. This article describes the process of developing Spaceroom, our real-time multiplayer framework, including a series of explortions and experiments, as well as solutions to the limitations of Node.js and the WebKit platform itself.

Getting started

Spaceroom glance

Spaceroom’s design was definitely demand-driven at the outset. We want the framework to provide the following basic functionality:

The ability to distinguish groups of users by room (or channel)
Receive commands from users in the collection group
When pairing between various clients, it can be specifiedintervalAccurately broadcast game data
Can minimize the impact caused by network delay

Of course, in the later stages of coding, we provided Spaceroom with more functions, including suspending games, generating consistent random numbers between clients, etc. (of course, these can be implemented in the game logic framework as required, It’s not necessary to use Spaceroom, a framework that works more at a communications level).

APIs

Spaceroom is divided into front and back ends. The server side needs to do the work of maintaining the room list, providing the ability to create and join a room. Our client APIs look like this:

Spaceroom.connect (Address, callback) — Connect to the server
Spaceroom. CreateRoom (callback) – Creates a room
Spaceroom. JoinRoom (roomId) – Join a room
Spaceroom. On (event, callback) – Listen for events
…

When the client connects to the server, it receives various events. For example, a user in a room might receive an event where a new player joins, or the game starts. We give the client a “life cycle”, which is in one of the following states at any given time:

You can get the current state of the client via spaceroom.state.

Using the server-side framework is relatively easy, and if you use the default configuration file, you can simply run the server-side framework. We had one basic requirement: the server code could run directly in the client, without the need for a separate server. Anyone who has played playstation or PSP should know what I’m talking about. Of course, can run in a dedicated server, naturally is also excellent.

The implementation of the logical code is abbreviated here. The original Spaceroom implemented the functionality of a Socket server that maintained room lists, including room status, and playtime communications (command collection, bucket broadcast, etc.) for each room. Specific implementation can see the source code.

Synchronization algorithm

So how do you make everything displayed across clients consistent in real time?

That sounds interesting. Think about it, what do we need the server to deliver for us? It is natural to think of what might cause logical inconsistencies between clients: user directives. Since the code handling the game logic is the same, given the same conditions, the code will run the same. The only difference is the various player commands received during the game. Of course, we need a way to synchronize these instructions. If all clients were given the same instructions, all clients would theoretically have the same results.

The synchronization algorithms of online games are very strange and applicable to different scenarios. The synchronization algorithm used by Spaceroom is similar to the concept of frame locking. We split the timeline into intervals, each called a bucket. Buckets are used to load instructions and are maintained on the server side. At the end of each bucket period, the server broadcasts the bucket to all clients, who retrieve instructions from the bucket, verify them, and execute them.

To reduce the impact of network latency, each command received by the server from the client will be delivered to the corresponding bucket according to certain algorithms, as follows:

Let order_start be the occurrence time of the instruction carried by the instruction, and t be the start time of the bucket where order_START is located
If t + delay_time <= the start time of the bucket that is currently collecting instructions, post the instructions to the bucket that is currently collecting instructions, otherwise continue step 3
Post the instruction to the bucket corresponding to t + delay_time

Delay_time is the agreed server delay time, which can be the average delay between clients. The default value in Spaceroom is 80, and the default value in bucket length is 48. At the end of each bucket period, the server broadcasts this bucket to all clients and begins receiving instructions for the next bucket. The client keeps the time error within an acceptable range when automatically matching in logic based on the bucket interval received.

This means that the client normally receives a bucket from the server every 48ms, and when it reaches the time to process the bucket, the client processes it accordingly. Assuming that the client FPS=60, every 3 frames or so, it receives a bucket and updates the logic based on that bucket. If the bucket is not received after the lapse of time due to network fluctuations, the client suspends the game logic and waits. Within a bucket of time, logical updates can be made using the LERP method.

With delay_time = 80 and bucket_size = 48, any instruction will be delayed at least 96ms. Changing these two parameters, such as delay_time = 60 and bucket_size = 32, will delay any instruction at least 64ms.

The death of a timer

As a whole, our framework needs to run with a precise timer. Bucket broadcasts are performed at a fixed interval. Of course, the first thing we thought of was using setInterval(), but the next moment we realized how unreliable that idea was: the naughty setInterval() seemed to be seriously off. And, crucially, each error adds up to more and more serious consequences.

We immediately thought of using setTimeout() to keep our logic roughly stable at the specified interval by dynamically correcting the next time it will arrive. For example, setTimeout() is 5ms shorter than expected, so let’s make it 5ms earlier next time. But the results are disappointing, and it’s not elegant in any way.

So we have to think about it differently. Is it possible to let setTimeout() expire as soon as possible, and then we check if the current time has reached the target time? For example, in our loop, using setTimeout(callback, 1) to constantly check the time seems like a good idea.

Disappointing timer

We immediately wrote a piece of code to test our idea, and the results were disappointing. On the latest Stable version of Node.js (V0.10.32) and Windows, run this code:

var sum = 0, count = 0;
function test() {
  var now = Date.now();
  setTimeout(function () {
    var diff = Date.now() - now;
    sum += diff;
    count++;
    test();
  });
}

test();
Copy the code

After a while, type sum/count in the console and you’ll see something like this:

> sum/count 15.624555160142348Copy the code

What? !!!!! I asked for 1ms interval and you told me the actual average interval was 15.625ms! This is a beautiful picture. We did the same test on the MAC and got 1.4ms. So we wondered: What the hell is this? If I were an apple fan, I would have concluded that Windows was rubbish and given up, but as a serious front-end engineer, I kept thinking about this number.

Wait, why does that number look familiar? Is 15.625ms too much like the maximum timer interval on Windows? I immediately downloaded a ClockRes and tested it. The console ran and got the following results:

Maximum timer interval: 15.625 ms
Minimum timer interval: 0.500 ms
Current timer interval: 1.001 ms
Copy the code

Sure enough! If you look at the Node. js manual, you’ll see the following description of setTimeout:

The actual delay depends on external factors like OS timer granularity and system load.

However, the actual delay was shown to be the maximum timer interval (note that the system’s current timer interval is only 1.001ms at this point), which is unacceptable anyway, and we were so intrigued that we took a look at the Node.js source code to see what it was.

A BUG in the Node. Js

I believe that most of you and I have a certain understanding of the even loop mechanism of Node.js. We can roughly understand the implementation principle of timer by checking the source code of timer implementation. Let’s start with the main loop of event loop:

while (r ! = 0 && loop->stop_flag == 0) { uv_update_time(loop); uv_process_timers(loop); if (loop->pending_reqs_tail == NULL && loop->endgame_handles == NULL) { uv_idle_invoke(loop); } uv_process_reqs(loop); uv_process_endgames(loop); uv_prepare_invoke(loop); (*poll)(loop, loop->idle_handles == NULL && loop->pending_reqs_tail == NULL && loop->endgame_handles == NULL && ! loop->stop_flag && (loop->active_handles > 0 || ! ngx_queue_empty(&loop->active_reqs)) && ! (mode & UV_RUN_NOWAIT)); uv_check_invoke(loop); r = uv__loop_alive(loop); if (mode & (UV_RUN_ONCE | UV_RUN_NOWAIT)) break; }Copy the code

The source code of uv_update_time is as follows :(github.com/joyent/libu…

Void uv_update_time(uv_loop_t* loop) {/* Get the current system time */ DWORD ticks = GetTickCount(); /* The assumption is made that LARGE_INTEGER.QuadPart has the same type */ /* loop->time, which happens to be. Is there any way to assert this? */ LARGE_INTEGER* time = (LARGE_INTEGER*) &loop->time; /* If the timer has wrapped, add 1 to it /* uv_poll must make sure that the timer can never overflow more than */ /* once between two subsequent uv_update_time calls. */ if (ticks < time->LowPart) { time->HighPart += 1; } time->LowPart = ticks; }Copy the code

The internal implementation of this function uses the Windows GetTickCount() function to set the current time. Simply put, after calling the setTimeout function, after a series of struggles, the internal timer->due is set to the time + timeout of the current loop. In the Event loop, the time of the current loop is first updated with uv_update_time, then checked in uv_process_timers to see if any timers have expired and entered the JavaScript world if so. As you read through, the event loop looks something like this:

Update global time
Check the timer. If any timer expires, execute the callback
Check the REQS queue to execute pending requests
Enter the poll function to collect IO events. If there are IO events coming, add the corresponding handler function to the REQS queue for execution in the next event loop. Inside the poll function, a system method is called to collect IO events. This method blocks the process until an I/O event arrives or a timeout is reached. When this method is called, the timeout is set to the time when the last timer expired. Block the collection of IO events. The maximum block time is the end time of the next timer.

One of the poll functions under Windows

static void uv_poll(uv_loop_t* loop, int block) {
  DWORD bytes, timeout;
  ULONG_PTR key;
  OVERLAPPED* overlapped;
  uv_req_t* req;

  if (block) {
    
    timeout = uv_get_poll_timeout(loop);
  } else {
    timeout = 0;
  }

  GetQueuedCompletionStatus(loop->iocp,
                            &bytes,
                            &key,
                            &overlapped,
                            
                            timeout);

  if (overlapped) {
    
    req = uv_overlapped_to_req(overlapped);
    
    uv_insert_pending_req(loop, req);
  } else if (GetLastError() != WAIT_TIMEOUT) {
    
    uv_fatal_error(GetLastError(), "GetQueuedCompletionStatus");
  }
}
Copy the code

Following the above steps, assuming we set a timer with timeout = 1ms, the poll function will block for up to 1ms to resume (if there are no IO events during that time). As we continue into the Event Loop, uv_update_time updates the time, and uv_process_timers notices that our timer has expired and executes a callback. So the initial analysis is that either the Uv_update_time was wrong (it didn’t update the current time correctly) or the poll function waited 1ms to recover, and the 1ms wait was wrong.

Looking at MSDN, we were surprised to find the description of the GetTickCount function:

The resolution of the GetTickCount function is limited to the resolution of the system timer, which is typically in the range of 10 milliseconds to 16 milliseconds.

GetTickCount’s accuracy is so crude! Suppose the poll function correctly blocks for 1ms, but the next execution of uv_update_time does not update the current loop time correctly! Therefore, our timer was not judged to be expired, so Poll waited for another 1ms and entered the next Event loop. It wasn’t until GetTickCount was finally updated correctly (so called 15.625ms once) and the loop’s current time was updated that our timer was judged expired in uv_process_timers.

Seek help from its

The source code for Node.js looks pretty helpless: it uses an imprecise timing function and does nothing. But we immediately realized that since we’re using Node-WebKit, in addition to Node.js setTimeout, we also have Chromium setTimeout. Write a test code, use our browser or Node – its run: marks.lrednight.com/test.html#1 (# followed by Numbers need measurement interval), the results are as follows:

According to the HTML5 specification, the theoretical result should be 1ms for the first 5 times and 4ms for the next. The results shown in the test case are from the third run, which means that the data on the table should theoretically be 1ms for the first three runs and 4ms for the next run. There is some error in the results, and according to the regulations, the minimum theoretical result we can get is 4ms. While we weren’t satisfied, it was clearly a lot more satisfying than the node.js results. Strong curiosity trend let’s look at the Chromium source and see how it’s done. (chromium.googlesource.com/chromium/sr…).

First, Chromium uses the timeGetTime() function to determine the current time of the loop. Looking at MSDN, it can be found that the accuracy of this function is affected by the current timer interval of the system. On our test machine, it was theoretically 1.001ms as mentioned above. However, by default on Windows, the Timer Interval is the maximum (15.625ms on the test machine), unless the application changes the global Timer interval.

If you follow IT news, you must have seen this one. Looks like Chromium has set the timer interval very small! So we don’t have to worry about system timer intervals anymore? Don’t get too excited, this fix hit us in the head. In fact, this issue has been fixed in Chrome 38. Are we going to fix the old Node-WebKit? This was clearly not elegant and prevented us from using the higher performance Chromium version.

A further look at the Chromium source shows that when there is a timer and the timer timeout is < 32ms, Chromium will change the global timer interval of the system to achieve a timer accuracy less than 15.625ms. (see the source code) to start the timer, a call HighResolutionTimerManager will be enabled, this class will be based on the current equipment of power type, call EnableHighResolutionTimer function. Specifically, the current equipment batteries, he can call EnableHighResolutionTimer (false), and use the power supply will be introduced to true. The realization of the EnableHighResolutionTimer function is as follows:

void Time::EnableHighResolutionTimer(bool enable) {
  base::AutoLock lock(g_high_res_lock.Get());
  if (g_high_res_timer_enabled == enable)
    return;
  g_high_res_timer_enabled = enable;
  if (!g_high_res_timer_count)
    return;
  
  
  
  
  
  if (enable) {
    timeEndPeriod(kMinTimerIntervalLowResMs);
    timeBeginPeriod(kMinTimerIntervalHighResMs);
  } else {
    timeEndPeriod(kMinTimerIntervalHighResMs);
    timeBeginPeriod(kMinTimerIntervalLowResMs);
  }
}
Copy the code

Among them, the kMinTimerIntervalLowResMs = 4, kMinTimerIntervalHighResMs = 1. TimeBeginPeriod and timeEndPeriod are functions provided by Windows to modify the timer interval. That is to say, the minimum timer interval we can get is 1ms when connecting power supply, and 4ms when using battery. Since our loop is constantly calling setTimeout, and according to the W3C specification, the minimum interval is also 4ms, we can be relieved that this doesn’t affect us much.

Another precision problem

Going back to the beginning, we found that the test results showed that the setTimeout interval was not stable at 4ms, but constantly fluctuated. And marks.lrednight.com/test.html#4… The test results also showed that the interval jumped between 48ms and 49ms. The reason is that the accuracy of the Windows function call waiting for IO events in the Chromium and Node.js event loop is affected by the current system timer. The realization of the game logic requires requestAnimationFrame function (constantly update the canvas), this function can help us to set the timer interval at least to kMinTimerIntervalLowResMs (because he needs a 16 ms timer, Triggered the high precision timer requirement). When the test machine uses the power supply, the timer interval of the system is 1ms, so the test result has an error of ±1ms. If your computer has not changed the system timer interval, run the test #48 above and the Max may reach 48+16=64ms.

Using the Chromium setTimeout implementation, we can control the error of setTimeout(fn, 1) at about 4ms, while the error of setTimeout(fn, 48) at about 1ms. So we have a new blueprint in mind that makes our code look like this:

var deviation = getMaxIntervalDeviation(bucketSize); function gameLoop() { var now = Date.now(); if (previousBucket + bucketSize <= now) { previousBucket = now; doLogic(); } if (previousBucket + bucketSize - Date.now() > deviation) { setTimeout(gameLoop, bucketSize - deviation); } else { setImmediate(gameLoop); }}Copy the code

The above code asks us to wait for a time with an error less than bucket_size(bucket_size — deviation) rather than directly equal to a bucket_size, 46ms delay even if the maximum error occurs, according to the above theory, The actual interval is also less than 48ms. The rest of the time we used the busy wait method to make sure our gameLoop executed at sufficiently precise intervals.

We solved the problem to some extent with Chromium, but it was definitely not elegant.

Remember our original request? Our server-side code should be able to run directly on a node.js computer without the Node-WebKit client. If we run the above code directly, the deviation value is at least 16ms, that is to say, in each 48ms, we have to wait for 16ms time. CPU usage is creeping up.

An unexpected surprise

It’s annoying that nobody noticed such a big BUG in Node.js. The answer was a surprise. This BUG has been fixed in v.0.11.3. You can also look directly at the master branch of the Libuv code to see the modified results. To do this, add a timeout to the loop’s current time after the poll waits. So even if GetTickCount doesn’t react, after the poll wait, we still add the wait time. The timer can then expire smoothly.

In other words, the problem of half a day’s hard work has been solved in V. 0.11.3. But our efforts were not in vain. The poll function itself is affected by the system timer, even if the GetTickCount function is eliminated. One solution is to write a Node.js plug-in to change the interval of the system timer.

But our game, the initial setting is no server. After the client establishes the room, it becomes a server. Server code can run in a Node-WebKit environment, so timer issues on Windows are not the highest priority. According to the solution we gave above, the result is enough to satisfy us.

finishing

With the timer out of the way, our framework implementation is pretty much free. We provide WebSocket support (in a pure HTML5 environment) and custom communication protocols for higher performance Socket support (in a Node-WebKit environment). Of course, Spaceroom’s features were rudimentary at first, but we’ve gradually improved the framework as requirements have been raised and time has passed.

For example, when we realized that we needed to generate consistent random numbers in our game, we added such a feature to Spaceroom. Spaceroom distributes random number seeds at the beginning of the game, and the client Spaceroom provides a method to generate random numbers using random number seeds by taking advantage of the randomness of MD5.

So far So good. In writing such a framework in the process, also learned a lot of things. If you’re interested in Spaceroom, you can get involved. Spaceroom is sure to throw its weight around in more places.