The phenomenon of

There is a process on our server that exits every time it runs more than a day. Since our logs are only in the Lua layer, there is no dump at the time of the abnormal exit. In my experience, there are usually two reasons for this. One is to go into an infinite loop, get high CPU usage, and get killed by the system. There is too much memory usage, was killed by the system. We are using the SkyNet framework, which can detect dead loops and see that there is no record of dead loops in the log. Therefore, it was roughly located as a memory leak. To verify my guess, I started the process and used Top to check the memory usage. At the beginning of the process, the memory usage was 0.7%. This must be a memory leak.

Principles of Memory Leakage

Lua as a language with GC. Memory leaks typically occur because references are not deleted, but they can be very difficult to find. An object can be indexed anywhere in the program. Lua’s GC algorithm uses Mark and Sweep. Internally, a root tree is maintained. When the GC runs, objects that are unreachable from the root node are marked and then reclaimed in the sweep phase. Such as:

local tmp = {}
local tmp_ref = tmp
Copy the code

The code snippet above produces a tree of sundown objects:



When tmp_ref = nil, the tree deforms as follows



At this point, root can also reach {} through TMP, so the GC does not recycle {}. Let’s go ahead and set TMP to nil. TMP = nil.



The {} object is now root unreachable and will be collected by gc.

The so-called memory leak in Lua is the increasing number of root-reachable objects, in other words, the increasing number of nodes in the root tree. So the way to detect this is to compare the tree structure at different points in the program’s run. Finding the objects that keep increasing and locating them to the relevant code will help you pinpoint the problem. Such as:

local tmp = {}
---- snapshot 1
local tmp1 = {}
----- snapshot 2
Copy the code

A tree structure snapshot of the three locations of the above code is as follows:

snapshot1:                                        snapshot2:

 

Snapshot2 has more TMP1 nodes than Snapshot1.

Testing process

We wrong code, for example, we give an issue code three turn-based game, at the beginning of each round of games I print the object tree, because with the game to start all status will be reset, at this time will begin a round and started this turn, comparing the two time code tree to see what the new object this turn, leads to cause memory leaks. Tools for https://github.com/cloudwu/lua-snapshot.git. (on the tool usage to see its lot) related code is as follows:

local old_snap
local function instance_diff(a)
    Second and subsequent rounds
  if old_snap then
    Compare this turn to the object tree of the previous turn to find the new tree of this turn
    local new_snap = snapshot()
    local diff = {}
    for k,v in pairs(new_snap) do
      if not old_snap[k] then
            diff[k] = v
      end
    end
    -- Prints all new objects for this turn
    luadump(diff, "snapshot diff")
  else
    In the first round, get the snapshot first
    old_snap = snapshot()
  end
end
Copy the code

After running it, I see a lot of TF function objects and use this as a clue to find the following code:

function timer.timeout(ti, f)
 	local function tf(a)
 		local f = timer.handles[tf]
 		if f then
 			f()
 		end
 	end
 	skynet.timeout(ti, tf)
 	timer.handles[tf] = f
 	return tf
end

function timer.cancel(tf)
       timer.handles[tf] = nil
endCopy the code

This code was designed because Skynet’s timer does not cancel. If you want to cancel the timer early, you can call timer.cancel. In fact,timer.cancel must be called, otherwise the tf function is temporarily generated every time timer.handles[tf]=f, and the TF address is used as the key to index f. Handles add an index, which is a 4-byte memory leak.