Write a new article to commemorate the minefields explored during the creation and debugging of the LuaJIT Bridging PATCH in famine.

This article has been basically updated, if there is any new content will be timely supplemented.

Information about the hunger games, direct look at the answers well: https://www.zhihu.com/question/21282202

Details about this plug-in, download and source code, please click: https://paintdream.github.io/DontStarveLuaJIT/

This article used software: OllyDBG 2.01, Microsoft Visual C++ 2008


— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — I’m line — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

0x00 God ship grounded

The fact that I was going to write a PATCH for Famine was not something that was planned in the beginning. Famine launched on Steam and added a creative workshop. There are lots of good, bad, working, unworking, original, and copycat Mods, most of which are well maintained, with save and MOD Settings synchronized to the cloud when you start the game, almost perfect.

The only fly in the ointment, however, seems to be that the game is surprisingly configuration-hungry. N years ago to buy the god ship will play for a long time will be from time to time, so that later incredibly more and more card. Although i7-3610qm +GT650M is still a bit of a mess, at least dota2 matches are still smooth, and will not be so miserable on famine.

It’s interesting to open task Manager and take a peek. When the CPU is stuck, only two cores are running at full capacity and the rest are idle — by pausing and resuming CPU usage, you can conclude that CPU0 is the rendering thread and CPU6 is the script logic thread.

The stutter comes from two things. On the one hand, if you draw more objects, the machine will obviously start to stutter. CPU0 usage drops significantly, indicating that a large number of draw instructions block the rendering pipeline, causing the drawing thread to be blocked for a long time waiting for the rendering to complete. The user’s input feedback is also significantly blocked, presumably the code that reads user input is also executed in this thread (Famine uses DirectInput to read keyboard actions). Famine itself is a Gothic style, using some prefabricated textures to manipulate the Billboard map to create a pseudo-2D image, rendering itself should not be too much pressure. However, considering that Instanced Draw is not required in the low ES version, it is likely that each object corresponds to a Draw call, and a large amount of data is repeatedly committed when each object is drawn, resulting in bandwidth exhaustion. But I haven’t done any careful research on this, it’s just speculation.

On the other hand, if the logic of the script is too complex, it will also become significantly stagnant. At this point, CPU6 will be fully occupied, and CPU0 will be less occupied due to synchronization of logical and render frames. This is a relatively serious problem, as many MODs are loaded for fun, and the new game features added to Shipwrecked DLC are cpu-eating, causing the lag to be far more pronounced than in the original OR Reign of Giants DLC. So on many forums the only solution is to stop running so many mods and get a better computer.

This article attempts to solve the problem caused by script logic.

0x01 Too much can’t be undone

The good news is that Famine uses lua 5.1.4 as a script engine that I’m familiar with (but actually I would have preferred lua 5.3), and comes with all the Lua scripts (about 180,000 lines, not counting DLC), and all the logic is implemented in the scripts.

After a few hours of research, the authors seem to be using it just as a tool, not as a feature of Lua. To make it easier and more familiar to write, when asynchronous operations are required, most of the requirements are handled through c-level support for clock callbacks (often in conjunction with global task queues). This design also caused me trouble because it would break the entire logic from it, printing the call stack when debugging would only see the logic after the current event was triggered.

So why not coroutine? I’ve searched the source code, and there are three places where Coroutine is packaged, two of which are third-party libraries for the network that do things like email and have nothing to do with game logic. The only thing the scheduler does is create a coroutine for each new task as it adds tasks, and then enumerate all the tasks to execute one by one during scheduling.

However, it seems that the coroutine is simply there to hold a separate stack of execution — the scheduler’s operations on it are simply fetched from the queue and then deleted. Yield() is used in only nine infrequently executed areas of the entire code. Personally, the use of this coroutine is questionable — after all, coroutine itself has a little overhead, and most asynchronous requirements are done with callbacks.

From the lua code architecture, this script itself realized a simple object-oriented framework similar to C++, module packaging is more capricious, coupling degree is also a little high, often appear in the bottom framework code to add some code to control the upper level of the implementation of the logic of the code. The two DLC’s it released were not expansions developed on the original code framework, but directly copied the code for modification, probably not intended to have a DLC when it was originally written…

It can be seen that the game itself should be directly from a simple prototype step by step iteration, although in the process of development there has been abstraction and encapsulation, but the accumulation is difficult to return, the general structure has been fixed, there is no room for simple optimization. Scripts are too much work and not much parallelism, and trying to optimize existing code directly has become a dead end for me.

So replacing the Lua engine with LuaJIT has become one of the few options available to make applications run less sluggish. Although from the truth, this can only be regarded as a palliative, but at least some play ah.

0x02 Branches grow out of branches

But there is more to it than meets the eye. I initially thought I’d just switch to Lua51. DLL for famine, but I couldn’t find the lua51. DLL library in any of the installation directories. The result is that LUA is wrapped in another DLL and loaded at startup by dontstarve_steam.exe. Finding out that the Lua engine was compiled directly into the main program as a static library or source code is the worst possible outcome.

The problem is that exe usually doesn’t need to export functions (look closely at dontstarve_steam.exe), so locating the Lua API in exe is the first problem to solve.

The layout of the LUa API in EXE is completely different from that in DLL: the compiler can cut out functions and constants that are not referenced, and it can reorder functions. It is not possible to locate these functions publicly or by accident. So there are three ways left: directly change EXE, hard coding and feature code search.

It may seem simple to change the EXE directly or hardcode the RVA of the corresponding function, but if the EXE is updated, you have to release a new version and manually search and correct the address using the signature code. On second thought, use the signature code, and try to make it as non-invasive as possible, so that it effectively accommodated famine’s own updates.

0x03 Probe plan

Feature code search requires finding the feature code corresponding to each function first. My idea is to compile an original LUa 5.1.4 DLL, set the Lua API to export, so that I can get the address of each API, and then use the binary instruction at the beginning of the function as the characteristic code to search in the EXE. Then, you need to match the search source address with the name, and load luajit. DLL to pass kernel32.dll! GetProcAddress takes the destination address and uses an inline hook to direct the execution flow from the source address to the destination address. Because these hooks do not need to execute the original function, it is not necessary to count the number of bytes in the original function header that are retained by the springboard.

The algorithms used by different compilers to generate object code are quite different, and even the same family of compilers can change as versions change. So what should be used to compile the original Lua DLL? In keeping with the famine master program, of course. OllyDBG tells us the answer: Microsoft Visual C++ 2008




Before I start coding, however, the question is: How do I get my code to execute?

It was obvious that if we wanted to make a non-invasive Patch, we either had to write an extra launcher, or we had to do some trick to get our code executed when the game started. The official MOD mechanism allows custom mods to execute Lua code when the game is initialized, however the Lua engine is already built and it is too late to replace it.

So I looked at the DLLS that the EXE relied on to see if there was a strategy for shifting gears. Write a puppet DLL with the same name and exported functions as the target DLL, and place it in the EXE directory so that it can be loaded when the EXE starts. Hand draw again after loading kernel32. DllLoadLibraryA/W target DLL loading, and the exported functions through to target the corresponding functions in the DLL. In the early years, some viruses disabled anti-virus software by placing a fake WS2_32.dll in the installation directory of the anti-virus software, so that the anti-virus software would crash or be hijacked upon startup. Note that some system core DLLS are not affected by this, such as kernel32.dll, ntDLl.dll, etc.)

Most of the DLLS in the famine main program directory are written by C++, and the names of the exported functions are name decoration, which will be troublesome to deal with. In addition, the number of derived functions is also very large, which is not cost-effective.

After a long search, the best choice is dinput8.dll, which is required by DirectInput. It was chosen because dontStarve_steam. Exe imports only one function DirectInput8Create from it, which is used to create a DirectInput device object. Since this object is wrapped in COM, the DLL no longer needs to export its member functions. Importing entries is therefore much cleaner. In addition, as it is a DLL in the system directory, replacing it does not need to replace the original DLL, but only needs to be placed in the EXE directory to be loaded, which is a perfect choice for non-invasive PATCH.

As mentioned above, if you take over dinput8.dll, then you need to implement DirectInput8Create as well, otherwise the game will start with no entry. This is easy:

typedef HRESULT(WINAPI *PFNDirectInput8Create) (HINSTANCE hinst. DWORD dwVersion. REFIID riidltf. LPVOID *ppvOut. LPUNKNOWN punkOuter);

static PFNDirectInput8Create pfnDirectInput8Create = NULL;
HRESULT WINAPI DirectInput8Create(HINSTANCE hinst. DWORD dwVersion. REFIID riidltf. LPVOID *ppvOut. LPUNKNOWN punkOuter) {
	return pfnDirectInput8Create(hinst. dwVersion. riidltf. ppvOut. punkOuter);
}

DllMain:
	TCHAR systemPath[MAX_PATH];
	: :GetSystemDirectory(systemPath. MAX_PATH);

	HMODULE hInput = : :LoadLibrary(CString(systemPath) + _T("\ \DINPUT8.DLL"));
	if (hInput ! = NULL) {
		pfnDirectInput8Create = (PFNDirectInput8Create): :GetProcAddress(hInput. "DirectInput8Create");
}
Copy the code

In the above code, you need to load the real dinput8.dll in DllMain and get the DirectInput8Create address before calling in the relay code. It is important to note that it is best not to include any header files about the DirectX SDK, which tend to specify the API as an import and we need to export the relay function of the same name.

If the type is not defined, don’t worry, just write an equivalent length. I’ll show you a simpler way to make a relay function springboard later.

So, here’s the question of how to search. Let’s compile the original lua51. DLL and select the Release version.




If you are using kernel32.dll! After loading the original lua51. DLL, LoadLibraryA/W directly searches the binary code starting with the API function and only finds a small number of functions. This is because the code in the DLL needs to be repositioned based on the information in the.reloc section, where the binary values need to be positioned will change, so pure MEMcMP won’t work.

Here, I directly adopted a relatively simple but time-consuming strategy: in the search, disassembly engine XDE32 was used to parse the instruction format one by one. When finding the instruction that needed to be repositioned, I read the value at the address instead of the address itself as the feature code.

while (target < entry.instr + INSTR_SIZE) {
	xde_disasm((BYTE*)c. &instr);
	int len = instr.len;
	if (len = = 0)
		break;

	if (instr.opcode = = 0x68 || instr.addrsize = = 4) {
		// read memory data
		PVOID addr = instr.opcode = = 0x68 ? *(PVOID*) (c + 1) : (PVOID)instr.addr_l[0];
		char buf[16];
		memset(buf. 0. sizeof(buf));
		if (addr ! = NULL && : :ReadProcessMemory(: :GetCurrentProcess(), addr. buf. 4. NULL)) {
			entry.stringList.push_back(std: :make_pair(addr. std: :string(buf)));
		}

		BYTE temp[16];
		if (instr.opcode = = 0x68) {
			memcpy(temp. c. instr.len);
			*(PVOID*) (temp + 1) = *(PVOID*)buf;
		} else {
			instr.addr_l[0] = *(long*)buf;
			xde_asm(temp. &instr);
		}
		memcpy(target. temp. len + target > entry.instr + INSTR_SIZE ? entry.instr + INSTR_SIZE - target : len);
	} else {
		memcpy(target. c. len + target > entry.instr + INSTR_SIZE ? entry.instr + INSTR_SIZE - target : len);
	}

	c + = len;
	target + = len;
	if (!edge)
		validLength + = len + target > entry.instr + INSTR_SIZE ? entry.instr + INSTR_SIZE - target : len;
	entry.lengthHist[len]++;

	if (*c = = 0xCC) {
		edge = true;
	}
}
Copy the code

As for the matching algorithm, in order to avoid the overall dislocation caused by the slight length difference, I did not directly memCMP, but used the dynamic programming algorithm with the longest common substring, with both time and space complexity O(N * N). For short feature sequences, the performance penalty is acceptable.

static int CommonLength(const BYTE* x. int xlen. const BYTE* y. int ylen) {
	int opt[INSTR_SIZE + 1] [INSTR_SIZE + 1];
	memset(opt. 0. sizeof(opt));

	for (int i = 1; i < = xlen; i++) {
		for (int j = 1; j < = ylen; j++) {
			if (x[i - 1] = = y[j - 1])
				opt[i] [j] = opt[i - 1] [j - 1] + 1;
			else
				opt[i] [j] = opt[i - 1] [j] > opt[i] [j - 1] ? opt[i - 1] [j] : opt[i] [j - 1];
		}
	}

	return opt[xlen] [ylen];
}
Copy the code


The search scope should be set to the.text section of dontstarve_steam. Exe. In fact, after several versions, I found that.text + 0x170000 will have the corresponding lua API function. Text + 0x170000 is the end of the.text. (Of course, there is a risk in doing so, but I have been too lazy to change. In case there is a big update of famine one day, it will probably be wrong.

Once you’ve found the target function, mark it up and connect it to Luajit. Here is the code for the inline hook:

static void Hook(BYTE* from. BYTE* to) {
	// prepare inline hook
	unsigned char code[5] = { 0xe9. 0x00. 0x00. 0x00. 0x00 };
	*(DWORD*) (code + 1) = (DWORD)to - (DWORD)from - 5;
	DWORD oldProtect;
	: :VirtualProtect(from. 5. PAGE_READWRITE. &oldProtect);
	: :memcpy(from. code. 5);
	: :VirtualProtect(from. 5. oldProtect. &oldProtect);
}
Copy the code

Personally, I prefer 0xE8/0xE9 hooks, which require no register changes and are short in length. PUSH+RET, FF 15/25 CALL/JMP DWORD PTR [ADDR]


0x04 Get burned

After writing the code, compile the DLL and rename it dinput8.dll. DLL (lua51. DLL), lua51. DLL (lua51. DLL), lua51. DLL (lua51. DLL), lua51. DLL (lua51. DLL)

Copy these three files to famine’s bin directory and start the game. (Time for a miracle!!)

No surprises, BOOM! The program crashed!!

Guest officer sorry, in front of the yanglian asperse wrote a lot, it seems that there is a great possibility and no what to use.

But if you think about it, it’s not always a tragedy, at least it proves that we’re capable of using DINPUT8 puppets to mess things up.


But why did it crash? There has to be an explanation!! Through careful investigation (in fact, is playing a LOG), found that in fact, some functions did not Hook.

But why didn’t they Hook? After all that effort, why is the result still wrong?


0x05 No secrets under the debugger

I had no choice but to debug it with OllyDBG. Since the exe won’t start the game by clicking it directly, it will only start STEAM and then use STEAM to start the game again, so I had to force a getchar() in dinput8.dll’s DllMain and start the game in STEAM using OllyDBG for additional debugging.

Press ALT+E to open the list of modules, hit dontstarve_steam. Exe and launch Search for => All Inter-modular calls. You’ll see that most of the functions have been successfully hooked.




Check HOOK functions one by one and find that as long as the target is searched, there is basically no big problem. The problem is that many functions are not found, and the paradox is that they can not be found manually by using OD’s powerful Search Command function to Search, relax the Search matching conditions.

So…

Could it be…

Those functions don’t exist… ?

All right. That conclusion turned out to be half right.

There are some APIS that are not available in EXE because exe does not use them and the compiler erases them when connecting. I then deleted their export entries in Lua51.dll.

What about the other part? It’s missing. By looking for string references to progressively locate the relevant instructions (I won’t go into detail), I discovered something strange:

These apis are very different from those in Lua 5.1.4 DLL at the binary level!









0x06 Flywheel with chain

If the author of Famine had changed the implementation of these functions, then things would be pretty stiff — I’d have to make the equivalent changes in Luajit. But a closer look reveals that things are not so bad.

There are two main reasons for the inconsistency:

1. Some API calls to internal functions are inline

2. Famine author removed part of API about luaC_checkGC call





For 1, refer to the following figure:

LuaO_pushfstring is a function called by lua_pushfstring. In the famine main program, this call is inline, causing it to be inconsistent with the lua_PUSHfstring binary in lua51.dll that I compiled.


For 2, refer to the following figure:

The famine authors removed some luaC_checkGC calls from the code so that some functions no longer trigger garbage collection. Is the idea to avoid certain bugs? (for example, objects held at level C are invalidated because they do not maintain references) or are you trying to reduce lag? I don’t know. Strangely, however, the online version of Famine (see below for more on the online version) has changed in different locations than the standalone version. (The luaC_checkGC_ I added in the figure above is an empty macro, equivalent to removing the GC call)

After recompiling Lua51.DLL with these changes, our remade chain finally meshed with the flywheels of the original Famine program.

0x07 Pack Bomb

Some values of luajit constants are too small (such as the maximum number of constants in function string and the maximum number of parameters in parameter list) according to OllyDBG LOG trace. Change the values to larger values, which will not be detailed here. (Famine is a memory killer)

After a long time, our patch was barely running. Successfully through the loading interface, the main interface successfully started! Long lost background music sounded ~




Try the basic features first, click MODS first to see if they hang…

Murphy’s law says that if a place feels wrong, it will. A plugin crashed.

The built-in console can also be opened by pressing the ‘ ‘key in the game, but here the LOG displays a limited number of lines and does not scroll. So open up OllyDBG’s LOG and see what happens:




(These logs are OutputDebugString output and require OllyDBG 2.0 or later to support displaying them in the debugger.)

Mods. Lua error at line 42 when loading the wing language MOD indicates that the variable ‘arg’ is not defined.

So check it out:

local runmodfn = function(fn.mod.modtype)
	return (function(.)
		if fn then
			local status. r = xpcall( function(a) return fn(unpack(arg)) end. debug.traceback) - < < 42
			if not status then
				print("error calling ".modtype." in mod ".ModInfoname(mod.modname).": \n".r)
				ModManager:RemoveBadMod(mod.modname.r)
				ModManager:DisplayBadMods(a)
			else
				return r
			end
		end
	end)
end
Copy the code

The problem is obvious, famine authors use the old syntax for variable argument lists. Before 5.1, you could use arG to represent {… } this table, arg[I] can be used to extract the i-th item of a variable parameter. This syntax has since been deprecated by default, leaving only one macro to enable compatibility. LuaJIT is completely incompatible with this notation, and on a closer look at the code, it even removes the mask bit required to implement ARG compatibility and uses that bit elsewhere.

At this point in the analysis, I thought it would be easy to just ask the user to modify the file, after all, adding the following line would solve the problem:

local arg = {.}
Copy the code

So I wrote a detailed description, posted the program and source code on Github, and posted it on tiebar to collect user feedback.


0x08 Smoking Patch

The first wave of feedback has been mixed. Some users successfully installed the patch and report that it has had a noticeable effect, especially on Shipwrecked DLC with a large number of Mods. However, there are still a number of bugs that don’t boot correctly, and the error reports are quite odd. There are even a lot of problems with the game not even seeing what it starts.

What struck me most was that the requirement to modify the source code file was too difficult for the average user. Many users can’t find the file, don’t know what tool to use to open it, can’t read English and don’t know how to start, and don’t even know how to punctuate Chinese. Many third party mods also use this old ARG syntax. To draw a direct conclusion from my suggestions for modifying mods. Lua, there are very few players to correct the syntax problems in third party mods — after all, Tieba is mainly for entertainment, unlike Zhihu, where there are many programmers. This is difficult.

What to do? I had to make the compatibility patch myself.

After an afternoon of working on luajit Parser, I abandoned the idea of adding compatibility to the same design as in the Lua source code. On the one hand, as mentioned above, there is no available mask bit, and the bit width of the corresponding FLAG variable needs to be increased in order to run. On the other hand, lua source code for this function is implemented in VM interpretation, and Luajit does not have an interpreter, it directly runs native instructions. However, it is not easy to understand the implementation mechanism of JIT COMPILER and insert this function accurately.

But after spending another afternoon with the Parser, I realized that there’s a simpler solution. Local arg = {… local arg = {… }. There are two ways to do this:

1. Detect the source text, do a simple process, locate it where the function definition is, and insert a line.

Insert syntax nodes directly at AST generation time.

Method 1 is less difficult, but requires a bit of parsing, such as passing comments, skipping strings, checking statements defined by functions, and so on. It’s easier to make mistakes without thinking it through.

To think it through, you need a simple Parser. Going straight to Method 2 and borrowing Luajit’s own Parser is the best option. To make exploration more purposeful, I wrote a function like this:

function test(.) local arg = { . } end
Copy the code

Then run luajit to see local arg = {… } which path does this sentence take to generate the AST? After an hour of debugging, I finally got it.

Open lj_parse.c and navigate to:

static void parse_chunk(LexState *ls)
{
  int islast = 0;
  synlevel_begin(ls);
  add_argstmt(ls); // HERE!!!!!
  while (!islast && !parse_isend(ls->tok)) {
    islast = parse_stmt(ls);
    lex_opt(ls. '; ');
    lua_assert(ls->fs->framesize > = ls->fs->freereg &&
	       ls->fs->freereg > = ls->fs->nactvar);
    ls->fs->freereg = ls->fs->nactvar;  /* Free registers after each stmt. */
  }
  synlevel_end(ls);
}
Copy the code

Add a line at the tag that calls add_argstmt and write this function:

static void add_argstmt(LexState* ls)
{
  ExpDesc e;

  if (ls->fs->flags & PROTO_VARARG) {
    var_new_lit(ls. 0. "arg");
// nexps = expr_list(ls, &e);
    {
      synlevel_begin(ls);
  // expr_unop(ls, &e);
      {
    // expr_simple(ls, v);
        {
      // expr_table(ls, v);
          {
            ExpDesc key. val;
            FuncState *fs = ls->fs;
            BCLine line = ls->linenumber;
            BCInsLine *ilp;
            BCIns *ip;
            ExpDesc en;
            BCReg base;

            GCtab *t = NULL;
            int vcall = 0. needarr = 0. fixt = 0;
        uint32_t narr = 1;  /* First array index. */
        uint32_t nhash = 0;  /* Number of hash entries. */
            BCReg freg = fs->freereg;
            BCPos pc = bcemit_AD(fs. BC_TNEW. freg. 0);
            expr_init(&e. VNONRELOC. freg);
            bcreg_reserve(fs. 1);
            freg++;

            vcall = 0;
            expr_init(&key. VKNUM. 0);
            setintV(&key.u.nval. (int)narr);
            narr++;
            needarr = vcall = 1;

            // expr(ls, &val);
            {
              checkcond(ls. fs->flags & PROTO_VARARG. LJ_ERR_XDOTS);
              bcreg_reserve(fs. 1);
              base = fs->freereg-1;
              expr_init(&val. VCALL. bcemit_ABC(fs. BC_VARG. base. 2. fs->numparams));
              val.u.s.aux = base;
            }

            if (expr_isk(&key)) expr_index(fs. &e. &key);
            bcemit_store(fs. &e. &val);
            fs->freereg = freg;

            ilp = &fs->bcbase[fs->pc-1];
            expr_init(&en. VKNUM. 0);
            en.u.nval.u32.lo = narr-1;
        en.u.nval.u32.hi = 0x43300000;  /* Biased integer to avoid denormals. */
            if (narr > 256) { fs->pc--; ilp--; }
            ilp->ins = BCINS_AD(BC_TSETM. freg. const_num(fs. &en));
            setbc_b(&ilp[-1].ins. 0);

        e.k = VNONRELOC;  /* May have been changed by expr_index. */


            ip = &fs->bcbase[pc].ins;
            if (!needarr) narr = 0;
            else if (narr < 3) narr = 3;
            else if (narr > 0x7ff) narr = 0x7ff;
            setbc_d(ip. narr|(hsize2hbits(nhash)<<11));
          }
        }
      }
      synlevel_end(ls);
    }
    assign_adjust(ls. 1. 1. &e);
    var_add(ls. 1);
  }  
}
Copy the code

The code above is based on a rough execution path, and some judgments may not trigger, or it can be shorter. But it doesn’t have much impact.

Recompile Luajit, this time successfully, the MOD can be enabled.


Fun fact: there aren’t many places in famine code where arGS are used without being declared. Most of it is directly used… In order to extract the specified parameter, it is obvious that he did not know that the select function can be used :(modutil.lua)

env.AddLevel = function(.)
	arg = {.}
	initprint("AddLevel". arg[1]. arg[2].id)
	require("map/levels")
	AddLevel(.)
end
env.AddTask = function(.)
	arg = {.}
	initprint("AddTask". arg[1])
	require("map/tasks")
	AddTask(.)
end
env.AddRoom = function(.)
	arg = {.}
	initprint("AddRoom". arg[1])
	require("map/rooms")
	AddRoom(.)
end
Copy the code

(Then the sentence arg = {… } a global variable arg was actually declared/modified.

So why not use mods. Lua… To deliver? The reason is probably that… You cannot cross the closure boundary (that is, an anonymous function as an argument to an XpCall) to become a upvalue, only compatible versions of ARGs can. Developers had no choice but to use arg, but forgot to declare local arg = {… }.

Since compatibility mode is on, this minor issue does not cause an exception, so the code remains. (I guess)





0x09 Please write normal words

When we look at the crashes reported by players, we find that many of them seem to be string errors, with some cases showing the famine crash screen, while others simply flash back.

The crash screen looks like this:

Although it appears that the error was made in the original Famine script, it was actually caused by an old Chinese package. Its Chinese character string looks like this:

#: STRINGS.MODS.VERSIONING.OUT_OF_DATE
msgid "Mod \"%s\" is out of date. The server needs to get the latest version from the Steam Workshop so other users can join."
msgstr "Mod\ '% \' has been updated and the server needs to get the latest version from creative Workshop for other users to join."
Copy the code

Although the string escape scheme in Lua is basically the same as in C, MODs and Chinese writers often have no experience with C and are prone to writing errors. Chinese quotes in the above code that should not have been escaped were backslashes, and the S in the format placeholder %s was forgotten.

There are also mods that heavily use single backslashes to represent the backslash itself without proper escape. The original Lua 5.1.4 compiler ignores this error. The official Lua documentation does not specify what the user will do if they provide an unsyntactically inclined box combination.

So LuaJIT freezes: for reasons of rigor, LuaJIT doesn’t allow ungrammatical backslashes, otherwise an error will be reported. And this error just hung on the error screen.

To make matters worse, not all of the code in the mod is executed with the correct Sandbox protection (see an example below). The info file in Lua format, which is a data file, is simply a Lua table with no functions. However, if it contains such a string, execution will simply hang. The built-in error screen will not pop up but will flash back.

This is a serious design problem. Even more, it doesn’t require the offending mod to be activated (because the offending code is in info and hangs when the mod cache is loaded). Since the mod itself is updated within the game by syncing with steam Creative Workshop, once a mod makes such an error, there is no opportunity to unsubscribe from the Workshop to resolve the issue, only to manually delete the mod file. This is a serious challenge for the average user.

However, if I were to explain so many reasons to test users, the content might not be of much use. Because the player will think: run fine, use your patch and blink back! You must have broken it!

Indeed, others do not understand the principle, you say these words are actually the equivalent of shaking the pot.

No matter how heavy the pot is, you have to carry it. Change the code to accommodate this problem… Open the lj_cparse. C: 244

if (lj_char_isdigit(c)) {
  c - = '0';
  if (lj_char_isdigit(cp_get(cp))) {
    c = c*8 + (cp->c - '0');
  } else {
	  c = '\ \';
  }
  cp_save(cp. (c & 0xff));
  continue;
}
Copy the code

Only illegal single slashes can be enforced.

LuaJIT also reports an error if there is no valid format tag after %. There is no way to change it to the default %s.

Open the lj_strfmt. C: 70

	/* Parse conversion. */
c = (uint32_t)*p - 'A';
if (LJ_LIKELY(c < = (uint32_t) ('x' - 'A'))) {
	uint32_t sx = strfmt_map[c];
	if (sx) {
		fs->p = p+1;
		return (sf | sx | ((c & 0x20) ? 0 : STRFMT_F_UPPER));
	}
}

c = 's'-'A'; // My lazy nature is highlighted here.
{
	uint32_t sx = strfmt_map[c];
	if (sx) {
		fs->p = p+1;
		return (sf | sx | ((c & 0x20) ? 0 : STRFMT_F_UPPER));
	}
}
Copy the code

It’s been compromised like this! Any questions?!


0x0A Ok

However, things are far more bizarre than I imagined. After fixing this problem for a long time, I found that another mod crashed because of the string.

It crashes because it uses LuaJIT’s extended escape character \u. This symbol in LuaJIT is due to support for direct UNICODE internals. However, the author of this MOD meant \umbrella or something…

Good. I lost. TAT

This time I have just removed LuaJIT’s escape extensions that are different from the original. Locate the lj_lex. C: 216

#if 0
case 'x': /* Hexadecimal escape '\xXX'. */
	c = (lex_next(ls) & 15u) << 4;
if (! lj_char_isdigit(ls->c)) {
if (! lj_char_isxdigit(ls->c)) goto err_xesc;
	  c += 9 << 4;
	}
.
	break;
case 'z': /* Skip whitespace. */
	lex_next(ls);
	while (lj_char_isspace(ls->c))
	  if (lex_iseol(ls)) lex_newline(ls); else lex_next(ls);
	continue;
#endif
Copy the code

Heart good tired.


0x0B Prepare for a fire

After the new version was released again, people expressed satisfaction. Only a few users of the older version (at least a year old) reported that the game wouldn’t start. Out of support for the original, I will not deal with this part of the requirements. (Legitimate games are updated with Steam and will always be up to date unless users intentionally cancel them.)

So I thought about making a famine MOD online. The online version of Famine is based on Reign of Giants DLC, but is a standalone release with a number of changes for the multiplayer experience. As mentioned earlier, the lua engine code embedded in the online Famine main program is slightly different from the standalone version (mainly where GC calls are made). I use the precompiled macro DST to distinguish between the two versions and name the generated Lua DLL lua51ds.dll.

Everything is all right. Copy the generated files to the bin subdirectory of the online installation directory. Run the game.

And then it collapsed.

Flash back. I don’t even have a screen left. OllyDBG load the game again and see what happens:

Circled in the lower right corner, you can see that the program stops on the Assert macro that is still enabled in a Release set by the developer.

The call stack in the upper right indicates that this is an error that occurred when Luajit called back a function related to the original program.

Looking at the LOG window in the lower left corner, you’ll be prompted to see that the second parameter to RunInSandboxSafe does not provide the correct value. This can happen in Lua either if the value of the supplied argument is indeed nil nil, or if the call does not provide enough arguments.

I then searched the code for where RunInSandboxSafe was called:

To my surprise, there was only one place where two parameters were properly used, and all the rest were missing. The missing argument is an error-handling routine by name that is passed into xpCall and is called whenever an error occurs.

But a look at the RunInSandboxSafe source code seems to have some problems:

function RunInSandboxSafe(untrusted_code. error_handler)
	if untrusted_code:byte(1) = = 27 then return nil. "binary bytecode prohibited" end
	local untrusted_function. message = loadstring(untrusted_code)
	if not untrusted_function then return nil. message end
	setfenv(untrusted_function. {} )
	return xpcall(untrusted_function. error_handler )
end
Copy the code

Error_handler is not checked for validity, so where did “got nil, function expected” come from?

The last xpcall is a tail call. Tail calls are essentially equivalent to JMP instructions, so when you jump to xpCall for execution, the current stack still displays RunInSandboxSafe.

Therefore, it is xpCall that strongly complains that the error_handler passed in is empty. Interestingly, in lua5.1.4, there is no check for this parameter, and only attempts to call error_handler will fail if xpcall is executing code that already has an error. LuaJIT, on the other hand, took precautions and died.

The interesting thing here is that the famine author uses ASSERT to ASSERT the result of the lua_call when RunInSandboxSafe is executed. In general, only errors in the program logic (that is, the programmer’s own POTS) should be used for ASSERT assertions, whereas RunInSandboxSafe executes strings from an external MOD/ archive. This is exception handling in the program logic and should go into exception handling (such as display error dialog). So this is where the few mods cause the program to fail to start.


In earlier versions of PATCH, I asked users to modify this file and add a line:

error_handler = error_handler or function (str) print("Klei, you missed this line: " . str) end
Copy the code

Later, for well-known reasons, I integrated a modified version of this function directly into Luajit.dll.





0 x0c YaoYinZi

But the online version has some magical features. By default, players can choose to host themselves and have their friends connect to them. On the console player’s computer, all of the game simulation is done in a single famine session. But if you enable caves, you add a parallel world.

In the standalone version, the player can only exist in one world at a time. So if the player jumps into a cave, just pause and save the above-ground part of the world, and then resume the underground world and synchronize the time.

The online version, however, cannot do this because some players are allowed to be on the ground and some are allowed to be underground at the same time. Since the original framework was designed for the standalone version, it became extremely rigid.

So the game writer thought about it for a long time and came up with a perfect idea. Anyway, we need to set up some independent servers to facilitate players to log in (since most players don’t have public IP, their own hosts can only be seen in the LAN, so Klei official will have a number of independent servers for players to log in. You can also install dedicated Servers to build your own VPS.

Specifically, if you choose to create caves when creating worlds, two additional processes will be created each time the host starts the archive, one as the overworld server and the other as the underworld server. Then the host and all its peers on the LAN are connected to these two servers. Ultimately, for consoles, there are three game emulators running at once. Therefore, when adding caves, the game asks the player if he or she has decided to subject his computer to triple computing.

Of course, with dedicated servers, you can host both servers on another computer (usually a purchased VPS), which is much less stressful locally. However, wan gaming is much more stuck than LAN due to the poor optimization of network transmission.

What does all this have to do with what we’re doing now? Of course it does:

The two enabled servers are not much different from the client, but the author created a separate EXE named dontstarve_steam_nullRenderer.exe.

According to the name, this is a shrunken version of the Famine main program, with no rendering, to take the pressure off the computer.

The other thing was not written in the face, but it was a solid punch.

It also does not need to read user input.

So it doesn’t rely on dinput8.dll.

Jiang.





So what do you do? Is dinput8.dll, our perfect injection point, not working?

Yes. If the user enabled the cave or used a dedicated server on a dedicated server, our PATCH would not load.

What a tragedy. After a long look at the online version, it seems that there is no better choice of injection points, the only remaining are ws2_32.dll and WINMM.

Although the famine main program imports only two functions from ws2_32.dll, it has been my observation that just implementing two functions in the puppet is not enough. The reason is that steamapi.dll itself relies on ws2_32.dll and imports more functions that you have to implement.

Fortunately, the Winmm. DLL dependencies are simpler, the only problem is that there are more functions. But on second thought, it wasn’t that much trouble:

#define ENTRY(name) \
static void* pfn##name = NULL; \
__declspec(naked)void name() { \
	_asm jmp pfn##name \
}

#define INIT(name) \
pfn##name = ::GetProcAddress(hOrg, #name);

ENTRY(mciExecute)
ENTRY(CloseDriver)
ENTRY(DefDriverProc)
ENTRY(DriverCallback)
. // omit the middle
ENTRY(waveOutUnprepareHeader)
ENTRY(waveOutWrite)
ENTRY(wid32Message)
ENTRY(wod32Message)


void Init(a) {
  TCHAR sysPath[MAX_PATH * 2];
  : :GetSystemDirectory(sysPath. MAX_PATH);
  _tcscat(sysPath. _T("\ \WINMM.DLL"));

  HMODULE hOrg = : :LoadLibrary(sysPath);
  if (hOrg ! = NULL) {
    INIT(mciExecute)
    INIT(CloseDriver)
    INIT(DefDriverProc)
    INIT(DriverCallback)
    .
    INIT(waveOutUnprepareHeader)
    INIT(waveOutWrite)
    INIT(wid32Message)
    INIT(wod32Message)
  }

  : :LoadLibrary(_T("DINPUT8.DLL"));
}

BOOL WINAPI DllMain(HINSTANCE hInstance. DWORD ul_reason_for_call. LPVOID reserved) {
  switch (ul_reason_for_call) {
    case DLL_PROCESS_ATTACH:
    Init(a);
  // system("pause");
    break;
  }

  return TRUE;
}
Copy the code

We don’t need to define the springboard functions exactly as we wrote them, since we’re not going to do that, we can just use a JMP instruction. The arguments and stuff are already filled in by the caller. Note that the __declspec(naked) modifier is used to avoid generating C skeleton code.

The new WINMM. DLL is a good introduction to the server. Its only function is to forward calls to WINMM. DLL and load dinput8.dll at initialization.





0x0D Evil spirit

In theory, the online version of famine and standalone should not be much different, the hole is almost filled (one of the three life illusion). I played on my own machine and tried it a few times by myself, and everything was fine.

But when someone else gets in, the problem starts all over again. According to the post bar friends feedback, sometimes the client characters can not move; Exit will leave a dark shadow, re-entry can not enter, unless restart the server.

So I opened the source of famine Lua script, looking for the relevant server logic. Found that most communication is through a simple RPC mechanism to implement (scripts/networkclientrpc. Lua). Each request has an independent RPC code. The client packages the code and parameters and sends them to the server, which unpacks them and executes them. (Incidentally, due to the limitations of Lua 5.1.4, this packaging process is inefficient — lua code for a table definition is generated from the data and then executed again to get the data when unpacking. In Lua 5.3, string.dump can be packaged in binary mode perfectly. For this purpose, I combine the ideas of LZ series compression algorithm, write a simple module, the code is here (core.encode) : github.com/paintdream/…

This doesn’t seem to be a problem, so why can’t the client characters move around? I have no choice but to open two game debugging.

I run games as a server on my work computer and clients on the Surface Pro 3. Look what the hell happened.

The problem mentioned by the bar friend was indeed repeated. In addition, I found that even if I could successfully enter the host computer, all kinds of behaviors were very abnormal, and I could not even walk (it would be bounced back). So this is weird.

So I changed some RPC Handler code and added some LOG statements to see if there were any exceptions. The findings:

Some calls take arguments that are obviously calling another function!

Here’s an example:

    RPC_HANDLERS.DoWidgetButtonAction = function(player. action. target. mod_name)
        if not (checknumber(action) and
                optentity(target) and
                optstring(mod_name)) then
            print("Current RPC CODE = " . CURRENT_RPC_CODE)
            printinvalid("DoWidgetButtonAction". player)
            return
        end
        local playercontroller = player.components.playercontroller
        if playercontroller ~ = nil and playercontroller:IsEnabled(a) and not player.sg:HasStateTag("busy") then
            if mod_name ~ = nil then
                action = ACTION_MOD_IDS[mod_name] ~ = nil and ACTION_MOD_IDS[mod_name] [action] ~ = nil and ACTIONS[ACTION_MOD_IDS[mod_name] [action]] or nil
            else
                action = ACTION_IDS[action] ~ = nil and ACTIONS[ACTION_IDS[action]] or nil
            end
            if action ~ = nil then
                local container = target ~ = nil and target.components.container or nil
                if container = = nil or container.opener = = player then
                    BufferedAction(player. target. action) :Do(a)
                end
            end
        end
    end.
Copy the code

So I’m going to add a little bit of print here, and I’m going to find something amazing. Player is the correct player table, but mod_name is nil, and action and target are two numbers!

Look up and find another function:

    DragWalking = function(player. x. z)
        if not (checknumber(x) and
                checknumber(z)) then
            printinvalid("DragWalking". player)
            return
        end
        local playercontroller = player.components.playercontroller
        if playercontroller ~ = nil then
            playercontroller:OnRemoteDragWalking(x. z)
        end
    end.
Copy the code

Obviously, this call is really crossed. Its arguments are more likely to call the DragWalking function, and the two numbers are actually X and Z, the character coordinates. Of course, it could be another function with arguments similar to DragWalking.

So the problem should be the call code. Take a look at how the call code is generated, and sure enough, there is a problem:

RPC = {}

--Generate RPC codes from table of handlers
local i = 1
for k. v in pairs(RPC_HANDLERS) do
    RPC[k] = i
    i = i + 1
end
i = nil

--Switch handler keys from code name to code value
for k. v in pairs(RPC) do
    RPC_HANDLERS[v] = RPC_HANDLERS[k]
    RPC_HANDLERS[k] = nil
end
Copy the code

Fortunately, I saw the lua string Hash algorithm written by Yunfeng a long time ago to prevent DDOS changes and immediately saw the problem: this method of generating RPC codes is wrong in the new version of Lua.

The reason is that the new Lua string hash algorithm contains a random seed, which avoids DDOS attacks caused by hackers constructing a large number of different strings with the same hash value because the algorithm is fixed. Therefore, the HASH result for the same string is not always the same at startup.

However, internal pairs rely on Next for table traversal, which traverses sequentially based on the hash value of the key. The key is clearly the key of the RPC_HANDLERS, which is the name of the RPC handling routine. Therefore, this method of generating call codes that rely on traversal order results in a complete mismatch between the server and client codes, resulting in a cross wire failure.

So how to change it. Either change the HASH algorithm or change the Lua code.

For world peace we need to protect servers from DDOS!!

So I tactfully (chun) chose to let the user change the Lua code! (Here to say, this article is compiled in BUG order, so the chronological order will be inconsistent. When this version was released at that time, the first few pieces of code still need to be manually modified by users, so it was not a problem to change more.)

Go to line 722 and change it to this:

local temp = {}
for k. v in pairs(RPC_HANDLERS) do
    table.insert(temp. k)
end

table.sort(temp)

for k. v in ipairs(temp) do
    RPC[v] = k
end
--[[
local i = 1
for k, v in pairs(RPC_HANDLERS) do
    RPC[k] = i
    i = i + 1
end
i = nil
]]
Copy the code





0x0E Unpacked Bomb

After the change, the character is able to move, but the client exit there is still a black shadow, enter again can not enter.

It’s weird, and at first glance, it might be that a mission wasn’t completed. But there are no error messages, so where do I find out which task was not completed?

All the way from players disappears the message tracking, found the problem in scripts/components/playerspawner lua: 79

local function PlayerRemove(player. deletesession. migrationdata. readytoremove)
    if readytoremove then
        player:OnDespawn(a)
        if deletesession then
            DeleteUserSession(player)
        else
            player.migration = migrationdata ~ = nil and {
                worldid = TheShard:GetShardId(),
                portalid = migrationdata.portalid.
                sessionid = TheWorld.meta.session_identifier.
            } or nil
            SerializeUserSession(player)
        end
        player:Remove(a)
        if migrationdata ~ = nil then
            TheShard:StartMigration(migrationdata.player.userid. migrationdata.worldid)
        end
    else
        player:DoTaskInTime(0. PlayerRemove. deletesession. migrationdata. true)
    end
end
Copy the code

Here DoTaskInTime is a deferred API that calls back to PlayerRemove itself, thus creating an infinite loop. (Really want to make fun of the code here)

But how can this be a loop? Readytoremove must be true when we delay the call, so we’re not going to branch else next time. How can that be?

However, I print it. It’s an infinite loop, but the readyToRemove value I keep printing is neither true nor false as I expected.

But is nil.

So, what happens is that PlayerRemove keeps going in an infinite loop, subsequent processing doesn’t get done, and the server thinks the player who wants to log out is still online, so the player who wants to log in will be rejected.

Okay, why nil? By looking at the internal implementation of DoTaskInTime, we can see that it has a process for packaging varied-length arguments (Scheduler.lua :286)

function Scheduler:ExecutePeriodic(period. fn. limit. initialdelay. id. .)
    local list. nexttick = self:GetListForTimeFromNow(initialdelay or period)
    local periodic = Periodic(fn. period. limit. id. nexttick. .)
    list[periodic] = true
    periodic.list = list
    return periodic
end
Copy the code

And what does Periodic do? Look at line 37:

Periodic = Class(function(self, fn, period, limit, id, nexttick, ...) self.fn = fn self.id = id self.period = period self.limit = limit self.nexttick = nexttick self.list = nil self.onfinish  = nil self.arg = toarrayornil(...) end)Copy the code

Where toarrayornil is a function implemented by C, equivalent to {… }, but when… Returns nil if null.

So, is toarrayornil the problem? Before exploring this function, it occurred to me that the array part of the table in Lua has a very strange property that the length is from 1 to the first nil (not including nil). So if the arguments in the package contain nil values, will anything after the nil values be ignored?

Congratulations on being right again!

Write a script like this and save it as test.lua:

function foo(...) return {... } end print(unpack(foo(1, 2, 3, nil, 5)))Copy the code

The result of lua 5.1.4 is as follows:

1  2  3  nil  5
Copy the code

The result of luajit 2.1.0 is as follows:

1  2  3
Copy the code

The official result here is interesting, it actually reads the value after nil, but luajit doesn’t. The Lua documentation does not specify lua’s behavior in this case, so the implementation is different.

In this case, when PlayerRemove is called, only the player argument is not nil, everything else is nil, so when DoTaskInTime is called, readyToremove is specified as true, but the first two arguments are nil, So that triggered the problem in LuaJIT.

Open the lib_base. C: 220

LJLIB_CF(unpack)
{
  GCtab *t = lj_lib_checktab(L. 1);
  int32_t n. i = lj_lib_optint(L. 2. 1);
  int32_t e = (L->base+3-1 < L->top && !tvisnil(L->base+3-1)) ?
	      lj_lib_checkint(L. 3) : (int32_t)lj_tab_arraylen(t); // <---
  if (i > e) return 0;
  n = e - i + 1;
  if (n < = 0 || !lua_checkstack(L. n))
    lj_err_caller(L. LJ_ERR_UNPACK);
  do {
    cTValue *tv = lj_tab_getint(t. i);
    if (tv) {
      copyTV(L. L->top++. tv);
    } else {
      setnilV(L->top++);
    }
  } while (i++ < e);
  return n;
}
Copy the code

Note that at <—- I replaced the original lj_tab_len call with a new lj_tab_arraylen, which is implemented as follows:

MSize LJ_FASTCALL lj_tab_arraylen(GCtab *t)
{
  MSize j = (MSize)t->asize;
  while (j > 1 && tvisnil(arrayslot(t. j - 1))) {
    j--;
  }
  if (j) --j;
  return j;
}
Copy the code

Recompile luajit and apply it, problem solved.

0x0F Disappearing pirate parrot

The new version has been used for a long time, and there are few new BUG feedback, which is basically caused by the old patch or the inability to change the code as required.

The good times continued until users started reporting that the animals caught with mouse clicks on trap-like items disappeared and the durability of items did not decrease.

This may seem hard to understand, but the bug is so refined that it looks more like a bug in the Lua code itself than a bug caused by a script engine change. If patches cause bugs, why does the rest of the logic in the game work precisely except for traps? How does this work?

(High energy reminder: This is the weirdest and longest logical chain, but also the most interesting one when making this Patch.)


To do this I opened the game (PATCH enabled), created a new file, and used the console to spawn a bird catcher and a pirate parrot. After picking up the parrot did not, the durability of the bird trap did not drop.

So what went wrong? Through careful comparison of images before and after PATCH is enabled, a strange thing is found:

(Before commissioning)

(After enabled)

So the problem is that what was supposed to be a Check turns into a Pick up, causing the player to Pick up the bird trap instead of harvesting it. But if whitespace is used, the Check operation is performed correctly.

“Pick up” means “Pick up” or “Pick up”.

By searching “Pick up” in Lua source code, it can be found that strings. Lua has its definition:

STRINGS = { --ACTION MOUSEOVER TEXT ACTIONS = { REPAIR = "Repair", REPAIRBOAT = "Repair", PICKUP = "Pick up", CHOP = "Chop", FERTILIZE = "Fertilize", SMOTHER = "Extinguish", ... }}Copy the code

Look again at where strings.actions.PICKUP is referenced, but it doesn’t. Look again at ACTIONS.

In other words, all Actions are encapsulated by creating a BufferedAction, and from the name it’s Buffered, there should be a queue or something to store Actions, Keep looking “BufferedAction” can be found in the scripts/components/playeractionpicker lua in some strange things:

function PlayerActionPicker:SortActionList(actions. target. useitem)
    if #actions > 0 then
        table.sort(actions. function(l. r) return l.priority > r.priority end)
        local ret = {}
        for k.v in ipairs(actions) do
            if not target then
                table.insert(ret. BufferedAction(self.inst. nil. v. useitem))
            elseif target:is_a(EntityScript) then
                table.insert(ret. BufferedAction(self.inst. target. v. useitem))
            elseif target:is_a(Vector3) then
                local quantizedTarget = target 
                local distance = nil 

                --If we're deploying something it might snap to a grid, if so we want to use the quantized position as the target pos 
                if v = = ACTIONS.DEPLOY and useitem.components.deployable then 
                    distance = useitem.components.deployable.deploydistance
                    quantizedTarget = useitem.components.deployable:GetQuantizedPosition(target)
                end

                local ba = BufferedAction(self.inst. nil. v. useitem. quantizedTarget)
                if distance then 
                    ba.action.distance = distance 
                end 
                table.insert(ret. ba)
            end
        end
        return ret
    end
end

function PlayerActionPicker:GetSceneActions(targetobject. right)
    local actions = {}

    for k.v in pairs(targetobject.components) do
        if v.CollectSceneActions then
            v:CollectSceneActions(self.inst. actions. right)
        end
    end

	if targetobject.inherentsceneaction and not right then
		table.insert(actions. targetobject.inherentsceneaction)
	end

	if targetobject.inherentscenealtaction and right then
		table.insert(actions. targetobject.inherentscenealtaction)
	end

    if #actions = = 0 and targetobject.components.inspectable then
        table.insert(actions. ACTIONS.WALKTO)
    end

    return self:SortActionList(actions. targetobject)
end
Copy the code

On closer inspection, this seems to be one of the key points. All bufferedactions are sorted here by priority via table.sort. My guess is that the logic of the game is to collect the various actions that are available, and then rank them in order of priority, with the top being the winner and being selected as the default action.

Therefore, either there is a problem with the actions before sorting, or there is a problem with the sorting itself. We should be able to narrow down the search from here.

0x10 Unstable balance

To test my idea, I’m going to do a LOG in SortSceneActions. Let’s see if it’s fun:

local mapActionToName = {}
for k, v in pairs(STRINGS.ACTIONS) do
    local m = ACTIONS[k]
    if (m) then
        mapActionToName[m] = k
    end
end

local function PrintList(actions)
    for i, v in ipairs(actions) do
        print("ACTION [" .. i .. "] = " .. (mapActionToName[v] or "NULL"))
    end
end

function PlayerActionPicker:SortActionList(actions, target, useitem)
    if #actions > 0 then
        print("-----------------------")
        print("Before sorting ... ")
        PrintList(actions)
        table.sort(actions, function(l, r) return l.priority > r.priority end)
        print("After sorting ... ")
        PrintList(actions)
        
        local ret = {}
        for k,v in ipairs(actions) do
            if not target then
                table.insert(ret, BufferedAction(self.inst, nil, v, useitem))
            elseif target:is_a(EntityScript) then
                table.insert(ret, BufferedAction(self.inst, target, v, useitem))
            elseif target:is_a(Vector3) then
                local quantizedTarget = target 
                local distance = nil 

                --If we're deploying something it might snap to a grid, if so we want to use the quantized position as the target pos 
                if v == ACTIONS.DEPLOY and useitem.components.deployable then 
                    distance = useitem.components.deployable.deploydistance
                    quantizedTarget = useitem.components.deployable:GetQuantizedPosition(target)
                end

                local ba = BufferedAction(self.inst, nil, v, useitem, quantizedTarget)
                if distance then 
                    ba.action.distance = distance 
                end 
                table.insert(ret, ba)
            end
        end
        return ret
    end
end
Copy the code

I print out the values in the Actions array before and after the sort and see what the sort does:

(Before enabling PATCH)

(After PATCH is enabled)

Look at the blue box. The PICKUP is number one. As a direct result, the author does not know that table.sort is unstable. LuaJIT’s fast sorting algorithm is probably not the same as the original one, and the results will be different if the largest element is not unique!

So I open actions.lua and everything seems clear:

Action = Class(function(self. priority. instant. rmb. distance. crosseswaterboundary) 
	self.priority = priority or 0
	self.fn = function(a) return false end
	self.strfn = nil
	self.testfn = nil
	self.instant = instant or false
	self.rmb = rmb or nil
	self.distance = distance or nil
	self.crosseswaterboundary = crosseswaterboundary or false
end)

ACTIONS=
{
	REPAIR = Action(),
	.
	PICKUP = Action(2),
	.
	CHECKTRAP = Action(2),
	BUILD = Action(),
	PLANT = Action(),
	.
}
Copy the code

Obviously, PICKUP and CHECKTRAP are both of priority 2, so it is possible for the sorting to occur before CHECKTRAP.

This can be easily resolved by setting the CHECKTRAP to a higher priority (e.g. 2.5). As it turns out, the bug does go away.

But is that the end of the question?

0x11 depends on the correctness of the error

The amazing thing about this bug is that it’s not just that.

After the patch was released, you said the issue was fixed, but there were other similar issues, such as ships not Inspect, lotus not right-clicking in the MOD character Wing language, oven mods not working properly, etc.

Indeed, these are the result of the original Famine Actions priority setting, so there are two possibilities:

1. The authors modify the sorting algorithm to make it stable (e.g., bubble sort), so that when the priority is the same, the first order in the original sequence is also the first order.

2. The authors have no idea that quicksorting is unstable, and when the results are abnormal, they change the priority, and if the results are as expected, they leave it alone. The quicksort implementation in lua5.1.4 is different from that in LuaJIT causing this problem.

At first I thought it was a 1 problem, so I wrote a stable sort by hand, and when I hooked it, it solved the trap problem (even though the priority was 2). But the rest of the bugs can not be solved, this road is blocked.

So if press 2, quick sorting implementation is different, then I change a lua5.1.4 original DLL to try?

I renamed lua51.dll to Luajit.dll, ran the game, and sure enough, everything worked. I guess it’s the sorting algorithm.

So I opened lua5.1.4 source, compared to the source of Luajit, but found a magical thing: sorting algorithm in addition to the error part is a little different, unexpectedly is exactly the same!!

So this is kind of weird, the sorting algorithm is the same, why is it different? Am I missing something?

A closer examination of these two images finally revealed the problem:

(Before enabling PATCH)

(After PATCH is enabled)

I had been looking at the results after sorting, but did not notice that the order of the data before sorting was also different (shown in the red box). In other words, the problem itself has nothing to do with sorting algorithms, and something to do with wrong priority, but it’s not fatal.

What is fatal is why the order of the data before sorting is different!

Follow the SortActionList and sure enough, I just missed it right in front of my eyes:

function PlayerActionPicker:GetSceneActions(targetobject. right)
    local actions = {}

    for k.v in pairs(targetobject.components) do
        if v.CollectSceneActions then
            v:CollectSceneActions(self.inst. actions. right)
        end
    end

	if targetobject.inherentsceneaction and not right then
		table.insert(actions. targetobject.inherentsceneaction)
	end

	if targetobject.inherentscenealtaction and right then
		table.insert(actions. targetobject.inherentscenealtaction)
	end

    if #actions = = 0 and targetobject.components.inspectable then
        table.insert(actions. ACTIONS.WALKTO)
    end

    return self:SortActionList(actions. targetobject)
end
Copy the code

Believe it or not, the problem is the for loop in this function.

If you’ve seen the previous article, you can see what I mean — the enumeration order of the for loop is related to the String HASH algorithm! V :CollectSceneActions actions are added sequentially to actions, so the different enumeration order will result in the inconsistent order of actions in actions.

LuaJIT’s string HASH algorithm is not the same as lua’s, which is why the online RPC was buggy.

So, to tidy up the logic, the complete bug trigger flow is:

Inconsistent string HASH algorithms => The order of keys traversed in the table is inconsistent => The order of actions is inconsistent + The sorting algorithm is unstable and the array to be sorted has the same keys (priority is equal) => The sorting results are inconsistent => The selected operations are inconsistent.

This is it. All the mysteries are solved. In retrospect, if the author had thought about the stability of the sorting algorithm when he saw that the actions were in a strange order, he would not have adjusted the priority of individual actions, but would have reassigned different priorities to all actions. If they had, the whole problem would have gone away.

Now, it depends entirely on how a particular sorting algorithm sorts a particular piece of data. Imagine having a mod that manually adds an ACTION; Or if the author adds a default component to targetobject.components with a version update, the sorting result will be inconsistent with the expected result, and this inconsistency will result in a large number of logic errors that are extremely difficult to troubleshoot.

To make matters worse, there are already a number of third party mods that use ACTION. Changing the priority value of the default ACTION can cause these mods to fail.

So the bug slowly became a feature and no one dared to touch it.

So what’s the solution? I had no choice but to copy lua5.1.4’s string HASH algorithm and replace luajit’s implementation. This also resolves the previous RPC problem without having to change the code. I don’t really want to change it because it would be a higher security risk. But there’s no way. Just go ahead.

0 x12 bottleneck

Based on my feedback, I found one limitation of LuaJIT itself: When loading an archive, if the archive is too large and the number of constants exceeds 65536, LuaJIT will make an error when initializing the table. The error occurred when the table size reached a staggering 0x65000000 +, after a simple trace, the table structure has been broken. A closer look shows that the constant BCMAX_D in the LUAJIT instruction cannot be arbitrarily increased, otherwise the 32-bit instruction will not fit. Check luajit’s BBS and see Mike answering this question:

http://lua-users.org/lists/lua-l/2010-03/msg00238.html

LuaJIT has different limits and different design constraints. For a single table that consists only of constants there's  effectively no limit (memory is the limit). But your test case has lots of small, nested tables. Every individual table only occupies a single constant slot (and not one for each of its keys/values). But there are too many of these tables to keep them in a single function. The short term solution is to split the table construction into smaller pieces, wrapped into individual functions. The long term solution would be to add a workaround for these limits to LuaJIT. But I'm not sure about the best way, Yet. > [2] The Lua bug it exposed has since had fixed > http://www.lua.org/bugs.html#5.1.4-6 This problem is not The present LuaJIT 2.0 in.Copy the code

(Actually luajit 2.0 still has this problem.)

The reason for this is simply that the famine archive used a poor strategy of serializing tables into Lua code. Then you have a huge number of tables and constants. Thus exceeding LuaJIT’s limit when loading. The simplest solution seems to be to rewrite the load code to block it for the archive. A more radical approach would be to change the archive format, but this would not be compatible with the old archive.

However, after reading Mike’s words carefully, I found that this problem can be alleviated only by layering the data table and wrapping it with function. As long as there are no more than 65536 function constants per layer, it will load correctly.

So open lib_base.c:

void filter(lua_State* L) {
	char ch. check = 0;
	int level = 0;
	int t = lua_type(L. 1);
	int happen = 0;
	int quote = 0;
  int slash = 0;
	const char* p = lua_tostring(L. 1);
	long size = lua_objlen(L. 1);
	char levelMasks[1024];
	memset(levelMasks. 0. sizeof(levelMasks));
	if (t = = LUA_TSTRING) {
		char* target = (char*)malloc(size * 2);
		char* q = target;
		while (size-- > 0) {
			ch = *p++;
			if (ch = = '"' && !slash) {
				quote = !quote;
			}

			if (!quote) {
				if (ch = = '{') {

					level++;
					if (check) {
						const char* ts = "(function () return {";
						--q;
						memcpy(q. ts. strlen(ts));
						q + = strlen(ts);
						levelMasks[level - 2] = 1;
						check = 0;
						happen = 1;
					}

					check = 1;
					*q++ = (char)ch;
				} else {
					*q++ = (char)ch;
					if (level > 0 && ch = = '} ') {
						level--;
						
						if (levelMasks[level] ! = 0) {
							const char* ts = "end)()";
							memcpy(q. ts. strlen(ts));
							q + = strlen(ts);
							levelMasks[level] = 0;
						}
					}
					check = 0;
				}
			} else {
        if (ch = = '\ \') {
          slash = !slash;
        } else {
          slash = 0;
        }
				*q++ = (char)ch;
				check = 0;
			}
		}

		if (happen) {
			*q = 0;
      /* printf("TARGET: %s\n", target); * /
/ *
      FILE* tg = fopen("modified.lua", "wb");
      fwrite(target, q-target, 1, tg);
fclose(tg); * /
			lua_pushlstring(L. target. q - target);
			lua_replace(L. 1);
		}

		free(target);
	}
}

LJLIB_CF(loadstring)
{
  filter(L);
  return lj_cf_load(L);
}
Copy the code

When there are two {in a row detected (there is no way to make this compatible for DS). Insert a function boundary just outside the child table. After recompiling, the problem is resolved.






0 x13 summary

After nearly one and a half months of efforts, my PATCH finally successfully solved most of the bugs and was officially released. At the same time, in order to reduce the pressure on users, I have somehow integrated the changes to the original Famine Lua code so that users can simply copy the published files into the Famine bin directory.

I really didn’t expect to encounter so many problems, but by solving these bugs, I read a fair amount of source code and used OllyDBG+WinDBG to debug and analyze a lot of crash reports. Although most guesses and experiments are not posted because they do not match the final results, who can guarantee to find bugs in a hurry

At the same time, I was thinking about design and coding by reading other people’s code. Is the implementation of this place good? Why? What should the author have been thinking? Why should there be such a set or limit? If I were to write this, how would I design it? Can it be better?

In addition, coding is only a part of the game experience. The world view, element design, numerical design, art style selection, music production and so on all form an integral part of the game. In my opinion, why the famine so popular, and its efforts are inseparable. Famine itself may have a lot of implementation glitches from a programming standpoint, but that doesn’t stop it from being a great sandbox game.


The attached:

At the time of writing, there are still bugs that have not been resolved:

After PATCH is enabled, there is a certain probability that the upper and lower caves will cause seasonal confusion. I played this BUG myself in the earlier version of PATCH, but the latest version failed to reproduce it. According to the feedback of the bar friends, this problem still exists, but of all the bar friends who said the problem exists, only one provided the archive and MODS according to my requirement, but still failed to reproduce on my machine. The other two or three bar friends disappeared after asking questions, never to respond again. I found this problem in the original version after looking at the old post of the bar, but that post was made a long time ago and whether the author tried to “fix” it is unknown.

In fact, the biggest resistance to writing this PATCH is not from the program code itself, but from the numerous feedbacks, few people can effectively describe the specific process, details and how to reproduce the bug as required. Many bug fixes are guessed by simple descriptions, so a lot of time is wasted on inaccurate guesses.





0x14 Volcanic Boundary (EXt.)

(This part addresses a bug that has been around since the volcano opened in the original Famine, namely the date confusion when entering and exiting the volcano in Shipwrecked DLC. I received an official reply from Klei on August 10th, which should be fixed in the next version!)

(Again, the triggering of this BUG has nothing to do with whether my PATCH is enabled or not)

Famine’s author is a bit odd in his date design. Instead of using a uniform time, each world (including caves and volcanoes) has a separate time, and only the current world’s clock moves. This will cause time inconsistency when jumping around the world. Wouldn’t it be easy to replace the new world’s time with the previous one? The author wanted to allow different worlds to have different times, so player_age was used to synchronize the two worlds (except ROG and SW jump). Then, when it detects that the user is jumping from one world to another, it triggers the synchronized code. There are several ways to travel: “Ascend”, “descend”, “shipwrecked”, “ascend_volcano” and “descend_volcano”.

Open data\DLC0002\scripts\gamelogic. Lua and go to:

if travel_direction = = "ascend" or travel_direction = = "descend" then
    print ("Catching up world". catch_up. "(". player_age."/".world_age.")" )
Copy the code

You need to synchronize time up and down caves and in and out of volcanoes (ROG and SW do not), so you need to check if you need to synchronize time when loading the world. So the source of the BUG is that the author has omitted “descend_volcano” and “ascend_volcano” here. Once you spend more than a day inside a volcano, this time should be synchronized, but due to the author’s carelessness, this synchronization will never happen…