Application core is the behavior that occurs when an application fails to maintain its normal running state. Core-dump files are generated when the program core is used as backup data of the program state when the program crashes. A core-dump file contains state information such as memory, processor, register, program counter, and stack pointer. This article introduces some methods and techniques for using core-dump files to locate the cause of the program core.

The full text contains 7,023 words and is expected to take 13 minutes to read.

Definition and classification of program Core

An application core is a crash that occurs when an application fails to maintain its normal running state. The core-dump file is a backup of the state data of the program when the program crashes. A core-dump file contains state information such as memory, processor, register, program counter, and stack pointer. We can use the core-dump file to analyze why core is located.

Here we classify program Core in three aspects: machine, resource, and program Bug. The following table breaks down the common causes of Core:

Two, function stack introduction

When we open the core file, we first focus on the state of the function call stack in the event of a crash. In order to understand some of the techniques for locating core later, let’s take a look at the function stack.

2.1 Register Introduction

At present, the production environment is all 64-bit machines. Here, only the registers of 64-bit machines are introduced, as follows:

For the x86-64 architecture, there are 16 64-bit registers, and each register is not used for a single purpose. For example, % RAx usually holds the result returned by a function, but is also applied to IMul and IDIV instructions. The focus here is % RSP (top pointer register), % RBP (bottom pointer register), %rdi, %rsi, % RDX, % RCX, %r8, %r9 (corresponding to the first to sixth function parameters, respectively).

Callee Save indicates whether the caller needs to Save the value of the register.

2.2 Function Calls

2.2.1 Call function Stack frame:

When a function is called, the arguments are pushed first, in the reverse order of the argument definition. Note that it is not necessarily the argument that pushes the stack. In the x86-64 architecture, values are passed directly through registers for variables that can be passed using registers, such as numbers, Pointers, references, etc.

This is followed by the ** return address stack. ** Return address is the address of the next instruction to be executed by the calling function after the function is executed. Keep the location of the return address in mind here, and the return address feature will be used in subsequent sections.

Here is an example of the above description:

As shown in the figure above, the foo function is called in the main function. First, the parameters are pushed. All three parameters can be passed directly in registers (corresponding to %edi, %esi, %edx), and then the call instruction pushes the next instruction.

2.2.2 Stack frames of called functions:

The called function first saves the previous function’s pointer to the bottom of the stack (% RBP), which is called % RBP. Then Save the register value that needs to be saved, that is, the register where Callee Save is True. Then apply the stack space for temporary variables and local variables.

For the called function, here’s an example:

As shown in the figure above, foo is executed by pressing main’s % RBP and then storing the parameters in the register into local variables (a, b, c).

2.3 summarize

Through the simple introduction of function calls, we can find that the function stack is a rigorous and fragile structure, memory structure must be in accordance with the strict way to be accessed, such as a little careless may lead to the program crash.

3. GDB positioning Core

In this section, you’ll see some of the problems you might encounter along the way, from opening a core file to locating it, and how to solve them.

3.1 the Core file

Where is the core file?

Check /proc/sys/kernel/core_pattern to determine the core file generation rule.

3.2 Variable printing

During program debugging, it is often necessary to check whether the values of various variables (memory, register, function table, etc.) are correct. The common variable printing methods and some minor tricks are described in a separate section.

3.2.1 print command

print [Expression] print $[Previous value number] print {[Type]}[Address] print [First element]@[Element count] print /[Format] [Expression] Format: o-8 x-hexadecimal U-unsigned decimal T-binary F-floating point a - address C - character S - stringCopy the code

3.2.2 x command

X /<n/f/u> <addr> n: a positive integer, indicating the number of memory units to be displayed, that is, the contents of n memory units to be displayed backwards from the current address. The size of a memory unit is defined by the third parameter U. F: indicates the output format of the memory contents pointed to by addr, and s corresponds to the output string. Here, pay special attention to the format of the output integer data: x displays the variables in hexadecimal format, and d displays the variables in decimal format. U Displays unsigned integers in decimal format. O Displays variables in octal format. T displays variables in binary format. A Displays variables in hexadecimal format. C Displays variables in character format. F displays variables in floating point format. U: specifies the number of bytes in a memory unit. -unit. Default is 4. U can also be represented by some characters: for example, b=1 byte, h=2 bytes,w=4 bytes,g=8 bytes.<addr>: indicates the memory address.Copy the code

3.2.3 Container object printing

Using the above print and x commands, combined with the data structure of the container, we can know the details of the container. Here’s an example of printing a complete binary string. The string structure is as follows:

When string is empty, _M_dataplus._M_p refers to nullPTR. The first half is the meta information (STD ::string::_Rep), such as length, capacity, and refcount. The second half is the data area. _M_p refers to the data area.

If a string is not binary, it will be print, but if it is binary, it will be truncated. Therefore, the first task of printing a binary string is to confirm the size of the string.

STD ::string::_Rep: STD ::string::_Rep: STD ::string::_Rep: STD ::_Rep

(STD ::string::_Rep*)(S._m_dataplus._m_p) -1)Copy the code

After finding the size (_M_length) of string, run the x command to print the related memory area. The command is:

_Rep._M_length x/NCB s. _m_dataplus._m_pCopy the code

The running effect is as follows:

For convenience, here recommend a convenient script: STL – views. GDB (link: sourceware.org/gdb/wiki/ST… Supports printing of common containers such as vector, map, list, and string.

3.2.4 Printing static variables

Static variables are often used in programs. Sometimes we need to check whether the value of a static object is correct, which involves printing the static object. Here’s an example:

void foo() {
    static std::string s_foo("foo");
}
Copy the code

Here you can use the nm – c. / bin | grep xx find static variable memory address, again through the GDB print to print.

3.2.5 memory dump

Dump [format] memory filename start_addr end_addr dump [format] value filename expr format Use binary. For example, we can use the example above to dump the entire string into a file. Dump binary memory file1 s. _m_dataplus. _M_p s. _m_dataplus. _M_p + lengthCopy the code

Dump a string to a file

3.3 Locate lines of code

To locate a core, you first need to locate the line of code that was executing at the time of the crash. This section focuses on some methods for locating lines of code. Usually you can see the entire function stack directly through GDB’s BreakTrace, but sometimes the function stack information is not so clear, and you can use a few tricks to see the function stack.

3.3.1 To compile optimization


Sometimes you will find that the stack of core functions does not match the actual line of code. If you are in an offline environment, you can try to set the compilation optimization to ** -o0 ** and then reproduce the core problem again.

3.3.2 Program counter + adDR2line

For online core problems, there is generally no way to compile and optimize the program, but to locate the code based on the existing core files. In this section, we use an example to show how to use the program counter + adDR2line to locate lines of code.

From the screenshot, we can see that the line of code indicated by Frame 20 does not match the actual line of code. The locating steps are as follows:

Shell /opt/compiler/gcc-8.2/bin/addr2line -e bin addressCopy the code

3.3.3 Function stack repair

Sometimes we find that the function call stack has a lot of… This often occurs when the stack is overwritten, and in some cases manually fixed. Function stack repair takes advantage of the knowledge of function stack memory distribution, as shown in section 1.

----------------------------------- Low addresses ----------------------------------- 0(%rsp) | top of the stack frame |  (this is the same as -n(%rbp)) ---------|-------------------------- n(%rbp) | variable sized stack frame- 8(%rbp) | varied 0(%rbp) | previous stack frame address 8(%rbp) | return address ----------------------------------- High addressesCopy the code

From the stack diagram above, you can find the return address of the last function and the pointer at the bottom of the stack using the % RBP register, and then use the addr2line command to find the corresponding code line. Here’s an example:

X /2ag $RBP $RBP $RBP $RBP $RBP $RBP $RBP $RBP $RBPCopy the code

3.3.4 Irregular core stack

Irregular core stack problems usually occur when the heap memory is written bad. Function calls are such a delicate process that an unexpected read or write from any location can crash the program. Here’s a small example:

int main(int argc, char* argv[]) {
    std::string s("abcd");
    *reinterpret_cast<uint64_t*>(&s) = 0x11;
    return 0;
}
Copy the code

In the example above, core is on a string destructor because the string _M_ptr is rewritten to 0x11 and the destructor becomes an illegal memory operation.

Similarly, because the process heap space is shared, an illegal operation on the heap by one thread may affect the normal operation of another thread. Due to the randomness of heap allocation, the phenomenon is an irregular core stack.

One of the best ways to target irregular core stacks is through AddressSanitizer.

# Set compilation parameters CXXFLAGS CXXFLAGS=" -fpic-fsanitize = address-fno-omit -frame-pointer" # Set link parameters LDFLAGS="-lasan" # Set enable-environment variables export ASAN_OPTIONS= halt_ON_error =0:abort_on_error=1:disable_coredump=0 # LD_PRELOAD=/opt/compiler/gcc-8.2/lib/libasan.so ./bin/xxxCopy the code

3.3.5 summary

The methods mentioned above are all aimed at finding specific lines of problematic code that provide clues to the specific cause of core later on.

3.4 Locating the cause of Core

This section focuses on ways to locate the cause of Core and some common causes.

3.4.1 Confirm the semaphore


From the above Core classification, we can find that the Core of some scenarios is caused by machine faults, such as SIGBUS, so we can first eliminate some Core causes through semaphore.

3.4.2 Locate abnormal assembler instructions


By locating the code line above, we can roughly find out which line the program core is in. A simple core can be found by directly printing the program context.

But in some scenarios, by checking the context without any exception, this time it is necessary to accurately locate the specific exception assembly instructions, according to the instructions to find the cause.

An easy way to view assembly instructions is to use layout ASM, where frame points to the stack and displays the assembly for that stack. Here’s an example of core:

The program shows core in the start function, check the relevant context variables are not abnormal. Use Layout ASM to open the executing assembly instruction as follows:

Check the assembly to locate the program core in the MOV instruction, mov instruction on an instruction for sub, for the stack for 3M space, suspected that the stack space is insufficient. The stack space is insufficient by using % rsp-% RBP of frame 0 and % RBP of frame N.

From the above example, we can see that after locating the exception assembler instruction, we can further compress the exception to locate which instruction, variable, and address caused the core problem.

3.4.3 Troubleshooting Abnormal Variables

Through the above operations, we can accurately locate which instruction in which line of code has a problem. According to the abnormal instruction, we can check related variables and determine whether the variable value meets the expectation.

Here’s a classic example of a null pointer:

int main(int argc, char* argv[]) {
    int* a = nullptr;
    *a = 1;
    return *a;
}
Copy the code

The value of %rax comes from 0x8(% RBP), and the x command prints the address associated with it. This is a null pointer error.

3.4.4 Viewing optimized variables

Usually when the program is turned on for compilation optimization, it will appear that the variable cannot be printed, prompting that the variable is optimized. Sometimes you can use the method of assembly + register to view the optimized variable.

Here’s an example:

void foo(char const* str) {
    char buf[1024] = {'\0'};
    memcpy(buf, str, sizeof(buf));
}

int main(int argc, char* argv[]) {
    foo("abcd");
    return 0;
}
Copy the code

Normally, within foo, the STR variable is not optimised because the %rdi register can be used to pass the parameter. In order to print out the value of STR, we can find the specific variable value by means of assembly + register, as follows:

Mov $0x402011, %edi. 0x402011 is the memory address of STR. X command can display the value of STR.

In complex scenarios where it is not possible to find the optimized variable directly, you can use assembly backtracking to find the variable.

3.4.5 Abnormal Function Address

Sometimes the core problem is caused by data exception, and sometimes it can be caused by optimizing function address, such as calling virtual function address error, function return address error, function pointer value error.

Exception function address troubleshooting is the same as exception variable troubleshooting, according to the assembly instruction to confirm whether the call is abnormal. Here’s an example of a virtual address exception:

class
A {
public:
    virtual ~A() = default;
    virtual void foo() = 0;
};
class
B : public A {
public:
    void foo() {}
};
 
int main(int argc, char* argv[])
{
    A* a = new B;
    a->foo();
    A* b = new B;
    *reinterpret_cast<void**>(b) = 0x0;
    b->foo();
 
    return 0;
}
Copy the code

According to the assembly instruction, core is in mov (%rax), %rax. Combining with the instruction context, it can be found that it is in the address of the virtual function addressing operation. Comparing the virtual function table of the two variables, it can be found that the core is caused by the function address load error.

3.4.6 summary

The basic process of locating core can be summarized as the following steps:

  1. Determine what causes the core to trigger roughly. Machine problem? Procedural problems of its own?

  2. Locate the line of code. Which line of code is the problem?

  3. Locate the execution instruction. Which line of instruction does what.

  4. Locate the exception variable. There is no problem with the instruction, but the variable the instruction operates on is not as expected.

I am good at using assembly instructions and print instructions (x, print, display) to locate Core more effectively.

References:

Check the tool assembly: godbolt.org/ cppinsights. IO/standard GDB document: sourceware.org/gdb/current…

Recruitment Information:

Welcome to join the content center architecture team of baidu mobile ecology business group. We need students of back-end, C++, model architecture, big data, performance tuning, social recruitment, internship and school recruitment all the time

Email address: [email protected]

Recommended Reading:

Database design and practice for large-scale commercial systems

Baidu love pan mobile terminal web page seconds open practice

Decrypt 100 TB data analysis how to run in 45 seconds

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods · Industry information · online salon · Industry conference

Recruitment information · Internal promotion information · Technical books · Around Baidu

Welcome your attention