In a new project, there are always some bottleneck problems that hinder the progress of the project. Tencent mobile game assistant project, 98% of the problems of the start card belongs to this kind of problem. Fortunately, the team solved the problem, and now it’s time to review and reflect on what we learned and improved. Tencent mobile Game Assistant is based on
virtualboxSecondary development of the product in
virtualboxMake a layer of UI on the basis of, encapsulate some common operations, set some default virtual keys for the game, so that players can happily play mobile games on the computer, without worrying about tedious Settings.

(Figure 1) Simulator module structure

At the beginning of the project, we received feedback from some users that the loading simulator stuck at 98%. The interface is shown as follows:

(Figure 2) 98% performance of the assistant boot card

Looking through the feedback information of BBS, we can see that this problem has been exposed since November 2015:



(FIG. 3) 98% feedback of BBS card

Look at the corresponding code in the UI and sort out the startup process as follows:

(Figure 4) The simulator mainly starts the process

1) CheckEnvironment(

  1. Check to see if a crash occurred last time
  2. Check whether COM and driver are normal. If so, try to repair
  3. Check whether the CPU supports VT, and whether VT is enabled
  4. Check whether OPENGL rendering is OK
  5. Sets the current display color to 32-bit

2) StartVM() Prepare the VM

  1. Check the OPENGL version and determine whether to force the DX mode
  2. Adjust the VM memory size
  3. Adjust the number of CPU cores on the VM

3) StartVMInternal() Starts the VM

  1. Set the VM resolution
  2. Example Set the DPI of the VM
  3. Enable hardware_OpengL on the VM
  4. Set the IMEI
  5. Example Set the VM agent
  6. Setting Port Forwarding
  7. Invokes the command to start the emulator

4) Init_devices() initializes various devices.

In this step, multiple communication threads will be created to communicate with android internally. As long as any thread can communicate successfully, the emulator will start successfully and control the emulator normally.

  1. Start local OPENGL rendering and create a render window
  2. Start the input communication thread
  3. Start the control communication thread
  4. Start the sensor communication thread

Under the normal process, UI calls up some Tbox(a modified version of Virtual) command lines for setting, and then starts the ROM. After the ROM is successfully started, the Launch process in Android will send a “Connected” message, and UI will start successfully after receiving it. UI communicates with Tbox through establishing sockets, and Tbox communicates with ROM through virtual PCI devices. In the abnormal process, UI has not received a successful connection message after ROM is started. So there are only two possible causes:

1. ROM has not started successfully at all

2. ROM started, but communication failed.

Once the cause of the problem is identified, it seems easy to troubleshoot, but the follow-up process is not so smooth.

1) The machine configuration is too lowNovember 15. Found that some users can not start ROM, common is the machine configuration is not high. Later, it was found that memory mainly affects vm startup, so the solution is to add a check on the machine memory in the installation program, and the installation is not allowed if the memory is less than 2 GB.

2) The Tbox process is stuck

November 15. 98% of the users who followed up multiple startup cards found that tboxManage. exe, tboxSvc. exe and tboxHeadless. exe (tbox process) may freeze if the emulator exits abnormally. The solution is simple: force the end of three processes before starting.

3) Third-party injection

December 2015. Discover a few user card 98% commonality is installed fast thunder net swims accelerator. The xlacclsp. DLL of the software will be injected into all processes, including the TBoxHeadless. Exe process of the emulator, resulting in the socket establishment failure. The solution is to prevent injection of this module.

4) The SOCKET is unavailable due to LSP service

First half of 2016. Still received a lot of feedback, and followed up a number of users, found that users are due to the establishment of socket failure caused by 98% of the startup card, the reasons include:

A) LSP disconnection

B) VPN problems.

C) Firewall problems.

Decided to use pipes instead of sockets for communication. Because the change involves the bottom layer, the change momentum is big, and other businesses need more, it is scheduled to come online in July.

5) New ROM bugs

August 2016. I thought 98% of the card would be completely fixed with the piped version, but the new version still has a lot of user feedback. Continue with the user. It was found that all users started tbox separately and could not enter the desktop. Further locating, it was found that the VDI (also known as ROM) file was damaged. Then I found on the official forum that it was caused by the XML abnormality damaged by the system parsing in 4.4.2. In the first half of the year, we just upgraded the system from 4.2.2 to 4.4.2 and finally solved the problem after the official patch was installed. 98% of the problems with the card were finally solved.

1) Don’t put too much faith in third-party components.Over-reliance on third-party components, too much trust in third-party components will often step on the pit, the use of third-party components on the one hand need to do all aspects of in-depth understanding, on the other hand is to do some necessary fault tolerance or avoidance mechanism.

2) Key information should be reported

Although this problem is very serious, it only occurs in a very small number of user environments and cannot be reproduced in the test environment, so progress is very slow. More importantly, the scope of the problem is impossible to assess. Relying solely on feedback to drive problem solving is highly unreliable, and users are likely to try it once and then lose it. Therefore, the critical path data should be sorted out at the early stage of the product and reported to the background, so that the abnormal impact range on the critical path can be timely assessed and the problem can be solved in a timely manner.

3) Abnormal causes should be detailed as far as possible

The first is that the product performance is too general, increasing the cost of locating the problem. As long as the communication with the virtual machine is not successful, the startup card is 98%. In addition, some questions are worth thinking about:

A) Can the reasons for non-communication be specified by technical means?

B) Is it possible to detect in advance whether it is the VM’s own problem or the communication problem?

C) Can you specify the cause of abnormal socket connection establishment in advance?

D)…

In the process of troubleshooting, try to refine the abnormal, whether it is product performance or log data report, so that problems can be quickly and accurately located.

Copyright, prohibit reprint