What are the common server faults?

Hardware is faulty. Common server hardware faults include disk damage and battery failure.

Software problems. For example, the operating system crashes or unknown program runs incorrectly.

Virus destruction. Ransomware encrypts and deletes service data.

Uncontrollable forces. Equipment damage and data loss caused by flooding, fire, collapse, etc.

Wrong operation. Data loss caused by misoperations, such as formatting, deleting, or overwriting.

How can server failures be reduced or avoided?

Regular overhaul and maintenance. The hardware performance of a server is affected by its service life. Periodically check and maintain the device to discover possible faults in a timely manner. For example, slow disk read/write, abnormal ringing, and disconnection of disks in the storage array indicate an impending fault.

Customize the server contingency plan. You can customize an emergency plan, such as backup server, emergency power supply, and redundant memory. You can enable the emergency plan immediately when the server stops running, preventing services from being affected.

Update software regularly. Periodically update the operating system and software on the server to prevent virus attacks.

Create event logs. Strict monitoring of operators and operation content, as far as possible to achieve automation.

How to recover from a fault?

Although the server failure can be prevented but not controllable, failure is inevitable, how to recover after failure?

1. If a fault occurs, enable the emergency mechanism and replace the faulty backup server.

2. Troubleshoot and repair the fault.

3. If data on the server is damaged, shut down the server, back up the server data, and restore the server data.

HP DL380 Server RAID Information Loss Case

The SERVER to be shared this time is HP DL380 series. The storage is RAID5 consisting of six 73GB SAS disks. The operating system is WINDOWS 2003 SERVER, which is used as a file SERVER in enterprise departments. After the RAID is restarted, an error message is displayed indicating that the storage device cannot be found. After you log in to the RAID management module, the operation fails. The fault persists after the RAID management module is restarted.

Host accident led to a RAID power module damage (including RAID management information loss and RAID module hardware damage) is not uncommon, in general, after the RAID has been created, its information management module will not change again, but after all belong to this part of the information can modify information, accident is easy to cause this part of the power of information been tampered with or even lost, Multiple power outages may even damage the components of the RAID controller card, causing the host to lose the middle layer module that provides RAID management for multiple physical disks. In this case, the RAID module crash is most likely caused by RAID card hardware damage (verified by HP after-sales engineers). In this case, the data on the six hard disks cannot be obtained through normal channels, and the third-party data recovery service is required.

What is the data recovery process like?

1. A strict physical check is performed on the six SAS disks provided by the user. The six disks are in good reading state.

2. Mirror the six disks in the faulty RAID group. To ensure absolute data security, the target storage is array storage with redundancy function.

3. After the image is created, analyze the RAID structure of the six backup files, determine the disk sequence, data block size, and verification mode of RAID5 for the six disks based on the file system storage rules, and reconstruct RAID groups in the virtual environment.

4. Verify the logical data in the RAID array to ensure that the parameters used in the RAID array reconstruction are correct. Then verify the data that users are most interested in.

5. After confirming that the data recovery result is as expected (data is restored to the pre-fault state), migrate all user service data to the user storage

North Asia Tips

1. Ensure that the power supply in the equipment room is stable to minimize the impact of abnormal power supply on hosts and storage devices.

2. It is better to configure UPS for important servers and storage systems to ensure the normal operation of core business systems for a certain period of time in case of unexpected power failure in the equipment room, thus winning precious time for enterprises to seek emergency solutions.

3. Check the security status of servers that have been in service for a long time periodically and evaluate their overall operating status to decide whether to upgrade hardware and system comprehensively. Meanwhile, make emergency handling plans for sudden data disasters in advance to reduce service losses caused by data disasters.

Server as a high-speed operation, long time running equipment, the occurrence of failure is relatively more, but we can minimize or avoid server failure in the process of use, can also choose data recovery means after the server failure to protect the data in the server, reduce the loss.