Engaged in operation and maintenance for one and a half years, encountered all kinds of problems, such as data loss, website suspension, database file deletion by mistake, hacker attacks and other problems.

Today, I will share it with you.

I. Online operation specifications

1. Test use

When studying the use of Linux, from basic to the service to the cluster, it’s all done in a virtual machine, while the teacher told us to true machine is no different, but the longing for the real environment is rising, but the various snapshot of the virtual machine, let’s get into the habit of my hand is out of control, so that to get a server operating authority, can’t wait to try, I remember that on my first day at work, the boss handed me the root password. Since I could only use putty, I wanted to use Xshell, so I quietly logged in to the server and tried to change to xshell+ key login. Because there was no test and no SSH connection, I was blocked from the server after restarting the SSHD server. Fortunately, I backed up the sshd_config file at that time, and then let the computer room personnel CP past, fortunately, this is a small company, otherwise it would have been directly dried…… I was lucky that year.

The second example is about file synchronization. Rsync is known for its fast synchronization, but it deletes files much faster than RM -RF. In rsync, there is a command to synchronize a file based on a directory (if the first directory is empty, the result is expected), and the source directory (with data) will be deleted. It was due to misoperation and lack of testing that I wrote the directory backwards, and the key was no backup… Data in the production environment is deleted

No backup, we think about the consequences, its importance is self-evident.

2. Confirm again before entering

Rm – RF/var error, I believe that people with fast hands, or when the Internet speed is relatively slow, the probability is quite large

At the very least, your heart cools when you realize it’s done.

And you might say, well, I’ve done it so many times and I didn’t make a mistake, don’t worry, I just want to say

When it happens once you understand, don’t assume that those operations accidents are on other people, if you don’t pay attention, you will be next.

3. Avoid multiple operators

In my last company, the operation and maintenance management was quite chaotic. To take a typical example, the operation and maintenance staff who left several times had the root password of the server.

We received operational task, usually for a simple check if cannot solve, ask others for help, but when problems, customer service supervisor (Linux) understand the point, network management, your boss together to debug a server, when you all sorts of baidu, a variety of contrast, finished, found that your server configuration files, keep up with the time you modify, Then change it back, then Google, excitedly find the problem, solve it, people tell you, he also solved it, modify different parameters… Well, I really don’t know which is the real cause of the problem, of course, it is good, the problem is solved, everyone is happy, but you have just changed the file, test invalid, go to modify the file again to find that the modification? Really angry, avoid by all means multi-person operation.

4. Back up data and perform operations later

Make it a habit to back up any changes you make, such as your.conf configuration file

In addition, when modifying the configuration file, you are advised to comment out the original options and then copy and modify them

Furthermore, if there is a database backup in the first example, then the rsync misoperation will soon be over

So say to lose database not overnight, casually backup a need not so miserable.

Two, involving data

1. Use RM-RF with caution

There are many examples on the Internet, all kinds of RM-RF /, all kinds of deleting the master database, all kinds of operation and maintenance accidents…

A little mistake can cause a lot of damage. If you do delete it, be careful.

Backup is more important than anything else

It’s all about backups, but I want to break it down into the data class to emphasize again how important backups are

I remember one of my teachers saying that you can never be too careful when it comes to data

The company I work for has third-party payment websites and online loan platforms

The third-party payment is fully backed up every two hours, while the online loan platform is backed up every 20 minutes

I’ll leave it to you to decide

3. Stability trumps everything else

In fact, not only data, in the entire server environment, is more stable than everything, not the fastest, but the most stable, for availability

So don’t use new software on the server, such as Nginx +php-fpm, without testing it

Just reboot, or switch to Apache.

4. Confidentiality trumps everything else

Now all kinds of yan zhao door fly all over the sky, all kinds of router back door, so, involving data, not confidential is not good.

Three, involving safety

1. ssh

Change the default port (of course, if the professional to hack you, scan it out)

Disabling root login

Use common user +key authentication + SUDO rule + IP address + user restriction

Use anti-explosion cracking software like Hostdeny (more than a few attempts to directly block)

Filter the login user in /etc/passwd

2. The firewall

The firewall production environment must be enabled and follow the minimum rules. Drop all ports and release required service ports.

3. Fine-grained permissions and control

Do not use root for services that can be started by common users, and control the permissions of various services to the minimum and the control granularity should be fine.

4. Intrusion detection and log monitoring

Use third-party software to constantly detect changes in key system files and service configuration files

For example,/etc/passwd,/etc/my. CNF, /etc/httpd/con/httpd.con;

The centralized log monitoring system is used to monitor alarm error logs such as /var/log/secure, /etc/log/message, and FTP upload and download files.

In addition, for port scanning, some third-party software can also be used to directly pull host.deny if it is found to be scanned. This information is very helpful when the system has been hacked. It has been said that the cost invested in security by a company is directly proportional to the cost lost by security attacks. Security is a big topic

Is also a very basic work, do a good job of the foundation, can quite improve the security of the system, the other is the security master to do

Iv. Daily monitoring

1. Monitor system operation

Many people start from monitoring operation and maintenance. Large companies generally have professional 24-hour monitoring operation and maintenance. System operation monitoring generally includes hardware usage

Common are, memory, hard disk, CPU, network card, OS including login monitoring, system key file monitoring

Regular monitoring can predict the probability of hardware failure and provide useful functionality for tuning

2. Monitor service operation

Service monitoring is generally a variety of applications, web, DB, LVS, etc., which are generally monitoring some metrics

Performance bottlenecks in the system can be quickly identified and resolved.

3. Monitor logs

Log monitoring here is similar to secure log monitoring, except that it is generally error and alert messages for hardware, OS, and applications

Monitoring is not helpful when the system is running steadily, but when something goes wrong and you don’t monitor it, it becomes very passive

Fifth, performance tuning

1. Get an in-depth understanding of the operation mechanism

In fact, according to the operation and maintenance experience of more than a year, talking about tuning is nothing at all, but I just want to briefly summarize, if I have a deeper understanding, I will update. Before software optimization, for example, to understand the operation mechanism of a software, such as Nginx and Apache, everyone says that Nginx is fast, it must know why Nginx is fast, what principle it uses, processing requests than Apache, and to be able to communicate with others in simple and understandable words, If necessary, you should be able to read the source code, otherwise all documentation for tuning parameters is nonsense.

2. Tune the framework and priorities

Familiar with the underlying mechanism, and so have the tuning framework and sequence, such as database bottlenecks, many people just to change the database configuration file, my advice is to first according to the bottleneck to analyze, check the log, write tuning direction, and then to obtain, and the database server tuning is supposed to be the last step, the first should be hardware and operating system, Today’s database servers are released after various tests

It works on all operating systems. You shouldn’t start with it.

3. Adjust one parameter at a time

One parameter at a time, this is more than everyone knows, tune more, you will be confused.

4. Benchmark

Benchmarking is necessary to determine whether tuning is useful and to test the stability and performance of a new version of software. Benchmarking involves many factors

Whether the test is close to the actual requirements of the business depends on the experience of the tester. For relevant information, you can refer to the high Performance mysql edition 3, which is quite good

As my teacher once said, there is no one-size-fits-all parameter, and any parameter change or tuning must fit the business scenario

So stop googling tweaks that don’t have a lasting effect on your improvement and business environment

Six, operation and maintenance mentality

1. Control your mind

A lot of RM-RF /data are at their peak in the last few minutes of work, so don’t you want to control your mood

They say you have to be upset at work, but you can avoid dealing with critical data environments when you’re upset

The more stressed you are, the more calm you must be, or you will lose more.

Rm -rf /data/mysql /mysql/mysql.mysql/rm -rf /data/mysql/mysql.mysql/rm -rf /data/mysql So disconnect the business, but do not shut down the mysql database, this is very helpful for recovery, and use DD to copy the hard drive, and then you can restore

Of course, most of the time you have to go to a data recovery company.

Imagine that the data is deleted, you do all kinds of operations, shut down the database, and then repair, not only may overwrite the file, but also can not find the table in memory.

2. Take responsibility for your data

The production environment is not a game, the database is not a game, you must be responsible for the data. The consequences of not backing up can be very serious.

3. Get to the bottom of it

Many operation and maintenance personnel are busy, so they will not take care of the problem when it is solved. I remember that last year, a client’s website was always unable to open, and an error was reported through THE PHP code

It was found that session and WHOS_Online were damaged. The previous operation and maintenance were repaired by repair, and I also repaired in this way. However, a few hours later, it appeared again

After repeated three or four times, I went to Google database table inexplicable damage reason: one is myISam bug, two is mysqlbug, three is mysql in the process of writing

The mysqld process is killed in OOM because the memory is not enough

And there is no swap partition, background monitoring memory is enough, the final upgrade physical memory solution.

4. Test and production environments

Be sure to look at the machine you are in before important operations, and try to avoid multiple Windows

Original: http://server.51cto.com/0S-582314.htm