This article is provided and published by GOPS2017 Beijing Station Conference. Efficient operation and maintenance community is committed to accompany your career and grow happily with you.

About the author:

Zhao Shundong China SaltStack user group initiator Jiang Hu said: Monitor Zhao, used to be in charge of command automation architecture and operation and maintenance work in a department of armed Police. After retiring in 2008, he has been engaged in Internet operation and maintenance work, successively serving as operation and maintenance engineer, operation and maintenance manager, operation and maintenance architect and operation and maintenance director. Author of “SaltStack Technology Introduction and Combat”, author of “Operation and Maintenance Knowledge System”, GOPS Gold Medal lecturer, Exin DevOps Master certified lecturer.

What are operations doing?

Here is a common scenario in the daily work of an operations friend:

Hi, buddy, I just launched a new function, please help me to check whether there is any abnormality in the run log… Operations: Yes, right away;

Test: Hi, brother, the traffic of interface A has dropped; Check the ERROR log for me… Ops: Ok, hold on;

Hi, good gay friend, help me remove the log file from my log directory… Ops: Ok, I’ll give it to you later.

In addition, some of my previous operations and maintenance team were busy with these things all day long; Deploying, executing scripts, getting files, awK analyzing logs, etc., all while being extremely busy.

I asked one of my friends, an engineer in system operation and maintenance, what he was busy with every day. He said that logs account for one-third of his daily workload; The second is deployment, uploading code to the server, etc.

So what are operations and maintenance busy with every day? I think I’m dealing with log related issues

Log demand

In terms of logs, you need to collect logs for O&M. There are five aspects to this question.

  • Requirement 1: System logs

The first is to master the running state of the operating system; System logs are very useful. For example, how can you tell if a memory is broken? Generally bad memory in the system will have a log prompt, or write a threshold memory below how much alarm. But sometimes the memory is 8G, a broken memory do not know; And you don’t see it in the memory hardware logs. You can find the problem using system logs.

  • Requirement 2: Access logs

Back when we were advertising, access logs were our lifeblood; We rely on access logs to get a lot of data. This is used for statistical analysis of access sources, URL request frequency, response time, success rate and so on.

  • Requirement 3: Run logs

Run logs help you learn about system running details, running exceptions, and service output logs.

  • Requirement 4: Error logging

You need to find some error logs by keyword. For example, it is often found that logs and traffic fluctuations can be correlated; You can use this to do a full flow analysis.

  • Requirement 5: Associate logs

Finally, business association logging. Here is an example that has been encountered. At the beginning of the e-commerce business, the boss came to ask, usually we only dozens of orders per minute, asked why we recently a period of time every minute suddenly hundreds of orders.

I actually thought it was weird. Why did you ask me? Later, I thought that because operation and maintenance were not in place; You said that a lot of orders suddenly came in, and operations didn’t know about it. At this time, the easiest thing to know is operation and maintenance. You can figure out what the order is by all means.

Therefore, I think the goal of operation and maintenance is to “know what it is and why it is”. For example, if your boss asks you, “Why are there so many orders in the last 5 minutes?” you can tell him, “It’s because we did a two-for-one deal, and a lot of people showed up.”

Pain points in the log collection environment

To start with, why log collection? Developers need to query logs; But the developer cannot have access to the server login. Many large companies now do not allow the operation and maintenance of online machines. Because SSH login can do the same thing as others; It might cause this node to be different from the other nodes, and if it’s different, something happens.

There are logs for all systems; Log data is scattered and difficult to find. There are two aspects to this pain point that need to be addressed; One is how to collect, and the other is standardization.

For an application you need to know what logging you are doing; Identify the logging requirements while doing the requirements logging survey. We had a user service before, because it was stressful; We need to separate it into microservices. At this point, there are two alternative schemes, the first is RPC procedure call;

The second is HTTP. At that time, R&D considered the needs of operation and maintenance, the first is permission control; The second is the need for monitoring. Monitoring requirements include cluster health checks and call and response statistics.

After the requirements are put forward, unless a framework is made, RPC can not be done; Otherwise, use HTTP. Because the system is going to run; Therefore, non-functional requirements are also a large part, and operation and maintenance requirements must be taken into account when making requirements.

Origin of the ELK architecture, the artifact to solve pain points

The old ELK architecture

The old ELK architecture was Logstash, ElasticSearch, and Kibana. What to do in the age of containers? Our previous approach was to create a Docker on each physical machine. When all containers are started, the container-related directory can be retrieved. I’m going to unify the directories and write them in this container, the Logstash container, and take the logs away. This approach has no cost, low cost and no technical bottlenecks.

Elasticsearch is a distributed full-text search engine based on Lucene. Elasticsearch provides a native Java API; It also provides restful apis.

Logstash serves as the data collection end; At the same time, have the ability of data flow processing. Kibana as front-end data display; Can present a variety of complex data structures, high flexibility.

Now the Elastic Stack

Now ELK is called Elastic Stack; Because Elastic bought the Beats tool. Beats tools can be used for log collection, network collection and stream collection, expanding the entire ecosystem.

Now Elastic is doing SaaS, and SaaS is a big part of the future for all people on the Internet.

ELK – Get started with ElasticSearch

This section introduces the basics of ElasticSearch. The ElasticSearch cluster has three states, green, yellow, and red. Green indicates that all master and slave shards are working properly.

Elastisearch’s sharding is actually an instance of Lucene on a node. Elasticsearch an index is divided into five shards and one copy by default.

Suppose you have two machines, and one machine dies; The original replica shard is promoted to master shard. In this way, it is highly available. Cluster yellow indicates that all main shards are running properly. But there are problematic copy sharding. In this case, replica fragments may be lost or not created, but data will not be lost.

Red indicates that the main fragment of the cluster does not work properly. There may be a risk of data loss.

– ElastiSearch cluster discovery was originally multicast, but was changed to unicast in 2.x Because there are some problems with multicast discovery. The current 5.1 version has been changed to unicast discovery. In the green state, when a node is added to the cluster, the cluster automatically balances fragments and migrates some fragments to the newly added node.

The number of copies of the index can be adjusted; Increasing redundancy at a given number of nodes does not improve performance. For example, three replicas will not cause data loss if two machines are mounted at the same time. However, the number of shards is tuned for size.

What if the node is down? Elasticsearch is normal. If the primary node fails, the cluster selects a new master. After primary selection is complete, fragments in the cluster are automatically balanced. If a duplicate fragment is lost, a new one is automatically created.

All elasticSearch nodes accept query requests, and query requests are scheduled to each node for data retrieval. Each node returns the result and merges the response to the query request. The following figure shows the query process.

ELK learning logstash introduction

Here’s how to use LogStash for log collection and some examples. This is a Logstash Hello world, printed directly on the command line. Logstash has three modules called input, Filter and Output. Data input stores stdin; From standard input to standard output. I’ve coded it, say Hello world and that’s what it looks like; Your data is in message, and logstash has a time stamp.

For example, I write elasticSearch directly from standard input. I’m going to put a file here, input standard input; And then output. Write elasticSearch to output using a plug-in; There is also standard output, which is displayed through RubyDebug. Learning logstash is about learning how to use various plug-ins. For example, the first one is to collect system logs; Where the system log files come from, use the INPUT plugin. The data will be written to ElasticSearch via plugins in Output.

In ELK, logStash, ElasticSearch, and Kibana versions are all unified; Learning can be directly read the official documents.

Let’s look at a demo case. The demo case is typing something, typing it into ES; Also output to the current RubyDebug. The logstash plugin is elasticSearch; Currently there are two indexes in ElasticSearch. If you do not write the index name, it will automatically generate the index name, followed by a timestamp. The simplest way to collect logs is to write a file directly to ElasticSearch.

The next is decoupling of log collection using message queues; This architecture is the current classic architecture. The data source is written in the input, using a plug-in to collect the data; Output then writes the data to the message queue using the message queue plug-in. Logstash has many message queue plug-ins such as Kafka, RabbitMQ, Zeromq, etc.

What is decoupling? After decoupling everything on the back end hangs, the log will not be lost as long as the message queue does not hang; Because it’s all written in the message queue. Message queues can be guaranteed not to hang by clustering.

Again, what can be done by decoupling? The data is received intact in the message queue and I am not processing it on the collected instance. If so, when the data volume is very large, the CPU may not be able to support and affect the normal business.

After the message queue we attach the Logstash for consumption and write ElasticSearch. The data is processed in the Filter module of the Logstash. At the same time, multiple consumer instances can be created based on consumption.

ELK Learning Kibana introduction and ELK enterprise practice

Kibana is introduced

A quick introduction to Kibana, because you can’t map without data. Actually Kibana drawing is relatively simple, do not worry about learning; I’m sure you can if you have the data.

Take a quick look at Kibana’s image; It has a range of times to view, such as today. Often someone said, I just wrote the log why not; It’s just a matter of time. There are lots of relatively absolute time horizons; All searches can be saved, and the saved search statements can do other related things. That way, you can use it when you’re doing your visualization. For example, when we create a visual graph, our data can do simple statistics on how many rows; You can do this from a saved search.

And you can put multiple visualizations into the dashboard; Dashboards can also be saved and reused. Through Kibana can make a very beautiful plate. For example, the market of the business we have done before, the market of operation and maintenance; On one side is the duty sheet, and on the other side is the number of user minutes registered, the pie chart of user daily activity, the core database log, and so on.

ELK enterprise practice

This is the architecture based on ElasticSearch; The main part is collecting message queues and different stores. Because there are so many logStash plug-ins, everything is handled through logStash. The current ELK related book is the ELK Authoritative Guide, which is available in print or in a free PDF version.

ELK remember a few words I said; The whole tool is not the point, the point is how to push log standardization in the company. Because the log is not standard, it is useless to collect a bunch of random logs. Let after short play value of the thing all can’t do. The next step is to do uniform collection through tools.

The logstash also does the alarm; Like sending something to a specific location based on a keyword. You can also write all logs to elasticSearch and call elasticSearch. Another way is to call the police if you find a keyword.

For example, our traffic suddenly has a small hole, this time you check the Nginx error log found there. You can also check PHP logs, and then check mysql slow query. Mysql slow query requires regular matching; Then a lot of people came up to me and asked me if Regev was right. That’s when it breaks down. So establishing log standardization is critical.