Common Linux log statistics and analysis commands

“This is the 8th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge.”

One, the introduction

Now that we’ve covered the basic use of the Linux Trimusketers in detail, let’s take a look at the specific application in the area of performance testing. This article focuses on the statistical analysis of Tomcat and Nginx Access logs.

Second, Tomcat statistics request response time

Server. XML uses configuration mode, %D- request time, %F- response time

<Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs"
prefix="localhost_access_log." suffix=".txt"
pattern="%h %l %u [%{yyyy-MM-dd HH:mm:ss}t] %{X-Real_IP}i &quot;%r&quot; %s %b %D %F" />
Copy the code

The fields are described as follows:

%h – The IP address of the client that initiates the request. The IP address recorded here is not necessarily the IP address of a real user client. It may be the public mapped IP address of a private client or the IP address of a proxy server.
% L – THE RFC 1413 identity of the client (reference). Only clients that implement the RFC 1413 specification can provide this information.
%u – User name of the remote client, which is used to record the name provided by the user for authentication. For example, the user name zuozewei for logging in to Baidu is blank if there is no login.
%t – The time when the request was received (the time and time zone of the access, such as 18/Jul/2018:17:00:01 +0800, the “+0800” at the end of the time message indicates that the server time zone is 8 hours after UTC)
%{x-real_ip} I -Real IP address of the client
%r – The request line from the client (the request URI and HTTP protocol, which is the most useful information in the entire PV logging of what a request was received by the server)
%> S – The server returns the client status code. For example, success is 200.
%b – Size of the body content of the file sent to the client, excluding the size of the response header (this value can be added up from each record in the log to give a rough estimate of server throughput)
%{Referer} I – record which page link was accessed from (the content of the request header Referer)
%D – The time in milliseconds to process the request
%F – The time in milliseconds for the client browser information to submit a response

Log example:

47.203.89.212 -- [19/Apr/2017:03:06:53 +0000] "GET/HTTP/1.1" 200 10599 50 49Copy the code

Nginx statistics request and background service response time

Extend response_time&upstream_response_time using the classic format of default combined

Nginx. conf uses the following configuration methods:

log_format main '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent $request_time $upstream_response_time "$http_referer" "$http_user_agent" "$http_x_forwarded_for"';
Copy the code

The fields are described as follows:

$remote_addr – The IP address of the client that initiates the request. The IP address recorded here is not necessarily the IP address of a real user client. It may be the public mapped IP address of a private client or the IP address of a proxy server.
$remote_user– User name of the remote client: records the name provided by the user for authentication, for example, the user name for logging in to BaiduzuozeweiIf there is no login, it is blank.
[$time_local]– The time the request was received (time of access and time zone, e.g18/Jul/2018:17:00:01 +0800, time information last"+ 0800"Indicates that the time zone of the server is 8 hours after UTC.
“$Request” – The request line from the client (the REQUEST URI and HTTP protocol, this is the most useful information in the entire PV logging of what a request was received by the server)
$status – The server returns the client status code, such as 200 for success.
$body_byTES_sent – Size of the body content of the file sent to the client, excluding the size of the response header (this value can be added up from each record in the log to give a rough estimate of server throughput)
$request_time – Total time of the entire request, in seconds (including the time to receive the request data from the client, the time to respond from the backend program, and the time to send the response data to the client (excluding the time to write the log))
$upstream_response_time – response time from upstream during a request (in seconds)
“$http_referer” – records which page link was visited from (the content of the request header Referer)
“$HTTP_user_Agent” – Client browser information (request header user-agent content)
"$ http_x_forwarded_for"– Real IP address of the client. Usually, the Web server is placed after the reverse proxy. In this way, the web server cannot obtain the IP address of the client$remote_addThe IP address is the IP address of the reverse proxy server. The reverse proxy server can be added to the HTTP header for forwarding requestsx_forwarded_for** Information, which records the IP address of the original client and the server address requested by the original client.

Log example:

[19/Apr/2017:01:58:04 +0000] "GET/HTTP/1.1" 200 0 0.023 -" -" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36Copy the code

Introduction to AWK

1. Basic concepts

In order to understand the AWK program, let’s outline the basics.

AWK programs can consist of one or more lines of text, the core of which is a combination of patterns and actions.

pattern { action }
Copy the code

Pattern is used to match each line of text in the input. Awk performs an action for each line of text that is matched. The pattern and action are separated by curly braces. Awk scans each line of text sequentially and uses a record delimiter (typically a newline) to record each line read. It uses a field delimiter (typically a space or TAB) to split a line of text into multiple fields, each of which can use 1,1, 1,2… N said. N said. N said. 1 is the first domain, 2 is the second domain, 2 is the second domain, 2 is the second domain, n is the NTH domain. $0 indicates the entire record. Neither pattern nor action can be specified. By default, all rows will be matched. In the case of the default action, the action {print} is executed, that is, the entire record is printed.

Nginx access.log is used as an example, and Tomcat logs can be used as an example. Use AWK to decompose the information in the Nginx Access log

[19/Apr/2017:01:58:04 +0000] "GET/HTTP/1.1" 200 0 0.023 -" -" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36Copy the code

$0 is the whole row
$1 is access IP"218.56.42.148"
$4 is the first half of the request time"[19 / Apr / 2017:01:58:04"
The $5 is the second half of the request time"+ 0000]"

And so on… When we use the default domain separator, we can parse the following different types of information from the log:

awk '{print $1}' access.log       $remote_addr = $remote_addr
awk '{print $3}' access.log       $remote_user $remote_user
awk '{print $4,$5}' access.log    # date and time ([$time_local])
awk '{print $7}' access _log      # URI ($request) 
awk '{print $9}' access _log      # status code ($status)
awk '{print $10}' access _log     # Response size ($body_bytes_sent)
awk '{print $11}' access _log     $request_time ($request_time)
awk '{print $12}' access _log     Upstream response time ($upstream_response_time)
Copy the code

It’s easy to see that using only the default domain delimiter makes it difficult to parse out other information, such as the request line, reference page, and browser type, because it contains an indefinite number of Spaces. Therefore, we need to change the field delimiter to “so that this information can be easily read.

awk -F\" '{print $2}' access.log        $request ($request)
awk -F\" '{print $4}' access.log        # Reference page ($http_referer)
awk -F\" '{print $6}' access.log        # browser ($http_user_agent)
awk -F\" '{print $8}' access.log        # real IP ($http_x_ded_for)
Copy the code

Note: Here to avoid the Linux Shell misreading “for string start, we used a backslash, escaped”. Now we know the basics of AWK and how it parses logs.

2. Examples of application scenarios

Nginx access.log is used as an example, and Tomcat logs can be used as an example.

2.1 Browser Type Statistics

If we want to know which types of browsers have visited the site, in reverse order of occurrence, I can use the following command:

awk -F\" '{print $6}' access.log | sort | uniq -c | sort -fr
Copy the code

This command line first parses out the browser domain and then pipes the output as input to the first sort command. The first sort command is mainly used to make it easier for the UNIq command to count the number of occurrences of different browsers. The last sort command will output the previous statistics in reverse order.

2.2. Discover the problems existing in the system

We can use the following command line to measure the status code returned by the server and discover possible problems with the system.

awk '{print $9}' access.log | sort | uniq -c | sort
Copy the code

Normally, status codes 200 or 30x would be the most frequent. 40x Indicates a client access problem. 50x indicates a server-side problem. Here are some common status codes:

200 – The request was successful and the desired response header or data body for the request is returned with this response.
206 – The server has successfully processed some of the GET requests
301 – The requested resource has been permanently moved to a new location
302 – The requested resource now temporarily responds to requests from a different URI
400 – Incorrect request. The current request was not understood by the server
401 – The request is not authorized and the current request requires user authentication.
403 – Access denied. The server understood the request, but refused to perform it.
404 – File not found, resource not found on server.
500 – The server encountered an unexpected situation that caused it to be unable to complete processing the request.
503 – The server is currently unable to process requests due to temporary server maintenance or overload.

HTTP protocol status codes can be defined at www.w3.org/Protocols/r…

2.3 Statistics on status codes

Find and display all requests with status code 404

awk '($9 ~ / 404 /)' access.log
Copy the code

Statistics for all requests with status code 404

awk '($9 ~ / 404 /)' access.log | awk '{print $9,$7}' | sort
Copy the code

Now let’s assume that a request (for example: URI: /path/to/notfound) generates a large number of 404 errors. We can find out which referenced page the request came from and which browser it came from by using the following command.

awk -F\" '($2 ~ "^GET /path/to/notfound "){print $4,$6}' access.log
Copy the code

2.4. Tracking down who is stealing pictures on the website

Sometimes you find that other sites are using the same images on their site for some reason. If you want to know who is using the images on your site without authorization, we can use the following command:

awk -F\" '($2 ~ /\.(jpg|gif|png)/ && $4 !~ /^http:\/\/www\.example\.com/)\ 
{print $4}' access.log \ | sort | uniq -c | sort
Copy the code

Note: Before use, change www.example.com to the domain name of your own website.

Break up each line using “;
The request line must contain.jpg,.gif, or.png.
The reference page does not start with your web site domain name string (in this case, www.example.com);
Displays all referenced pages and counts the number of occurrences.

2.5 IP statistics

Count the number of different IP access:

awk '{print $1}'Access. The log | sort | uniq | wc - lCopy the code

Count the number of pages visited per IP:

awk '{++S[$1]} END {for (a in S) print a,S[a]}' log_file
Copy the code

Order the number of pages visited by each IP from smallest to largest:

awk '{++S[$1]} END {for (a in S) print S[a],a}' log_file | sort -n
Copy the code

Statistics on the number of IP access at 14:00 on August 31, 2018:

awk '{print $4,$1}' access.log | grep 31/Aug/2018:14 | awk '{print $2}'| sort | uniq | wc -l
Copy the code

The top ten IP addresses that are accessed the most are counted

awk '{print $1}' access.log |sort|uniq -c|sort -nr |head -10
Copy the code

To view which pages are visited by a given IP:

The grep ^ 202.106.19.100 access. Log | awk'{print $1,$7}'
Copy the code

Statistics about the access to an IP address in order of access frequency

grep '202.106.19.100' access.log |awk '{print $7}'|sort |uniq -c |sort -rn |head -n 100
Copy the code

2.6 Statistics on response page size

List the files with the largest transfer sizes

cat access.log |awk '{print $10 " " $1 " " $4 " " $7}'|sort -nr|head -100
Copy the code

Lists pages with output greater than 204800 bytes (200KB) and the number of times the corresponding page occurred

cat access.log |awk '($10 > 200000){print $7}'|sort -n|uniq -c|sort -nr|head -100
Copy the code

List the most frequently visited pages (TOP100)

awk '{print $7}' access.log | sort |uniq -c | sort -rn | head -n 100
Copy the code

List the most frequently visited pages ([excluding PHP pages]) (TOP100)

grep -v ".php"  access.log | awk '{print $7}' | sort |uniq -c | sort -rn | head -n 100          
Copy the code

List pages that have been visited more than 100 times

cat access.log | cut -d ' ' -f 7 | sort |uniq -c | awk '{if ($1 > 100) print $0}' | less
Copy the code

List the last 1000 records and the most visited pages

tail -1000 access.log |awk '{print $7}'|sort|uniq -c|sort -nr|less
Copy the code

2.7 PV statistics

Count requests per minute,top100 point in time (accurate to minute)

awk '{print $4}' access.log |cut -c 14-18|sort|uniq -c|sort -nr|head -n 100
Copy the code

Count the number of requests per hour,top100 point in time (precise to hour)

awk '{print $4}' access.log |cut -c 14-15|sort|uniq -c|sort -nr|head -n 100
Copy the code

Count the number of requests per second,top100 time point (accurate to second)

awk '{print $4}' access.log |cut -c 14-21|sort|uniq -c|sort -nr|head -n 100
Copy the code

Collect the pv of the day

grep "10/May/2018" access.log | wc -l
Copy the code

Description:

Awk ‘{print $1}’ : fetch low 1 field (column 1)
Sort: sort the IP parts.
Uniq-c: Print the number of occurrences of each repeated line. (and remove duplicate lines)
Sort-nr-k1: repeat rows in reverse order of occurrence. -K1 is sorted by the first column.
Head -n 10: Indicates the top 10 IP addresses

2.8 Statistics on page response time

You can use the following command to collect all logs whose response time exceeds 3 seconds.

awk '($NF > 1){print $11}' access.log
cat access.log|awk '($NF > 3){print $7}'
Copy the code

Note: NF is the number of fields in the current record. $NF is the last field.

List pages that take more than 3 seconds for PHP page requests and count the number of times they occur, showing the first 100

cat access.log|awk '($NF > 1 && $7~/\.php/){print $7}'|sort -n|uniq -c|sort -nr|head -100
Copy the code

List requests that take longer than 5 seconds and display the first 20

awk '($NF > 1){print $11}' access.log | awk -F\" '{print $2}' |sort -n| uniq -c|sort -nr|head -20
Copy the code

2.9 Statistics of spider capture

Count the number of spider grabs

grep 'Baiduspider' access.log |wc -l
Copy the code

Count the number of spider grab 404

grep 'Baiduspider' access.log |grep '404' | wc -l
Copy the code

Five, the summary

Through the introduction of this article, I believe that students will find the three Musketeers of Linux powerful. On the command line, it can also accept and execute external AWK program files, and it can do very complex processing of text messages, so to speak, “it can’t do anything except surprise.”