As a big data development, let’s start with an interview question THAT I like better.

For the following nginx log access.log, use a script to analyze the Top 10 access IP addresses. In fact, this question is not difficult, but I have investigated several commonly used shell commands, awk, UNIq, sort, head, I think it should be necessary for doing big data development, operation and maintenance, data warehouse, etc.


2018-11-20T23:37:40+08:00 119.1590.30 -"GET /free.php? The proxy = out_hp & sort = HTTP / 1.1 & page = 1" "/free.php" - 200 0.156 362 6849/7213TLSv1.2 ecdhe-rsa-AES128-GMM-sha256 n/A"Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; LBBROWSER)"
2018-11-20T23:37:44+08:00 117.3095.62 -"The GET/partner. HTTP / 1.1 PHP" "/partner.php" - 200 0.016 457 6534/6956TLSv1.2 ECDHE - RSA - either AES128 - GCM SHA256 - HTTPS:/ / blog.csdn.net/ithomer/article/details/6566739 - "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
2018-11-20T23:37:44+08:00 117.3095.62 -"GET/CSS/bootstrap. Min. CSS HTTP / 1.1" "/css/bootstrap.min.css" - 200 0.045 398 19402/19757TLSv1.2 ECDHE - RSA - either AES128 - GCM SHA256 - HTTPS:/ / proxy.mimvp.com/partner.php - "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
2018-11-20T23:37:44+08:00 117.3095.62 -"GET/CSS/hint. Min. CSS HTTP / 1.1" "/css/hint.min.css" - 200 0.000 393 1635/1989TLSv1.2 ECDHE - RSA - either AES128 - GCM SHA256 - HTTPS:/ / proxy.mimvp.com/partner.php - "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
 
Copy the code

Come on answer

 cat access.log | awk '{print $2}'| uniq -c | sort -k1 -nr| head -10
Copy the code

Awk is quite powerful, but today we are going to focus on some of the more common uses of awk in our work.


awk '{[pattern] action}' {filenames}

Copy the code

The cutting file

-f specifies the split file separator. The default is space or \t. For example, in the above log, we want to get the IP address of the second column

awk -F ' ' '{print $2}'  access.log 
Copy the code

We don’t have to write the space, but I’ll do it here as a demonstration.

In hive, the default delimiter is 0x01.

awk -F '\ \ 001' '{ print $1 }' abcd.txt
Copy the code

Use of built-in variables

  • $0 is used to print out a full line of fields.
  • $n is used to print the number of fields after being cut by the delimiter specified by the -f argument, with the index starting at 1
  • NF How many columns are there after each row is shred. For example, we can print $NF to print out the last column

Sometimes we can use AWK to take a few of these fields and splice them together to make some statements that we want. Let’s say we want to capture the IP field in access.log above and generate some SQL and insert it into the database.

awk '{print "insert into mytable(ip) values('\''"$2"'\'');" }' access.log > /tmp/ip.sqlCopy the code

Regular match

Sometimes we just want to print out the columns we want, and we can do that with regular matching.

For example, if we want to print the IP starting with 117 in access.log above, we can do so.

awk '$2 ~ /^117/ {print $2}' access.log 
Copy the code

Class SQL function

In fact, AWK can also help us to implement some simple SQL-like functions, let’s briefly discuss.

Let’s say we have a table of students down here

Id Class name ID Class name

Class 11 Zhang SAN 2 Class 2 Li Si 3 Class 1 Wang Wu 4 Class 3 Zhao Liu

For example, if we want to count the number of students in each class, we can use the following command

awk '{a[$2]++} END {for(I in a){print I" student.txt
Copy the code

We define a map-like variable A, where key is the class name, i.e. the second column, value is the number of students in each class, and finally output through a for loop.

More exciting content, welcome to pay attention to my public number [big data technology and application combat], grow together.