Hello, I amxindooIt has been nearly a month and a half since I last posted a technical article. The reason is that I have been very, very busy recently. Besides eating and sleeping on weekdays, I am either at work or on the way to workToday 1024 program ape day, take time out of your busy schedule to send a long time to write the article to gather a lively, simple teach you how to use awK this command line tool. Those of you who know me know that I was born in operations. I didn’t learn much in operations. There are some command line tools that make thieves go rogue, awk being one of them. Later, when I transferred to development, I quickly solved many small problems by virtue of my mastery of some command line tools. The convenience and efficiency of the command line also shocked our colleagues for many times.

The combination of various command line tools and pipes can solve a lot of problems very quickly. I won’t expand it here, but if you are interested, you can take a look at a blog POST I wrote earlier about some of the Linux commands I use frequently. Today’s hero is AWK, a powerful text processing tool. I use it to do data cleaning, screening, viewing and even complete some simple data statistics. It’s no exaggeration to say that I use AWK to do tasks that would take hours or even be impossible to do.

So maybe you don’t feel it. Let me give you a specific example. I had a colleague who needed to evenly split a text file of tens of millions of lines (over 500MB) into two files. In fact, he wanted to evenly and randomly split tens of millions of users into two sets for some comparative experiments. What would you do? I actually did it with one awk command, 20 seconds of typing and half a minute of execution.

cat users.txt |awk 'NR%2==0 {print $1}' > 0.txt
cat users.txt |awk 'NR%2==1 {print $1}' > 1.txt
Copy the code

All this is just to draw attention to the power of AWK, so what is awK? Many beginners think of AWk as a text-processing tool that, along with grep and sed, is the triple Sword of Linux text. Awk is more than just a text processing tool. It is also a programming language. Awk provides many built-in variables and functions for text processing (more on that later), making it easy to use for text processing.

The basic use

The basic use of awk is awk + concrete execution + text files, which can also read from Linux pipes. Two ways of using awk are as follows.

awk program textfile 
cat textfile | awk program  
Copy the code

Awk is actually row oriented, meaning that its instructions are executed once for each row of data, as in the following example

cat a.txt| awk '{print $1, $3}'
Copy the code

The command above is the first and third columns of all lines in the output file. The subscript starts at 1, and $0 has a special meaning for all data in this line. Awk uses Spaces or tabs to distinguish columns by default. Sometimes text files are not separated by Spaces or tabs, but by special characters (such as -). Awk also provides the -f parameter to specify separators.

cat a.txt| awk -FThe '-' '{print $1, $3}'
Copy the code

Built-in variables

One of the reasons awK is so good at processing text is that it provides a large number of built-in variables that make it easy to get information about the content of text, such as what line is currently in (NR), how many columns are in that line (NF), what FILENAME is being processed… To list just a few,

variable role
$0 Everything in the current row

1   1 ~
n
Columns 1-n of the current row
NF How many columns are in the current row
NR What is the current row, starting at 1
RS The input record is interpreted as a newline character
OFS The output field separator is also a space by default
ORS The record separator for output, which defaults to newline
ARGC Number of command line arguments
ARGV Array of command line arguments
FILENAME Currently enter the name of the file
IGNORECASE If true, case-insensitive matching is performed
ARGIND ARGV identifier for the file currently being processed

Such as I want the output to a text file a.t xt, | style, which line the number of columns, respectively I can wrote:

cat a.txt | awk -F'|' '{print NR, NF}' 
Copy the code

I in the blog “AWK implementation class SQL join operation” used a number of built-in variables to complete the complex processing of multiple text, interested can see, similar to the intersection of multiple files, difference set are easy to achieve.

Built-in function

In addition to the built-in variables, awk also built a lot of commonly used functions, I also not go into here, the specific content can be found at www.runoob.com/w3cnote/awk… , awK built-in functions are divided into the following types:

  • The arithmetic function
  • String function
  • Time function
  • Bitwise operating function
  • Other functions

These built-in functions can do most of the common operations. If these built-in functions are not enough, as mentioned earlier, AWK is a programming language and you can implement any function you need.

grammar

Let’s take a look at the basics of AWK as a programming language beyond the command line.

variable

Start with variables. In addition to the built-in variables mentioned above, you can use other variables as well. Awk and Python, it is weakly typed, not declared, directly used by variables. For example, if you want the sum and average of column 2 of a text file, you can write this.

cat a.txt |awk '{sum += $2; cnt += 1} END {print sum, sum/cnt}'
Copy the code

So sum and CNT are our own variables, so we can write them as we want, so it’s very convenient. In addition to simple variable, awk also support some complex data structures, such as map, I still give you an example here, like a recent one month we have a group of people weight record tomorrow, we would like to know everyone this month, what is the average weight data as follows, there are three columns, name, date, weight, respectively.

Zhang SAN 2021-10-01 67.7 Li Si 2021-10-01 83.9 Zhang SAN 2021-10-02 68.1 Li Si 2021-10-02 85.0 Zhang SAN 2021-10-03 68.3 Li Si 2021-10-01 67.9 Li Si The 2021-10-03 84.0...Copy the code

By using map in AWK, the sum sum and the quantity CNT of each person’s weight can be stored separately, and the output can be unified after all the data is processed. The specific code is as follows:

cat a.txt|awk '{sum[$1] += $3; cnt[$1] += 1} END {for (key in sum) {print key, sum[key]/cnt[key]}}'  
Copy the code

judge

As you can see from the examples above, sometimes you have to use some judgment conditions. For example, in the original text splitting example, I split the file into two by parity of line numbers. In this case, I need to perform different logic for different contain numbers, and it is easy to determine the logic in AWK.

awk 'expr { statement }' The statement block in braces will only be executed if expr is true.
Copy the code

END, which has been used several times above, means that the following block of code will not be executed until all lines have been processed. The corresponding code to END is BEGIN, which is executed before file processing begins, so you usually do some file initialization. All other judgments you make can be written in a similar way, and it also supports if else, which is written as follows:

cat a.txt |awk '{if (NR%2==1) print NR, $1 ; else print NR, $2}'  Print the row number and the first column if it is odd, otherwise print the row number and the second column
Copy the code

cycle

Awk also supports for and while loops, as in C:

for (initialisation; condition; increment/decrement)
    action

while (condition) 
    action
Copy the code

Here I use AWK to output all primes between 0 and 100 as an example, string above said loop and judgment, in addition to variable definition, and C language basically the same.

BEGIN {
   i = 2;
   while (i < 100) {
      isPrime = 1;
      for (j = 2; j < i; j++) {
          if (i % j == 0) {
              isPrime = 0; }}if (isPrime == 1) {
          print i;
      }
      i += 1; }}Copy the code

If the code is too long to be fully concatenated to the command line, you can save the code to a file and call it up using awk -f, for example:

awk -f getPrime.awk 
Copy the code

function

Awk function definition is also very simple, and the js is a MAO, specific can refer to www.runoob.com/w3cnote/awk…

function isPrime(n) {
   for (j = 2; j < n; j++) {
      if (i % j == 0) {
         return 0; }}return 1;
}

BEGIN {
   i = 2;
   while (i < 100) {
      if (isPrime(i)) {
          print i;
      }
      i += 1; }}Copy the code

Like the above grammar programming language will not be unfamiliar to the people, very simple.

conclusion

Awk, as a one-side language, may seem to be a niche language that has no advantage over other mature programming languages, but it’s focused exclusively on text processing, and it’s a leader in that area. However, with the advent of distributed text retrieval tools such as Elastic Search, awK is becoming less and less popular, and it’s possible that command-line tools like this will be forgotten by the new generation of programmers… So I hope this article will help awK be known by more people.

In addition, this article has only scratched the surface of the basic functionality of AWK, but there are other resources and lots of contacts that you need to check out if you want to become proficient in AWK. Today, I also listened to the CSDN 1024 online live broadcast, and happened to hear some top programmers’ suggestions for ordinary programmers. In fact, they are some platitudes. We all know the truth, but most people are mediocre, and the core is still lack of practice and accumulation. Short step without thousands of miles, not small streams into rivers.