This article is from MIT’s missing lesson in computer education, which covers the command line, the use of a powerful text editor, the use of many features provided by version control systems, and more. Chinese Course Homepage:missing-semester-cn.github.io/

This is the fourth part of the course on data consolidation.

In most cases, data consolidation requires that you be able to identify which tools can be used for a particular data consolidation purpose and how to use them in combination.

  • Get server logs:ssh myserver journalct > journal
  • sed: Use regular expressions to replace, usage,sed 's/.*Disconnected from //', replace this part with null.s/REGEX/SUBSTITUTION/
    • Capture groups, which you can use \1, \2 in SUBSTITUTION to represent the captured information in if you want to preserve some of the matching information in the expression
  • Some uses of regular expressions
    • .“Any single character” except Spaces (is that right, no Spaces?)
    • *Matches the preceding character zero or more times
    • +Matches the preceding character one or more times
    • [abc]matchinga.b 和 cAny one of the
    • (RX1|RX2)Anything that can matchRX1 或 RX2The results of the
    • ^The beginning of a line
    • $End of each line
  • Regular expressions are greedy by default and can be used in*+After adding?To be non-greedy
  • Tests whether the regular expression is correct: debug
  • Match any word ([^ ]+Matches any sequence that is not empty and does not contain Spaces.
    • Where ^, if placed at the beginning of the expression, is matched from the beginning of the line; Otherwise, it means “not included”
  • Sort the input data:sort
    • sort -nThe input is sorted numerically (lexicographical by default)- k1, 1It means sort only based on the first column separated by Spaces.,nPart means “sort only to the NTH part”, the default is to the end of the line.
    • sort -rYou can reverse
  • Fold consecutive lines into a line prefixed by the number of occurrences:uniq -c
ssh myserver journalctl # read log
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )? user (.*) [^ ]+ port [0-9]+( \[preauth\])? $2 / / \ ' Get user name
 | sort | uniq -c Sort and merge user names1 | | sort - nk1, tail - n10 | awk'{print $2}' 
 | paste -sd, # merge lines, use, delimiter
Copy the code
  • Awk: Is a programming language that is very good at processing text
    • $0Represents the contents of the entire line,The $1$nIs a row of N areas, the area is divided based on awK domain delimiter (default is space, can pass-FTo modify).
    • All tocAt the beginning,eUsers who have only tried to log in once.| awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l
      • This match requires that the first part of the text be equal to 1 (which happens to be the count obtained by uniq -c)
      • The second part must satisfy a given regular expression
BEGIN { rows = 0 }
The $1&& = = 1$2 ~ /^c[^ ]*e$/ { rows += The $1 }
END { print rows }
Copy the code
  • Command line calculator, BC

    • bc -lUse libraries, including sines and cosines

    Practice after class

    1. Study this short passageInteractive regular expression tutorial.
      • \d matches numbers, \d matches non-numeric characters
      • . Can match any character, and. Matches a period
      • Matches a specific character, [ABC], which can match a or B or C
      • Remove specific characters, [^ ABC], can match any character except a or B or C
      • Represents characters, \w effect = [A-zA-z0-9_], \w, other than these characters
      • Captures a character that occurs a certain number of times
        • A {3} occurs 3 times
        • A {1,3}, not less than 1, not more than 3 times
        • Matches occur any number of times and + matches occur at least once
      • ? The match is optional and may not occur at all or once
      • Various white space handling: \s can match including space space, TAB (\t), newline (\n), and carriage return (\r), very useful. \S matches characters that are not Spaces
      • ^ Match the beginning of the sentence
      • $matches the end of the sentence
      • Capture group
        • Regular expressions can be used to extract information for further processing
        • Use () to group the results
        • Nested group: Multiple brackets nested together
      • The conditional expression, in combination with () and |
      • Back Reference
        • \0 Fully matched text
        • \1 The first group
        • \2 The second group
    2. Statistics words file (/usr/share/dict/wordsContains at least threeaAnd not to'sNumber of words at the end. What are the last two letters in the top three of these words?sedtheyCommand, ortrThe program may be able to help you with case. How many final two-letter combinations are there? Then there’s the challenging question: What combination has never appeared? A better solution is to use a capture group:. * (..), including\ 1You can capture the last two letters.
    # Count words
    cat /usr/share/dict/words | rg "\w*a\w*a\w*a\w*[^'s]" | tr "[:upper:]" "[:lower:]" | uniq -c | wc -l
    # 5345
    
    #! /bin/bash
    words="./words.txt"
    cat /usr/share/dict/words | rg "\w*a\w*a\w*a\w*[^'s]" | tr "[:upper:]" "[:lower:]" >words.txt
    
    run() {
        for i in{a.. z};do
            for j in{a.. z};do
                echo -n "$i$j "
                rg ".*$i$j$" $words | wc -l | awk '{ print $1 }'
            done
        done
    }
    
    run >occurance.txt
    
    echo "the most frenquent 3 combinations"The cat occurance. TXT | sort - nk2, 2 - r | head - n3echo -n "there are total "
    cat occurance.txt | awk ' BEGIN { num = 0 }
    $2 ~ "0" { num += 1 }
    END { printf num } '
    echo " combinations"
    
    echo "never appeared combinations"
    cat occurance.txt | awk ' {if ($2 == "0") print $1} ' >nevershowed.txt
    paste -s -d , nevershowed.txt
    Copy the code
    1. It sounds tempting to do in-place SUBSTITUTION, such as sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. But that’s not a wise thing to do. Why? Or is it just sed? See man sed to complete the problem

    2. Find out the average, median, and maximum time of your last ten boot-ups. Journalctl is needed on Linux, while log show is used on macOS. Find the time stamps from the start to the end. On Linux it works like this:

    Logs begin at ...

    and

    systemd[577]: Startup finished in ...

    On macOS, find:

    === system boot:

    and

    Previous shutdown cause: 5

    1. View the different parts of the previous three restart boot messages (see option B of journalctl). To break this down into several steps, first get the boot logs for the last three boots. Perhaps the command to get the boot logs has an option that helps you get the logs for the last three boots, or you can use sed ‘0,/STRING/d’ to delete everything before the STRING that STRING matches. Then, filter out the parts that are different each time, such as timestamps. Next, record the input lines repeatedly and count them (you can use uniQ). Finally, delete any content that appears three times (because it is repeated in the three startup logs).

    2. Find a data set like this or this on the Internet. Or get some from here. Use curl to get the data set and extract the two columns of data, if you want to obtain the HTML data, then (pup) (https://github.com/EricChiang/pup) may be more helpful. For JSON types of data, can try [jq] (https://stedolan.github.io/jq/). Use one instruction to find the maximum and minimum values in one column and another to sum up the differences between the two columns.