This article is from MIT’s “The Missing Lesson in Computer Education” course, which covers the command line, the use of a powerful text editor, and the many features provided by a version control system. Chinese Course Homepage:https://missing-semester-cn.github.io/

This is the fourth part of the course on data consolidation.

In most cases, data collation requires that you be able to identify which tools can be used for a particular data collation purpose and understand how these tools are used in combination.

  • Get the server log:ssh myserver journalct > journal
  • Sed: If sed ‘s/.*Disconnected from //’, replace this part with null. If sed ‘s/. s/REGEX/SUBSTITUTION/

    • Capture groups. If you want to keep some of the matching information in an expression, you can use \1, \2 in Substitution to represent the captured information in it
  • Some uses of regular expressions

    • .Any single character other than Spaces (right, no Spaces?)
    • *Matches the preceding character zero or more times
    • +Matches the preceding character one or more times
    • [abc]matchinga.b 和 cEither one of these
    • (RX1|RX2)Anything that can matchRX1 或 RX2The results of the
    • ^The beginning of a line
    • $End of each line
  • Regular expressions are greedy matches by default and can be used in*+After adding?To change into a non-greedy mode
  • Test that the regex expression is correct: debug
  • Matches any word ([^]+ will match any sequence that is not empty and contains no Spaces)

    • Where ^, if placed at the beginning of the expression, matches from the beginning of the line; Otherwise it means “not contained”
  • Sort the input data: sort

    • sort -nThe input is sorted numerically (lexicographical by default)- k1, 1Sort only based on the first column separated by Spaces.,nPartial means “only sorted to the NTH part”, which by default goes to the end of the line.
    • sort -rYou can reverse
  • Collapse successive occurrences into one line with the number of occurrences as a prefix:uniq -c
SSH myserver journalctl # read logs | grep SSHD | grep "Disconnected from" | sed -e 's /. * Disconnected from (invalid |authenticating )? user (.*) [^ ]+ port [0-9]+( \[preauth\])? # $2 / / \ 'access to user name | sort | uniq - c # will be ordered username and merge | sort - nk1, 1 | tail - n10 | awk' {print $2} '| paste - sd # merger, use, separated
  • Awk: Awk is a programming language that is very good at dealing with text

    • $0Represents the contents of the entire line,The $1$nIs n fields in a row, and the regions are divided based on the awk field separator (the default is space, which can be passed-FTo modify).
    • All users who start with C, end with E, and have only tried to log in once. | awk ‘$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }’ | wc -l

      • This match requires that the first part of the text be equal to 1 (which happens to be the value of uniq-c).
      • The second part must satisfy a given regular expression
BEGIN { rows = 0 }
$1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 }
END { print rows }
  • Command line calculator, BC

    • bc -lUse calculation libraries, including sines and cosines

Practice after class

  1. Check out this short interactive regular expression tutorial.

    - \d matches numeric characters, \d matches non-numeric characters -. Can match any character, \.. Match full stop - Matches specific character, [ABC], can match a or b or c - Removes specific character, [^ ABC], can match any character except a or b or c - Indicates character, \w effect = [a-za-z0-9_], \w, Not these characters - catch characters that occur a certain number of times - a{3}, occur 3 times - a{1,3}, occur not less than 1, occur not more than 3 times - * match occurs any time, + match occurs at least 1 time -? Matching is optional and can occur no or no once - various whitespace handling: \s can match including space Spaces, TAB (\t), newline (\n), and carriage return (\r), very useful. \S matches characters that are not Spaces - ^ Match sentence first - $Match sentence end - Capture group - Regular expression can be used to extract information for further processing - With () bracketed, you can group the results - Nest group: Multiple nested parentheses up - conditional expression, in combination with () and (back reference) - \ | - reference 0 complete matching to the text - first group - \ \ 1 2 second group
  2. Count the number of words in your words file (/usr/share/dict/words) that contain at least three As and do not end with an’s. What are the last two letters in the first three of these words? The sed y command, or the tr program, may help you with the case problem. How many end-of-word dichrograms are there? And then there’s the challenging question: Which combination never came up? A better solution is to use a capture group:.*(..) Where \1 can capture the last two letters)

    # statistical word for cat/usr/share/dict/words | rg a \ "a \ \ w * w * w * a \ [^ w * 's]" | tr "[: upper:] [: lower:]" | uniq -c | wc - l # 5345 #! /bin/bash words="./words.txt" cat /usr/share/dict/words | rg "\w*a\w*a\w*a\w*[^'s]" | tr "[:upper:]" "[:lower:]" >words.txt run() { for i in {a.. z}; do for j in {a.. z}; do echo -n "$i$j " rg ".*$i$j$" $words | wc -l | awk '{ print $1 }' done done } run >occurance.txt echo "the most Frenquent three combinations "cat occurance. TXT | sort - nk2, 2 - r | head - n3 echo -n" there are total "cat occurance. TXT | awk ' BEGIN { num = 0 } $2 ~ "0" { num += 1 } END { printf num } ' echo " combinations" echo "never appeared combinations" cat occurance.txt | awk ' {if ($2 == "0") print $1} ' >nevershowed.txt paste -s -d , nevershowed.txt
  3. It’s tempting to make in-place substitutions, such as:sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. But that’s not a smart thing to do. Why? Or is it justsedIs that right? To viewman sedTo complete the problem
  4. Find the average, median, and maximum boot time of your last ten boot times. Journalctl is used on Linux and Log Show is used on MacOS. Find the timestamp to the start and end of each time. On Linux, do something like this:

    Logs begin at ...

    and

    systemd[577]: Startup finished in ...

    On MacOS, find:

    === system boot:

    and

    Previous shutdown cause: 5

  5. See the different sections of the previous three restart boot messages (seejournalctlthebOption). Break this task down into several steps. First, get the boot log for the previous three boots. Perhaps the command to get the boot log has an appropriate option to help you extract the logs for the previous three boots, or you can use itsed '0,/STRING/d'To deleteSTRINGAll that precedes the string that matches. Then, filter out parts that are different each time, such as timestamps. Next, repeatedly record and count the input lines (you can useuniq). Finally, delete any content that has appeared three times (because it was duplicated in the three-time startup log).
  6. Find a similar one onlinethisorthisThe dataset of. Or fromHere,Looking for some. usecurlGet the data set and extract two columns of data, if you want HTML data, then[pup](https://github.com/EricChiang/pup)It might be more helpful. For JSON data, try this out[jq](https://stedolan.github.io/jq/). Use one instruction to find the maximum and minimum values of one column, and the other instruction to calculate the sum of the differences between the two columns.