Parallel command to use

Original: Coding diary (wechat official ID: Codelogs), welcome to share, reprint please reserve the source.

Introduction to the

Sometimes, we need to process a batch of data. Using a while loop is a good idea, but the commands in a while loop are executed one by one. If there is a lot of data to process in a batch, the execution time can be very long. Next, we use NCAT to simulate a data processing interface.

Analog interface

ncat -lk 8088 -c 'sleep 1; Printf "HTTP/1.1 200 OK\r\ ncontent-type: plain/text\r\nContent-Length: 3\r\n\r\nok\n"'
Copy the code

The interface sleeps for one second and then returns an OK indicating successful data processing.

Call interface

curl -X POST http://localhost:8088/user/add -d '{"user_id": 1, "user_name":"u1"}'
Copy the code

The test data

Suppose there are 10 pieces of data in data.txt as follows:

1 u1
2 u2
3 u3
...
Copy the code

Use a while loop

$ time while read -r -a line; do
      curl -X POST http://localhost:8088/user/add -d '{"user_id": '${line[0]}', "user_name":"'${line[1]}'"}'
  done< data.txt ok ok OK OK OK ok ok ok ok real 0m10.276s user 0m0.094s sys 0m0.096sCopy the code

Using a while loop, where time is used for timing and real is the execution time of the while command, we can see that it takes about 10 seconds to process the 10 items of data. Next, parallel is used for concurrent execution.

Parallel is used for concurrent execution

$ time cat data.txt | parallel -j10 -C '\s+' curl -s -X POST http://localhost:8088/user/add -d \'{\"user_id\": {1}, \"user_name\":\"{2}\"}\'Ok ok OK ok ok ok real 0m1.205s user 0m0.203s sys 0m0.060sCopy the code

Use the parallel command to run curl concurrently. -j10 indicates a maximum of 10 concurrent processes. -c ‘\s+’ indicates a blank space to split each line. \s+ is a regular expression for blank), so {1} can be used for column 1 and {2} for column 2. As expected, 10 rows of data are processed in 10 concurrent sessions in about 1 second.

Useful –tag option

Parallel parallel provides the –tag option, which can print the data and the execution result as follows:

$ cat data.txt | parallel -j10 -C '\s+' --tag curl -s -X POST http://localhost:8088/user/add -d \'{\"user_id\": {1}, \"user_name\":\"{2}\"}\'
1 u1    ok
2 u2    ok
4 u4    ok
3 u3    ok
5 u5    ok
6 u6    ok
7 u7    ok
8 u8    ok
9 u9    ok
10 u10  ok
Copy the code

In this way, what data execution corresponds to what result is clear at a glance.

Check the schedule

Parallel provides three options for viewing progress: –bar, –progress, and –eta. Use –bar, –progress. Where –bar is suitable for scenarios where the amount of data to be processed is determined, because PARALLEL needs to read all the data before calculating the progress bar based on the total amount of data. While –progress is suitable for scenarios where the amount of data to be processed is unknown and only how many data have been processed, as follows:

Use --bar to display the progress bar
cat data.txt | parallel -j10 -C '\s+' --tag --bar curl -s -X POST http://localhost:8088/user/add -d \'{\"user_id\": {1}, \"user_name\":\"{2}\"}\'

# use --progress to display the number of processes
cat data.txt | parallel -j10 -C '\s+' --tag --progress curl -s -X POST http://localhost:8088/user/add -d \'{\"user_id\": {1}, \"user_name\":\"{2}\"}\'
Copy the code

–joblog and — Resume-failed options

I believe that when you use scripts to process data with a certain amount of data, there will be occasional data processing failure (due to network instability), then you need to find out the failed data again, and then process again, the process is still very troublesome. Fortunately, the parallel command already takes this scenario into account and provides –joblog and — Resume-failed options, so when a failure occurs, you just need to execute the entire command line again.

If the interface fails, return true. If the interface fails, return fail
ncat -lk 8088 -c 'sleep 1; r=$(head /dev/urandom | tr -dc 0-9 | head -c 1); Printf "HTTP/1.1 200 OK\r\ ncontent-type: plain/text\r\nContent-Length: 5\r\n\r\n"; [ $r -lt 5 ] && printf "true\n" || printf "fail\n"'

# parallel add --joblog job.log --resume-failed parameter, this time we package the processing script as a function, and use export-f to make it effective, so that parallel can use the function directly
function deal_data(){
    res=$(curl -s -X POST http://localhost:8088/user/add -d '{"user_id": 'The $1', "user_name":"'$2'"}')
    echo "$res"
    [[ "$res"= ="true"&&]]return0 | |return1}export -f deal_data

$ cat data.txt | parallel -j10 -C '\s+' --tag --joblog job.log --resume-failed deal_data
 1 u1        true
 2 u2        true
 4 u4        true
 5 u5        true
 3 u3        fail
 6 u6        true
 8 u8        true
 7 u7        true
 9 u9        true
 10 u10      fail

# There are two data processing failures above, we execute the above command again, as follows, you can see that only two failed data are executed this time, perfect!
$ cat data.txt | parallel -j10 -C '\s+' --tag --joblog job.log --resume-failed deal_data
 3 u3        true
 10 u10      true
Copy the code

– the semaphore option

Parallel provides concurrency issues, such as the sed command does not support concurrent modification of the same file, but parallel provides the –semaphore option to solve this problem. Where SEM is an alias for parallel — Semaphore, which is equivalent.

function deal_data(){
    res=$(curl -s -X POST http://localhost:8088/user/add -d '{"user_id": 'The $1', "user_name":"'$2'"}')
    echo "$res"
    [[ "$res"= ="true"&&]]return0 | |return1}export -f deal_data

$ grep -vnP 'ok$' data.txt |parallel -C ':|\s+' --tag 'deal_data {2} {3}; [[ $? -eq 0 ]] && sem -j1 sed -i \"{1} s/$/ ok/\" data.txt'
Copy the code

TXT, then use sed to add an OK to the end of the data line, indicating that the execution is successful. Then use grep to find the data that does not contain OK to achieve the logic that the command will process the unprocessed or failed data each time. Sem-j1 protects SED from concurrent execution.

Used with mysql

Parallel can also be used with mysql to import tasks into mysql or execute tasks in mysql as follows:

# 1. Import the task data into the ParallelJobs table of the PARDB library, which needs to be created in advance
cat data.txt |parallel --sqlmaster 'sql:mysql://user:pass@localhost:3306/pardb/paralleljobs'

# 2. Execute the tasks in the parallelJobs table where Exitval=-1000 is the task to be processed
function deal_data(){
    p=($*)
    res=$(curl -s -X POST http://localhost:8088/user/add -d '{"user_id": '${p[0]}', "user_name":"'${p[1]}'"}')
    echo "$res"
    [[ "$res"= ="true"&&]]return0 | |return1}export -f deal_data

parallel --sqlworker 'sql:mysql://user:pass@localhost:3306/pardb/paralleljobs' --tag deal_data
Copy the code

Processing CSV data

The parallel command can also easily process CSV file data, such as data.txt to data.csv, as follows:

$ cat data.csv
user_id,user_name
1,u1
2,u2
3,u3
...
Use the --header: option to read the CSV header, and then {user_id}, {user_name} to place the command parameters
$ cat data.csv | parallel --header : -C ', ' --tag curl -s -X POST http://localhost:8088/user/add -d \'{\"user_id\": {user_id}, \"user_name\":\"{user_name}\"}\'
Copy the code

– pipe options

Many text-processing commands do not take data from parameters but from standard input, such as paste, which can pass data into the input stream of the command to be executed by specifying the — PIPE option.

# for example, I want to convert data. CSV to data.json and aggregate every 3 pieces of data into a JSON array as follows:
$ cat data.csv | parallel --header : -C ', ' echo \'{\"user_id\": {user_id}, \"user_name\":\"{user_name}\"}\' | parallel -N3 --pipe paste -s -d, | sed -e 's/^/\[/' -e 's/$/]/'
[{"user_id": 7, "user_name":"u7"}, {"user_id": 8, "user_name":"u8"}, {"user_id": 9, "user_name":"u9"}]
[{"user_id": 1, "user_name":"u1"}, {"user_id": 2."user_name":"u2"}, {"user_id": 3."user_name":"u3"}]
[{"user_id": 4."user_name":"u4"}, {"user_id": 5, "user_name":"u5"}, {"user_id": 6, "user_name":"u6"}]
[{"user_id": 10, "user_name":"u10"}]
Copy the code

The first parallel transforms each piece of data into {“user_id”: 1, “user_name”:” U1 “}. The second PARALLEL passes every three JSON pieces to paste’s input stream, which paste then concatenates using commas. Sed adds [] to the beginning and end of every third data format.

Used in combination with TMUX

Parallel provides — TmuxPane, which enables commands to be executed in multiple TMUx panels, ideal for observing the results of monitoring commands, such as viewing the network status of each host.

# Use ping to monitor the network situation of Jianshu, Baidu, Zhihu at the same time, note that you must add --delay option
$ printf "www.jianshu.com\nwww.jianshu.com\nwww.baidu.com\nwww.zhihu.com"|parallel -j0 --delay 1 --tmuxpane ping {0}
See output with: tmux -S /tmp/tmsOHGXM attach
# View the TMUX panel
$ tmux -S /tmp/tmsOHGXM attach
Copy the code

Below, for the contents of the TMUX panel, you will be able to visually see the real-time ping results of 4 machines.

conclusion

If you use the shell a lot to help you with all kinds of problems, I think the Parallel command is for you. It’s really powerful and convenient.

Content of the past

Still messing around with connection idle time? Use SOCAT to batch operate multiple machines to improve work efficiency, jq command to help you (4)