Python for Hadoop Practices (part 1)

As for Hadoop, you are advised to set up a single-node and multi-node Hadoop platform on your Linux by following the online tutorial, or refer to the Hadoop installation tutorial _ single-node/pseudo-distributed configuration. I am new to MapReduce and can only think of it in terms of “divide and conquer”. First, map is “divide” — divide data, and then Reduce performs further operations on the results of map processing. The example given here is the general Hadoop entry program “WordCount”. It is to write a map program to divide the input string into single words, then reduce these single words, count the same words, output different words respectively, and output the frequency of each word.

Note: The input and output of data are controlled by sys.stdin (system standard input) and sys.stdout (system standard output). The permission of all scripts must be changed before they are executed. For example, the following script is created using chmod +x mapper.py.

mapper.py

#! /usr/bin/env python
# ($HADOOP_HOME/mapper.py)
import sys

for line in sys.stdin:  Iterate over each line of data read in
    
    line = line.strip()  # Remove space at the beginning of a line
    words = line.split()  # Break sentences into individual words by space
    for word in words:
        print '%s\t%s' %(word, 1)
Copy the code

reducer.py

#! /usr/bin/env python
# ($HADOOP_HOME/reducer.py)
from operator import itemgetter
import sys

current_word = None  # is the current word
current_count = 0  # Current word frequency
word = None

for line in sys.stdin:
    words = line.strip()  # Remove whitespace at the beginning and end of the string
    word, count = words.split('\t')  # separate words and quantities by TAB
    
    try:
        count = int(count)  # convert string '1' to integer 1
    except ValueError:
        continue

    if current_word == word:  # If the current word equals the word read in
        current_count += count  # Increase word frequency by 1
    else:
        if current_word:  Print the word and frequency if the current word is not empty
            print '%s\t%s' %(current_word, current_count)  
        current_count = count  Otherwise, the read word will be assigned to the current word, and update frequency
        current_word = word

if current_word == word:
    print '%s\t%s' %(current_word, current_count)
Copy the code

Viewing the output

  cd $HADOOP_HOME
  echo "foo foo quux labs foo bar zoo zoo hying"|. / mapper. Py | sort - k, 1, 1 |. / reducer. Py# echo is behind "foo * * * *" string output, and use the pipe "|" as the output data mapper. Py input data to the script, and mapper. The data input to the reducer of py. Py, The parameter sort -k 1,1 is to sort the output from the reducer in ascending order according to the ASCII code value of the first letter in the first column
Copy the code

Here’s the output of some scripts:

Get Python code running on Hadoop!

1. Prepare input data

Next, download three books:

 mkdir -p tmp/gutenberg
 cd tmp/gutenberg
 wget http://www.gutenberg.org/ebooks/20417.txt.utf-8
 wget http://www.gutenberg.org/files/5000/5000-8.txt
 wget http://www.gutenberg.org/ebooks/4300.txt.utf-8
Copy the code

Then upload these three books to HDFS:

hdfs dfs -mkdir ./input Create a folder for input files in this user directory on HDFS
hdfs dfs -put ./tmp/gutenberg/*.txt ./input Upload files to an input folder on HDFS
Copy the code

Find the jar file where your streaming files are stored. Note that version 2.6 is in the share directory. Go to the Hadoop installation directory to find this file:
```
cd $HADOOP_HOME
find ./ -name "*streaming*"
Copy the code
```

Since this file has a long path, we can write it to the environment variable:

vi ~/.bashrc  Open the environment variable configuration file
# Write the streaming path inside
export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
source ~/.bashrc 
Copy the code

Run the script through the Streaming interface

hadoop jar $STREAM  -files ./mapper.py,./reducer.py -mapper ./mapper.py -reducer ./reducer.py -input ./input/gutenberg*.txt -output ./output
Copy the code

It will complete and you can view the count as follows

Once again, thanks to the following documentation: Getting Python to run streaming on Hadoop

Python for Hadoop Practices (part 1)

mapper.py

reducer.py

Viewing the output

Get Python code running on Hadoop!

Related Posts

How is an updated SQL statement executed?

What is the difference between user requirements and product requirements in the process of receiving a project order _ daily routine of product manager

Install kubesphere using kubekey