IEEE Spectrum’s third “Most Popular Programming Language,” with C topping the list, some say big data wins. This paper will explore the development practice of C and big data. Big data is the term used to process large and complex collections of data using tools and techniques. The technology that can process large amounts of data is called MapReduce.

When to use MapReduce

MapReduce is particularly suited to problems involving large amounts of data. It works by dividing work into smaller chunks that can then be handled by multiple systems. Because MapReduce slices a problem in parallel, the solution is faster than traditional systems.

The following scenarios apply to MapReduce:

1 Counting and statistics 2 Sorting 3 filtering 4 sorting

Apache Hadoop

In this article, we will use Apache Hadoop.

Develop MapReduce solutions using Hadoop, which is already the de facto standard and is open source and free.

In addition, we rent or build Hadoop clusters from cloud providers such as Amazon, Google and Microsoft.

There are several other advantages:

Extensible: You can easily add new processing nodes without changing a single line of code

Cost effectiveness: No need for any specialized and fancy hardware, because the software runs well on normal hardware

Flexible: No mode. You can handle any data structure and even combine multiple data sources without many problems.

Fault tolerance: If a node has a problem, the other nodes can take over its work and the whole cluster can continue processing.

In addition, the Hadoop container supports an application called a “stream,” which gives users the freedom to choose the scripting language to use to develop the mapper and restorer.

In this article we will use PHP as the main development language.

Hadoop installation

Installing and configuring Apache Hadoop is beyond the scope of this article. You can easily find many articles online, depending on your platform. To keep things simple, let’s just talk about big data.

Mapper

The mapper’s job is to convert input into a series of key-value pairs. In the case of a word counter, for example, the input is a series of lines. We split them by word and turn them into key-value pairs (such as key:word,value:1) that look like this:

the      1
water    1
on       1
on       1
water    1
on       1
...      1Copy the code

These pairs are then sent to the Reducer for the next step.

reducer

The reducer’s job is to retrieve (sort) pairs, iterate, and transform them into the desired output. In the example of a word counter, you take the number of words (values) and add them up to get a word (key) and its final count. As follows:

water 2
the   1
on    3Copy the code

The entire process for mapping and Reducing looks a bit like this, as shown in the following diagram:


Use PHP as a word counter

We’ll start with the “Hello World” example of the MapReduce World, which is the implementation of a simple word counter. We’ll need some data to work with. We did this with the published book Moby Dick.

Execute the following command to download the book:

wget http://www.gutenberg.org/cache ... 1.txt
Copy the code

Create a working directory in HDFS (Hadoop distributed file system)

hadoop dfs -mkdir wordcount
Copy the code

Our PHP code starts with Mapper

#! /usr/bin/php <? php // iterate through lines while($line = fgets(STDIN)){ // remove leading and trailing $line = ltrim($line); $line = rtrim($line); // split the line in words $words = preg_split('/\s/', $line, -1, PREG_SPLIT_NO_EMPTY); // iterate through words foreach( $words as $key ) { // print word (key) to standard output // the output will be used in the // reduce (reducer.php) step // word (key) tab-delimited wordcount (1) printf("%s\t%d\n", $key, 1); }}? >Copy the code

Here is the Reducer code.

#! /usr/bin/php <? php $last_key = NULL; $running_total = 0; // iterate through lines while($line = fgets(STDIN)) { // remove leading and trailing $line = ltrim($line); $line = rtrim($line); // split line into key and count list($key,$count) = explode("\t", $line); // this if else structure works because // hadoop sorts the mapper output by it keys // before sending it to the reducer  // if the last key retrieved is the same // as the current key that have been received if ($last_key === $key) { // increase running total of the key $running_total += $count; } else { if ($last_key ! = NULL) // output previous key and its running total printf("%s\t%d\n", $last_key, $running_total); // reset last key and running total // by assigning the new key and its value $last_key = $key; $running_total = $count; }}? >Copy the code

You can easily test scripts locally by using some combination of commands and pipes.

head -n1000 pg2701.txt | ./mapper.php | sort | ./reducer.php
Copy the code

We run it on an Apache Hadoop cluster:

Hadoop jar/usr/hadoop / 2.5.1 / libexec/lib/hadoop - streaming - 2.5.1. Jar \ - mapper ". / mapper. PHP "- reducer". / reducer. PHP" -input "hello/mobydick.txt" -output "hello/result"Copy the code

The output is stored in the folder hello/result and can be viewed by executing the following command

hdfs dfs -cat hello/result/part-00000

Copy the code

Calculate the average annual gold price

The next example is a more practical one where, although the data set is relatively small, the same logic can easily be applied to sets with hundreds of data points. We will attempt to calculate the average annual price of gold over the past 50 years.

We download the data set:

wget https://raw.githubusercontent. ... a.csv

Copy the code

Create a working directory in HDFS (Hadoop distributed file system)

hadoop dfs -mkdir goldpriceCopy the code

Copy the downloaded dataset to HDFS

hadoop dfs -copyFromLocal ./data.csv goldprice/data.csv
Copy the code

My Reducer looks something like this

#! /usr/bin/php <? php // iterate through lines while($line = fgets(STDIN)){ // remove leading and trailing $line = ltrim($line); $line = rtrim($line); // regular expression to capture year and gold value preg_match("/^(.*?) \ - (? :.*),(.*)$/", $line, $matches); if ($matches) { // key: year, value: gold price printf("%s\t%.3f\n", $matches[1], $matches[2]); }}? >Copy the code

The Reducer was also slightly modified because we needed to calculate the number of items and the average.

#! /usr/bin/php <? php $last_key = NULL; $running_total = 0; $running_average = 0; $number_of_items = 0; // iterate through lines while($line = fgets(STDIN)) { // remove leading and trailing $line = ltrim($line); $line = rtrim($line); // split line into key and count list($key,$count) = explode("\t", $line); // if the last key retrieved is the same // as the current key that have been received if ($last_key === $key) { // increase number of items $number_of_items++; // increase running total of the key $running_total += $count; // (re)calculate average for that key $running_average = $running_total / $number_of_items; } else { if ($last_key ! = NULL) // output previous key and its running average printf("%s\t%.4f\n", $last_key, $running_average); // reset key, running total, running average // and number of items $last_key = $key; $number_of_items = 1; $running_total = $count; $running_average = $count; } } if ($last_key ! = NULL) // output previous key and its running average printf("%s\t%.3f\n", $last_key, $running_average); ? >Copy the code

Like the word statistics sample, we can also test locally

head -n1000 data.csv | ./mapper.php | sort | ./reducer.php
Copy the code

Finally, run it on a Hadoop cluster

Hadoop jar/usr/hadoop / 2.5.1 / libexec/lib/hadoop - streaming - 2.5.1. Jar \ - mapper ". / mapper. PHP "- reducer". / reducer. PHP" -input "goldprice/data.csv" -output "goldprice/result"Copy the code

Check the average

hdfs dfs -cat goldprice/result/part-00000
Copy the code

Bonus: Generate charts

We often turn the results into graphs. For this demo, I’m going to use gnuplot, and you can use whatever else is interesting.

First return the result locally:

hdfs dfs -get goldprice/result/part-00000 gold.dat
Copy the code

Create a GNU plot configuration file (gold.plot) and copy the following

# Gnuplot script file for generating gold prices
set terminal png
set output "chart.jpg"
set style data lines
set nokey
set grid
set title "Gold prices"
set xlabel "Year"
set ylabel "Price"
plot "gold.dat"Copy the code

Generate a chart:

gnuplot gold.plot
Copy the code

This generates a file called chart.jpg. It looks something like this: