Write ahead: Before learning big data, you must first have the idea of divide and conquer. And before you can understand distribution, you need to understand how to deal with big data in a single machine. Because the nature of distribution is to squeeze the performance out of every single machine.

The idea of divide and rule

Now you have a requirement: you have 10,000 elements to store, and if you want to find one of them, what is the simplest way to iterate? What should I do if I want complexity O(4)?

The easiest way to traverse is obviously order n, which is to look each one up. If you want complexity O(4), you need to divide the data into 4 pieces and search for each piece of data in parallel. There are many ways to divide data into quarters, but here you can do it by calculating hashCode and then modulo it. HashCode % 2500 can be divided into four parts.

Single-machine processing of big data problems

Requirement: a text file, the size of 1T, only two lines are the same, and appear in an unknown location, how to find these two lines?

If the file is traversed directly, assuming that the disk I/o speed is 500MB per second, it takes about 30 minutes to traverse the file once and 30 minutes to loop through it N times.

Since memory addressing is 100,000 times larger than disk IO addressing blocks, the idea is to put files into memory for calculation, but the server’s memory size is limited. ReadLing ().hashcode % 2000 into 2000 files of about 500MB each, at which point the files can be put into memory for calculation. Also, the same two lines of hashCode must be the same, so you can find the same two lines in one file without comparing them across files.

The above is a stand-alone approach to speed up the computation through the idea of divide-and-conquer.

Requirement: Change the above requirement to sort the file lines.

At this point, the way to calculate hashCode doesn’t work. You can split the files into several files, load them into memory for sorting, and then merge them to sort them quickly.

Clusters handle big data issues

Requirement: a text file, the size of 1T, only two lines are the same, and appear in an unknown location, how to find these two lines?

If you are working in a cluster, you can divide 1 TERabyte of data into 2000 machines, each storing about 500 MB of data, and then divide the 500 MB of data into 2000 pieces in parallel by readLing().hashcode % 2000. The processing logic for each machine is the same as for a single machine. Finally, the files with the same hashCode of each machine are pulled to the same machine for comparison, and finally the data of each machine is judged in parallel.

Big data technology focus

  • Divide and conquer
  • Parallel computing
  • Computing moves to data
  • Data localization read

Whether learning Hadoop or other distributed frameworks, we need to focus on these four points.