Text introduction to Hadoop

  • Hadoop is an open source distributed computing platform of Apache. It is based on the HDFS distributed file system and MapReduce distributed computing framework and provides users with a set of transparent distributed infrastructure
  • The core design of Hadoop framework is HDFS and MapReduce. HDFS stores massive data, and MapReduce computes data.
  • The core design of Hadoop framework is HDFS and MapReduce
    • HDFS is a Hadoop distributed file system with high fault tolerance and scalability. It allows users to build distributed storage systems based on inexpensive hardware and provides underlying support for distributed computing and storage
    • MapReduce provides a simple API that allows users to develop distributed parallel programs without knowing the underlying details, and use large-scale cluster resources to solve big data processing problems that cannot be solved on a single machine
    • The design idea originated from Google GFS and MapReduce Paper
  • Doug Cutting developed at Yahoo and contributed to the Apache Foundation in 2008

Hadoop Core Project

HDFS

HDFS: Hadoop Distributed File System

MapReduce

MapReduce: Programming model and parallel computing framework

MRv1

First-generation MapReduce. It consists of two parts:

  • Programming model
  • Runtime environment (Computing framework)

Purpose of design:

  • It mainly solves the problem of poor scalability of massive data processing faced by search engines
  • Easy to program, simplify distributed programming, users only need to focus on their own application logic implementation

MRv1 programming model

Programming model: Multithreaded programming model

  • Parallel processing
  • Data sharing
  • Need to coordinate through locks
  • Complex write operations

Programming model: Data-driven programming model

  • Triggered by incoming data
  • Data sharing is prohibited between processing units
  • There is no need for coordination through locks

Programming model :MapReduce programming model

  • Special data driven
  • There are two phases: Map and Reduce
  • Concurrency only occurs within the same job
  • Data access for different jobs does not need to be coordinated

MRv1

Programming model :MapReduce programming model

• Special data-driven • Map and Reduce phases • Concurrency only occurs within the same job • Data access for different jobs does not need to be coordinated

For example

Data segmentation

The data source can be a daily log of the company

Function:

  1. Split the data into several splits according to a certain policy to determine the number of Map tasks 2. Given a split, you can parse it into key/value pairs.

How do you slice it?

  • File segmentation algorithm
  • Host selection algorithm

Where is the Map task started? Task localization?

File segmentation algorithm

The file splitting algorithm is mainly used to determine the number of inputsplits and the corresponding data segments for each InputSplit.

  • A large file is split into several inputsplits
  • Files are split by a “fixed” size, which is called split size
  • splitSize=max{ minSize, min{ totalSize / numSplits, blockSize } }
    • Num-splits the number of Map tasks assigned to the user. The default value is 1.
    • MinSize is the minimum value for Split, determined by the configuration parameter, and the default is 1.
    • BlockSize specifies the block size in the HDFS. The default value is 64MB.
      • This data is now set to 128 256 to reduce namenode stress
      • 32MB files also occupy a block
      • The size of a block is a dynamic file and a 1K block is 1K
  • Once the splitSize value is determined, InputSplit the files into splitSize inputsplits
  • Finally, the remaining data block with insufficient splitSize becomes a separate InputSplit

Host selection algorithm

The InputSplit object contains four properties: filename, start position, Split length, and node list. Form a quad <file, start, length, hosts>. The node list is the key to the locality of the task. Hadoop divides data locality by cost into three levels :Node, Rack, and Any.

  • Node Indicates the machine where the data resides
  • Rack if the same machine is running low on memory then go to the same rack/network

The so-called locality of a task means that idle resources are preferentially used to process data on the node. If there is no data available on the node, data on the same rack is processed. In the worst case, data on other racks is processed.

The main points of