Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Recommended reading time: 10 minutes

Word count: 610

What is the Hadoop?

The Apache™ Hadoop® project develops open source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows distributed processing of large data sets across clusters of computers using a simple programming model. It aims to scale from a single server to thousands of machines, each providing local computing and storage. Because every machine is prone to failure, the library itself does not rely on hardware to provide high availability, but is designed to detect and handle failures at the application layer to provide high availability services on top of a cluster of machines.

The origin of the Hadoop

  • In 2003-2004, Google published some details of the GFS and MapReduce ideas, which inspired Doug Cutting et al to implement DFS and MapReduce mechanisms in his spare time for 2 years, which made Nutch performance soar. Then Yahoo hired Doug Gutting and his project.
  • Hadoop was officially introduced to the Apache Foundation in 2005 as part of Lucene’s Nutch sub-project.
  • In February 2006, it was spun off into a complete and independent software package named Hadoop

To sum up, Hadoop originated in three papers by Google

  • GFS: Google’s distributed File System
  • MapReduce: Google’s MapReduce open-source distributed parallel computing framework
  • BigTable: A large distributed database

The history of Hadoop

Hadoop versions are developed in parallel with multiple branches. So far, Hadoop versions are 1.x, 2.x and 3.x. The following is a mind map showing the features and defects of each version

Hadoop1.x

Hadoop2. X, Hadoop3. X

So far, the author has shown the features and defects of the three versions of Hadoop in the way of mapping. As for the version selection, it is recommended to directly choose the version after Hadoop2.0 for learning. If it is deployed in a production environment, select a stable version based on the actual service environment.


I am a little white in this field, on the way of learning, if there are mistakes in the above article, please point out criticism.