Hadoop is an open source software framework that can be installed on a cluster of business machines, enabling machines to communicate with each other and work cooperatively to store and process large amounts of data together in a highly distributed manner. Initially, Hadoop consisted of two main components: Hadoop Distributed File System (HDFS) and a Distributed computing engine that supported implementing and running programs as MapReduce jobs. Hadoop also provides the software infrastructure to run MapReduce jobs as a series of Map and Reduce jobs. The Map task calls the Map function on a subset of the input data. After these calls, the Reduce task starts calling the Reduce task on the intermediate data generated by the Map function to produce the final output. Map and Reduce tasks run separately from each other, which enables parallel and fault-tolerant computing. Most importantly, the Hadoop infrastructure handles all the complex aspects of distributed processing: parallelization, scheduling, resource management, inter-machine communication, software and hardware troubleshooting, and so on. Thanks to this clean abstraction, implementing distributed applications that handle terabytes of data on hundreds (or even thousands) of machines has never been easier, even for developers with no prior experience with distributed systems.

Map Reduce process Shuffle Combine The overall shuffle process consists of the following parts: Shuffle on the Map end, Sort phase, and Shuffle on the Reduce end. That is, the Shuffle process spans map and Reduce. The Shuffle process includes the sort phase, which is the process during which data is output from Map Task to input from Reduce Task. Sort and Combine are at the map end, while Combine is an early reduce, which needs to be set by ourselves. In a Hadoop cluster, most Map and Reduce tasks are executed on different nodes. Of course, in many cases, Reduce execution needs to pull map task results from other nodes across nodes. If there are many jobs running in a cluster, the normal execution of tasks consumes heavy network resources in the cluster. For the necessary network resource consumption, the ultimate goal is to minimize unnecessary consumption. In addition, disk I/O has a significant impact on job completion time in nodes compared to memory. The basic requirements for the Shuffle process of MapReduce job performance tuning are as follows: Completely pull data from the Map Task to the Reduce end. When data is pulled across nodes, the unnecessary consumption of bandwidth is minimized. Reduce the impact of disk I/O on task execution. In general, this Shuffle process can be optimized by reducing the amount of data pulled and using memory rather than disk. YARN ResourceManager replaces Cluster manager ApplicationMaster with a dedicated and transient JobTracker NodeManager instead of TaskTracker a distributed application replaces MapReduce Jobs A global ResourceManager runs as a main background process, which typically runs on dedicated machines and mediates available cluster resources between competing applications. When a user submits an application, a lightweight process instance called ApplicationMaster is started to coordinate the execution of all tasks within the application. This includes monitoring tasks, restarting failed tasks, presumably running slowly, and calculating the sum of application counter values. Interestingly, ApplicationMaster can run any type of task inside the container. NodeManager is a more generic and efficient version of TaskTracker. Instead of a fixed number of Map and Reduce slots, NodeManager has a number of dynamically created resource containers.

Big data Hadoop developers include Amazon Web Services, Cloudera, Hortonworks, IBM, MapR Technologies, Huawei and Dakuaisearch. These vendors build on the Apache open source project and then add features like packaging, support, integration, and innovations of their own. The Fast General Computing Platform for Big Data (DKH) has integrated all components of the same version of the development framework. Hadoop, Spark, Hive, SQOOP, Flume, Kafka Data collection: dk. Hadoop data processing module: Hadoop, Spark, Storm, Hive machine learning and AI: dk. Hadoop, Spark NLP module: upload JAR packages on the server and directly support search engine module: not published independently