Big data Hadoop introduction to hadoop family details

The term big data may have sounded unfamiliar to you a few years ago, but I’m sure it sounds familiar to you now when you hear the term Hadoop! More and more people are working on Hadoop or learning about Hadoop. As a beginner to Hadoop, what would you find difficult? The setup of the runtime environment is probably enough to give a novice a headache. It would be great for beginners if every distribution of hadoop could be as integrated as big DKHadoop and do it all in one installation. A little more gossip. Back to the big picture. This article is going to share some basic knowledge of Hadoop – hadoop family products for everyone who is new to Hadoop. Through the understanding of Hadoop family products, we can further help you learn Hadoop well! At the same time, we also welcome your valuable suggestions! Hadoop is a big family, an open source ecosystem, and a distributed operating system based on Java programming language architecture. But its brightest technologies are HDFS and MapReduce, which allow it to process massive amounts of data in a distributed manner.

Second, Hadoop products

HDFS (distributed file system) : it is different from the existing file system there are a lot of the features, such as high fault tolerance, even if the way errors, also can continue to run), support for multimedia data and streaming media data access, efficient access to large data collection, data remain consistent rigorous, deployment cost reduction, deployment efficiency, etc., as shown in figure is the basis of the HDFS architecture.

MapReduce/Spark/Storm (parallel computing architecture) : 1. Separate line computing and online computing in terms of data processing methods: Role Description MapReduce MapReduce is used for offline complex big data calculation. Storm Storm is used for online real-time big data calculation. Spark can be used for offline or online real-time big data calculation. Spark mainly processes data in a time area in real time, so Spark is flexible.

2. Data storage location Disk calculation and memory calculation: MapReduce data is stored on disks Spark and Strom data is stored in memory

Pig/Hive (Hadoop Programming) : Role Description Pig is a high-level programming language with very high performance for semi-structured data, which helps shorten development cycles. Hive is a data analysis and query tool, especially when SQL queries are used to analyze data. Being able to do in minutes what ETL would have done in a night is an advantage, a head start!

HBase/Sqoop/Flume (Data import and export) Role Description HBase is a column storage database running on the HADOOP Distributed File System (HDFS) architecture, and is well integrated with Pig and Hive. Using the Java API, HBase is nearly seamless. Sqoop is designed to make it easy to import data from traditional databases into Hadoop data sets (HDFS/Hive). Flume is designed to easily import data directly from the log file system into the Hadoop data set (HDFS). All of these data transfer tools make it much easier for users to focus on business analysis and improve productivity.

ZooKeeper/Oozie (System management architecture) ZooKeeper is a system management coordination architecture used to manage basic configurations of distributed architectures. It provides many interfaces to simplify configuration management tasks. Oozie Oozie service is used to manage workflows. Used to schedule different workflows so that each job has a finish line. These architectures help us manage big data distributed computing architectures in a lightweight way.

Ambari/Whirr (System Deployment Management) : Role Description Ambari helps people quickly deploy the entire big data analysis architecture and monitor the system health in real time. Whirr Whirr is designed to help speed up cloud computing development.

Mahout (Machine Learning) : Mahout is designed to help us build intelligent systems quickly. Part of the machine learning logic has already been implemented. This architecture allows us to quickly integrate more machine learning intelligence.

Big data Hadoop introduction to hadoop family details

Related Posts

Distributed transaction implementation scheme

RocketMQ cluster monitoring platform was built

Dubbo part TWO: A quick primer