With the demand for big data development positions rising, many programmers will choose to switch to programming when they face a career bottleneck.


Do you realize that this is a major turning point in your life? Whether you can seize the opportunity of this era depends on your application and acquisition of big data information. And how to become a trendsetter in the era of big data, master the most urgently needed software skills is the key! Google, Alibaba, Baidu and JD.com are in urgent need of big data talents who master Hadoop technology! No matter what kind of big data you are proficient in, you will stand out in the future workplace!

01 single topic selection


1. Which application is responsible for HDFS data storage?


a)NameNode

b)Jobtracker

c)Datanode

d)secondaryNameNode

e)tasktracker


The answer C datanode


2. How many blocks are saved in HDfS by default?


A)3 b)2 C)1 d) not sure


A. three copies by default


Hadoop author?


a)Martin Fowler

b)Kent Beck

c)Doug cutting


C. Doug cutting D. Doug cutting


4. Which of the following programs usually starts on the same node as NameNode?


a)SecondaryNameNode

b)DataNodeb)DataNode

c)TaskTracker

d)Jobtracker


Answer: D


Analysis of this problem:


Hadoop cluster is based on master/ Slave mode. Namenode and JobTracker belong to master, Datanode and Tasktracker belong to slave, and there is only one master. Slave has multiple SecondaryNameNode memory requirements on the same order of magnitude as NameNode, so usually secondary, NameNode (running on separate physical machines) and NameNode run on different machines.


JobTracker and TaskTracker, JobTracker for NameNode, TaskTracker for DataNode, DataNode and NameNode for data storage, JobTracker and TaskTracker are for the execution of MapReduce. Several major concepts in MapReduce can be divided into several execution clues as follows: Obclient, JobTracker and TaskTracker.


JobClient uses the JobClient class to package application configuration parameters into jar files and store them in HDFS. Submit the path to Jobtracker, which then creates each Task (i.e., MapTask and ReduceTask) and distributes them to each TaskTracker service for execution. JobTracker is a master service. After the software starts, JobTracker receives the Job, schedules each sub-task of the Job to run on TaskTracker, monitors it, and re-runs it if any task fails. In general, JobTracker should be deployed on a separate machine. TaskTracker is a Slaver service that runs on multiple nodes. TaskTracker actively communicates with JobTracker, receives jobs, and is responsible for executing each task directly. TaskTracker needs to run on datanodes in HDFS.


5. Which of the following is usually the most important bottleneck in a cluster?


A)CPU B) network C) disk IO D) memory


C


The answer to this question is:


The first goal of clustering is to save costs, replacing minicomputers and mainframes with cheap PCS. What are the features of minicomputers and mainframes?


  1. Strong CPU processing capability

  2. The memory is large enough. So the bottleneck of the cluster cannot be A and D

  3. The Internet is a scarce resource, but it is not a bottleneck.

  4. Because big data is faced with massive data, I/O is required for reading and writing data, and then redundant data is required. Hadoop generally has three copies of data, so I/O is compromised.


6. HDFS default Block Size


a)32MB

b)64MB

c)128MB


Answer: B


7, Which is true about SecondaryNameNode?


A) It is hot standby for NameNode

B) It does not require memory c) it is designed to help NameNode merge edit logs and reduce NameNode startup time d)SecondaryNameNode should be deployed on a node with NameNode.


Answer: C


02 multiple-choice questions


1. Which of the following can be used to manage a cluster?


a)Puppet

b)Pdsh

c)Cloudera Manager

d)Zookeeper


Answer: ABD


2. Which of the following is correct in configuring rack awareness:


A) If a rack fails, data reads and writes are not affected. B) Data is written to Datanodes on different racks. C)MapReduce obtains network data near the rack


Answer: ABC


3. Which of the following is correct when the Client uploads files?


A) Data is transferred to Datanodes through NameNode b) The Client splits files into blocks and uploads them sequentially c) The Client uploads data to only one DataNode, and NameNode copies the blocks


Answer: B


Analysis of this problem:


Lient initiates a file write request to the NameNode.


The NameNode returns information about datanodes managed by the Client based on the file size and file block configuration. The Client divides the file into multiple blocks and writes them to each DataNode Block in sequence based on the DataNode address information.


4. Which of the following is the running mode of Hadoop?


A) standalone b) pseudo-distributed c) distributed


Answer: ABC


5. What methods does Cloudera offer to install CDH?


a)Cloudera manager

b)Tarball

c)Yum

d)Rpm


Answer: the ABCD



03 true or false


Ganglia not only monitors, but also alerts.


correct


Ganglia Ganglia Ganglia Ganglia Strictly speaking, yes. Ganglia, one of the most commonly used monitoring software in Linux environments, excels at collecting data from nodes at a low cost to the user. But Ganglia isn’t very good at warning and notifying users of events. The latest Version of Ganglia already has some of this functionality. But Nagios is even better at warning. Nagios is a software that specializes in early warning and notification. By combining Ganglia and Nagios, using Ganglia as the source of Nagios data, and Nagios as the source of alerts, you can implement a complete monitoring and management system.


2. Block Size cannot be modified.


error


Analysis of this problem: The basic Hadoop configuration file is hadoop-default. XML. By default, the Job Config is created, and the Job Config is read into the hadoop-default. XML configuration first. Then read the configuration of hadoop-site. XML (the file was initially configured as). In hadoop-site. XML, the system level configuration of hadoop-default. XML needs to be overwritten.


Nagios cannot monitor a Hadoop cluster because it does not provide Hadoop support.


error


Nagios is a cluster monitoring tool and one of the top three cloud computing tools


SecondaryNameNode takes over if NameNode terminates unexpectedly.


error


SecondaryNameNode is an aid to recovery, not a substitute


Cloudera CDH costs a fee to use.


error


Cloudera Enterpris, the Cloudera Enterprise, unveiled at the Hadoop Summit in California, enhances Hadoop with several proprietary management, monitoring, and operational tools. The fee is a contract subscription, and the price varies with the size of the Hadoop cluster used.


Hadoop is developed in Java, so MapReduce can only be written in Java.


error


Rhadoop is developed in R language, MapReduce is a framework, can be understood as an idea, can be developed in other languages.


Hadoop supports random data read and write.


error


Lucene supports random read/write, whereas HDFS supports only random read/write. But HBase can help. HBase provides random read and write services to solve problems that Hadoop cannot handle. HBase has focused on scalability issues from its underlying design: tables can be “tall” with billions of rows; It can also be “wide”, with millions of columns; Horizontal partitioning and automatic replication on thousands of normal business machine nodes. The table schema is a direct reflection of physical storage, making it possible for the system to improve efficient serialization, storage, and retrieval of data structures.


8. NameNode manages metadata. It reads or writes metadata information from disk and reports it back to the client for each read/write request.


error


Analysis of this problem:


The NameNode does not need to read metadata from disk. All data is in memory, and only serialized results are read on disk each time the NameNode is started.


1) File writing


The Client initiates a file write request to the NameNode.


The NameNode returns information about datanodes managed by the Client based on the file size and file block configuration.


The Client divides the file into multiple blocks and writes them to each DataNode Block in sequence based on the DataNode address information.


2) File reading


Client initiates a file read request to NameNode.


9. NameNode Local disk saves Block location information.


I think it is correct and welcome other opinions


Datanodes are basic units of file storage. Datanodes store blocks in local file systems, store meta-data of blocks, and periodically send all existing Block information to NameNode. NameNode returns the DataNode information stored in the file. The Client reads file information.


10. DataNode communicates with NameNode through long connection.


There is disagreement on this point: specifically, favourable information is being sought in this regard. The following information is available for reference.


First, to clarify the concept:


(1) Long connection

A communication connection is established between the Client and Server. After the connection is established, packets are sent and received continuously. This mode is often used for point-to-point communication because the communication connection always exists.


(2) short connection

The Client and Server communicate with each other only after the transaction is complete. This mode is commonly used for point-to-multipoint communication, for example, multiple clients connect to a Server.


11. Hadoop has strict permission management and security measures to ensure the normal operation of clusters.


error


Hadoop prevents good people from doing bad things, but not bad people from doing bad things.


12. The Slave node stores data, so the larger the disk, the better.


error


Once the Slave node is down, data recovery is a problem.


13. The Hadoop dfsadmin -report command is used to detect HDFS damaged blocks.


error


Hadoop default scheduler policy is FIFO


correct


15. RAID should be configured for each node in the cluster to avoid single disk damage and affect the operation of the entire node.


error


To understand what RAID is, refer to disk array. What is wrong with this sentence is that it is too absolute. Questions are not important, knowledge is the most important. Because Hadoop is inherently redundant, you don’t need RAID if you’re not too strict. Please refer to question 2 for details.


NameNode does not have a single point of problem because HDFS has multiple copies.


error


17. Each map slot is a thread.


error


Analysis of this problem: First of all, we know what is the map slot, map slot – > map slotmap slot is a logical value (. Org. Apache hadoop. Mapred. TaskTracker. TaskLauncher. NumFreeSlots), It’s not a thread or a process.


Mapreduce input split is a block.


error


NameNode’s Web UI port is 50030, and it launches the Web service through Jetty.


error


The Hadoop environment variable HADOOP_HEAPSIZE is used to set the memory of all Hadoop daemons. It defaults to 200 GB.


error


Analysis of this problem: Hadoop is allocated to all daemons (namenode, SecondaryNamenode, Jobtracker, Datanode, tasktracker) and is set in hadoop-env. The parameter is HADOOP_HEAPSIZE. The default value is 1000MB.


21. When datanodes join the cluster for the first time, if incompatible file versions are reported in the log, NameNode needs to run the Hadoop namenode-format operation to format disks.


error


Analysis of this problem:


First of all, what is ClusterID?


ClusterID


A new identifier ClusterID was added to identify all nodes in the cluster. When formatting a Namenode, this identifier needs to be provided or generated automatically. This ID can be used to format other Namenodes that join the cluster.


The secondary consolidation


Some students’ questions are not focused on the above analysis, the content is as follows:


This error indicates that the Hadoop version installed on DataNode is inconsistent with that of other nodes. You should check the Hadoop version of DataNode.

Follow public accounts

【 Pegasus Club 】


Past welfare
Pay attention to the pegasus public number, reply to the corresponding keywords package download learning materials;Reply “join the group”, join the Pegasus AI, big data, project manager learning group, and grow together with excellent people!

Reply number “5” big data learning material download, novice guide, data analysis tools, software use tutorial

Reply to the number “8” full analysis of big data data (352 cases + big data transaction white paper + Domestic and foreign policy collection)

Reply number “9” dry | selections for 10 big data books (junior/intermediate/advanced) become large data expert!

AI Artificial Intelligence/Big Data /Database/Linear Algebra/Python/ Machine Learning /Hadoop

Reply number “13” big data technology tutorial + books +Hadoop video + big data research newspaper + science books

Big data Hadoop technology e-books + technical theory + actual combat + source code analysis + experts to share PPT

526 Industry reports + White papers: AI, Artificial intelligence, robotics, smart mobility, smart home, Internet of Things, VR/AR, blockchain, etc. (download)

Reply number “19” 800G ARTIFICIAL intelligence learning materials :AI ebook +Python language introduction + tutorial + machine learning and other limited time free access!

FMI Artificial Intelligence and Big Data Summit Guest Speech PPT

Top 10 AI Jianghu Fields

Machine Learning Practical Experience Guide

More than 100 Papers on deep Learning

Top ten Classic Algorithms of Data Mining

6.10 Ele. me & Pegasus Project Management Practice PPT