Big Data Interview (Hadoop)

🚀 author: “Big Data Zen”

🚀 Column introduction: This column shares big data-related interview questions about Hadoop, Spark, Flink, Zookeeper, Flume, Kafka, Hive, Hbase and other big data-related technologies. Big data interview column address.

🚀 Personal homepage: Big data little Zen

🚀 Welcome to like 👍, collect ⭐, leave a message 💬

1. The most important bottleneck of the cluster

Disk I/o

2. Hadoop operating mode

Standalone edition, pseudo distributed mode, fully distributed mode

3. HDFS write process

1) Client The client sends an upload request and establishes communication with Namenode through RPC. Namenode checks whether the user has the upload permission and whether the uploaded file has the same name in the corresponding directory of HDFS. If either of the two requirements is not met, an error is reported directly. A message is returned to the client that can be uploaded

2) The client splits the file according to the size of the file. By default, the client splits the file into 128 MB blocks. After the splits, the client sends a request to namenode which server to upload the first block to

3) After receiving the request, Namenode allocates files according to the network topology, rack awareness and replica mechanism, and returns the address of available DataNode

4) After receiving the address, the client communicates with A node in the server address list such as A, which is essentially an RPC call to establish pipeline. After receiving the request, A will continue to call B, and B will call C to complete the whole pipeline establishment and return to the client step by step

5) The client sends the first block to A (read data from disk and then put it into local memory cache) in packet (64KB). When A receives A packet, it sends it to B, which then sends it to C. After transmitting A packet, A will put A reply queue to wait for the reply

6) The data is divided into packet packets and transmitted successively on the pipeline. In the reverse transmission of the pipeline, ACK is sent one by one (correct reply with command). Pipelineack is sent to the Client by the first DataNode in the pipeline, A

7) When a block transfer is complete, the Client requests NameNode to upload the second block again, and NameNode re-selects three Datanodes to the Client

Explain the concepts of “Hadoop” and “Hadoop ecosystem”

Hadoop refers to the Hadoop framework itself; Hadoop ecosystem includes not only Hadoop, but also other frameworks to ensure the normal and efficient running of Hadoop framework, such as ZooKeeper, Flume, Hbase, Hive, Sqoop and other auxiliary frameworks.

5. Please list which Hadoop processes need to be started in a normal Hadoop cluster, and what are their functions?

1) NameNode: It is the primary server in Hadoop, manages the file system namespace and access to the files stored in the cluster, and holds metadate. SecondaryNameNode: It is not a redundant daemon of namenode, but provides periodic checkpoints and cleanup tasks. Help NN merge editslogs to reduce NN startup time. 3) DataNode: It is responsible for managing the storage connected to nodes (multiple nodes can exist in a cluster). Each node that stores data runs a Datanode daemon. 4) ResourceManager (JobTracker) : JobTracker schedules jobs on Datanodes. Each DataNode has a TaskTracker that performs actual work. 5) NodeManager :(TaskTracker) executes tasks. 6) DFSZKFailoverController: when it is highly available, it is responsible for monitoring the state of NN and timely writing the state information to ZK. It obtains the health state of the NN by periodically calling a specific interface on the NN by an independent thread. FC also has the right to choose who will be the Active NN, because there are only two nodes at most, and the current selection strategy is relatively simple (first come, first served, rotation). JournalNode: Editlog file of namenode in high availability condition.

6. How many blocks are saved in HDFS by default?

Three copies are saved by default

7. HDFS read process

1) Client sends an RPC request to namenode. Request the location of the file block

2) After receiving the request, Namenode checks the user’s permissions and the existence of the file. If both are true, namenode returns a partial or complete block list as appropriate. For each block, Namenode returns the DataNode address containing the copy of the block. The returned DN addresses are sorted according to the cluster topology to determine the distance between datanodes and clients. The two sorting rules are as follows: In the network topology, the ones closest to clients are ranked first. In the heartbeat mechanism, the DN reported in timeout status is STALE

3) The Client selects the DataNode with the highest order to read the block. If the Client itself is a DataNode, the data will be directly obtained from the local (short-circuit read feature).

4) The bottom layer is essentially creating a Socket Stream (FSDataInputStream) and repeatedly calling the parent DataInputStream’s read method until the block is finished reading data

If the file reading is not complete, the client will continue to obtain the next block list from NameNode. The client notifies the NameNode and continues reading from the next DataNode that has a copy of the block

7) The read method reads block information in parallel, not in pieces; The NameNode returns only the DataNode address of the block requested by the Client, not the data of the block requested

8) All blocks are merged into a complete final file upon final reading

8. Which part is responsible for HDFS data storage?

DataNode is responsible for data storage

9. What is the purpose of SecondaryNameNode?

His purpose is to help NameNode merge edit logs and reduce NameNode startup time

10, Hadoop block size, from which version is 128MB

Both hadoop1.x are 64M, and both hadoop2.x start with 128M.

11. What if one of the HDFS blocks is damaged when reading files

After reading DataNode blocks, the client performs checksum verification, that is, checks the local block read by the client and the original block read by the HDFS. If the check result is inconsistent, the client notifies NameNode. Then read from the next DataNode that has a copy of the block

12. How econdary Namenode works

(1) After NameNode is formatted for the first time, fsimage and edits files are created. If it is not the first boot, load edit logs and image files directly into memory. (2) Requests for adding, deleting and modifying metadata by the client. (3) NameNode records operation logs and updates rolling logs. (4) NameNode adds, deletes, modifies and checks data in memory. (1) The Secondary NameNode asks the NameNode whether it needs to checkpoint. Check the result directly back to NameNode. (2) The Secondary NameNode requests checkpoint execution. (3) NameNode scrolls the edits log it is writing. (4) Copy the edit log and image file before scrolling to the Secondary NameNode. (5) The Secondary NameNode loads the edit log and image file into memory and merges it. (6) Generate the image file fsimage.chkpoint. (7) Copy fsimage.chkpoint to NameNode. (8) NameNode rename fsimage. Chkpoint to fsimage.

NameNode is SecondaryNameNode. NameNode is responsible for managing the metadata of the entire file system and the data block information corresponding to each path (file). SecondaryNameNode is used to periodically merge the edit logs of namespace mirrors and namespace mirrors. 2) Connection: (1) SecondaryNameNode saves a mirror file (fsimage) and edit log (edits) consistent with namenode. SecondaryNameNode can recover data from SecondaryNameNode when the primary namenode fails (assuming that data is not backed up in time).

13. HDFS architecture

The architecture is mainly composed of four parts, namely HDFS Client, NameNode, DataNode and Secondary NameNode. Let’s take a look at each of these four components. 1) Client: This is a Client. (1) File segmentation. When uploading a file to the HDFS, the Client splits the file into blocks and stores the file. (2) Interact with NameNode to obtain file location information; (3) Interact with DataNode to read or write data; (4) The Client provides some commands to manage HDFS, such as starting or stopping HDFS. (5) The Client can use some commands to access HDFS. 2) NameNode is a Master. (1) Manage HDFS namespace; (2) Manage data Block mapping information; (3) Configure a copy policy. (4) Processing client read and write requests. 3) DataNode: Slave. NameNode issues commands, and DataNode performs actual operations. (1) Store actual data blocks; (2) Perform read/write operations on data blocks. 4) Secondary NameNode: is not the hot standby of NameNode. When a NameNode dies, it cannot immediately replace the NameNode and provide services. (1) Assist NameNode to share its workload; (2) Merge Fsimage and Edits regularly and push to NameNode; (3) In an emergency, NameNode recovery can be assisted.

conclusion

Hadoop interview questions are divided into two chapters, with more content. You can choose the part you need to view. More big data and installation packages can be obtained through the close.