First, write first

In our last article, we talked about the Edits log writing mechanism in NameNode in Hadoop.

When edits log is written to disk and network, the throughput of edits log is greatly improved by the mechanism of segmenting locking and double buffering, so as to support high concurrent access.

For those of you who missed the article, take a look back at Hadoop NameNode hosting High Concurrency in Large clusters.

In this article, we will take a look at the performance optimization of file upload in Hadoop’s HDFS distributed file system.

First, let’s review the general principle of file uploading with a diagram.

As shown above, the principle of file uploading is simple to say.

For example, if a terabyte file is too large, the HDFS client will split it into many blocks. A block is 128MB.

This HDFS client can be understood as a cloud disk system, log collection system and so on.

For example, someone uploads a large 1TB file to a web disk, or uploads a large 1TB log file.

Then, the HDFS client uploads blocks one by one to the first DataNode

The first DataNode makes a copy of the block and sends a copy to the second DataNode.

The second DataNode sends a copy of the block to the third DataNode.

So you’ll find that there are three copies of a block, spread across three machines. Data is not lost when any machine goes down.

In the end, a terabyte file is split into N Megabyte files on many machines.

2. Original file uploading scheme

The question we are going to discuss today is how do HDFS clients upload large TB files?

So let’s think about, what would you do if you wanted to upload in a more primitive way?

You can probably imagine what this picture looks like.

Many Java beginners probably know that uploading files in this way means constantly reading data from local disk files using the input stream, and then immediately writing data to DataNode through the network output stream.

Above this kind of flow chart code, it is estimated that the students who have just graduated can write out immediately. Because the input stream to a file is at most a FileInputStream.

The DataNode OutputStream is, at most, an OutputStream returned by a Socket.

Then find a small memory byte[] array in the middle and copy the stream. Read some data from the local file and send some data to the DataNode.

But if you do that, the performance is extremely low, network communication is about the right frequency, each batch, you have to read a lot of data, send a batch of data through network communication.

You can’t just read a little bit of data and immediately have a network communication and send that little bit of data.

If in accordance with the above primitive way, will absolutely lead to network communication efficiency is extremely low, large file upload performance is very poor.

Why do you say that?

You could read a few hundred bytes of data and then write to the network in a few hundred milliseconds.

It then reads the next few hundred bytes of data and writes the network for a few hundred milliseconds, which is poor performance and would not be tolerated in industrial-scale, large-scale distributed systems.

3. Performance optimization of HDFS for large file uploading

Ok, so having looked at the original file uploads, let’s look at how large file uploads in Hadoop are optimized for performance. Take a look at the picture below.

First you need to create your own input stream for local terabyte disk files.

Immediately after reading the data, write to the FSDataOutputStream output stream provided by HDFS.

What is this FSDataOutputStream doing?

Do you think he’s going to naively transfer data over the network to the DataNode right away?

The answer, of course, is no! If you do this, it will be the same as before!

1. Chunk buffering

First, the data is written to a chunk buffer array, which is a 512-byte chunk of data, you can think of it this way.

This buffer array can then accommodate multiple chunks of data buffered in it.

This buffer alone allows the client to write data quickly in the first place, without requiring a network transfer of several hundred bytes. Think about it, isn’t it?

2. Packet mechanism

Then, when the chunk buffer array is full, the chunk buffer array is split into chunks. A chunk is a piece of data.

Then, multiple chunks will directly write another memory buffer data structure at one time, namely Packet Packet.

A Packet Packet, designed to hold 127 chunks, is roughly 64MB in size. Therefore, a large number of chunks are continuously written into the memory buffer of Packet packets.

Through the design of Packet Packet mechanism, a large amount of data can be contained in the memory, which further avoids the impact of frequent network transmission on performance.

3. Memory queue asynchronous sending mechanism

When a Packet is filled with chunks, the Packet is queued in a memory queue.

Then, a DataStreamer thread will continuously obtain Packet packets in the queue and directly write a Packet Packet to the DataNode through network transmission.

If a Block is 128mb by default, then a Block corresponds to two packets by default, each Packet is 64MB.

That is, after two packets are sent to the DataNode, a notification is sent stating that all data of a Block has been transferred.

In this way, the DataNode knows that it has received a Block containing two packets sent by the other party.

Four,

OK, now that you’ve seen the diagram above and the large file upload mechanism adopted by Hadoop, does it feel clever?

To put it bluntly, industrial-scale distributed systems do not adopt particularly simple code and patterns, which will suffer from poor performance.

There are a lot of concurrency optimization, network IO optimization, memory optimization, disk read and write optimization architecture design, production solutions in it.

The HDFS client can quickly read terabytes of data from large files, and then quickly write the output stream to the HDFS into memory.

Based on the chunk buffer mechanism in memory, packet packet mechanism, memory queue asynchronous sending mechanism. There will never be any network traffic delays that slow down large file uploads.

On the contrary, the upload performance of a terabyte large file can be improved a hundred times by using the above mechanisms.

Welcome to learn and communicate with me to build Java cloud architecture. I will record the building process and essence of Java cloud architecture recently developed, so as to help more friends who are interested in developing Java advanced architecture to discuss the building process of Java advanced architecture and how to apply it to enterprise projects.

I personally invite all BATJ architecture masters to create a Java Advanced Architecture Communication community Group (group number: 673043639), committed to providing a free Java architecture industry communication platform, through this platform to let everyone learn and grow from each other, improve their skills, make their level to a higher level, and successfully become a Java architecture technology leader or architect development.

Hope this article can help you at the same time, also listen to your point of view. Welcome to comment, add attention, share your ideas! Keep updating!

  • To-morin Java architecture

Share the latest Internet articles to pay attention to the latest development of the Internet