How can Hadoop NameNode support thousands of concurrent accesses per second in a large cluster

Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

First, write first

In the last article, we have preliminarily explained the overall architecture principle of Hadoop HDFS to you, and I believe that you have a certain understanding and understanding.

For those of you who haven’t read the previous article, take a look at the Hadoop Architecture principles in Plain English.

In this article, let’s take a look at how NameNode can withstand a large number of clients making highly concurrent (say thousands of times per second) calls to the NameNode to modify the metadata.

Second, the origin of the problem

Let’s examine the problem that NameNode encounters with high concurrency requests.

The edits log is created for each request to the NameNode to modify a piece of metadata. The edits log consists of two steps:

Write data to the local disk.
Transmits over the network to the JournalNodes cluster.

But if you know a little bit about Java, you should know about the problem of multi-threaded concurrency, right?

NameNode’s first rule for writing edits logs:

Each EDits log must have a globally sequential incrementing transactionId (txID for short) so that the sequence of edits logs can be identified.

In order to ensure that the TXID of each EDits log is increasing, it must be locked.

Each thread modified the metadata, to write an EDits log, must queue in order to obtain the lock, can generate an increasing TXID, representing the number of edits log to write this time.

Ok, so the question is, look at the picture below.

What if every time a txID is generated in a locked block of code, the disk file EDits log is written, and the network request is written to journalNodes an EDits log?

Needless to say, this one is dead!

The NameNode itself uses multiple threads to receive concurrent requests from multiple clients. As a result, multiple threads queue up to write edits logs after modifying the metadata in memory.

And you should know that writing to a local disk plus a network transfer to JournalNodes is time-consuming. Two performance killers: disk write + network write!

If HDFS were designed this way, NameNode would be able to handle a very small number of concurrent requests per second, possibly dozens of concurrent requests per second.

Third, HDFS elegant solution

So, in view of this problem, the HDFS is to do a lot of optimization!

We don’t want to serialize queue generated txID + write disk + write JournalNode when each thread writes edits log.

That is, multiple threads can quickly acquire locks, generate TXids, and quickly write edits logs to the memory buffer.

Then the lock is quickly released, and the next thread continues to acquire the lock, generating id + write edits log into the memory buffer.

Then one thread can flush the edits log from memory to disk, but in the process, other threads continue to write the EDits log to the memory buffer.

If someone writes to the same buffer at the same time, and someone reads and writes to disk at the same time, that’s also a problem, because you can’t read and write to a shared memory simultaneously!

So HDFS uses the double-buffer mechanism here to handle! Divide a memory buffer into two parts:

One part can be written
The other section is for writing to disk and JournalNodes after reading.

You may feel that the written narrative is not intuitive, as usual, let’s take a picture, in order to explain to you.

(1) segmental locking mechanism + memory double buffer mechanism

First, each thread obtains the lock for the first time, generates the txID in ascending order, then writes the EDits log to region 1 of the memory double buffer, and immediately releases the lock for the first time.

In this gap, subsequent threads can immediately acquire the lock again for the first time and then immediately write their own edits log to the memory buffer.

Write to memory so fast that it can take dozens of subtleties and then immediately release the lock for the first time. So this concurrency optimization is definitely working. Do you feel it?

And then each thread is competing to get the lock for the second time, and when one thread gets the lock, it says, is anyone writing to the disk or to the network?

If not, well, then this thread is a lucky one! The double-buffered regions 1 and 2 are swapped directly, followed by a second release of the lock. The process is fairly fast, with several conditions in memory, taking less than a few microseconds.

Ok, at this point, the buffer has been swapped, and subsequent threads can immediately acquire the locks in sequence and write the EDits log to region 2 of the buffer. Region 1 is locked and cannot be written.

How’s it going? Feel a little bit of concurrency optimization again?

(2) Multithreading concurrency throughput hundredfold optimization

Next, the lucky thread reads data from region 1 of the memory buffer (no one is writing to region 1, it’s writing to region 2), writes the Edtis logs to a disk file, and writes to the JournalNodes cluster over the network.

This process can be time-consuming! But it doesn’t matter ah, somebody else has done optimization, in the process of writing disk and network, is not hold the lock!

So subsequent threads can snap the lock for the first time, write it to buffer region 2, and release the lock.

At this point a large number of threads can write to memory quickly, without blocking or stalling!

How’s that？ Feel the feeling of concurrency optimization!

(3) Buffer data batch disk brushing + network optimization

So while the lucky thread is huffing and puffing and writing to disk and network, the next large number of threads quickly acquire the lock for the first time, write to buffer 2, release the lock, and then what do they do when they acquire the lock for the second time?

They’re gonna find someone writing disks, guys! So it immediately goes to sleep for 1 second to release the lock.

A large number of concurrent threads will quickly acquire the lock a second time here, and then discover that someone is writing to the disk and network, quickly release the lock and sleep.

How, this process does not block others for a long time! Since locks are quickly released, subsequent threads can still quickly acquire the lock and write to the memory buffer for the first time!

Again!!! Feel the feeling of concurrency optimization?

At this point, many threads will find that the txID of the lucky thread is next to their own, and will write their edits logs from the buffer to disk and network.

These threads don’t even sleep and wait, they just go back and do something else instead of getting stuck. Do you feel concurrency optimizations here again?

The lucky thread then wakes up the previously dormant threads after writing to the disk and network.

Those threads queue up to get the lock a second time before entering the judgment, whew! No one is writing disk and network!

It then determines whether the thread next to it has written its edtis log to disk and network.

If there is, it just returns.
If not, become the second lucky thread and swap two buffers, region 1 and region 2.
Then you release the lock and start huffing and puffing region 2 data to disk and network.

However, it doesn’t matter at this point, if later threads want to write edits logs, they can still write to the buffer and release the lock as soon as they acquire the lock for the first time. And so on.

Four,

In fact, this mechanism is quite complex, involving the segmentation lock and memory double buffer two mechanisms.

NameNode ensures that multiple threads do not write edits logs to disk or network after a high level of concurrent modification of metadata.

So through that complex mechanism, the best efforts are made to ensure that a thread can batch multiple Edits logs from a buffer to disk and network.

Other threads can quickly write edits logs to the memory buffer without blocking other threads from writing edits logs.

Therefore, it is this mechanism that maximizes NameNode’s ability to handle high concurrent access to modify metadata!

END

[Secrets of Performance Optimization] How can Hadoop improve the upload performance of terabyte large files by a hundred times? , stay tuned

If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!

A large wave of micro services, distributed, high concurrency, high availability **** original series

The article is on its way,Please scan the qr code belowContinue to pay attention to:

Architecture Notes for Hugesia (ID: Shishan100)

More than ten years of EXPERIENCE in BAT architecture

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How can Hadoop NameNode support thousands of concurrent accesses per second in a large cluster

directory

First, write first

Second, the origin of the problem

Third, HDFS elegant solution

(1) segmental locking mechanism + memory double buffer mechanism

(2) Multithreading concurrency throughput hundredfold optimization

(3) Buffer data batch disk brushing + network optimization

Four,

END

How can Hadoop NameNode support thousands of concurrent accesses per second in a large cluster

directory

First, write first

Second, the origin of the problem

Third, HDFS elegant solution

(1) segmental locking mechanism + memory double buffer mechanism

(2) Multithreading concurrency throughput hundredfold optimization

(3) Buffer data batch disk brushing + network optimization

Four,

END

Related Posts

Machine | Goland set annotation template

ReentrantLock source code

The Amazing Journey of Go: Basics of Go Grammar | Go Theme Month