Preface:

I’m Han Shu. From our series on a letter to the backend Nginx primary introductory tutorial for 25 days have passed, interspersed with the two block chain related article, in fact, the twenty days I have been holding big, that is the latest series for backend Hadoop primary introductory tutorial, due to the Hadoop itself was a lot of technical details, Hadoop basic environment building, distributed pseudo-distributed deployment, cluster start-up preparation, HDFS file system, MR programming model, and the final optimization, etc., the whole set of writing workload is quite large, fortunately, I will soon come to the winter vacation (happy), so I have enough time and energy to write this tutorial. First, I want to help myself understand this knowledge more deeply when WRITING, and I want to help those who are just preparing for the introduction of big data to understand and use Hadoop technology. After all, we all know that the quality of the technical tutorials found on the Internet is so uneven that you accidentally step on the pit:

A computer, a cigarette, a tutorial to learn along the way, debugging along the way but wrong, want to send the author on the sky.

What is big data, and how does it affect our lives? What is Hadoop and how does it compare to other big data technologies? Knowing these questions, I believe that if I learn big data again, I can’t say I have a buff bonus, but at least I know what I need to learn next.

Cut the crap and just get on with it

What is Big Data?

Big Data: it mainly refers to the Data collection that cannot be captured, managed and processed by conventional software tools in a certain range. It is a massive, high-growth and diversified information asset that requires new processing mode to have stronger decision-making ability, insight and discovery ability and process optimization ability.

In a word: Big data is a lot of data, more data than traditional solutions can handle.

Data volume is not the most important, of course, it is important to hide information in these data, the information both in business and has a great value in the research, electricity by mining the data in the information for each user, and recommend the right products for users to increase purchase, of course, also can adjust to a price, by the way to kill a ripe or something.

Units of big data:

But we are serious science students after all, you say big data big data, how big is big data? To solve this problem and reduce arguments, scientists have developed a series of data units, from smallest to largest:

Bit Byte KB MB GB TB PB EB ZB YB BB NB

Of course, what’s the point of talking about these units, how am I supposed to know how much data these units hold? To give you a more direct sense of the power of these units of data, HERE are some of the chestnuts:

  • The amount of printed material produced worldwide is about 200 petabytes.
  • The total amount of human speech in the world is about five exabytes.
  • In 2017, the total amount of data generated by the website of P site, a well-known foreign website, was 3732PB.
  • About 2MB of memory is needed for 1 million Chinese characters.

I think there was something strange mixed in there.

Characteristics of big Data:

  • A large number of:Yes, but most are shy about calling it big data.
  • High speed:So much data must be quickly digested, processing decades can not afford to wait ah, this year’s double 11 transaction amount can not count to next year’s double 11 announced again and again.
  • Variety:Different scenes will produce different data, youku is the user browsing data, video data, QQ Music is music data.
  • Low value density:That meansEven if the data volume is large, we always focus on specific parts, not the whole“, just like a policeman’s monitoring, the data from a year ago and a month ago are usually of no use to him. He only needs the monitoring data of a few key nodes.

Application scenarios are not mentioned. Application scenarios are everywhere.

What is Hadoop?

Knowing what big data is, we have to think about another question: where do I put all this data?

Of course, put it in the hard disk. Otherwise, where can I write it on paper? I: hard disk I know, but in case this hard disk is broken, that data didn’t?

Passerby: you fasten not fasten silly, you put a few hard disk more, respectively put up not line?

When Hadoop arrived, my brothers all moved to the side. Your method was too clumsy. Let me take care of it for you.

Hadoop is a distributed system infrastructure developed by the Apache Foundation. It is mainly used to store and analyze big data.

Of course, like Spring, Hadoop is no longer just the technology of Hadoop. If you tell someone that my new e-commerce project is based on Spring, people will definitely not think that you only use Spring, but that you may use Spring MVC, Boot, JPA and a series of Spring ecosystem technologies. The same is true of Hadoop, not only for the technology itself, but also for the technology ecosystem around Hadoop.

And we don’t think complex things, think what these concepts are distributed storage how esoteric things, indeed, the official concept is indeed a bit abstract, but I think, any a theory must be derived from life, for it is life gave their inspiration, but life is not very complicated, So any abstruse and complex theory can certainly find an easy to understand explanation in life.

What is a distributed storage, not to blow, junior high school of time I have been doing this, the popular at that time see the fantasy, the tome know, thick, usually one class is only then a, was confiscated by the guidance director was ended, who didn’t have to see, so when put a fantasy novel off from page to page, each student a few pages, Everyone looked at each other, even if the teacher found it was only a part of the confiscation, not all annihilation. You see, distributed has, storage has, this is not distributed storage? In case one book is confiscated by the teacher and the book is not complete, then buy three copies and save several pages separately. This is just multiple backups. It is not that complicated, and don’t always tangle with the concepts written by scholars for scholars to read.

History of Hadoop:

There is no good to talk about this, I list a few key points here, interested friends can go down their own search, a lot of online search.

  • A guy named Dung Cutting wrote a framework for full text search in Java –Lucene
  • When there’s a lot of data,LucenePerformance can’t keep up.
  • Coincidentally, Google itself is also doing full text search, why people’s performance is so top?
  • By studying Google, I made aNutch
  • Google later made some of them publicGFSandMapReduceThe details.
  • Dung Cutting saw that the answer was all for himself, so it took him two years, spare time, to do it himselfDFSandMapReduce‘Nutch.’ It just goes up. One word.
  • Later,HadoopAs aLucenesubprojectsNutchPart of it was formally introduced into the Apache Foundation.
  • thenMap-ReduceandNDFSOne piece is integrated into itHadoopIn the project,HadoopAnd so it was born.

Why people can do such a great thing in their spare time, I can’t even get on the king of glory in my spare time, is there a middleman to make a difference?

Hadoop distribution:

Much like Linux, different companies have customized their own distributions based on this. There are three major distributions of Hadoop, which are:

  • Apache version: the original (most basic) version, best for beginners to learn, after all, birthplace, pedigree is also the most positive.
  • Cloudera: It’s mostly used by large Internet companies.
  • Hortonworks: Well documented.

Of course, we chose Apache for no other reason than basic, simple, and free.

What are the Hadoop advantages?

What is it about Hadoop that makes us think of It when we talk about big data development?

After all, writing a program is not love, nothing even if you are not good I still love you this thing, we are very bad, which use to make which.

Hadoop’s success in the world depends on the following four points:

  • High reliability:HadoopThe underlying layer uses multiple copies of data, even thoughHadoopThe failure of a computing element or storage does not result in data loss, as in the distributed storage example above.
  • High scalability: Allocating task data between clusters allows easy scaling of thousands of nodes. That is, one morning operations came in and, oh my God, the cluster was running low, but it didn’t matter because it didn’t take a minute to add a new node to the cluster or remove a node.
  • High efficiencyIn:MapReduceWith the thought of,HadoopIt works in parallel to speed up the processing of tasks.
  • High fault tolerance: Ability to reassign failed tasks.

You say a lot of good things, but Hadoop doesn’t have any bad things? I have to, but I can’t say this until I write HDFS and MR later, otherwise I don’t know what HDFS is now, and it’s not good to say bad things about people, just like saying bad things about people in front of people.

Let’s start with the technical summary:

Today, this article as a whole set of the first article of the Hadoop tutorial series is mainly according to my habit of writing a blog about some basic concepts, I hope you have seen after the in the mind to be able to have a basic understanding of big data and Hadoop, in addition, I write technology article is more colloquial, nonsense is more, the welcome advice, rest assured, I also don’t change, but what I write novels are still very serious, and garbage article you read faster than those of the concept of deep play with article (funny), the next article, we are also the concept, main HDFS, YARN, MR these three Hadoop core concepts, and the code is hard to contact.

Thank you very much for reading this, your support and attention is my motivation to continue high-quality sharing.

Relevant code has been uploaded to my Github. Make sure you hit “STAR” ahhhhhhh

Long march always feeling, to a star line

Hanshu development notes

Welcome to like, follow me, have good fruit to eat (funny)