preface

It is another warm and cold year, and the spring wind is coming, with some warmth in the coolness. Oh, you know, it’s spring. Just like the internship that year, in the farewell graduation season, frozen in July of that year.

People will miss, miss the young period of their own, ignorant but full of efforts. People will miss, miss every day with people, one day will wave goodbye at the corner. People will change, change the inarticulate self, eventually wandering in the world.

Three years after graduation, I have been engaged in big data for nearly four years. Following the tide of big data, I have gained a foothold in a medium-sized city. During festivals, friends and relatives will develop mobile apps when talking about work, so as to avoid the situation of unclear explanation for a long time. Later, I also encountered questions about big data in many places, so I took advantage of my spare time to record the big data time in these years.

concept

What is big data

My understanding of big data is the use of some technological means to process massive amounts of data and realize its value. The first is massive data. Without data to back it up, big data is just talk. Then there’s the technology, which is used to process data offline or in real time, and Hadoop is one of those things you’ve probably heard of. At present, big data is widely used in e-commerce, operators, finance, medical care and so on.

Why big Data

Take e-commerce for example. Have you ever wondered why every time you browse an item, it shows up on the home page or in an advertisement on another APP? This is actually one of the applications of big data.

When you browse products on the APP, the background will collect your product browsing data, including user account number, product category and other fields. Right now, if you’re a technical person, how do you store your browsing data? In traditional development thinking, many people will choose MySQL.

But with hundreds or hundreds of billions of product browsing data a day, how many disks does a host need to complete data retention? Can MySQL handle that much data? How to efficiently analyze users’ browsing preferences in real time? This requires developers to think about technology selection.

The emergence of big data solves these problems.

Is big data hard to learn

The paper come zhongjue shallow, and must know this to practice. In fact, big data is not difficult to learn, but requires a relatively wide technical level, involving programming, network, host and other aspects of knowledge, requires the precipitation of knowledge in many aspects. In-depth study of big data requires practice on the basis of theory. When learning the technical framework, it is best to set up clusters on Ali Cloud or virtual machines. On the one hand, it can improve the ability to use Linux and understand the operating principle of clusters. On the other hand, it can conduct operation practice on clusters.

Second, big data technologies are not the same in production and test environments. The production environment will have real business scenarios and various problems, so if you have the opportunity to contact the big data production environment, the learning efficiency will be doubled with half the effort.

The main technical

In big data, different business scenarios correspond to different technology selection, and the application direction of big data technology is mainly offline computing and real-time computing. Before we do that, let’s take a look at Hadoop.

Hadoop

Most people know about Hadoop. As the most basic big data framework, Hadoop occupies the core position. Apache Hadoop is the community open source version, and the most used in production are third-party distributions of Apache based Hadoop, such as HDP and CDH, which are free. Currently we use HDP. Of course, there are also charges, such as Huawei, Intel.

So what does Hadoop do?

In traditional thinking, the running of the program only occupies the computing resources of the running program host, such as CPU and memory. The file occupies only the disk storage of the host. Hadoop, on the other hand, provides distributed computing and storage capabilities by using clusters of multiple machines.

HDFS(Hadoop Distributed File System)

The HDFS consists of the primary node NameNode and secondary node DataNode. Master-slave architecture is the most common architecture in big data.

NameNode is responsible for managing metadata for the entire file system, such as which machine a file is stored on. When NameNode fails, HDFS becomes unavailable. The current solution is HA high availability, that is, there are two Namenodes in the cluster, one is Active and the other is StandBy. If the Active NameNode fails to work, the StandBy NameNode becomes Active and takes over the work.

Datanodes store data files. Each file is stored on different machines according to the preset number of copies. If you set the number of copies to 3, a file will be copied three more times, making three copies. Storage is stored on different nodes based on rack awareness policies.

  1. Copy 1 is placed on the same rack node as the Client (select the nearest node if the Client is not in the cluster)
  2. Copy 2 is placed on any node in a different rack than the first rack
  3. Copy 3 is placed on a different node from the rack where the second node resides

In this way, if a node is faulty and files are damaged, files on other nodes can be used to ensure normal operation. This is the data Dr Policy, which sacrifices space and data redundancy to ensure data availability, similar to RAID. Kafka also ensures data availability through duplicates.

MapReduce

MapReduce is a distributed computing model that divides task execution into Map and Reduce phases. Each phase is divided into multiple tasks for concurrent execution, similar to the divide-and-conquer concept in the algorithm.

As shown in the figure, the divide-and-conquer idea is to divide the task into several sub-tasks and calculate them simultaneously to get the final result. MapReduce also splits tasks and distributes them to Hadoop nodes for computation, so that computing resources on multiple hosts can be utilized. As for the underlying implementation details of MapReduce, you can explore them if you are interested.

Off-line calculation

Offline data refers to data that has been persisted to disks, such as files and databases. I think of offline computing as bounded computing because the data in a file or database is known and usually does not change. In a narrow sense, it can also be understood as database SQL calculation, which makes use of big data technology to analyze massive offline data for marketing decisions or report presentation.

The technical architecture

Hive is generally used for offline computing. Hive as a data warehouse tool, data files are stored in the HDFS and can be added, deleted, modified, and queried using HiveSQL. Hive provides database operations, but HiveSQL is parsed into MapReduce tasks by Hive execution engine and distributed to Hadoop nodes for execution. Therefore, Hive itself is not a database and underlying computing relies on MapReduce.

Other commonly used technologies include SparkSQL, Kylin, Hbase, and Druid.

Application, for example,

For example, analyze the Top100 commodities with the most trading volume in a month and make visual statements.

Real-time computing

Corresponding to off-line computing is real-time computing, which can be understood as borderless streaming computing. Data flows into programs like a river. The program will continue to run until an exception occurs or it is manually stopped.

The technical architecture

At present, Flink and SparkStreaming are the most commonly used real-time computing frameworks in enterprises, and Kafka is used as message queue to construct real-time computing. Here is a simple simulation of a stream processing:

As shown in the figure, the acquisition program as a producer generates data in real time to write Kafka; As a consumer, Flink program reads data sources in Kafka in real time for calculation processing, and finally writes the calculation results to Kafka or HDFS.

More commonly used stream processing technologies in daily life include Storm, RabbitMQ, etc., while Redis usually serves as a cache for streaming computing.

Application, for example,

For example, find out the users who are browsing a certain book and push book coupons.

Post classification

Many people want to engage in the big data industry, the most asked is what are the big data positions? Big data positions are mainly divided into three categories: big data analysis, big data development, and big data operation and maintenance. Of course, in detail, there are platform architects, mainly responsible for cluster building and so on, but this is not to mention.

During my internship, I did big data analysis for one month and big data operation and maintenance for half a year. After graduation, I began to take charge of big data development. A circle of experience down, each post also have some of their own experience.

Big data analysis

Big data analysis is mainly oriented to offline computing. Responsible for data analysis, report statistics and other work, focusing on the embodiment of data value; ETL scheduling of data, namely E extraction, T transformation and L loading, focuses on the flow of offline data. Although the work form is relatively simple, there are many daily demands, especially the data analysis work on holidays is also extremely urgent.

On Zhihu, some people ask why most of the positions in the big data industry are working on offline data warehouse and writing HiveSQL?

My first job in big data internship was big data analysis, and now many interns are also arranging data analysis work. Because this piece of work tends to business, the technical level is not very strict, relatively easy to start. Most of the work is in database SQL development and can be started quickly with guidance.

Secondly, there is a large amount of offline data, so data cleaning, hierarchical aggregation and accuracy verification all need manpower and time. At the same time, due to the large business demand, offline data can effectively support customers’ marketing decisions and external realization through statistical, year-on-year and sequential analysis methods, which can quickly create benefits for the company. Therefore, the company’s business structure and operation system determine the need for a large number of big data analysis positions.

Technology stack

  1. Programming language: can be the icing on the cake, can not work, but recommended to learn a little Python, Java.

  2. Big data technology: Hadoop, HDFS, Hive, Hbase, ETL scheduling, etc.

  3. Others: Shell, Linux operation, SQL.

Big data development

Big data development is mainly oriented towards real-time computing. Flink and Spark applications are developed using Java and Scala. Compared with big data analysis, the scope of work is wider, the technical requirements are higher, and the work form is more flexible, which can develop a variety of solutions through different technology selection, and the work is not so complicated.

In general, there are few big data development positions in companies, except big data processing engine and big data platform product development. In our big data team of more than ten people, most of them are big data analysis positions, and I am alone in big data development.

At present, my main work contents are as follows:

  1. Data access: The binary data of 1 trillion/day is parsed into plaintext according to the specification and stored in Kafka. Mainly for Java multithreading, JVM, NIO application.

  2. Stream processing development: Flink, Spark, IBMStreams application development. Development language: Scala, SPL.

  3. Data retention: Stores 300 TB of data per day in the HDFS and loads it to Hive. Technical selection: Flume.

  4. Crawler development: data collection combined with marketing scenarios, millions of data per day. Technical selection: Scrapy.

Therefore, big data development is mainly programming development. The difference between Java development and traditional Java development is that Java development is oriented to project engineering, and the module structure is relatively large and complex, which requires the collaboration of many people. Big data development is a solution for a single application scenario, usually a few hundred lines of code, usually done by one person.

Technology stack

  1. Programming language: The main language is Java, Scala, strong programming ability is required.

  2. Big data technologies: Flink, Spark, Kafka, Redis, Hadoop, HDFS, and Yarn.

  3. Others: Shell and Linux.

Big data operation and maintenance

Big data o&M mainly monitors the health status of big data platforms and applications and needs to respond to emergencies in a timely manner. Big data operation and maintenance work is hard, often need to stay up late on duty. O&m workers must be familiar with clusters and hosts, and have the ability to analyze logs and track and resolve problems.

When I was in charge of the operation and maintenance of big data, I was never far away from my computer, either sitting in front of it or walking on the road with it on my back. In addition, I was constantly bombarded with alarm messages.

Technology stack

Big data platform use, Linux operation, host, network, scheduling, etc.

The above is my personal understanding of the practice of various positions in big data, as well as my answer to the question whether it is necessary to learn Java to engage in big data.

Me and big data

In the summer of 2017, AFTER many twists and turns, I started my internship in big data. At the beginning, I mainly did some data verification work of HiveSQL, and also arranged some document writing work. At that time, I still had the dream of Java development in my heart. Every day, I learned the unfinished Java courses in the rental house with my college roommates, built clusters with big data videos in the company, and read big data articles on the bus every day.

At that time, MY face was really thin. When I was at work, I used to put my head under the computer screen, and I was embarrassed to ask questions. All of them were saved together, and when my colleagues came to inquire about the progress, I asked them all in one head. The repetition of work and the drift away from Java development ideals left me wondering at night: Is this the job I want?

Two months later, the little operation and maintenance guy resigned, and I became a big data operation and maintenance guy. From then on, I lived a life of man-machine integration. I wrote my first Shell script because I needed to monitor some applications. Later, because I knew Java, I also participated in some development work in the capacity of operation and maintenance. At the same time, I also taught myself Spark, Kafka and other big data development technologies. In the following months of overtime work, I took the opportunity to have a deep understanding of the platform architecture and clear up the data flow process in the whole big data platform, which suddenly enlightened me.

It was really hard at that time, but it was also really happy and fulfilling. That time brought me closer and closer to the post of big data development. At that time, I had inexhaustible energy, a thirst for knowledge, and a passion for work. After getting along with each other for a long time, my ignorance and tension in my heart faded away. One day, I inadvertently merged into the collective of big data.

Later, I became a big data developer after going through the transition, separation and personal growth.

Opportunity is accompanied by hard work, so that I have a memorable experience in each position. If there is no opportunity again and again, the ability may be nowhere to use; Without the support of technical capabilities, opportunities come and go. Choose a road, must strive to go on.

Three years in a hurry, also failed to wash all the lead. But three years ago, when that teenager first whispered nervously: “I can, I can do a good job in operation and maintenance”, may not have expected that today he can confidently say: “I can, I can do a good job in development”.

Be true to the dreams of your youth.

conclusion

I hope after reading this article, you can have some in-depth understanding of big data. Hopefully one day you’ll be able to speak with confidence when others talk about big data. Or maybe it gives you some insight, oh, so this is big data. My heart is enough.

Time changes, technology changes, big data fever has receded, artificial intelligence, machine learning has occupied the topic of traffic. But big data is still big data, AND I am still a teenager with a dream.



Writing is daily work in personal practice, in their own point of view from 0 to 1, to ensure that we can really understand.

This article will be published in the public account [entry to the road to give up], looking forward to your attention.