Many people know that big data is very hot, employment is very good, high salary, want to develop in the direction of big data. But which techniques to learn, and what route to follow? Should I attend big data training? If you are confused and want to develop in the direction of big data for these reasons, it is also ok. Then the teacher would like to ask, what is your major and what is your interest in computer/software? Computer major, interested in operating system, hardware, network, server? Software major, interested in software development, programming, writing code? Also mathematics, statistics major, especially interested in data and figures.

In fact, this is to tell you the three development directions of big data, platform building/optimization/operation and maintenance/monitoring, big data development/design/architecture, data analysis/mining. Please don’t ask me which is easy, which has good prospects, which has more money.

10 required skills

1.Java Advanced (Virtual machine, concurrent)

2.Linux Basic operations

3.Hadoop (a chivalrous concept refers to HDFS+MapReduce+Yarn)

4.HBase (JavaAPI operation +Phoenix)

5.Hive(Hql basic Operations and principles)

6.Kafka

7.Storm

8. Scala needs

9.Python

10. The Spark (Core + sparksql + Spark streaming)

11. Widgets (Sqoop, etc.)

6 advanced skills

1. Machine learning algorithm and Mahout library plus MLlib

2. The R language

3. The Lambda architecture

4. The Kappa architecture

5.Kylin

6.Aluxio

Chapter 1: Getting to know Hadoop

1.1 Learn baidu and Google

1.2 References Official documents are preferred

1.3 Let Hadoop run first

1.4 试试使用Hadoop

1.5 It’s time you understood how they work

1.6 Write a MapReduce program

Chapter 2: A More efficient WordCount

2.1 Learn some SQL

Version 2.2 SQL WordCount

2.3 Hive SQL On Hadoop

2.4 Installing and Configuring Hive

2.5 Try Using Hive

2.6 How Does Hive work

2.7 Learn Basic Hive Commands

Chapter 3: Getting data from elsewhere onto Hadoop

3.1 HDFS PUT Command

3.2 HDFS API

3.3 Sqoop

3.4 the Flume

3.5 Alibaba open source DataX

Chapter 4: Take your data on Hadoop somewhere else

4.1 HDFS GET Command

4.2 HDFS API

4.3 Sqoop

4.4 DataX

Chapter 5: Hurry up, my SQL

5.1 About Spark and SparkSQL

5.2 How Do I Deploy and Run SparkSQL

Chapter 6: Polygamy

6.1 about Kafka

6.2 How To Deploy and Use Kafka

Chapter 7: More and more Analysis tasks

7.1 Apache Oozie

7.2 Other Open-source Task scheduling systems

Chapter 8: My data in Real time

8.1 Storm

8.2 Spark Streaming

Chapter 9: My data is public

Usually provides external (business) data access, which generally includes the following aspects:

Offline: For example, the data of the last day is provided to the specified data source (DB, FILE, FTP) every day. Offline data can be provided by Sqoop, DataX and other offline data exchange tools.

Real-time: For example, the recommendation system of online website needs to obtain the recommendation data for users from the data platform in real time, which requires a very low delay (less than 50 milliseconds).

Possible solutions include HBase, Redis, MongoDB, and ElasticSearch based on delay requirements and real-time data query requirements.

OLAP analysis: In addition to the specification of the underlying data model, OLAP also requires higher and higher response speed to queries. Possible solutions include Impala, Presto, SparkSQL, and Kylin. If your data model is scale, Kylin is the best choice.

AD hoc query: AD hoc query data is arbitrary, and it is generally difficult to establish a general data model, so possible solutions include Impala, Presto, and SparkSQL.

So many mature frameworks and solutions need to be combined with their own business requirements and data platform technology architecture, select the appropriate. There is only one principle: the simpler and more stable is the best.

Chapter 10: Machine learning to drive High

About this piece, the teacher of the big platform just briefly introduced.

In our business, there are three types of problems that can be solved by machine learning:

Classification problem: including binary classification and multiple classification, binary classification is to solve the problem of prediction, like predicting whether an email is spam; Multi-classification solves the classification of text;

Clustering problem: roughly categorize users based on the keywords they have searched.

Recommendation problem: Make recommendations based on users’ browsing history and clicking behavior.

Most industries use machine learning to solve these kinds of problems.

PS:

Follow the wechat public account “programmer OfHome” and send “Get data” to get free video materials.

There are friends in the group who are engaged in or studying big data, and I also invite you to join the group to learn together. There is no advertisement in the group, and advertising is forbidden. You can also follow my wechat public account “Programmer OfHome” by scanning.