A pilot study

1.1 Open source big data technology

1.2 Necessary to improve competitiveness

1.3 Tutorial Planning

1.7 Preliminary Knowledge

  • Basic knowledge of big data
  • Familiar with Linux basic commands
  • Familiar with Scala programming methods
  • Have some mathematics foundation

1.8 Environment Parameters

  • The Spark: 2.3.0
  • The JDK: 1.8
  • IDE : IDEA

2 overview of machine learning

2.1 Machine learning concepts

2.2 History of machine learning

2.3 Machine Learning (ML) & Artificial Intelligence (AI)

2.4 General functions of machine learning

◆ Recognition of the gender of the face in the image is male or female

◆ Cluster to find your favorite type of girlfriend

◆ Go back and predict stock prices

The difference between classification and regression

◆ The categories of classification are discrete, and the output of regression is continuous

Mixer instance

  • The result of gender classification can only be one of the {male, female} sets
  • The output value can be any number in a range, such as the price of a stock.

2.4 Application of machine learning

Natural language processing, data mining, biological information recognition (such as face recognition), computer vision, etc

2.5 Current situation of machine learning

◆ Application fields are very wide, such as DNA sequencing, securities analysis

The national strategy has appeared many times in government work reports

Lack of talent

  • The threshold for new development is relatively high.
  • There is a huge talent gap

3. Core ideas of machine learning

3.1 Methods of machine learning

  • Statistical machine Learning (the main content of this tutorial)
  • BP neural network
  • Deep learning

3.2 Types of machine learning

◆ Supervised learning

◆ Unsupervised learning (and semi-supervised learning in between)

◆ Intensive learning

3.2.1 Supervised Learning

◆ Learning a model, so that the model can make corresponding predictions for any given input learning data in the form of (X,Y) combination

  • The combination of X, Y

3.2.2 Unsupervised learning

◆ Learning a model, using unmarked data, silently learning the hidden features, looking for models and rules input data form only X. Such as the clustering

3.2.3 Reinforcement learning

◆ In the absence of instructions, the algorithm evaluates its own predictions so that the computer can still generalize well on unlearned problems

3.3 Summary of machine learning ideas

Using existing data, train a model and then use such a model to fit other data and make predictions for location data

The process of human learning The teacher teaches math problems, students draw inferences from one another, the test result is the learning effect

3.4 A little more mathematics

◆ Find a mathematical function that measures the deviation between the predicted result and the actual result

Minimize (not minimize) deviations by repeatedly (iteratively) training models and learning features

  • For example, not always brush the question will be full marks, this is not reliable ~
  • It’s close

◆ The function to measure the prediction deviation is called loss function.

Machine learning is a process of solving optimization problems

3.5 Two situations that should be avoided in the training model

◆ Overfitting: the model training is excessive and the assumptions are too strict

  • Check if the image is a leaf: the model says the leaf must contain serrations

Incomplete fitting: The model needs further training, and the fitting ability is not strong

  • Check if the picture is a leaf: the model says that anything green is a leaf

4 framework and selection of machine learning

4.1 Common programming languages for machine learning

Mixer Python

Pieces of c + +

Mixer Scala

4.2 Commonly used frameworks for machine learning

◆ Spark(ML/MLLIB) SciKit-learn Mahout

4.3 Benefits of Using Spark

◆ Unified technology stacks facilitate integration of the four Spark modules

◆ The training of machine learning model is an iterative process, with higher efficiency of memory-based calculation

Natural distribution: make up for the lack of single computing power, with elastic capacity expansion

The prototype is Spark, which can be directly applied to the production environment

◆ Support mainstream deep learning framework operation

Own matrix calculation and machine learning library, algorithm comprehensive

4.4 Key points of machine learning project selection

Fully consider the production environment and business scenarios

◆ Try to choose more detailed documentation, more complete information, more active community open source projects

◆ Considering the situation of r&d team, strive to simplify and unify the technology stack, avoid jumbled

reference

Wiki/machine learning