This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.

Fundamentals of machine learning

What is machine learning

Machine learning is a technique that generates rules and discovery models of responses from large amounts of data to help us predict, judge, group and solve problems. (Machine learning is a technique for producing functions from data, rather than programmers writing them directly.)

Speaking of functions, independent variables and dependent variables are involved. In machine learning, independent variables are called features. Multiple independent variables can be defined as X1 and X2 respectively. Xn, the dependent variable called label, can be defined as Y, and a data set of features and labels is the data set of machine learning.

The learning process of machine learning is to select the most accurate mathematical function to describe the independent variables X1, X2…. on the basis of the known data set through repeated calculation The causal relationship between Xn and the dependent variable Y. This process is called machine learning training or fitting.

There are several concepts that need to be clarified: training set, validation set and test set

  • Training set. The data set originally used for training is called a training set.
  • The validation set verifies whether the model can be generalized and evaluated whether the model is over-fitting
  • The test set, used to evaluate the generalization ability of the final model of the model, is equivalent to the ability to draw inferences from one another

Machine learning classification

The main classification criteria are based on whether machine learning has labels in the training process.

  • Supervised learning: training data sets all have labels, according to the characteristics of the tag, supervision, learning and can be divided into two main questions: regression and classification, tag is a continuous numerical regression problems, such as predict prices, stocks, etc., label is a discrete numerical classification problems, such as face recognition, judgment is correct, two operation strategy which is more effective.

    The specific classification algorithms include logistic regression, decision tree classification, SVM classification, Bayesian classification, random forest, XGBoost, KNN…

    Specific regression algorithms are: linear regression, decision tree regression, SVN regression, Bayesian regression…

  • Unsupervised learning: Training data sets without labels are used in limited scenarios such as clustering and dimensionality reduction, such as group portraits of users, and often as a sub-step in data preprocessing. Such as correlation dimensionality reduction algorithm, clustering algorithm…

  • Semi-supervised learning: some data is labeled, some data is not labeled. Semi-supervised learning is very similar to supervised learning because of the difficulty in obtaining data labels, mainly because there are more pseudo-label generation links, that is, manual labeling of unlabeled data.

    Semi – supervised learning is divided into semi – supervised classification, semi – supervised regression, semi – supervised clustering and semi – supervised dimension reduction

  • Reinforcement learning: For problems that can’t be solved by either supervised or semi-supervised or unsupervised learning, reinforcement learning comes into play. It deals with how an agent (a machine learning model) responds to its environment in order to maximize cumulative rewards. The difference between supervised learning and supervised learning is that supervised learning learns from data while reinforcement learning learns from rewards and punishments given by the environment.

    Such as: Q-learning,SARSA, deep reinforcement network, Monte Carlo learning…

Specific classification can be seen in the following mind map:

How to understand deep learning

Deep learning, as it is often called, is a model that uses deep neural networks and can be applied to all four types of machine learning. Deep learning is good at processing unstructured inputs and is good at visual processing and natural language processing. The advantage of deep learning is that it can automatically extract complex features from unstructured data sets without manual intervention.

Machine learning approach

We are doing machine learning in the process of the project, first of all need to clear the problem to be solved first, and then choose a class of algorithms for problem, and then use the data for training, find the most suitable gens function that a formation of the final model, and then to apply the model to online, then online data collection, and then constantly iterative optimization.

Preparation for machine learning introduction environment

Background:

The preparation of this environment is mainly for quick learning. Most Internet companies provide products similar to Notebook to conduct data analysis, data modeling and data visualization in an interactive way. Most of the main implementations are customized development based on Jupyter and Zeppelin, focusing on big data computing, storage and underlying resource management, and supporting common machine learning and deep learning computing frameworks. The most common algorithm analysis and modeling is jupyter Notebook, which can be used in the browser. Run the script by writing a Python script and display the results below the script block.

Jupyter Notebook can be developed interactively, with rich text formats and graphic presentation results to quickly show what data analysts think.

The best tool for getting started is to install and deploy the Juypter Notebook on your local machine

Install Jupyter Notebook

The Anaconda installation manages the Juypter Notebook

Anaconda is a free development environment that allows you to manage numerous Python libraries, deploy multiple Python environments, support Jupyter Notebook, Spyder, and many science packages. You can download and install Anaconda from the website. It’s easy to install Juypter after you start Anaconda. By default, Anaconda installs Juypter and its science libraries.

Install using the PIP command

In this case, if you need other science packages and their dependencies, you need to manually install them. In this case, you need to install other science packages and their dependencies.

Pip3 install --upgrade pip3 install jupyter --port <port_number> Jupyter notebook -h = jupyter notebook -hCopy the code

Install using Docker

Docker is relatively easy to install and start Jupyter

Such as: docker run – it – d – the name = test. Tensorflow/tensorflow: 2.2.0 jupyter – p – 8888-8082

Practice Juypter notebook

Use Plotly plotting in Juypter

  • Introduction to the

Plotly is a very powerful open source data visualization framework that builds HTML-based interactive charts to display information, creating beautiful charts in a variety of forms. Plotly in this article refers to the Python encapsulation of Plotly. Plotly itself is an ecologically complex drawing tool that provides interfaces to many programming languages

  • Funnel plot

In an electric shopping scenario, users buy goods will involve more than one process, from the download APP, register APP, search goods, buy goods, every process potential turnover, figure can be used to render the user through a funnel loss situation, we can use to collect data to each stage after Plotly funnel graph drawing.

Drawing process

  • Install Plotly package
pip install plotly
Copy the code
  • Detailed code

Here’s a slightly more complicated code that plots the combined funnel of products suitable for both boys and girls

Import plotly. Express as px # Px import pandas as PD stages = [" visits ", "downloads "," registrations ", "searches ", Data = pd.DataFrame(dict(number=[59, 32, 18, 9, 2], Data [' gender ']=' male 'print(data) data2 = pd.DataFrame(dict) number=[40, 30, 22, 10, 5], Df = pd.concat([data,data2],axis=0) If you are looking for a funnel, print(df) fig.= px. Funnel (df,x='number',y='stage',color=' gender 'Copy the code

  • The results showed

  • Results analysis

Through this vulnerability map, it is found that there are user processes in each stage of the whole APP purchase process, and the proportion of female users buying is significantly larger. These phenomena can inspire students in product operation to focus on a certain link to reduce the churn rate in a certain process.

Five steps of machine learning engineering practice

Define the problem

We need to dissect the business scenario, set clear goals, and identify the type of machine learning problem at hand.

Scenario: For example, in one scenario, the operation efficiency of the promotion copywriting of wechat public account was analyzed, and a large amount of advertorials data was collected, including the number of likes, forwarding and page views, etc. Because the public account cannot display its specific reading volume after reading more than 100,000, so to solve this problem, the goal is to establish a machine learning model, according to the number of likes and forwarding and other indicators, estimate how many page views an article can achieve.

To estimate page views, the data set includes likes, retweets, heat index, and article ratings. These fields are features. Page views are tags.

Data collection and preprocessing

Good fuel for machine learning models and good data make them run better.

The complete steps of data collection and pre-processing are as follows:

  1. To collect data

There are many methods to collect data. In reality, it is necessary to do a lot of data burying in the operation link, obtain user consumption and other behavior information and interest preference information, and crawl data online. Refer to geek Time – Data Analysis in Action 45.

  1. Data visualization

The purpose is to visually observe the data, see the possible relationship between features and tags, see if there are dirty data and outliers in the data, and get a feel for selecting a specific machine learning model.

Df_ads = pd.read_csv('test.csv') df_ads.head(10Copy the code

Python data visualization tools: One is the Python drawing library Matplotlib, the other is the statistical data visualization tool Seaborn.

Import matplotlib.pyplot as PLT import seaborn as SNS Plt.plot (df_ADS [' likes '], df_ADS [' views'],'r.',label='Training data') plt.xLabel ('goods') plt.ylabel('views') plt.legend() plt.plot(df_ADS [' likes '], Df_ADS [' views'],'r.',label='Training data') plt.xLabel ('goods') plt.ylabel('views') plt.show()Copy the code

As shown below:

You can almost see the linear correlation.

Next, take a look at the boxplot:

Data = pd.concat([df_ADS [' pageview '], df_ADS [' pageview ']], axis=1) # pageview (x=' pageview ', y= 'pageview') # pageview (x=' pageview ', y= 'pageview') Axis (ymin=0, ymax=800000); # Set the y coordinateCopy the code
  1. Data cleaning

The cleaner the vegetables are, the better the model effect is. The main data cleaning can be divided into four cases:

  • The first is to deal with missing data. If there is missing data in the backup system, try to make up for it. If there is no incomplete data that can be removed, the average value, random value or 0 of other data records can be used to supplement the value. This process of supplement is called data repair.
  • The second method is to deal with duplicate data. If completely duplicate data is deleted, it is ok. If two different rows of data appear in the same primary key, we need to see if there are other auxiliary information to help us determine (such as timestamp), if not, we can only randomly delete or retain all.
  • The third is to deal with wrong data: for example, when the sales volume and sales amount of goods are negative, it is necessary to delete or transform them into meaningful values. For example, if the value of the field representing percentage or probability is greater than 1, it also belongs to logical error data
  • The fourth method deals with unusable data, which refers to the format of data collation. For example, some goods are in RMB and some are in US dollars, so line unification is required. Another method is to convert yes and no into 1 and 0 values and input them into machine learning models.

How to view data in a dataset that needs to be cleaned?

All nans can be counted using the isna().sum() function of DataFrame. NaN means Not A Number, and in Python it stands for A value that cannot be represented or processed — typically dirty data.

Df_ads.isna ().sum() # Number of occurrences of Nan.Copy the code

You can use the Dropna () API to delete rows where NaN appears

Df_ads = df_ads.dropna()# delete rows where NaN occursCopy the code

There are other methods of data cleansing that need to be addressed for specific projects and data sets.

  1. Characteristics of the engineering

Feature engineering is a special sub-field of machine learning, which is the most creative part of data processing. Whether feature engineering is done well or not greatly affects the efficiency of machine learning model.

What is feature engineering? For example, a measure of physical fitness, BMI, which is weight divided by height squared, is a feature engineering. Through this process, the BIM index replaces the original two characteristics – weight and height – and gives us a completely objective picture of our body shape.

What are the benefits of this? Through the feature of BMI, the dimension of the characteristic data set is reduced. With each additional feature in the data set, the feature space of model fitting will be larger and the amount of computation will be larger. Therefore, eliminating redundant features and reducing the dimension of features can make machine learning model training faster.

  1. Construct feature sets and tag sets

Features are collected data points that are variables to be fed into machine learning models, while tags are things to be predicted, judged, or classified. For all supervised learning, we need to input “feature set” and “label set” data into the model.

Typically, a feature data set and a tag data set are constructed from a data set containing features and tags by simply removing unwanted data from the original data.

Such as:

X=df_ads.drop[' pageview '],axis=1) : Y=df_ads. viewsCopy the code

Unsupervised learning does not require such steps

1. Split the training set, verification set, and test set

After the original data set is split vertically from the dimension of the column to the feature set and label set, it is further split horizontally from the dimension of the row. The main reason is that machine learning doesn’t end with training data sets to find a model, we use validation data sets to see if the model is good, and then we use test data sets to see if the model works on new data.

Depending on the amount of data, such as 20% or 30%, the split is usually done using the machine learning tool Train_test_split in SciKit-learn

from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = Train_test_split (X, y, test_size = 0.2, random_state = 0)Copy the code

Data collection and pre-processing summary:

Selection algorithm and training model

Selection basis

It mainly selects an appropriate algorithm according to the relationship between features and labels, and finds out the corresponding appropriate algorithm package, and then establishes the model by calling this algorithm package. Through the previous step, there is an approximate linear relationship between some features and labels in this data set. And the labels of this data set are continuous variables, so regression analysis is suitable to find the prediction function from feature to label.

Regression analysis is a statistical analysis to determine the interdependent quantitative relationship between two or more variables. In other words, it is to study how the dependent variable changes when the independent variable changes, which can be used to predict passenger flow, rainfall, sales volume, etc.

There are many algorithms of regression analysis, such as linear regression, polynomial regression, Bayesian regression and so on. It depends on the relationship between the feature and the tag. At the beginning, there may be a linear relationship between features and labels, which can be modeled by the simplest and most basic machine learning algorithm, linear regression, which is the process of finding parameters for each feature variable.

For example, the linear regression formula in mathematics is y = a*x +b. For machine learning, slope A is called weight, represented by the English letter W, intercept B is called bias, represented by the English letter B, and the linear regression formula in machine learning is expressed as:

Y = w*x +b

Machine learning algorithm package

Sklearn is the most widely used open source Python machine learning library. Sklearn provides a large number of machine learning tools for data mining, covering data preprocessing, visualization, cross-validation, and a variety of machine learning algorithms.

Build a model

Calling the LinearRegression to build the model is very simple, as shown below

Linereg_model = LinearRegression() # create models using LinearRegressionCopy the code

There are two kinds of model parameters, internal parameters and external parameters. Internal parameters are part of the algorithm itself, which need not be determined manually. For example, the weight W and intercept B in linear regression are both internal parameters of linear regression. External parameters are also called hyperparameters, and their values are set by ourselves when we create the model. The external parameters of the LinearRegression model mainly contain two Boolean values:

Fit_intercept, which defaults to True and indicates whether the model intercept is evaluated

Normalize, which defaults to Flase to indicate whether feature X is normalized before regression.

Training fit model

The training model is to use the characteristic variables and known labels in the training set to gradually fit the function according to the loss size of the sample size, determine the optimal internal parameters, and finally complete the model.

Linereg_model.fit (x_train,y_train) # Use the training set data, train the machine, fit the function, determine the internal parametersCopy the code

Mainly due to the existence of machine learning library, model training is directly completed through FIT. The core of FIT is to optimize its internal parameters to reduce losses, make the function more and more appropriate for the simulation of features and labels, and find a group of model parameters with small average losses for all samples. The key is to optimize the parameters of the model step by step through gradient descent to minimize the error of the training set.

Gradient descent: Find the direction of each step by taking a derivative to ensure that you always go in the direction of a smaller loss.

Evaluate and optimize model performance

In the process of model effect evaluation by validation set and test set, we optimize the hyperparameters (external parameters of the model) by minimizing the error. Machine learning packages such as SciKit-Learn provide common tools and metrics to evaluate validation sets and test sets to calculate current errors. For example, R square or MSE mean square error index can be used to evaluate the advantages and disadvantages of regression analysis model.

Prediction method:

Usually we use the predict method directly in the model:

Y_pred = linereg_model.predict(x_test) #Copy the code

The original feature data of the comparison test data set, the original tag value and the predicted tag value of the model are combined to display and compare

Df_ads_pred = x_test.copy () # test set feature data df_ADs_pred [' true pageview '] = y_test df_ADs_pred [' predicted pageview '] = y_predCopy the code

See what the model looks like? The LinearRegression coef_ and Intercept_ properties print the weights of each feature and the model bias, which are the internal parameters of the model.

linereg_model.coef_
linereg_model.intercept_
Copy the code

Evaluation scores of models: There are two kinds of metrics commonly used to evaluate regression analysis models: R-square scores and MSE metrics, and most machine learning toolkits provide related tools. Here is how to evaluate models using R-square scores

  linears_model.score(x_test,y_test)
Copy the code

Machine learning project is an iterative process. Excellent models are the product of iteration after iteration. Model evaluation requires repeated evaluation to find the optimal hyperparameters and determine the final model.

Model online service

Mainstream model service methods:

There are several deployment modes for model services, such as pre-storage results, model-based PMML model transformation and Serving, and Tensorflow Serving. The first two are not end-to-end training and model deployment. PMML has limited presentation ability for complex deep learning model services, which is not enough to support complex deep learning model. Therefore, Tesorflow Serving is required for the deep learning model.

  • Save recommended results

The offline prediction results are stored in an online database such as Redis, and the stored data is directly retrieved from the online environment and returned to the application

  • Transform and deploy models using PMML

PMML (Predictive Model Markup Language) : JPMML serves as the Library for serializing XML and parsing PMML files

  • TensorFlow Serving

The Tesorflow model on-line process is mainly: firstly, the model is serialized offline and stored to the file system; Tensorflow Serving loads the model file to the model server, restores the model inference process, and provides the model service externally through HTTP interface or gRPC interface

conclusion

This paper first introduced machine learning, understood that machine learning is a technology that helps us predict, judge, group and solve problems by generating rules and discovering models from data, and summarized the classification and deep learning of machine learning.

Secondly, it introduces how to use Juypter to prepare the environment for machine learning and install relevant machine learning packages to process, analyze and display visual data.

Then, the paper introduces the machine learning in the practical work of the entire process, from data collection and data preprocessing, and then to selection algorithm and determine model, followed by selection algorithm and model for the evaluation of training model, to model performance optimization, and the final model training after completion of the offline volume and the prediction model of online services the entire process.

The whole process of machine learning is probably so, specific application to the business, you need to start from specific business determine the need to solve business problems, in view of the problem to collect the relevant data, and then the experiment of different algorithms, evaluates the effectiveness of the model to business, each step will involve different tools and services, From big data offline batch processing, real-time stream processing, machine learning, deep learning training frameworks such as Spark, Flink, TensorFlow, Pytorch are all involved.

This paper reference learning machine learning from scratch, adding their own understanding and the related content, at best is an introductory summary, the whole machine learning involves many contents, is not merely a algorithm and a large number of AI data engineering, back-end technology stack, to be proficient in need in the back-end technology, AI algorithm more specifics, need combined with background of business at the same time, in actual combat.

The resources

zhuanlan.zhihu.com/p/74874291

Logistic Regression _ Radical Snail -CSDN blog _ Logistic Regression

zhuanlan.zhihu.com/p/33794257

Geek Time: Learning machine Learning from Scratch column