With the popularity of ARTIFICIAL intelligence, many enterprises and even data mining enthusiasts begin to try to mine the value in data themselves. The machine learning infrastructure includes data, algorithms, and tools. Having covered data and algorithms, this article will focus on machine learning tools.


Machine learning tools can be divided into two types in terms of computing power, namely single-machine computing and cluster computing. This article introduces the stand-alone machine learning tools, open source distributed machine learning tools, and enterprise cloud machine learning tools, as shown in Figure 1-1.

Figure 1-1 Machine learning tools

First, the stand-alone machine learning tools are introduced. If you have any experience in data mining, you will be familiar with the following tools: SPSS and R. These are the representatives of machine learning tools of stand-alone version. Each product has its own characteristics. SPSS is more convenient to operate, and R has a relatively simple drawing function.

The standalone tool is easy to install because it does not depend on the configuration of the underlying computing cluster.

However, the stand-alone tool cannot compare with the distributed machine learning tool in terms of computing power. It can only perform some data experiments or draw diagrams and display, and is relatively weak in enterprise-level data processing and business services.

After introducing the stand-alone machine learning tools, let’s move on to the distributed machine learning tools. The author believes that the real intelligent computing platform must be a computing platform with the ability to process large-scale data and provide rich algorithms. Generally speaking, a complete machine learning tool architecture consists of four layers, as shown in Figure 1-2.

Figure 1-2 Architecture of an intelligent machine learning tool

From the top down, for example, there are some business requirements, such as building advertising DSP system, building commodity recommendation engine. The requirements of these business scenarios are based on the underlying machine learning algorithms, as described in the business scenario abstractions.

These underlying machine learning algorithms, such as K-means, LR and RF, need to be mapped to distributed computing architectures and implemented through distributed code architectures such as MPI and MapReduce. Finally, the distributed code architecture deploys tasks to the underlying computing engine.

At present, with the development of cloud computing and intelligent algorithms, there are many choices for the architecture of intelligent machine learning algorithm, including the open source combination of cluster +Spark+Mlib, and enterprise-level machine learning platform services of some cloud computing service providers. Mature ones include Amazon’s AWS Machine Learning, Microsoft Azure’s Machine Learning Studio and Ali Cloud Machine Learning PAI.

Using the open source architecture to build an algorithm platform may improve its flexibility in data flow and algorithm customization, but it will also cost a lot in cluster operation and maintenance and algorithm development.

Next, stand-alone machine learning tools, open source distributed machine learning tools and enterprise-level cloud machine learning tools will be introduced, mainly from the aspects of dependency, ease of operation and algorithm richness (note: the experimental environment of this paper is Mac OS system).

Stand-alone machine learning tools

For ordinary users, especially some data mining beginners with weak algorithm skills, the stand-alone machine learning tool can be used more quickly. This section focuses on two tools, SPSS and RStudio.

1.1.1 SPSS

(1) Introduction. Statistical Products and Services Solutions (SPSS) is the earliest statistical analysis software in the world. It was successfully researched and developed by Norman H. Nie, C. Hadlai (Tex) Hull and Dale H. Bent from Stanford University in 1968. At the same time, SPSS company was established, and in 1975, a corporate organization was established, and SPSS headquarters was established in Chicago.

On July 28, 2009, IBM announced a $1.2 billion cash acquisition of SPSS, a provider of statistical analysis software. SPSS has been released to version 22.0 and renamed IBM SPSS. So far, SPSS company has been growing for more than 40 years.

The main characteristic of SPSS software is that the operation interface is extremely friendly. It displays almost all functions in a unified and standardized interface, using Windows to display various functions of data management and analysis methods, and dialog boxes to display various functional options. As long as users master certain Windows operating skills and are familiar with the principle of statistical analysis, they can use the software for scientific research.

(2) Installation. SPSS is a paid software that is easy to install. Log in to the official website to download SPSS, purchase the license, and directly install and register. SPSS 21.0 is used in this demonstration, and the product interface is entered, as shown in Figure 1-3.

Figure 1-3 SPSS page

(3) Run experiments. Open SPSS software, prompting to import data source. SPSS supports multiple data source inputs, as shown in Figure 1-4.

Figure 1-4SPSS data source

The import here is a set of data from the UCI open source dataset, which is a dichotomous scenario. Using this set of data to do a logistic regression dichotomy model calculation. The data is imported into SPSS, where dioxide_A field is the target column (the target column is divided into 0 and 1 values), and other fields are the feature columns, as shown in Figure 7-5.

Figure 1-5 Data import

Open the Analysis menu on the menu bar, as shown in Figure 1-6, and select Binary Logistic Regression. Here, “Dependent variable” represents the target column, and “covariable” represents the feature field. Click “OK” to start model training.

Figure 1-6 Logistic regression Settings

The final output results can be displayed through the output viewer. The evaluation module of the model is shown in Figure 1-7.

Figure 1-7 Model evaluation

The model evaluation module of logistic regression has multi-dimensional representation for each feature. Among these statistical indicators, the following four indicators are key.

“B” : Partial regression coefficient, that is, the coefficient of the generated linear model.

S.E: indicates standard error.

“Wals” : Wald statistic.

“EXP(B)” : the sometimes ratio of variables.

Finally, the logistic regression model generated by this experiment can be expressed as: Logistic acidity_A fixed = 0.399 * 3.895 * 2.884 * volatile acidity – 2.473-0.006 * * citric acid + 0.039 * 0.026 * residual sugar + 0-15.696 *. 847* Chlorides +1.925*free+0.879*sulfur_A+3.056 FROM the overall design perspective of SPSS, SPSS is also a statistical software. Data is operated in the form of Excel, which greatly reduces the threshold for data operators to use. However, if you want to customize the development of data, no matter through scripts or data conversion tools, it is difficult.

In order to carry out large-scale data mining, SPSS has certain limitations in terms of algorithm freedom or efficiency.

1.1.2 R language

(1) Introduction. If you’re in the data mining business, you’ve probably heard of R, which is one of the basic requirements for data mining engineers: familiarity with R.

What are the features of R?

Let’s look at it briefly in this article. To introduce the background, R is a collection of statistical calculation and graphics functions in one software. The predecessor of R language is S language, which was developed by the famous AT&T Bell LABS for data analysis and mapping.

Later, Robert Gentleman at the University of Auckland in New Zealand developed the S language and gave birth to the early form of R language.

The R language has the following advantages.

Open source. The R language is a fully open source tool. Because of open source, data developers are free to read the source code of R language and can be extended based on the code of R language, which is why R language can develop rapidly in a short period of time.

Open-source enthusiasts from all over the world contribute code packages to R every day, and users can easily install these extended algorithms through the install command. R language is different from SPSS and other software, it can freely modify the existing algorithm, make the algorithm more suitable for their own business scenarios.

Cross-platform. The cross-platform nature of R has greatly accelerated the spread of the technology, and there are currently stable versions available on Mac OS, Windows, and Linux. Users need only one set of code to run business logic on different platforms.

Relatively complete data. At present, R language has a large number of open source contributors, and R language is widely used in both academia and industry. A large number of users have contributed a lot of reference learning materials or example codes. There are already books on some applications of R.

Visualization. R language is also unique in data visualization, providing many kinds of drawing packages and rich drawing functions, so that the generated data can be clearly visualized. For example, draw a Sigmoid function curve whose domain is [-3,3]. The Sigmoid function formula is

You only need to enter the following command to get the screenshot shown in Figure 1-8.

Figure 1-8 R drawing

> x < – seq (3, 3, by = 0.01)

> y<-1/(1+exp(-x))

> plot(x,y)

The R language is implemented through the command line. R language is characterized by relatively simple and easy to understand. With a rich algorithm package, beginners can basically run through a set of complex data mining experiments in half a day. RStudio is an IDE tool for R. Here is a case study to explain how to run the logistic regression algorithm through RStudio.

(2) Installation. The experimental environment of this book is Mac OX 10.11.1 EI Capitan system. To use RStudio, you need to install the R language package. You can download the R language package from https://www.r-project.org/ on the official website. After the installation is successful, you will see a command line interface when you open R, as shown in Figure 1-9.

Figure 1-9 R terminal

After R is installed, you can install RStudio from https://www.rstudio.com/, as shown in Figure 1-10.

Figure 1-10 RStudio page

(3) Run experiments. After installing RStudio, this experiment will conduct logistic regression model training experiment on an open source data set. Import data first. RStudio supports data import in a variety of formats, some of which may require the installation of the corresponding function package.

In this experiment, CSV format files were imported. There are two Import methods: the Import Dataset button provided by RStudio can be used, or the following functions can be implemented.

> data <- read.csv

(” ~ / Documents/work/book/data/data. The CSV “, sep = “;” ) >View(data)

After the data is imported, you can view the data visually, as shown in Figure 1-11.

Figure 1-11 R data import

Here is how to perform logistic regression on the data. In fact, RStudio only needs the following line of code to implement.

Mylogit < -glm (label ~., data = data, family =binomial(link=’logit’))

Mylogit represents the logistic regression object name.

GLM is a linear model function.

Abel is the target column, and ~ represents fields other than the target column.

Data represents a data set.

Binomial means dichotomous.

Link =’logit’ indicates logistic regression.

Users can view the generated model through the summary function.

summary(mylogit)

Figure 1-12 shows the results.

Figure 1-12 Logistic regression results

The Estimate field in the result (see Figure 1-12) is the generated logistic regression model coefficient, and we can get the final logistic regression model as follows.

Logit = 3.05 + 0.39 * fixed in acidity – 3.89 * volatile acidity – 2.88 * citric acid – 0.006 * residual sugar – 2.47 + 0.03 * free * chlorides. Sulfu R.d ioxide – 0.02 * total, sulfur dioxide – + 0.84 * 15.69 * density pH sulphates + 0.87 + 1.92 * * alcohol

In addition, users can also check the model fit through R language’s powerful graphical display capabilities.

> plot(mylogit)

Figure 1-13 shows the result.

Figure 1-13 Graph fitting curve

Through the above experiments, readers can have a brief understanding of the syntax and operation mode of RStudio and R language. It can be seen that the syntax of R language is very easy to understand, and the graphical display of the results enables data operators to observe the results more intuitively.

In terms of data source support, RStudio supports importing local data sources as well as server connections. The specific supported formats can be extended by installing corresponding plug-ins, which basically contain all formats of SPSS and have good support for various database files.

In the algorithm, support for R language is based on the open source community, so there are a lot of algorithm selection of packages, covers the basic characteristics of engineering, classification algorithm, clustering algorithm and regression algorithm and neural network algorithm and other conventional machine learning algorithms, and extensibility in the algorithm, also support the algorithm more custom transformation.

Because R language has so many excellent features, so recently more and more distributed systems are transforming R language, looking forward to R language can also achieve distributed computing, so as to break through the bottleneck of computing resources encountered at present, the future cloud RStudio will be more exciting.

In conclusion, R language is a very ideal experimental environment for data mining engineers, especially in the visual presentation of computational results.

Open source distributed machine learning tools

SPSS and RStudio are the machine learning tools of the standalone version. As the standalone version of the tool, you don’t need to care about cluster configuration, operation and maintenance, so SPSS and RStudio are easy to install and get started.

However, in the actual use process, especially in the case of a relatively large amount of data, the problem of low efficiency will appear. Large-scale machine learning computations need to be processed by distributed architecture. This section focuses on two popular distributed machine learning tools, Spark MLib and Tensorflow.

1.2.1 Spark MLib

1. Introduction to the

MLib is Spark’s machine learning algorithm library, which is completely open source. Machine learning algorithm system based on Spark framework is widely used in various fields.

Since Spark and Hadoop’s MapReduce framework are two of the most mainstream open source distributed architectures in the industry, it is inevitable to compare them from the perspective of machine learning algorithm support as follows.

(1) Support for multi-step iteration. Through the introduction of algorithms in the algorithm chapter, we know that most algorithms can be realized only through multi-step iterative calculation, such as gradient descent algorithm, which needs to calculate the loss function through multiple iterations before gradually approaching the optimal solution.

The Traditional MapReduce computing framework of Hadoop requires hard disks to be read and written during each iteration, which causes high I/O consumption and reduces efficiency.

The Spark distributed computing framework implements iterative computing based on the computer memory. By processing a large number of computations in the memory, data reads and writes to hard disks are greatly reduced, improving the computing efficiency of iterative algorithms.

(2) Analysis from the perspective of cluster communication. Spark’s Akka and Netty communication systems are far more efficient than Hadoop’s JobTracker communication mechanism in terms of information transfer and data transfer.

The advantages of Spark over MapReduce are analyzed from the perspective of distributed computing architecture. The following describes some attributes of the Spark MLib library.

MLib is a distributed machine learning algorithm library designed to make machine learning algorithms easier to use and extend.

In terms of data set support, Spark MLib supports local vector and matrix data, as well as the underlying elastic distributed Data set (RDD). RDD is an abstraction of distributed memory that provides a highly constrained memory model, which can be thought of as an object in Spark MLib that runs in memory.

This is a basic introduction to Spark MLib. Here is how to build the Spark MLib machine learning system.

2. Installation and Configuration Environment

(1) Download Spark first. The experimental environment is Mac OS, and JDK needs to be installed. Spark download address for http://spark.apache.org/downloads.html, decompress after the download is complete, at the command line terminal into the Spark directory, execute the following command to start the Spark.

./sbin/start-master.sh

After Spark is started, you can log in to localhost://8080, as shown in Figure 1-14.

Figure 1-14 Logging in to Spark

(2) We find that Workers and Running Applications are empty at this time. Because Spark is a computing framework based on distributed system, Workers need to be added to make the system run; otherwise, it cannot be used.

For the convenience of explanation, the machine is added as Worker here, and the principle of adding other cluster machines is the same. To add Worker, run the following command to deploy Worker:

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

If the local host is added, the IP address :Port can be obtained from the box lines in Figure 1-15.

Figure 1-15 IP figure

After adding the machine as Worker, refresh localhost:8080 again to see Worker in the list, as shown in Figure 1-16.

Figure 1-16 Adding a Worker

1.2.2 TensorFlow

1. Introduction to the

TensorFlow, an open source machine learning framework, is based on the well-known DistBelief. TensorFlow was originally developed by the Google Brain team as a tool to study deep neural networks, but as the architecture has evolved, the system has been adapted to work in many different scenarios.

When Google made TensorFlow open source in 2015, the backlash from the IT industry was swift. Android, Google’s flagship open source product, has taken over the mobile market, and people are speculating that TensorFlow could be Google’s next big move into artificial intelligence.

Currently, TensorFlow has excellent features and distributed computing is supported in the new version. TensorFlow is set to lead a trend in machine learning for some time to come.

So just a little bit about what TensorFlow is, literally, Tensor means anything. In TensorFlow, data flows through algorithm nodes in the form of data flows. We illustrate this with an architectural flow chart for deep learning (see Figure 7-24).

Figure 1-17 Deep learning

Through the analysis of this architecture diagram of deep learning, the vertical unit in Figure 1-17 represents the algorithm layer, including the input layer, hidden layer and output layer. Each circular unit is the computing node.

Data in TensorFlow flows through compute nodes as data streams. Flow refers to the Flow of data from front to back. From a literal perspective, TensorFlow’s computational form is already clear.

Here is a brief introduction to some of the features of TensorFlow.

(1) Flexibility. TensorFlow’s flexibility lies not only in its algorithmic support, but also in its architecture. TensorFlow supports single-machine and distributed computing, and can switch between cpus and Gpus.

In terms of algorithm support, TensorFlow is not only a neural network library, but also a programming architecture for machine learning. Developers can write their algorithm logic in the form of flow diagrams, and then they can run their own algorithms in TensorFlow architecture.

(2) Ease of use. TensorFlow calculates gradients automatically by manually setting up the architecture, setting up the objective function, and feeding data into the system. The calculations and weight changes are done automatically, and the system provides a way for users to monitor the process.

In terms of usage, although the underlying code for TensorFlow is written in C++, it is possible to create computational flow diagrams through the Python interface. It is also easier for users to write the computational framework of logical code.

(3) Good resource scheduling ability. TensorFlow helps developers make the most of their computing resources. The scheduling of computing resources can be highly customized. Cpus and Gpus can be freely called. Threads, queues, and asynchronous computing are also supported.

TensorFlow allows developers to leverage their hardware resources and allow data to flow freely across different machines.

2. Experimental environment construction

TensorFlow has been introduced to some of the basic features and features of TensorFlow. Now, we will introduce you to setting up an experimental environment for TensorFlow and running the program most familiar to programmers: Hello World.

(1) Install PIP. PIP is a Python installation tool for Mac systems. TensorFlow can be automatically installed by PIP.

sudo easy_install pip

sudo easy_install –upgrade six

If piPs are already installed, you can skip this step.

2. Install Virtualenv. Virtualenv is a tool used to isolate the Python environment. TensorFlow requires adjustment of environment parameters. Therefore, it is recommended to install Virtualenv to isolate the Python environment.

To install Virtualenv, run the following command:

sudo pip install –upgrade virtualenv

Then create a tensorflow directory in Virtualenv using the following command.

Virtualenv – system – site – packages ~ / tensorflow

To activate the environment, you can run the following commands: activate and activate.

Source ~ / tensorflow/bin/activate# If using bash source ~ / tensorflow/bin/activate csh# If using CSH

(3) Install TensorFlow. It is now possible to install TensorFlow in this environment using PIP, depending on the Python version.

# Python 2

(tensorflow)$ pip install –upgrade $TF_BINARY_URL

# Python 3

(tensorflow)$ pip3 install –upgrade $TF_BINARY_URL

The TF_BINARY_URL command is selected according to the system version and whether the Python version supports the GPU.

# Mac OS X, CPU only, Python 2.7: (tensorflow) $export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.10.0-py2-none-any.whl

# Mac OS X, GPU enabled, Python 2.7: (tensorflow)$export

TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/gpu/tensorflow-0.10.0-py2-none-any.whl

# Mac OS X, CPU only, Python 3.4 or 3.5: (tensorflow) $export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.10.0-py3-none-any.whl # Mac OS X, GPU Enabled, Python 3.4 or 3.5: (tensorflow) $export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/gpu/tensorflow-0.10.0-py3-none-any.whl

(4) Examples. Virtualenv (TensorFlow) Virtualenv (TensorFlow) Virtualenv (TensorFlow) Virtualenv (TensorFlow) Virtualenv (TensorFlow) Here is the code to execute Hello World.

#tf import tensorflow as tf hello = tf.constant(‘Hello world! ‘) sess = tf.Session() print(sess.run(hello))

Constant is an array of TensorFlow, which I won’t go into here. The following is mainly about the concept of Session. Session refers to the concept of sessions. In TensorFlow system, users interact with TensorFlow system through sessions. The general pattern is to establish a Session, add nodes and edges to the Session, and then use the Session to interact with TensorFlow. Execute the code file above and see the result returned, as shown in Figure 7-25.

Figure 1 to 18

To sum up, a standalone version of TensorFlow has been installed successfully and the Hello World experiment has passed.

Enterprise cloud machine learning tools

Previously introduced are stand-alone machine learning tools and open source distributed machine learning tools respectively. Although most of these tools have friendly operation modes and rich algorithms, there are still some defects in enterprise-level services. Next, amazon machine learning platform and Aliyun machine learning platform PAI will be introduced in detail.

1.3.1 Amazon AWS ML

Amazon Web Services (AWS) is a cloud computing Service launched by Amazon in 2006. Its main advantage is that it can replace upfront capital infrastructure costs with a low cost that can be expanded according to business development. AWS already supports businesses in 190 countries and regions around the world, according to Amazon.

AWS is the current leader in cloud computing, having beaten IBM to a large cia cloud service order. Amazon machine Learning, launched in April 2015, is a service that helps developers develop and deploy predictive models using historical data.

These models have a wide range of uses, including fraud detection, preventing user churn, and improving user support. Amazon Machine Learning is a wizard that guides developers through the process of creating and debugging machine learning models to deploy and extend models to support predictions of billions of levels of data.

Amazon set up experiments with guides, And integrating machine learning services with Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Amazon Relational Database Service (Amazon RDS), Let customers use data stored in AWS cloud services to achieve the whole cloud service ecosystem.

1.3.2 Aliyun Machine learning PAI

Amazon’s enterprise-level machine learning service AWS ML was introduced before. Next, we will introduce a relatively mature machine learning platform in China, machine learning PAI from Aliyun.

Ali Cloud machine learning PAI is a machine learning platform covering almost all kinds of machine learning algorithms. The underlying computing engine of Ali Cloud machine learning is the Feitian distributed computing engine developed by Ali Cloud, which can process eB-level data.

Algorithm platform itself covers from the data preprocessing, feature engineering, machine learning algorithm, model evaluation, prediction and deployment of a whole set of machine learning algorithm solution, because the link through the whole article data mining, cloud machine learning can not only make ali as a scientific research tools, can also be used as the algorithm of enterprise solutions.

Let’s take a look at the functional architecture of the product, as shown in Figure 1-19.

Figure 1-19 Machine learning platform architecture

Analyzing the architecture diagram from bottom to top, ali Cloud machine learning is the distributed computing engine that supports heterogeneous scheduling (referring to the mixed distribution of CPU and GPU). GPU is mainly supported to better support the deep learning algorithm of the upper layer.

The computing architecture layer above the computing infrastructure supports several mainstream distributed architectures such as MR and PS. In actual operation, users are unaware of the two layers of computing framework and infrastructure, and they only need to consider which algorithms are applicable to their own scenarios, thus decoupling algorithms and computing architecture.

Compared with the wizard algorithm model building process of AWS ML, as shown in Figure 1-20, Ali Cloud machine learning platform adopts the drag-and-drop method to build the experimental process. The wizard model training method is easier to get started, but such drag-and-drop operation of Ali Cloud is more guaranteed for the expansibility and customization of the experiment.

Figure 1-20 Interface of Aliyun machine learning platform

Enter the operation interface, we can see that there are algorithm components to drag on the left side, drag the components to the middle canvas, and then follow the algorithm logic line, define the parameters of each component in the right setting box.

Such experience gives people a feeling of building blocks, without considering the underlying computing resources and operation and maintenance factors. Algorithm engineers only need to focus on business scenarios and algorithm collocation.

When using Aliyun machine learning platform for experiments, it should be noted that there are many components to choose from when building data preprocessing, feature engineering, machine learning algorithm and other links (AWS ML has relatively few options), so users are required to have a certain foundation of machine learning.

                                                     

Machine Learning Practice and Application

Li bo the

This book comprehensively introduces the theoretical basis and practical application of machine learning by explaining the background knowledge, algorithm flow, relevant tools, practical cases and knowledge graph of machine learning. The book involves a number of typical algorithms in the field of machine learning, and gives the algorithm flow of machine learning in detail.

This book is suitable for anyone with some background in data and programming. By reading this book, readers can not only understand the theoretical basis of machine learning, but also reference some typical application cases to expand their professional skills. The book is also suitable for students majoring in computer science and those interested in artificial intelligence and machine learning.

                                                   

                                                             Artificial Intelligence (2nd Edition)

By Stephen Lucci

American classic primer, known as the encyclopedia of artificial intelligence field. The most cutting-edge tutorials in the field of artificial intelligence in the last decade, updated in 2018.

Based on the theoretical foundation of ARTIFICIAL intelligence, this book presents readers with a comprehensive, novel, colorful and easy-to-understand body of artificial intelligence knowledge. Examples, applications, full-color images, and anecdotes are provided to stimulate reading and learning. Advanced courses in robotics and machine learning, including neural networks, genetic algorithms, natural language processing, planning and complex board games, have also been introduced.

                                             

Long press the QR code, you can follow us yo

I share IT articles with you every day.

If you reply “follow” in the background of “Asynchronous books”, you can get 2000 online video courses for free

Asynchronous book benefits to send non-stop

Invite 10 friends to pay attention to get a book asynchronously within 10 days (click the text to get details of the activity)

Click to read the original article to buy machine Learning Practice application

Read the original