Original link:tecdat.cn/?p=7923

Original source:Tuo End number according to the tribe public number

 

One of the main benefits of using R and Python for analysis is that there are always new and freely available services in their vibrant open source ecosystems. Today, more and more data scientists are able to use data on R, Python, and other platforms at the same time, because vendors are introducing high-performance products with apis to R and Python, and perhaps Java, Scala, and Spark.

The H2O brand is called “business AI,” which “makes it easy for anyone to apply math and predictive analytics to solve today’s most challenging business problems.” What sets H2O apart is its comprehensive, open source, cross-platform, machine learning infrastructure from the ground up to achieve scalability and speed.

 

In this exercise, I deployed the data management capabilities of R to build the model data set, and then “imported” it into the H2o structure to run the model. I can easily use the H2O function.

The sequence of tasks outlined starts with data loading and training/test data set construction. The H2O server is then started, followed by GLM, GLM with cubic splines, gradient enhancement, random forest and deep learning model calculation/rendering results. Provides time for H2O data set construction and model training.

First load the R library and set up the working directory.

 

Now load and subset the data for the modeling exercise. There are 8,644,171 cases and 7 attributes.

 

 

The next step is to divide Acs2014 into training and test data sheets in R. For our analysis, the dependent variable is Logincome, while the functions include age, sex, race and education level.

 

Start H2O server, allocate 16G RAM and use all 8 kernels.

 

Now create the H2O data structure from R data.tables. We can use data.frames/data.tables for data processing, or we can use H2O data structures and functions directly.

 

 

A linear model (GLM) was run and training data were used to regresse logon age, sex, race, and education.

 

 

Run the GLM model again, this time using a cubic spline of age to show a curvilinear relationship between age and login name.


 

 

 

 

 

 

 

Next, gradient – enhanced, more nonparametric, resampled black box models are performed. The execution speed is much slower, reflecting the large amount of computation. please

 

 

 

 

Now let’s try a random forest.


 

 

Finally, deep learning.


 

 

A cursory examination of the model’s performance suggests that using these data and models, gradient lifting is likely to yield the best results. Of course, different training and test data sets produce different performance.