Writing began

I am a Java development veteran of 10 years. I used to only read other people’s blogs, but after thinking about it today, I decided to start a blog. The purpose is twofold: first, to write down a copy of all the projects I have experienced and then to review them in my spare time (becoming forgetful, (^_^)). Secondly, exercise your writing skills. You should not only have good ideas, but also be more perfect if you can express them in words.

Environment to prepare

My running environment is spark-2.1.1-bin-hadoop2.7 +hadoop2.7+win10+hadoop_dll2.6.0_64bit. It takes a few days to build the environment from scratch because normal work tasks need to be completed. I downloaded Spark + Hadoop from the official website, hadoop_dll2.6.0_64bit from CSDN, install and configure process search engine.

The first step is to build the Spark project using Maven

XML core dependencies: this process is not smooth, jar download can not be down, cause the runtime error is not found, it is related to the network, ** solution: ** find that there is a problem with the local library location, delete, right-click ->manven->update project, will download again.

< the dependency > < groupId > org. Apache. Spark < / groupId > < artifactId > spark - core_2. 11 < / artifactId > < version > 2.3.1 < / version > </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> < version > 2.7.7 < / version > < / dependency > < the dependency > < groupId > Commons - configuration < / groupId > < artifactId > Commons - configuration < / artifactId > < version > 1.10 < / version > < / dependency > < the dependency > < the groupId > org. Apache. Spark < / groupId > < artifactId > spark - streaming_2. 11 < / artifactId > < version > 2.3.1 < / version > <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> <version>2.3.1</version> <scope> Runtime </scope> </dependency> </dependencies>Copy the code

Step 2 Copy the sample code to the Spark project

Example code in spark-2.3.1-bin-hadoop2.7\examples\ SRC \main, I use Java implementation

public class JavaLinearRegressionWithElasticNetExample {
	public static void main(String[] args) {
		SparkSession spark = SparkSession.builder().appName("JavaLinearRegressionWithElasticNetExample").master("local")
				.getOrCreate();
		// $example on$
		// Load training data.
		Dataset<Row> training = spark.read().format("libsvm").load("data/sample_linear_regression_data.txt"); LinearRegression LR = new LinearRegression().setMaxiter (100).setregParam (0.8).setelasticNetParam (0.8); // Fit the model. LinearRegressionModel lrModel = lr.fit(training); // Print the coefficients and interceptfor linear regression.
		System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept());
		// Summarize the model over the training set and print out some metrics.
		LinearRegressionTrainingSummary trainingSummary = lrModel.summary();
		System.out.println("numIterations: " + trainingSummary.totalIterations());
		System.out.println("objectiveHistory: " + Vectors.dense(trainingSummary.objectiveHistory()));
		trainingSummary.residuals().show();
		System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError());
		System.out.println("r2: " + trainingSummary.r2());
		// $example off$
		lrModel.transform(training).select("features"."label"."prediction").show(); spark.stop(); }}Copy the code

For code interpretation, please refer to SPARK MLLIB machine learning. So what’s different about this example from the official website? The point is that in this line, Lrmodel.transform (training).select(“features”, “label”, “prediction”).show(); lrmodel.transform (training). How important it is for beginners to be able to predict the results of the output ah, improve learning confidence!

Without this line of code output:

+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | residuals | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 1020073.172416118 | | 1779184.442462788 | | 1404048.5767556876 to 4165697.1427883618 1313336.9211495668 | | | | | | | 4972683.339662604 4435613.32077004 | | - | 5744798.382868968 to 373631.6067474559 | | | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + RMSE: r2 3379675.1928499704:0.9703976727070023Copy the code

I have the output of this line of code

+--------------------+-------------+--------------------+ | features| label| prediction| + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | (4,,1,2,3 [0], [263... | | | 3548600.922416118 2528527.75 | (4,,1,2,3 [0], [750... 4130186.0 2351001.557537212 | | | | (4,,1,2,3 [0], [297... 3.64669408 e7 | | 3.515360387885043 e7 | | (4,,1,2,3 [0], [820 2.9674333007211637 e7... 3.384003015 e7 | | | | (4,,1,2,3 [0], [661... 6.117732849 e7 | | 6.258137706675569 e7 | | (4,,1,2,3 [0], [1.2 5.1361431060337394 e7... 5.63341144 e7 | | | | (4,,1,2,3 [0], [1.3... 4.407602894 e7 4.851164226077004 e7 | | | | (4,,1,2,3 [0], [968 2.9792327133252542 e7... 3.016595874 e7 | | | | (4,,1,2,3 [0], [695... 1.7447519 e7 | | 2.3192317382868968 e7 | +--------------------+-------------+--------------------+Copy the code

Characteristic, tag value, predicted value

data

2528527.75 1:263455.75 2:275799 3:709413 4:629950 4130186 1:750437 2:201574.5 3:443560 4:409805 36466940.8 1:2974719.5 2:7677371.75 3:7301996.95 4:8867920.5 33840030.15 1:8207536.9 2:9843736.6 3:8679155.35 4:5086838 61177328.49 1:615422.05 2:15269585.4 3:15642690.7 4:13381847.16 56334114.4 1:12,991518.43 2:13,798081.9 3:12,502771.4 4:12,899599.5 44076028.94 1:13081200.65 2:13837499.9 3:12257050.8 4:11567330.24 30165958.74 1:9683931.6 2:9116163.8 3:5350176.6 4:9081991.59 17447519 1:6958805.1 2:7015432.75 3:5833268.4 4:5251674.55Copy the code

Summary: This example runs with new LinearRegression().setMaxiter (100).setregParam (0.8).setelasticNetParam (0.8) tunable. The data format should meet libSVM requirements, that is, label values Feature Number: eigenvalue, RMSE: mean of the square root error between the predicted value and the true value R2: Compare the predicted value to the mean value alone. The interval is usually between 0 and 1. 0 is better than nothing and taking the mean, while 1 is the case where all the predictions match perfectly with the real results and the average is somewhere between [0 and 1]. 0 means it’s below the mean. 1 means perfect prediction.

Write in the last

Writing this article to enhance their confidence in learning, and finally output the predicted results, so that the road of learning is not smooth, only persevere.