Microsoft has opened source MMLSpark, a deep learning library for Apache Spark. MMLSpark is perfectly integrated with Microsoft Cognition Toolkit and OpenCV.

Microsoft found that while SparkML could build scalable machine learning platforms, the vast majority of developers’ efforts were spent calling the underlying apis. MMLSpark is designed to simplify repetitive work in PySpark.

Using UCI’s adult Income Census data set as an example, income is projected using other items:

If you use SparkML directly, each column needs to be processed separately and sorted into the correct data type; Only two lines of code are required in MMLSpark:

__Tue Oct 24 2017 14:59:05 GMT+0800 (CST)____Tue Oct 24 2017 14:59:05 GMT+0800 (CST)__model = mmlspark.TrainClassifier(model=LogisticRegression(), LabelCol = "income").fit(trainData) Predictions = model.transform(testData) __Tue Oct 24 2017 14:59:05 GMT+0800 (CST)____Tue Oct 24 2017 14:59:05 GMT+0800 (CST)__Copy the code

Deep neural network (DNN) is not inferior to human in image recognition and speech recognition, but the training of DNN model requires professionals, and the integration with SparkML is also very difficult. MMLSpark provides a convenient Python API for training DNN algorithms. MMLSpark makes it easy to use existing models for classification tasks, train on distributed GPU nodes, and build scalable image processing pipelines using OpenCV.

The following three lines of code initialize a DNN model from the Microsoft Cognition Tool set to extract features from the image:

__Tue Oct 24 2017 14:59:05 GMT+0800 (CST)____Tue Oct 24 2017 14:59:05 GMT+0800 (CST)__cntkModel = CNTKModel (.) setInputCol (" images "). SetOutputCol (" features "). SetModelLocation (resnetModel). SetOutputNode (" z.x ") FeaturizedImages = cntkModel.transform(imagesWithLabels).select([' labels', 'features']) model = TrainClassifier(Model =LogisticRegression(),labelCol= "labels").fit(featurizedImages) __Tue Oct 24 2017 14:59:05 GMT+0800 (CST)____Tue Oct 24 2017 14:59:05 GMT+0800 (CST)__Copy the code

MMLSpark has been published to Docker Hub. You can use the following command to deploy it on a single machine:

__Tue Oct 24 2017 14:59:05 GMT+0800 (CST)____Tue Oct 24 2017 14:59:05 GMT+0800 (CST)__docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark__Tue Oct 24 2017 14:59:05 GMT+0800 (CST)____Tue Oct 24 2017 14:59:05 GMT+0800 (CST)__Copy the code

MMLSpark is licensed under the MIT protocol.

英文原文 :

Github.com/Azure/mmlsp…

Blogs.technet.microsoft.com/machinelear…


Thanks to CAI Fangfang for proofreading this article.

To contribute or translate InfoQ Chinese, please email [email protected]. You are also welcome to follow us on Sina Weibo (@InfoQ, @Ding Xiaoyun) and wechat (wechat id: InfoQChina).