The author | GUEST compile | source of vitamin k | Analytics Vidhya

An overview of the

  • Spark NLP of John Snow Lab was built on AWS electronic medical records, and the library was used for simple text classification of BBC articles.

introduce

Natural language processing is one of the most important processes in data science teams around the world. As data continues to grow, most organizations have moved to big data platforms such as apachehadoop and cloud products such as AWS, Azure and GCP.

These platforms are not only capable of handling big data, but also enable organizations to perform large-scale analyses of unstructured data, such as text categorization. But there is still a gap between big data systems and machine learning tools when it comes to machine learning.

Popular machine learning Python libraries, such as SciKit-learn and Gensim, are highly optimized for execution on single-node computers rather than designed for distributed environments.

Apache Spark MLlib is one of many tools that help bridge this gap, providing most machine learning models such as linear regression, Logistic regression, support vector machines, random forest, K-means, LDA, etc, to perform the most common machine learning tasks.

In addition to machine learning algorithms, Spark MLlib also provides a large number of feature converters such as Tokenizer, StopWordRemover, N-grams and CountVector, TF-IDF and Word2Vec.

While these converters and extractors are sufficient to build a basic NLP pipeline, to build a more comprehensive and production-grade pipeline, we need more advanced technologies such as stem analysis, lexical, part of speech tagging, and named entity recognition.

Spark NLP provides various annotators to perform advanced NLP tasks. For more information, see the list of annotators and their usage on the website

Nlp.johnsnowlabs.com/docs/en/ann…

Set up the environment

Let’s continue looking at how to set up Spark NLP on AWS EMR.

1. Before starting the EMR cluster, we need to create a boot operation. The boot operation is used to set up other software or to customize the cluster node configuration. The following boot actions are available to set up Spark NLP on an EMR cluster,

#! /bin/bashsudo yum install -y python36-devel python36-pip python36-setuptools python36-virtualenvsudo python36 -m pip install --upgrade pip # sudo python36 -m pip install pandas # sudo python36 -m pip install boto3 # sudo python36 -m pip Install re # sudo python36 -m PIP install spark-nlp==2.4.5Copy the code

Once the shell script is created, copy it to a location in AWS S3. You can also install other Python packages as needed.

2. We can start the EMR cluster using the AWS console, apis, or the Boto3 library in Python. The advantage of using Python is that you can reuse code whenever you need to instantiate a cluster or add it to a workflow.

Here is the Python code to instantiate the EMR cluster.

import boto3region_name='region_name'def get_security_group_id(group_name, region_name): ec2 = boto3.client('ec2', region_name=region_name) response = ec2.describe_security_groups(GroupNames=[group_name]) return response['SecurityGroups'][0]['GroupId']emr = boto3.client('emr', Region_name =region_name) Cluster_Response = emr.run_job_flow(Name='cluster_name', # ReleaseLabel='emr-5.27.0', LogUri='s3_path_for_logs', # Instances={'InstanceGroups': [{'Name': "Master Nodes ", 'Market': 'ON_DEMAND', 'InstanceRole': 'MASTER', 'InstanceType': 'm5.2xlarge', # change 'InstanceCount' as required: 1 # For high availability of primary nodes, set count greater than 1}, {'Name': "Slave nodes", 'Market': 'ON_DEMAND', 'InstanceRole': 'CORE', 'InstanceType': 'the m5.2 xlarge', # according to demands for change 'InstanceCount: 2}],' KeepJobFlowAliveWhenNoSteps: True, 'Ec2KeyName' : 'key_pair_name # update values' EmrManagedMasterSecurityGroup' : get_security_group_id('ElasticMapReduce-master', region_name=region_name) 'EmrManagedSlaveSecurityGroup': get_security_group_id('ElasticMapReduce-master', region_name=region_name) }, BootstrapActions=[ { 'Name':'install_dependencies', 'ScriptBootstrapAction':{ 'Args':[], 'Path':' path_to_bootSTRapAction_on_s3 '# update value}}], Steps = [], VisibleToAllUsers=True, JobFlowRole='EMR_EC2_DefaultRole', ServiceRole='EMR_DefaultRole', Applications=[ { 'Name': 'hadoop' }, { 'Name': 'spark' }, { 'Name': 'hive' }, { 'Name': 'zeppelin' }, { 'Name': 'presto' } ], Configurations=[ # YARN { "Classification": "yarn-site", "Properties": {"yarn.nodemanager.vmem-pmem-ratio": "4", "yarn.nodemanager.pmem-check-enabled": "false", "yarn.nodemanager.vmem-check-enabled": "false"} }, # HADOOP { "Classification": "hadoop-env", "Configurations": [ { "Classification": "export", "Configurations": [], "Properties": {"JAVA_HOME": "/usr/lib/jvm/java-1.8.0"}}], "Properties": {}}, # SPARK {"Classification": "spark-env", "Configurations": [ { "Classification": "export", "Configurations": [], "Properties": {"PYSPARK_PYTHON":"/usr/bin/python3", "JAVA_HOME": "/ usr/lib/JVM/Java - 1.8.0 comes with"}}], "Properties" : {}}, {" Classification ":" spark ", "Properties" : {"maximizeResourceAllocation": "true"}, "Configurations": [] }, { "Classification": "spark-defaults", "Properties": { "spark.dynamicAllocation.enabled": "true" #default is also true } } ] )Copy the code

Note: Make sure you have the correct access to the S3 bucket for logging and storing boot action scripts.

BBC article text classification based on Spark-NLP

Now that we are ready to cluster, let’s use Spark NLP and Spark MLlib to build a simple example of text categorization on BBC data.

1. Initialize Spark

We will import the required libraries and initialize the Spark session with different configuration parameters. The configuration value depends on my local environment. Adjust parameters accordingly.

Sparknlp. Base import * from Sparknlp. Pretrained import * from Sparknlp PretrainedPipeline import SparknLP from Pyspark.sql import SparkSession from Pyspark.ml import Pipeline# Use Spark NLP Start Spark session # Spark = sparknlp.start() Spark = sparksession.builder \. AppName ("BBC Text Categorization")\ .config("spark.driver.memory","8G")\ change accordingly .config("spark.memory.offHeap.enabled",True)\ .config("spark.memory.offHeap.size","8G") \ .config("spark.driver.maxResultSize", "2G") \ .config("spark.jars.packages", "Com. Johnsnowlabs. NLP: spark - nlp_2. 11:2. 4.5") \. Config (" spark. Kryoserializer. Buffer. Max," "1000M")\ .config("spark.network.timeout","3600s")\ .getOrCreate()Copy the code

2. Load text data

We’re going to use data from the BBC. You can download the data from this link. After downloading the following data, load it using the Spark code.

www.kaggle.com/yufengdev/b…

File_location = r'path\to\ BBC-text. CSV 'file_type = "CSV "# CSV infer_SCHEMA = "true" first_row_IS_header = "true" delimiter = ","df = spark.read.format(file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .load(file_location)df.count()Copy the code

3. Split the data set into training set and test set

Unlike Python, which uses SciKit Learn to split data, Spark Dataframe has a built-in function randomSplit() to perform the same operation.

(trainingData, testData) = df.randomSplit([0.7, 0.3], seed = 100)Copy the code

The randomSplit() function takes two arguments viz. Weight array and seed. In our example, we will use a 70/30 split, with 70% training data and 30% test data.

4. Use the Spark NLP NLP pipeline

Let’s continue building the NLP pipeline using Spark NLP. One of the biggest benefits of Spark NLP is its native integration with the Spark MLLib module, which helps build a comprehensive ML pipeline of Transformers and Estimators.

This pipeline can include feature extraction modules such as CountVectorizer or HashingTF and IDF. We can also include a machine learning model in this pipeline.

The following is an example of an NLP pipeline with feature extraction and machine learning models;

from pyspark.ml.feature import HashingTF, IDF, StringIndexer, SQLTransformer,IndexToString from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import MulticlassClassificationEvaluator# convert text as NLP file document_assembler = DocumentAssembler () \. SetInputCol \ (" text ") .setOutputCol("document")# Tokenizer = Tokenizer().setinputCols (["document"]).setOutputCol("token") # Normalizer = Normalizer() \.setinputcols (["token"]) \.setOutputCol("normalized")# Delete stopwords_cleaner = StopWordsCleaner()\ .setInputCols("normalized")\ .setOutputCol("cleanTokens")\ .setCaseSensitive(False) stemmer = Stemmer() \.setinputCols (["cleanTokens"]) \.setOutputCol(" STEM ")# Convert custom document structures into identity arrays. finisher = Finisher() \ .setInputCols(["stem"]) \ .setOutputCols(["token_features"]) \ .setOutputAsArray(True) \ HashingTF = hashingTF (inputCol="token_features", outputCol="rawFeatures", Idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)# Convert labels (strings) to integers. Label_stringIdx = StringIndexer(inputCol = "category", outputCol = "label")# Define a simple polynomial logistic regression model. Experiment with different combinations of hyperparameters to see which one fits your data better. You can also try different algorithms to compare scores. Lr = LogisticRegression (maxIter = 10, regParam = 0.3, Label_to_stringIdx = IndexToString(inputCol="label", NLP nlP_pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer, stopwords_cleaner, stemmer, finisher, hashingTF, idf, label_stringIdx, lr, label_to_stringIdx])Copy the code

5. Training model

Now that our NLP pipeline is ready, let’s train our model based on training data.

Pipeline_model = nlP_pipeline.fit (trainingData)Copy the code

6. Execute forecasts

Once the training is complete, we can predict the class labels on the test data.

Pipeline_model.transform (testData) # predictions for testData = pipeline_model.transformCopy the code

7. Evaluation model

Evaluation of the trained model is important to understand how the model works on invisible data. We’ll look at three popular metrics: accuracy, accuracy, and recall.

  1. The accuracy of
# import evaluator from pyspark. Ml. Evaluation import MulticlassClassificationEvaluatorevaluator = MulticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="accuracy") accuracy = evaluator.evaluate(predictions) print("Accuracy = %g" % (accuracy)) print("Test Error = %g "% (1.0-accuracy)Copy the code

  1. precision
evaluator = MulticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="weightedPrecision") accuracy = evaluator.evaluate(predictions) print("Accuracy = %g" % (accuracy)) Print ("Test Error = %g "% (1.0-accuracy))Copy the code

  1. The recall rate
evaluator = MulticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="weightedRecall") accuracy = evaluator.evaluate(predictions) print("Accuracy = %g" % (accuracy)) print("Test Error = %g "% (1.0-accuracy)Copy the code

Depending on the business use case, you can decide which metrics to use to evaluate the model.

For example, if a machine learning model was designed to be used according to some parameters to detect cancer, so it is best to use the recall rate, because the company cannot afford false negative cases, a people with cancer but models do not detect cancer), aims to generate user recommendation and if the machine learning model, companies can afford false positives (8 of 10 tips in line with the user configuration file). Therefore, accuracy can be used as an evaluation indicator.

8. Save the pipe model

After you have successfully trained, tested, and evaluated your model, you can save it to disk and use it in different Spark applications. To save the model to CD, use the following code;

pipeline_model.save('/path/to/storage_location')Copy the code

conclusion

Spark NLP provides a number of annotators and converters to build data preprocessing pipelines. Sparl NLP integrates seamlessly with Spark MLLib, enabling us to build end-to-end natural language processing projects in a distributed environment.

In this article, we looked at how to install Spark NLP on AWS EMR and implement text categorization of BBC data. We also looked at the different metrics in Spark MLlib and saw how to store the model for further use.

Hope you enjoyed this article.

The original link: www.analyticsvidhya.com/blog/2020/0…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/