Compare two production-grade NLP libraries: train spark-NLP and spaCy pipelines

The goal of this blog series is to compare two leading production-level language processing libraries (Apache Spark NLP from John Snow Labs and spaCy from Explosion AI) for real-world natural language processing (NLP) scenarios. Both libraries are open source and licensed for commercial use (Apache 2.0 and MIT, respectively). Both are active, published frequently, and have growing communities.

I wanted to analyze and identify the strengths of the two libraries, find out how they differ for data scientists and developers, and when it would be more convenient to use one or the other. This analysis is intended to be an objective exploration, with a certain amount of subjective determination at several stages (as in every application of natural language understanding).

As simple as it sounds, comparing two different libraries and benchmarking them in a comparable way can be very challenging. Keep in mind that your application will have different scenarios, data pipes, textual features, hardware Settings, and some non-functional requirements than what is done here.

I will assume that the reader is already familiar with the concepts and programming of NLP. You may not have knowledge of either tool, but my goal is to make the code as self-explanatory as possible and readable so that the reader doesn’t get bogged down in too much detail. Both libraries are publicly documented and are fully open source. So I suggest you take a look at spaCy 101 and spark-NLP quick Start documentation first.

About these two libraries

Spark-nlp was opened source in October 2017. As a Spark library, it is a native extension of Apache Spark. It introduces a set of Spark ML Pipeline stages, in the form of estimators and transformers, for processing distributed data sets. Spark NLP Annotators not only include basic features such as word segmentation, standardization, and part-of-speech tagging, but also other advanced features such as advanced sentiment analysis, spell checking, and assertion status. This all works within the Spark ML framework. Spark-nlp is written in Scala, runs in the JVM, and takes advantage of Spark’s optimization and execution plan. The library currently provides Scala and Python apis.

SpaCy is a popular and easy-to-use Python library for natural language processing. It recently released version 2.0, which includes neural networks, entity recognition, and many more models. It offers industry-leading accuracy and speed, and has an active open source community. SpaCy has been around for at least three years, with the first version on GitHub dating back to early 2015.

Spark-nlp does not currently include a pre-training model. SpaCy provides pre-trained models for seven (European) languages, so users can quickly inject target sentences and return results, including word segmentation, entry, word class (POS), similarity, entity recognition, and more, without the need for a trained model.

Both libraries provide some level of customization through parameters, allow trained pipes to be saved on disk, and require developers to develop programs that use these libraries in specific use cases. Spark NLP makes it easier to embed an NLP pipeline as part of the Spark ML machine learning pipeline from data loading, NLP, feature engineering, model training, hyperparameter tuning, and evaluation. In addition, Spark can optimize the entire pipeline execution process, making Spark NLP execution faster.

The benchmark application

The program I’ve written here will predict the POS tags in the original.txt file. Much of the data cleansing and preparation is sequential. Both applications will train with the same data and make predictions against the same data to achieve the maximum possible comparability.

My goal is to validate the two pillars of any statistical program:

1. Accuracy, a measure of how well a program can correctly predict language features

2. Performance, which means how long I have to wait to achieve this accuracy. And how much input DATA I can put into the program before it crashes or my grandchildren grow up.

To compare these metrics, I need to make sure that the two libraries are as comparable as possible. I used the following configuration:

1. A desktop computer running Linux Mint. It comes with an SSD and 16GB of RAM, and a quad-core 3.5ghz Intel i5-6600K processor.

2. Training, testing and data with correct results in NLTK POS format (see below).

3. SpaCy 2.0.5 Jupyter Python 3 Notebook is installed.

4. The Apache Zeppelin 0.7.3 Notebook of Spark-NLP 1.3.0 and Apache Spark 2.1.1 is installed.

data

The data used for training, testing, and comparison comes from the National American Corpus. I used the MASC 3.0.2 written corpus for the newspaper section.

I used ANCtool provided by corpus to organize the data. Although I could use the CoNLL data format, it contains a lot of markup information, such as terms, indexes, and entity recognition. But I prefer to use the NLTK data format, which includes Penn POS tags. It is sufficient for my purpose. The data looks like this:

As you can see, the training data reads:

Boundaries of detected sentences (new line, new sentence)
Results of word segmentation (separated by Spaces)
Detect the POS (with “|” space)

In the original text file, everything is jumbled up, cluttered, and without any standard boundaries.

Here are the key metrics for the benchmark we will run.

Benchmark data set

In this article, we will use two baseline datasets. The first is very small and is used for interactive debugging and experimentation:

Training data: 36. TXT files, a total of 77 KB
Test data: 14. TXT files, a total of 114 KB
Need to predict 21,362 words

The second set of data is still not “big data”, but a relatively large data set to evaluate the typical standalone application scenario:

Training data: 72. TXT files, a total of 150 KB
Two test datasets: 9225.txt files, 75 MB; 1125 files, 15 MB
13 million words to predict

It is important to note that we are not evaluating “big data” datasets here. This is because while spaCy can use multi-core cpus, it can’t use clusters natively like Spark NLP. As a result, Spark NLP is orders of magnitude faster than spaCy on terabyte datasets that use clusters. Also, the database on the mainframe will outperform my locally installed MySQL database here. My goal was to evaluate both libraries on a single machine and use the multi-core capabilities of both libraries. This is a common development situation and applies to applications that do not need to deal with large data sets.

Let’s start

So let’s do it. First, we must import the relevant libraries and start them.

spaCy

import os

import io

import time

import re

import random

import pandas as pd

import spacy

Nlp_model = spacy. Load (‘ en ‘, disable=[‘ parser ‘, ‘ner’])

Nlp_blank = spacy. Blank (‘ en ‘, disable=[‘ parser ‘, ‘ner’])

I disabled some of the pipes in spaCy to keep it from being bloated with unnecessary parsers. I also used an NLP_Model, a pre-trained NLP model provided by spaCy, as a reference. But I will use NLP_blank, which will be more representative, and it will be a self-trained model.

Spark-NLP

import org.apache.spark.sql.expressions.Window

import org.apache.spark.ml.Pipeline

import com.johnsnowlabs.nlp._

import com.johnsnowlabs.nlp.annotators._

import com.johnsnowlabs.nlp.annotators.pos.perceptron._

import com.johnsnowlabs.nlp.annotators.sbd.pragmatic._

import com.johnsnowlabs.nlp.util.io.ResourceHelper

import com.johnsnowlabs.util.Benchmark

The first challenge I faced was that I had to deal with three completely different word segmentation results, which made it difficult to determine if a word matched a participle and POS token:

1. SpaCy’s participle takes a rules-based approach and already includes a vocabulary that holds many common abbreviations for word segmentation.

2.SparkNLP’s participle has its own rules for segmentation.

3. My training and test data. These data are segmented according to ANC standards. In many cases, it splits words in a way that is completely different from the segmentation of the two libraries.

So, to overcome this problem, I need to decide how to compare POS tags for a set of completely different tags. For Spark-NLP, I’ll leave it as it is. Its default rules basically match the ANC open standard participle format. With spaCy, I need to relax the infix rules to increase the accuracy of the segmentation by not using the “-” split word.

spaCy

class DummyTokenMatch:

def __init__(self, content):

self.start = lambda : 0

self.end = lambda : len(content)

def do_nothing(content):

return [DummyTokenMatch(content)]

model_tokenizer = nlp_model.tokenizer

nlp_blank.tokenizer = spacy.tokenizer.Tokenizer(nlp_blank.vocab, prefix_search=model_tokenizer.prefix_search,

suffix_search=model_tokenizer.suffix_search,

infix_finditer=do_nothing,

token_match=model_tokenizer.token_match)

Note: I passed the VOCab object to nlP_blank, so nlP_blank is not really empty. This VOCab vocabulary object contains English language rules and strategies that help our blank model mark POS and participle English words. Thus, spaCy started with a slight advantage, whereas Spark-NLP didn’t “know” English beforehand.

Training pipeline

Now comes the training step. In spaCy, I need to provide a specified training data format, which looks like this:

TRAIN_DATA = [

(” I like green, dense eggs, “{” tags” : [‘ N ‘, ‘V’, ‘J’, ‘N’]}).

(” Eat Blue ham “, {‘ tags’ : [‘ V ‘, ‘J’, ‘N’]})

]

In the Spark – NLP, I must provide a folder, which contains with “separate word | tag” format. TXT data files, it looks like the ANC training data. So, I just pass the folder path to the POS tagger (PerceptronApproach).

Let’s load spaCy’s training data. In the code below, I have to add some hand-generated exceptions, rules, and characters because spaCy’s training set requires clean data.

spaCy

start = time.time()

Train_path = “. / target/training/”

train_files = sorted([train_path + f for f in os.listdir(train_path) if os.path.isfile(os.path.join(train_path, f))])

TRAIN_DATA = []

for file in train_files:

Fo = IO. Open (file, mode= ‘r’, encoding= ‘utF-8’)

for line in fo.readlines():

line = line.strip()

If the line = = “:

continue

line_words = []

line_tags = []

For pair in re.split(” \\s+ “, line):

Tag = pair. Strip (). The split (” | “)

Line_words. Append (re. Sub (‘) (\ w + \. ‘, ‘\’ 1 ‘, the tag [0]. Replace (‘ $’, ‘). The replace (‘ – ‘, ‘). The replace (” \ “, “)))

line_tags.append(tag[-1])

TRAIN_DATA. Append ((‘, ‘. Join (line_words), {‘ tags’ : line_tags}))

fo.close()

TRAIN_DATA[240] = (‘ The Company said The one time provision would substantially eliminate all future losses at The unit . ‘{‘ tags’ : [” DT “, “NN”, “VBD ‘,” DT “, “JJ”, “-“, “NN”, “NN”, MD, RB, VB, “DT”, “JJ”, ‘NNS’, ‘IN’, “DT”, “NN”, ‘. ‘]})

n_iter=5

Tagger = nlp_blank. Create_pipe (‘ tagger ‘)

Tagger. Add_label (‘ – ‘)

Tagger. Add_label (‘ (‘)

Tagger. Add_label (‘) ‘)

Tagger. Add_label (‘ # ‘)

Tagger. Add_label (‘… ‘)

Tagger. Add_label (” one – time “)

nlp_blank.add_pipe(tagger)

optimizer = nlp_blank.begin_training()

for i in range(n_iter):

random.shuffle(TRAIN_DATA)

losses = {}

for text, annotations in TRAIN_DATA:

nlp_blank.update(, [annotations], sgd=optimizer, losses=losses)

print(losses)

Print (time. Time () – start)

The elapsed time

{‘tagger’: 5.773235303101046}

{‘tagger’: 1.138113870966123}

{‘tagger’: 0.46656132966405683}

{‘tagger’: 0.5513760568314119}

{‘tagger’: 0.2541630900934435}

Time to run: 122.11359786987305 seconds

I had to do some extra work to get around some potholes. SpaCy won’t let me use my participle’s vocabulary because it contains some ugly characters. For example, spaCy won’t train sentences with “large-screen” or “No” tags unless they’re in vocab tags. I had to add these characters to the VOCab list so that spaCy could find them during training.

Now, let’s look at how pipes are built in Spark-NLP.

Spark-NLP

val documentAssembler = new DocumentAssembler()

SetInputCol (” text “)

SetOutputCol (” document “)

val tokenizer = new Tokenizer()

SetInputCols (” document “)

SetOutputCol (” token “)

SetPrefixPattern (” \ \ A ([^ \ \ s \ \ p {L} \ \ d \ \ $\ \. #] *) “)

AddInfixPattern (” (\ \ $? \\d+(? ] : [^ \ \ s \ \ d {1} \ \ d +) *)”

val posTagger = new PerceptronApproach()

SetInputCols (” document “, “token”)

SetOutputCol (” pos “)

SetCorpusPath (“/home/saif/NLP/comparison/target/training “)

.setNIterations(5)

val finisher = new Finisher()

SetInputCols (” token “, “pos”)

.setOutputAsArray(true)

val pipeline = new Pipeline()

.setStages(Array(

documentAssembler,

tokenizer,

posTagger,

finisher

))

Val model = benchmark. time(” time to train model “) {

pipeline.fit(data)

}

As you can see, building a pipe is a very linear process: set up the document assembler, which makes the target text column a target for subsequent annotators (that is, word dividers); PerceptronApproach, in turn, is the POS model, which receives both document text and symbolic forms as input.

I had to update the prefix pattern and add a new infix pattern to match dates and numbers in the same way as ANC (this may become the default mode in the next release of Spark NLP). As you can see, each component of the pipe is under the control of the user; No implicit vocab or Knowledge of English, unlike spaCy.

The corpusPath from PerceptronApproach is passed to a folder containing piped delimited text files. The Finisher annotator wraps the POS and word segmentation result for the next step. As the name SetOutputAsArray() implies, it returns an array rather than a concatenated string, though there is a computational cost to processing this.

The data passed to FIT () is not important because the only NLP annotator being trained is the PerceptronApproach. And this annotator is trained with external POS Corpora.

The elapsed time

Time to train Model: 3.167619593 SEC

As a side note, you can inject SentenceDetector or SpellChecker into the pipe. This can help improve POS accuracy in some cases by letting the model know where the sentence ends.

What’s next?

So far, we’ve initialized the library, loaded the data, and trained a segmentation model with both libraries. Note that spaCy comes with a pre-trained participle, so this step may not be necessary if your text data comes from spacy-trained languages (such as English) and fields (such as news reporting). But in order to make the generated symbols more compatible with our ANC corpus, the modification of the participle infixes is very important. Spark-nlp training was more than 38 times faster than spaCy at five iterations.

In the next article in this series, we’ll cover code, accuracy, and performance by running the NLP pipeline using the just-trained model.

Related information:

Compare two production-level NLP libraries: pipes running Spark-NLP and spaCy
Compare two production-grade NLP libraries: accuracy, performance, and scalability

Compare two production-grade NLP libraries: train spark-NLP and spaCy pipelines

About these two libraries

The benchmark application

data

Benchmark data set

Let’s start

Training pipeline

What’s next?

Related information:

Related Posts

2021- Review of multi-target tracking in video surveillance

[Data analysis] Based on MATLAB GUI library management system

The installation process of The Deep learning object detection framework, Detectron2