Original link:http://tecdat.cn/?p=23151 

This example shows how to classify textual data using a deep learning long short memory (LSTM) network.

Textual data is sequential. A text is a sequence of words that may have dependencies on each other. To learn and use long-term dependencies to classify sequence data, LSTM neural networks can be used. LSTM network is a recursive neural network (RNN), which can learn the long-term dependence between the time order of sequence data.

To enter text into the LSTM network, you first convert the text data into a sequence of numbers. You can do this using word encoding, which maps the file to a sequence of numeric exponents. For better results, a word embedding layer can also be added to the network. Lexical embedding maps terms to numeric vectors rather than scalar indexes. These embeddings uncover the semantic details of the words, so that words with similar meanings have similar vectors. They also use vector arithmetic to simulate the relationships between words. For example, the relationship “Rome is to Italy what Paris is to France” is described by the equation Italy-Rome + Paris = France.

In this example, there are four steps to training and using the LSTM network.

  • Import and pre-process data.
  • Converts words to a sequence of numbers using word encoding.
  • Create and train an LSTM network with word embedding layer.
  • The trained LSTM network was used to classify the new text data.

Import data

Import factory report data. This data contains a tagged text description of the factory event. To import text data as a string, specify the text type as “string”.

head(data)

The purpose of this example is to categorize events by the label in the category bar. To classify the data into categories, convert these labels to categorical.

Category = categorical(Category);

Use histograms to see the distribution of categories in your data.

figure
histogram(Category);

The next step is to break it down into a collection of training and validation. The data is divided into a training partition and a reserved partition for validation and testing. Specify a reserved percentage of 20%.

CVP = CV (Category, 'Holdout, 0.2);

Extract text data and labels from the partitioned table.

DataTrain = Description;
DataValidation = Description;

To check that you have imported the data correctly, use the word cloud to visualize training text data.

wordcloud(DataTrain);

Preprocessing text data

Create a function to tag and preprocess textual data. The function preprocesText, listed at the end of the example, performs these steps.

  • Mark the text with a TokenizedDocument.
  • Use lower to convert the text to lowercase.
  • Erase punctuation with erasePunctuation.

The training data and validation data are preprocessed.

Train = preprocessText(DataTrain);

Look at the first few pre-processed training files.

documentsTrain(1:5)

Convert the file to a sequence

To enter the document into the LSTM network, the document is converted to a sequence of numeric exponentials using a one-word encoding.

Create a code for a word

The next transformation step is to populate and truncate the files so that they are all of the same length.

To populate and truncate a file, first select a target length, then truncate the file longer than that, and move the file shorter to the left. For best results, the target length should be short without discarding large amounts of data. To find an appropriate target length, look at the histogram of the length of the training document.

histogram(documentLengths)

Most training files have less than 10 tags. Use this as the target length for truncation and padding.

Converts a document to a numeric index sequence. To truncate or fill the sequence to the left with a length of 10, set the Length option to 10.

doc2sequence(enc,'Length');

Use the same options to convert the validation file to a sequence.

sequence(Length);

Create and train the LSTM network

Define the LSTM network structure. To input sequence data to the network, a sequence input layer is included and the input size is set to 1. Next, a word embedding layer of 50 dimensions is included, with the same number of words as the encoding of the words. Next, include an LSTM layer and set the number of hidden cells to 80. Finally, add a full connection layer with the same number of classes, a softmax layer, and a classification layer.

inputSize = 1;
Dimension = 50;
HiddenUnits = 80;

Specify Training Options

  • Training using the ADAM optimizer.
  • Specify a small batch size of 16.
  • Randomize the data at regular intervals.
  • We will monitor training progress by setting the “Plots “option as “training-progress”.
  • Specify the ValidationData using the ‘ValidationData’ option.
  • Suppresses output by setting the ‘Verbose’ option to false.

By default, if a GPU is available, the GPU will be used (requiring a parallel computing toolkit ™ and a GPU with CUDA® support of 3.0 or more computing power). Otherwise, it will use the CPU. Training time on the CPU may be much longer than training time on the GPU.

options('adam', ... 'BatchSize',16, ... 'Shuffle','every-epoch', ...) ;

Train the LSTM network.

Use new data to make predictions

Categorize the three newly reported event types. Creates an array of strings containing the new report.

Preprocessing steps are used to preprocess the text data as a training document.

preprocessText(New);

Convert text data to a sequence with the same options as when creating the training sequence.

sequence(enc,sequenceLength);

The trained LSTM network was used to classify the new sequences.

classify(XNew)



The most popular insight

1. Explore the hot research spots of big data journal articles

2.618 Online shopping data inventory – what are the attention of the hand-minchers

3. TF-IDF Topic Modeling for R Language Text Mining and N-Gram Modeling for Sentiment Analysis

4. Python Topic Modeling Visualization LDA and T-SNE interactive visualization

5. News data observation under the epidemic situation

6. Python Topical LDA Modeling and T-SNE Visualization

7. Topic-modeling analysis is carried out for text data in R language

8. Topic Model: Data Listening to those “online matters” on the People’s Daily Online Message Board

9. Python crawler for semantic data analysis of Web crawling LDA topics