The foreword 0.

AI is very popular in the last two years, and people from all walks of life want to learn it. This article mainly shares an introduction to English text classification by using neural network. This is an assignment of the author in college, and it is open here for everyone to learn and discuss.

Warehouse address: github.com/guanjiangta… Please give a little star if you like it

1. Steps to implement an AI

Next I will be the AI eight article, how to achieve an AI. Please excuse me for any offense and comment on any inadequacies.

1.1 Determine what needs to be done

This time, take my previous implementation of cat and dog classification. In the first step of doing things, we need to clarify the goal. The goal of this time is to identify and classify cats and dogs, so that only cats or dogs exist, and pigs, cows and sheep will not suddenly appear. If there are any, we will categorize them as others.

So we’re done. We’re going to do cats, dogs, and so on.

1.2 Selection Framework

From a bewildering variety of frames to choose one that suits you, I chose PaddlePaddle.

1.3 Practical operation-data preprocessing

For different scenes, different data processing, cat and dog classification, we are for cats, dogs and other strange animals feature extraction, do some basic annotation function for the initial data set, we commonly known as teaching machine to read.

1.4 Practical operation – generating training data set

Here according to its own data set conditions are divided, you can feel free, but it is recommended not less than 5K images ~

1.5 Practical operation – The rest

1) Network model selection: Different models are selected for training according to different requirements. I do image class this time, just randomly selected an image class model, CNN (giant can choose their own configuration).

2) Configure the algorithm and train the result. The three steps of Loss are commonly known as parameter adjustment.

And just to tell you the truth, that’s where Tunparamedic came from.

Through our painstaking efforts, we finally got the result.

4) Finally run test to verify.

5) Some programming skills are required if deployment is required

1.6 Introduce this article

This time is to make a text classification, English, all the data from a foreign social media website, we need to make certain classification of these.

Because it is from a foreign media website, users speak more casually, and foreigners may also use shortened words when communicating. Therefore, IN order to improve accuracy, I have made some small processing.

2. File introduction

Create_data_start. py Text processing method

Create_data_utils.py Text processing tool

Model. py Model file 1

Multimodal learning file, which is a processing method for irregular data sets. The algorithm is not perfect and needs to be optimized.

Net.py main model file

The main use is CNN, the previous version was BilstM now changed to CNN. Learning mainly relies on large sparse matrix.

Read_config. py reads the configuration file

Pre_create_data. py Data preprocessing file

The main operations include

1. Word segmentation, because I found that some words in the data set were inefficient in word segmentation. Therefore, I used wordninja library to segment such words again, and the effect was better than the previous files.

2. Word correction, because the text exists, because of the oral description, such as U == you, such words are very poor for data training, so I use workcheck to correct these, the idea is regular replacement. Wordcheck. TXT is a dataset containing 12W lines from Friends. (In fact, the correction effect is not very obvious, but it is much better than before)

The text_reader.py data set reads the file

Trian.py training file

Utils. Py this is the previous version of the utility file

Valid_model_acc. py Model accuracy verification file

Valid. py predicts the output and gets the Accuary file

Wordcheck. py word correction algorithm file

3. Core algorithm

3.1 the CNN algorithm

Code:

import paddle.fluid as fluid def cnn_net(data,label,dict_dim,emb_dim=128,hid_dim=128,hid_dim2=96,class_dim=10, win_size=3,is_infer=False): """ Conv net """ # embedding layer emb = fluid.layers.embedding(input=data, size=[dict_dim, emb_dim]) # convolution layer conv_3 = fluid.nets.sequence_conv_pool( input=emb, num_filters=hid_dim, filter_size=win_size, act="tanh", pool_type="max") # full connect layer fc_1 = fluid.layers.fc(input=[conv_3], Size = HID_DIM2) # dropout Layer Dropout_1 = fluid. Layout. dropout(x=fc_1, Dropout_PROb =0.5,name="dropout") # softmax layer prediction = fluid.layers.fc(input=[dropout_1], size=class_dim, act="softmax") return predictionCopy the code

4. Procedure

4.1 Preprocessing data

① Break up the words

Because the word identification degree of some documents is not very high, I use wordninja framework for word segmentation again. It is obvious that some words have been split up again. For example, 2019years has been split into 2019years. This is much better than before. It’s easier for us to read.

② : Spelling correction

Because the data set is collected from foreign social network posts, talk and so on, there are a lot of slang and foreign speech, will omit many words, or replace the original pronoun with other words (the same pronunciation). For example, u == you, c == say, etc.

③ : operation method

python pre_create_data.py

(4) : code

Split (data) data1 = [] for data2 in data: STRS = "". Join (list(data1)) x_train_list.append(STRS) x_train_file.write(strs + "\n")Copy the code

4.2 Generating data sets

① : generation process

Generate data dictionary -> Generate a data dictionary based on the words in the test set and validation set. The size is determined by the number of words in the content. File name: dict_txt_all.txt.

Generate test set and training set -> file name: train_list_end.txt/valid_list_end.txt.

In this process, stem reduction algorithm is used to restore plural, gerund, perfect tense and so on. WordNetLemmatizer is used to pull out stop words.

(2) : operating

Python pre_create_data.py generates the train. TXT, valid. TXT files.

Run it again: python create_data_start.py

Select 1 Training mode.

After generating the training set, select 1 to continue generating and get the result.

4.3 Execution Training

① : Use GPU for training

In order to improve the calculation speed, I use GPU for calculation and training.

The code is as follows:

#1: CPU training speed is slow, so use GPU for training #2: To ensure ACC, please use GPU. place = fluid.CUDAPlace(0)Copy the code

② : Output model

Shutil. rmtree(save_path, save_path, shutil.rmtree, save_path, Ignore_errors =True) # Create hold model file directory os.makedirs(save_path) # save prediction model fluid. IO. Save_inference_model (save_path, feeded_var_names=[words.name], target_vars=[model], executor=exe)Copy the code

The model name is infer_model

③ : Mode of use

You can run the code after the 1,2 process

python train.py

4.4 Forecasting

① : Select GPU for predictive output

The output file is result.txt

Acc is included in the output results.

4.5 Framework Installation Method

(1) : visit the web site paddlepaddle.org/paddle/Gett…

② : Output results

Installation information:

pip3 install paddlepaddle-gpu
Copy the code

Verification information:

Use PYTHon3 to enter the Python interpreter and type import paddles.fluid, followed by paddles.fluid.install_check.run_check (). If Your Paddle Fluid is installed succesfully! “, indicating that the installation is successful.

Description information:

Python versions higher than 3.5.0 and PIP versions higher than 9.0.1 are required. For more help, see Installing with PIP in Ubuntu

Note: The PIP install Paddlepaddle-gpu command will install PADDlePaddles that support CUDA 9.0 cuDNN V7. If your CUDA or cuDNN version is different from this, see here for the installation commands applicable to other CUDA/cuDNN versions

4.6 Code Testing

Environment:

The graphics card NVIDIA Tesla V100
Cores 2
memory 8GB Memory
system Ubuntu
memory 16G
Paddlepaddle version Framework: PaddlePaddle 1.4.1 (Python 3.6)

Software: Jupyter Notebook

Address: aistudio.baidu.com

Model: infer_model

Training iteration log output:

Pass:5, Batch:40, Cost:1.07034, Acc:0.67969 Test:5, Cost:1.30199, Acc: 0.58453 Pass:6, Batch:0, Cost:1.01823, Acc:0.73438 Test:6, Cost:1.29810, Acc: 0.57989 Pass:6, Batch:40, Cost:1.02507, Acc:0.67969 Test:6, Cost:1.25205, ACC:0.58379 Pass:7, Batch:0, Cost:0.92054, ACC: 0.76562 Test:7, Cost:1.24948, ACC:0.58575 Pass:7, Batch:40, Cost:1.03106, Acc:0.74219 Test:7, Cost:1.21637, Acc: 0.59475 Pass:8, Batch:0, Cost:1.00770, Acc:0.75000 Test:8, Cost:1.03106, Acc:0.74219 Test:7, Cost:1.21637, Acc: 0.59475 Pass:8, Batch:0, Cost:1.00770, Acc:0.75000 Test:8, Cost:1.21415, ACC:0.59182 Pass:8, Batch:40, Cost:0.84253, ACC: 0.83594 Test:8, Cost:1.19011, ACC:0.60232 Pass:9, Cost:1.21415, ACC:0.59182 Pass:8, Batch:40, Cost:0.84253, ACC: 0.83594 Batch:0, Cost:0.87655, Acc:0.77344 Test:9, Cost:1.18894, Acc: 0.59915 Pass:9, Batch:40, Cost:0.88950, Acc:0.74219 Test:9, Cost:1.17149, Acc: 0.60451 Pass:10, Batch:0, Cost:0.81905, Acc:0.77344 Test:10, Cost:1.16931, ACC:0.60939 Pass:10, Batch:40, Cost:0.70325, ACC: 0.84375 Test:10, Cost:1.15738, ACC:0.61428 Pass:11, Batch:0, Cost:0.71462, Acc:0.81250 Test:11, Cost:1.15619, Acc: 0.61525 Pass:11, Batch:40, Cost:0.68220, Acc:0.84375 Test:11, Cost:1.14702, ACC:0.61795 Pass:12, Batch:0, Cost:0.43453, ACC: 0.96094 Test:12, Cost:1.14599, ACC:0.61428 Pass:12, Batch:40, Cost:0.66188, Acc:0.84375 Test:12, Cost:1.14097, Acc: 0.61697 Pass:13, Batch:0, Cost:0.58125, Acc:0.85156 Test:13, Cost:1.14213, Acc: 0.61916 Pass:13, Batch:40, Cost:0.64012, Acc:0.85938 Test:13, Cost:1.13701, ACC:0.62721 Pass:14, Batch:0, Cost:0.63643, ACC: 0.82031 Test:14, Cost:1.13649, ACC:0.62697 Pass:14, Batch:40, Cost:0.50114, Acc:0.87500 Test:14, Cost:1.13436, Acc: 0.62819 Pass:15, Batch:0, Cost:0.49957, Acc:0.90625 Test:15, Cost:1.13572, ACC:0.63014 Pass:15, Batch:40, Cost:0.47065, ACC: 0.85938 Test:15, Cost:1.13555, ACC:0.62990 Pass:16, Batch:0, Cost:0.39780, Acc:0.91406 Test:16, Cost:1.13521, Acc: 0.62869 Pass:16, Batch:40, Cost:0.37418, Acc:0.94531 Test:16, Cost:1.13725, Acc: 0.63038 Pass:17, Batch:0, Cost:0.42288, Acc:0.92969 Omitted someCopy the code

Predicted iteration output (part) :

Prediction result Label: 3, name: 3, probability: 0.999515 Prediction result label: 8, name: 8, probability: 0.999912 Prediction result label: 3, name: 3, probability: 0.999656 Prediction result label: 5, name: 5, probability: 0.999656 0.999305 Prediction result Label: 3, name: 3, probability: 0.999532 Prediction result label: 9, name: 9, probability: 0.999810 Prediction result label: 9, name: 9, probability: 0.993418 Prediction result label: 9, name: 9, probability: 0.993418 9, probability: 0.999761 Prediction result label: 8, name: 8, probability: 0.999687 Prediction result label: 1, name: 1, probability: 0.995791 Prediction result label: 8, name: 8, probability: 0.999731 Prediction result label: 8, name: 8, probability: 0.999731 5, name: 5, probability: 0.995905 Prediction result Label: 5, name: 5, probability: 0.997542 Prediction result label: 9, name: 9, probability: 0.999930 Prediction result label: 2, name: 2, probability: 0.999930 0.998152 Prediction result Label: 9, name: 9, probability: 0.999630 Prediction result label: 5, name: 5, probability: 0.999668 Prediction result label: 5, name: 5, probability: 0.999181 Prediction result label: 8, name: 0.999181 8, probability: 0.999770 Prediction result label: 9, name: 9, probability: 0.998988 Prediction result label: 9, name: 9, probability: 0.999110 Prediction result label: 3, name: 3, probability: 0.999672 Prediction result label: 9, name: 9, probability: 0.999672 prediction result label: 9, name: 9, probability: 0.999110 3, the name is: 3, probability is: 0.999675 omitted some of the accurary is :0.981Copy the code

The final accuracy was 0.981.

5. To summarize

In fact, most people do this is oriented to algorithmic models and frameworks, so all we need to know is how to use the framework correctly and output the target results. There’s no need to do anything fancy. Almost here, code word very tired ~~~~~