preface

Using a simple machine learning algorithm to realize spam recognition.

Let’s have a good time

The development tools

*Python version: *3.6.4

Related modules:

Scikit – learn module;

Jieba module;

Numpy module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Gradually achieve

(1) Divide the data set

Most of the data sets for spam detection online are in English, so to show good faith, I spent some time looking for a Chinese data set. The data set is divided as follows:

Training data set:

7063 normal emails (under data/normal folder);

7775 spam messages (in the data/spam folder).

Test data set:

392 emails in total (under data/test folder).

(2) Create a dictionary

The content of the emails in the data set typically looks like this:

First, we use regular expression to filter out non-Chinese characters, and then use jieba lexicon to divide sentences and remove some stop words. Finally, we use the above results to create a dictionary with the format as follows:

{” word 1″: word 1 frequency, “word 2”: word 2 frequency… }

Py “** file, which can be called in the main program (train.py) :

The final results are saved in the **”results.pkl”** file.

Are we done? Absolutely not!!

There are 52,113 words in the current dictionary, which is obviously too many. Some words only appear once or twice, so it is unwise to occupy a dimension all the time in the subsequent feature extraction. As a result, we kept only the 4,000 words with the highest frequency as the final dictionary:

The final result is saved in the **” wordsdict.pkl “** file.

(3) Feature extraction

After the dictionary is ready, we can convert the content of each letter into a word vector. Obviously, the dimension is 4000, and each dimension represents the frequency of a high frequency word appearing in the letter. Finally, we combine these word vectors into a large eigenvector matrix, whose size is:

(7063 + 7775) by 4000

That is, the feature vectors of the first 7063 messages with normal behavior, and the rest are the feature vectors of the spam messages.

The implementation of the above is still in the **”utils.py”** file, called in the main program as follows:

The final result is saved in the **”fvs_%d_%d.npy”** file, where the first formatter represents the number of normal messages and the second formatter represents the number of spam messages.

(4) Training classifier

We use the SciKit-Learn machine learning library to train the classifier, and the model selects naive Bayes classifier and SVM(support vector machine) :

(5) Performance test

Test the model with the test data set:

The results are as follows:

It can be found that the performance of the two models is about the same (SVM is slightly better than Naive Bayes), but SVM is more prone to spam detection.

That concludes this article, and thanks for watching. The Python mini-series is on hold. The next chapter will share the Python gadget series

To thank you readers, I’d like to share some of my recent programming favorites to give back to each and every one of you in the hope that they can help you.

Dry goods mainly include:

① Over 2000 Python ebooks (both mainstream and classic books should be available)

②Python Standard Library (Most Complete Chinese version)

③ project source code (forty or fifty interesting and classic practice projects and source code)

④Python basic introduction, crawler, Web development, big data analysis video (suitable for small white learning)

⑤ A Roadmap for Learning Python

⑥ Live access to Python’s two-day Crawler boot camp

All done~ Complete source code + dry plus Python novice learning exchange community: 594356095