Original link:tecdat.cn/?p=4027

Business background

The application of E-mail has become very extensive, which has brought great convenience to people’s life. However, as a by-product of its development — spam, it has brought a lot of trouble to the majority of users, network administrators and ISP(Internet service providers). The problem of spam is becoming more and more serious and has been widely concerned by researchers. Spam usually refers to electronic mail that is forced into users’ mailboxes without their permission. For the spam using mass sending technology, we must use some technical means to carry out anti-spam work. At present, anti-spam technology mainly includes spam filtering technology, security management of mail server and improvement of simple mail communication protocol (SMTP).

WEKA text segmentation preprocessing

Firstly, two kinds of mail documents in the training set folder are analyzed, and the characteristics of the two kinds of files can be automatically analyzed from different angles, and the algorithm can be written to build the classification model.

First set up the working directory, and read the classified text file

You can see the frequency histogram of spam and non-spam

Then word frequency matrix file is obtained by word segmentation of the original corpus

The classification histogram of each word frequency is obtained

After the word frequency matrix is obtained, the classifier is modeled

2. Analyze the attributes in corpus and find out the attributes contributing to classification (that is, those words only appear in positive, those words only appear in negative, and those words appear in both categories)

3. Find the classification rules that distinguish positive from negative (i.e., which words together result in positive and which words together result in negative)

It can be seen from the result that cell efficiengcy however breast rates and cell efficiengcy have great influence on the final classification result, for example, “however” is generally a negative word.

WEKA text word segmentation results comparison

The accuracy and confusion matrix of each classifier are obtained below:

NaiveBayes

Logistic
J48
RandomForest
SVM
OneR

conclusion

Spam filtering based on discriminant method has attracted little attention in modern research. The results clearly show that the classification method based on random forest and SVM model can effectively improve the accuracy and accuracy of spam filtering compared with the traditional method.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

WEKA text Mining analyzes the spam classification model

Original link:tecdat.cn/?p=4027

Business background

WEKA text segmentation preprocessing

After the word frequency matrix is obtained, the classifier is modeled

WEKA text word segmentation results comparison

conclusion

WEKA text Mining analyzes the spam classification model

Original link:tecdat.cn/?p=4027

Business background

WEKA text segmentation preprocessing

After the word frequency matrix is obtained, the classifier is modeled

WEKA text word segmentation results comparison

conclusion

Related Posts

Kubernetes Notes (9) – Authentication, authorization, and Access Control

When vscode prompts you that you cannot find git

Implement a custom Git submission specification