It mainly studies network security, anomaly detection, network public opinion monitoring, network traffic analysis, threat intelligence, sentiment analysis and other aspects, so this paper will mainly introduce according to this part of data. Data sets are mainly open source data sets. Customer data related to privacy and data security will not be introduced in this paper.

1. Network security

1.1 Intrusion detection and abnormal traffic detection

For the research of intrusion detection, a lot of effective experimental data are needed. Data can be collected using packet capture tools, such as Tcpdump in Unix and libdump in Windows, or special software such as Snort, Zeek, Argus, and Wireshark to capture data packets and generate connection records as data sources.

This paper uses KDDCup99 network intrusion detection dataset, UNSW-NB15 data, IDS2017 data set used in intrusion detection technology research based on data mining

1.1.1 kddcup99

For KDDCUP99 data set, the data set is relatively old, the update intensity is small, the domestic abnormal flow research, master thesis and other main applications of this old data. The accuracy rate, false positive rate and false negative rate of all relevant papers are low, and they are mainly used in academic research to verify algorithm optimization and algorithm improvement.

There are four data types in this database: DOS, R2L, U2R and PROBING

Each connection has 41 features, including TCP connections, temporal network traffic statistics (for connections within 2 seconds), and host network traffic statistics (for 100 connections). The main content will not do one by one analysis.

According to the papers read: feature selection: information gain, PCA, machine learning methods, correlation coefficient, etc.

Training model: Both machine learning and deep learning methods have applications. The better effect is mainly: neural network.

Optimization: mainly divided into optimization algorithm model, using global optimization algorithm to optimize the local optimization of machine learning and loss function, etc.

There are more papers, it is suggested to read English literature, domestic treatment methods are the same.

1.1.2 UNSW-NB15 dataset

This data set uses Zeek and Argus to get network traffic data and judge each connection.

There are 9 attacks in the dataset: This dataset has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms.

Each connection in the data has 49 features, among which the main features are Argus analysis features, and then zeek HTTP/FTP and other features are corresponding, and finally the statistical features of 100 connections are calculated.

The data set is relatively good, but in the open data set, the data processing part is coarse. Many of these anomalous data are duplicated.

According to the analysis of experimental results, the effect of using test set training set is better, but when using real network environment and real attack detection, the generalization ability of the model is weak.

mathpretty.com/11062.html

1.1.3 IDS2017 Dataset

Attacks include violent FTP, violent SSH, DoS, Heartbleed, Web attacks, infiltration, botnets, and DDoS

More than 80 feature data were obtained using CICFlowMeter.

The problem of weak generalization ability also exists in detection.

Other data sets can be reference: blog.csdn.net/jmh1996/art…

Blog.csdn.net/answer3lin/…

2. Online public opinion

Mainly for the collection of customer privacy data, here will not be introduced. Open source data set: mainly for emotion, movie and some NLP data.

The main downloads are data sets on Github, and data from the company’s own threat intelligence center or big data platform are mainly used for general experiments and product development. However, these data are relatively basic, and many feature engineering needs to be delicate and time-consuming, but it has better results for its own business.

This data mainly includes: public opinion texts at home and abroad, QQ and wechat data, moments of friends, Weibo, and publicly released real-time intelligence data, etc.

3.webshell

For the detection of Webshell, mainly at the verification stage of experiment and demo, the method of detection engine is adopted to carry out product incubation.

At the initial stage of verification, data sets are mainly used to collect and share open source data. At present, the company mainly determines attack vulnerabilities by rule matching in the form of syntax tree and semantics.

Data set: github.com/tennc/websh…

Test data set: For the PCAP package collected by the tool, and then parse and extract the text data in HTTP for real detection.

Internal testing: Open source the engine for a period of time and let it all be tested and returned.

4.WAF and URL data

For the process of WAF from static rules to semantic syntax and finally to AI engine, if it is your own experience and experimental demo, you can consider using open source data sets, and some feature processing methods. (Involving URL encoding, URL structure parsing, etc.)

Open source dataset: It can be searched on Github, but due to a long time, there is no specific connection saved here (sorry)

Engine stage: Using the results of the second generation WAF semantic products as a data set (involving Base64 encoding, URL encoding, label data with noise and errors, etc.), strengthen the model according to specific needs.

Product incubation stage: mainly to enable WAF products

5. User behavior detection

This aspect mainly involves internal system, user system, operation and maintenance system, etc. Usage data: user’s Internet access characteristics, operation and maintenance characteristics, user consumption characteristics, etc.

Data set: All internal private data, difficult to share.

Can use the United States, some of the domestic desensitization data experiment

Do you have any specific data to share? Other machine learning data such as Kaggle and contest data are of some value. Some other aspects of data are not covered here (mainly for familiarizing yourself with machine learning).