“This is the first day of my participation in the Gwen Challenge in November. See details of the event: The last Gwen Challenge in 2021”.

1. Question data

The data set is visible and downloadable after registration. The data is news text and is anonymized according to character level. The text data of 14 candidate categories are integrated: finance, lottery, real estate, stocks, home, education, technology, society, fashion, politics, sports, horoscope, games and entertainment. The problem data consists of the following parts: training set 20W samples, test set A 5W samples, test set B 5W samples. In order to prevent players from manually annotating test sets, we anonymized the text of competition data according to character level. The training data after processing are as follows:

label text
6 57 44 66 56 23 3 37 5 41 9 57 44 47 45 33 13 63 58 31 17 47 0 1 69 26 60 62 15 21 12 49 18 38 20 50 23 57 44 45 33 25 28 47 22 52 35 30 14 24 69 54 7 48 19 11 51 16 43 26 34 53 27 64 8 4 42 36 46 65 69 29 39 15 37 57 44 45 33 69 54 7 25 40 35 30 66 56 47 55 69 61 10 60 42 36 46 65 37 5 41 32 67 6 59 47 0 1 68

The corresponding relationship of labels in the dataset is as follows:

{' science and technology: 0, 'stock: 1,' sports' : 2, 'entertainment' : 3, 'politics', 4,' social ': 5,' education: 6, 'business' : 7, 'home', 8, 'games' : 9,' property ': 10,' fashion ': 11,' lottery ': 12, 'constellation ': 13}Copy the code

The data are collected from the news on the Internet and processed anonymously. Therefore, players can conduct their own data analysis, and can give full play to their strengths to complete various feature engineering, without limiting the use of any external data and models. The columns are split using \ T. Pandas reads the data in the following code:

train_df = pd.read_csv('.. /input/train_set.csv', sep='\t')Copy the code

Ii. Evaluation criteria

The evaluation criterion is the mean value of category F1_Score. The results submitted by contestants are compared with the categories of the actual test set. The larger the result, the better.

Formula: F1 = 2 * \ recall {(precision * recall)} {(precision + recall)} F1=2∗(precision+recall)(precision∗recall) F1_score can be calculated by sklearn:

from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')
Copy the code

3. Submission of results

Before submitting, ensure that the predicted results are in the same format as sample_submit.csv and that the submission file suffix is CSV.

Four,

The title of the competition is a simple multi-classification problem of structured text data. The category of the text is predicted by the text features of the input text, and the final scoring standard is to calculate the F1-score between the predicted classification and correct classification.

Fifth, the Reference

Datawhale NLP event – Task1 题 解 析 – Datawhale NLP Event – Task1 题 解 析 – Datawhale NLP Event – Task1 题 解 析 – Datawhale NLP Event – Task1 题 解 析 – Datawhale

Tianchi_competition/Zero Basics Introduction NLP – News text classification at main · RxxxxR/ Tianchi_competition · GitHub

\