“This is the first day of my participation in the Gwen Challenge in November. See details of the event: The last Gwen Challenge in 2021”.

1. Question data

The data set is visible and downloadable after registration. The data is news text and is anonymized according to character level. The text data of 14 candidate categories are integrated: finance, lottery, real estate, stocks, home, education, technology, society, fashion, politics, sports, horoscope, games and entertainment. The problem data consists of the following parts: training set 20W samples, test set A 5W samples, test set B 5W samples. In order to prevent players from manually annotating test sets, we anonymized the text of competition data according to character level. The training data after processing are as follows:

label	text
6	57 44 66 56 23 3 37 5 41 9 57 44 47 45 33 13 63 58 31 17 47 0 1 69 26 60 62 15 21 12 49 18 38 20 50 23 57 44 45 33 25 28 47 22 52 35 30 14 24 69 54 7 48 19 11 51 16 43 26 34 53 27 64 8 4 42 36 46 65 69 29 39 15 37 57 44 45 33 69 54 7 25 40 35 30 66 56 47 55 69 61 10 60 42 36 46 65 37 5 41 32 67 6 59 47 0 1 68

The corresponding relationship of labels in the dataset is as follows:

{' science and technology: 0, 'stock: 1,' sports' : 2, 'entertainment' : 3, 'politics', 4,' social ': 5,' education: 6, 'business' : 7, 'home', 8, 'games' : 9,' property ': 10,' fashion ': 11,' lottery ': 12, 'constellation ': 13}Copy the code

The data are collected from the news on the Internet and processed anonymously. Therefore, players can conduct their own data analysis, and can give full play to their strengths to complete various feature engineering, without limiting the use of any external data and models. The columns are split using \ T. Pandas reads the data in the following code:

train_df = pd.read_csv('.. /input/train_set.csv', sep='\t')Copy the code

Ii. Evaluation criteria

The evaluation criterion is the mean value of category F1_Score. The results submitted by contestants are compared with the categories of the actual test set. The larger the result, the better.

Formula: F1 = 2 * \ recall {(precision * recall)} {(precision + recall)} F1=2∗(precision+recall)(precision∗recall) F1_score can be calculated by sklearn:

from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')
Copy the code

3. Submission of results

Before submitting, ensure that the predicted results are in the same format as sample_submit.csv and that the submission file suffix is CSV.

Four,

The title of the competition is a simple multi-classification problem of structured text data. The category of the text is predicted by the text features of the input text, and the final scoring standard is to calculate the F1-score between the predicted classification and correct classification.

Fifth, the Reference

Datawhale NLP event – Task1 题解析 – Datawhale NLP Event – Task1 题解析 – Datawhale NLP Event – Task1 题解析 – Datawhale NLP Event – Task1 题解析 – Datawhale

Tianchi_competition/Zero Basics Introduction NLP – News text classification at main · RxxxxR/ Tianchi_competition · GitHub

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

NLP – News text classification -Day1- Question comprehension

1. Question data

Ii. Evaluation criteria

3. Submission of results

Four,

Fifth, the Reference

NLP – News text classification -Day1- Question comprehension

1. Question data

Ii. Evaluation criteria

3. Submission of results

Four,

Fifth, the Reference

Related Posts

WenetSpeech is a multi-domain Chinese speech recognition dataset

Perceptron algorithm — classical binary classification algorithm

A new generation of Kings! Japan push AI basketball robot, hit rate close to 100%!