I. Purpose and requirements of the experiment

1) Purpose of the experiment

  • Understand the basic principle of training classification model;
  • Master model interpretation and model improvement process;
  • Familiar with logistic regression method for satirical text detection.

2) Experimental requirements

  • Write a good source program according to the experimental topic;
  • Analyze the possible problems in the process of computer operation in advance, and determine the debugging steps and testing methods;
  • Input a certain amount of test data to analyze the running results;
  • After the computer experiment, write the experiment report carefully, analyze and summarize the problems in the computer.

Ii. Experimental environment (tools, configuration, etc.)

  • Hardware requirements: a computer;
  • Software required: Mac operating system. This experiment is developed on Jupyter Notebook.

Iii. Experimental content (experimental scheme, experimental steps, design ideas, etc.)

1) Experimental scheme

  • Learn and follow the blue Bridge cloud class to complete the test;
  • The method of word frequency and weight is used, and then the classification logistic regression model is used to classify the satirical content.
  • Through the existing data to build training set and test set, in the simple visualization of data graph, after training classification model and improve, in order to achieve higher accuracy.

2) Experimental steps

  • Model definition;
  • Data processing and loading;
  • Training model;
  • Visualization of training process;
  • Test, and modify many times.

In the above steps, we should know some personal habits when conducting deep learning projects or experiments. For different projects, many attempts and modifications are often needed in order to obtain the optimal model results.


3) Design ideas

  • The designed model needs to be highly configurable, easy to modify parameters, modify the model, and repeated experiments;
  • Your code should be well organized and easy to read.
  • The code should be well explained so that others can understand it.

4. Experimental results

  1. Load corpus and preview:
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head()
Copy the code

Figure 1: Dataset preview


  1. View the data set variable category information and findcommentIf the number is smaller than that of other features, it indicates that there are missing values. I’m just going to delete it;
train_df.info()
Copy the code

Figure 2: Data set INFO

train_df.dropna(subset=['comment'], inplace=True)
Copy the code

  1. Output data labels to see if categories are balanced:
train_df['label'].value_counts()
Copy the code

Figure 3: Data label information


  1. Visualize sarcasm and normal text length:
train_df.loc[train_df['label'] = =0.'comment'].str.len().apply(
    np.log1p).hist(label='normal', alpha=. 5)
train_df.loc[train_df['label'] = =1.'comment'].str.len().apply(
    np.log1p).hist(label='sarcastic', alpha=. 5)
plt.legend()
Copy the code

Figure 4: Visualizing sarcasm and normal text length


  1. usegroupbyDetermine the ranking of the number of sarcastic comments in each sub-section:
sub_df = train_df.groupby('subreddit') ['label'].agg([np.size, np.mean, np.sum])
sub_df.sort_values(by='sum', ascending=False).head(10)
Copy the code

Figure 5: Ranking of the number of sarcastic comments in each sub-section


  1. Implement wordcloud with wordcloud module:
from wordcloud import WordCloud, STOPWORDS
wordcloud = WordCloud(background_color = 'black', stopwords = STOPWORDS, max_words = 200, max_font_size = 100, random_state = 17, width = 800, height = 400)
plt.figure(figsize = (16.12))
wordcloud.generate(str(train_df.loc[train_df['label'] = =1.'comment']))
plt.imshow(wordcloud)
Copy the code

Figure 6: Word cloud 1

plt.figure(figsize = (16.12))
wordcloud.generate(str(train_df.loc[train_df['label'] = =0.'comment']))
plt.imshow(wordcloud)
Copy the code

Figure 7: Word cloud 2


  1. Output the information with more than 1000 comments in the sub-section and the top 10 sarcastic comments in proportion:
sub_df[sub_df['size'] >1000].sort_values(by = 'mean', ascending = False).head(10)
Copy the code

Figure 8: Output information


  1. Output the information of the 10 users whose comments are more than 300 with the highest proportion of sarcastic comments:
sub_df = train_df.groupby('author') ['label'].agg([np.size, np.mean, np.sum])
sub_df[sub_df['size'] > 300].sort_values(by = 'mean', ascending = False).head(10)
Copy the code

Figure 9: Output information


  1. The satirical text classification prediction model is trained and the accuracy evaluation results are obtained on the test set
tf_idf = TfidfVectorizer(ngram_range(1.2), max_features = 50000, min_df = 2)
logit = LogisticRegression(C = 1, n_jobs = 4, solver = 'lbfgs', random_state = 17, verbose = 1)
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), ('logit', logit)])
Copy the code
tfidf_logit_pipeline.fit(train_texts, y_train)
valid_pred = tfidf_logit_pipeline.predict(valid_texts)
accuracy_score(y_valid, valid_pred)
Copy the code

Figure 10: Prediction results

  1. A function that builds an obfuscation matrixplot_confusion_matrix:

Figure 11: Functions of the obfuscation matrix


  1. useeli5The output classifier is the weight of text features in the prediction judgment
import eli5
eli5.show_weights(estimator = tfidf_logit_pipeline.named_steps['logit'], vec = tfidf_logit_pipeline.named_steps['tf_idf'])
Copy the code

Figure 12: Weight of text features


  1. Next, add onesubredditFeature model improvement, also complete segmentation,Make sure you choose the same one when you slicerandom_stateMake sure it aligns with the comment data above:
subreddits = train_df['subreddit']
train_subreddits, valid_subreddits = train_test_split(
    subreddits, random_state=17)
Copy the code

  1. Next, the tF-IDF algorithm was also used to construct twoTfidfVectorizerUsed forcommentsubredditsFeature extraction.
tf_idf_texts = TfidfVectorizer(
    ngram_range=(1.2), max_features=50000, min_df=2)
tf_idf_subreddits = TfidfVectorizer(ngram_range=(1.1))
Copy the code

  1. Using builtTfidfVectorizerComplete feature extraction:
X_train_texts = tf_idf_texts.fit_transform(train_texts)
X_valid_texts = tf_idf_texts.transform(valid_texts)
X_train_texts.shape, X_valid_texts.shape
X_train_subreddits = tf_idf_subreddits.fit_transform(train_subreddits)
X_valid_subreddits = tf_idf_subreddits.transform(valid_subreddits)
X_train_subreddits.shape, X_valid_subreddits.shape
Copy the code

Figure 13: Feature extraction


  1. Splicing together extracted features:
from scipy.sparse import hstack
X_train = hstack([X_train_texts, X_train_subreddits])
X_valid = hstack([X_valid_texts, X_valid_subreddits])
Copy the code

  1. Continue to use logistic regression for modeling and forecasting
logit.fit(X_train, y_train)
valid_pred = logit.predict(X_valid)
accuracy_score(y_valid, valid_pred)
Copy the code

Figure 14: Accuracy after improvement

Problems encountered and solutions

  • Problem: The final result is inconsistent with my expectation, and the accuracy rate is too low
  • Solution: Logistic Regression was realized again, and the expected value of cost function was lowered, and the learning parameter was set as 0.1, which improved the accuracy to a certain extent. The code is as follows:
class LogisticRegression:
    def __init__(self, alpha=0.3) :
        self.coef_ = 0.0
        self.intercept = 0.
        self.theta = None
        self.alpha = alpha
        self.cost_list = []
        # number of iterations
        self.iter_count = 0

    def fit(self, x, y, threshold) :
        if x.shape[0] != y.shape[0] :raise 'Input format error'
        self.m = x.shape[0]
        self.n_feature = x.shape[1]
        self.theta = np.zeros(x[0].size)
        # Data normalization
        self.y = y
        self.x = self.normalize_(x)
        self.gradient_descent(threshold)
        self.coef_ = self.theta
        pass

    def gradient_descent(self, threshold) :
        cost = 100000.0
        # threshold = 0.1

        # Here, time is used for accuracy, and the cost function is set to be small enough
        while abs(cost) > threshold:
            self.theta = self.theta - self.alpha * self.partialDerivative()
            self.intercept = self.intercept - self.alpha * self.iteratedFunctionForIntersect()
            cost = -self.Jfunction()
            # Facilitate post-visual processing
            self.cost_list.append(cost)
            self.iter_count += 1

    # data normalization: (0,1) normalization is adopted to normalize the data values in the data set to the interval of [0,1]
    def normalize_(self, x) :
        offset = np.zeros(self.n_feature)
        scalar = np.ones(self.n_feature)
        for feature_idx in range(0, self.n_feature):
            col = x[:, np.newaxis, feature_idx]
            min = col.min(a)max = col.max(a)if (min! =max):
                scalar[feature_idx] = 1.0 / (max - min)
            else:
                scalar[feature_idx] = 1.0 / max

            offset[feature_idx] = min

        x = (x - offset) * scalar
        return x

    # activation function
    def sigmoid(self, z) :
        e_part = np.exp(-z)
        return 1 / (1 + e_part)

    def hypotheticFun(self, x) :
        z = np.dot(self.theta, x) + self.intercept
        return self.sigmoid(z)

    def error_dist(self, x, y) :
        return self.hypotheticFun(x) - y


    # cost function
    def Jfunction(self) :
        sum = 0
        for i in range(0, self.m):
            h = self.hypotheticFun(self.x[i])
            sum += self.y[i] * np.log(h) + (1 - self.y[i]) * np.log(1 - h)
        return 1 / self.m * sum
    Partial derivative of gradient descent algorithm
    def partialDerivative(self, ) :

        h = np.zeros(self.m)
        for i in range(0, self.m):
            h[i] = self.hypotheticFun(self.x[i])

        dist = h - self.y
        result = np.asarray(np.mat(dist.T) * self.x) / self.m
        return result
    # Iteratively update the distance between hypothesis and sample
    def iteratedFunctionForIntersect(self) :
        sum = 0

        for i in range(0, self.m):
            err = self.error_dist(self.x[i], self.y[i])
            sum += err

        return 1 / self.m * sum


    def predict(self, x) :
        # Data normalization
        x = self.normalize_(x)
        y_pred = []
        for element in x:
            y_pred.append(self.hypotheticFun(element))
        y_pred = np.array(y_pred)
        # for i in range(len(y_pred)):
        return np.array(y_pred >= 0.5, dtype='float')



    def plantCostDec(self) :
        # print(self.cost_list)
        # print(self.iter_count)
        plt.plot(range(0, self.iter_count), self.cost_list, color="red", label="costFunNum")
        plt.legend()
        plt.show()

Copy the code

Figure 15: Comparison of model accuracy

Five, attached to the source program

import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore')
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head()
train_df.info()
train_df.dropna(subset=['comment'], inplace=True)
train_texts, valid_texts, y_train, y_valid = \
  train_test_split(train_df['comment'], train_df['label'], random_state=17)
train_df.loc[train_df['label'] = =1.'comment'].str.len().apply(
  np.log1p).hist(label='sarcastic', alpha=. 5)
train_df.loc[train_df['label'] = =0.'comment'].str.len().apply(
  np.log1p).hist(label='normal', alpha=. 5) plt.legend() ! pip install wordcloudInstall the necessary modules
from wordcloud import WordCloud, STOPWORDS
wordcloud = WordCloud(background_color='black', stopwords=STOPWORDS,
                    max_words=200, max_font_size=100,
                    random_state=17, width=800, height=400)
plt.figure(figsize=(16.12))
wordcloud.generate(str(train_df.loc[train_df['label'] = =1.'comment']))
plt.imshow(wordcloud)
plt.figure(figsize=(16.12))
wordcloud.generate(str(train_df.loc[train_df['label'] = =0.'comment']))
plt.imshow(wordcloud)
sub_df = train_df.groupby('subreddit') ['label'].agg([np.size, np.mean, np.sum])
sub_df.sort_values(by='sum', ascending=False).head(10)
sub_df[sub_df['size'] > 1000].sort_values(by='mean', ascending=False).head(10)
sub_df = train_df.groupby('author') ['label'].agg([np.size, np.mean, np.sum])
sub_df[sub_df['size'] > 300].sort_values(by='mean', ascending=False).head(10)
tf_idf = TfidfVectorizer(ngram_range=(1.2), max_features=50000, min_df=2)
logit = LogisticRegression(C=1, n_jobs=4, solver='lbfgs',
                         random_state=17, verbose=1)
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf),
                               ('logit', logit)])
tfidf_logit_pipeline.fit(train_texts, y_train)
valid_pred = tfidf_logit_pipeline.predict(valid_texts)
accuracy_score(y_valid, valid_pred)
def plot_confusion_matrix(actual, predicted, classes,
                        normalize=False,
                        title='Confusion matrix', figsize=(7.7),
                        cmap=plt.cm.Blues, path_to_save_fig=None) :
  """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """
  import itertools
  from sklearn.metrics import confusion_matrix
  cm = confusion_matrix(actual, predicted).T
  if normalize:
      cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

  plt.figure(figsize=figsize)
  plt.imshow(cm, interpolation='nearest', cmap=cmap)
  plt.title(title)
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.xticks(tick_marks, classes, rotation=90)
  plt.yticks(tick_marks, classes)

  fmt = '.2f' if normalize else 'd'
  thresh = cm.max(a) /2.
  for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
      plt.text(j, i, format(cm[i, j], fmt),
               horizontalalignment="center",
               color="white" if cm[i, j] > thresh else "black")

  plt.tight_layout()
  plt.ylabel('Predicted label')
  plt.xlabel('True label')

  if path_to_save_fig:
      plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')
plot_confusion_matrix(y_valid, valid_pred,
                    tfidf_logit_pipeline.named_steps['logit'].classes_, figsize=(8.8))

!pip install eli5  Install the necessary modules
import eli5
eli5.show_weights(estimator=tfidf_logit_pipeline.named_steps['logit'],
                vec=tfidf_logit_pipeline.named_steps['tf_idf'])

subreddits = train_df['subreddit']
train_subreddits, valid_subreddits = train_test_split(
  subreddits, random_state=17)

tf_idf_texts = TfidfVectorizer(
  ngram_range=(1.2), max_features=50000, min_df=2)
tf_idf_subreddits = TfidfVectorizer(ngram_range=(1.1))

X_train_texts = tf_idf_texts.fit_transform(train_texts)
X_valid_texts = tf_idf_texts.transform(valid_texts)
X_train_texts.shape, X_valid_texts.shape
X_train_subreddits = tf_idf_subreddits.fit_transform(train_subreddits)
X_valid_subreddits = tf_idf_subreddits.transform(valid_subreddits)
X_train_subreddits.shape, X_valid_subreddits.shape
from scipy.sparse import hstack
X_train = hstack([X_train_texts, X_train_subreddits])
X_valid = hstack([X_valid_texts, X_valid_subreddits])
X_train.shape, X_valid.shape
logit.fit(X_train, y_train)
valid_pred = logit.predict(X_valid)
accuracy_score(y_valid, valid_pred)
Copy the code