Introduction to the

Fake news can use multimedia content to mislead and spread readers, have a negative impact, and even manipulate public events. How to identify whether a new event is fake news on social media is a new unique challenge. In this paper, an end-to-end framework called Event Adversarial Neural Network (EANN) is proposed to detect fake news based on multi-modal features. Inspired by adversarial network, EANN combines event discriminator to predict event auxiliary tags in training phase, and the corresponding loss can be used to estimate the dissimilarity of feature representation between different events.

EANN consists of three main parts: multimodal feature extractor, fake news detector and event discriminator. The multi-modal feature extractor works with the fake news detector to complete the main task of fake news recognition, while the multi-modal feature extractor tries to deceive the event recognizer to learn the event invariant representation. Convolutional neural network (CNN) is used to automatically extract features from text and visual content of articles.

The research content

Model overview

The goal of the model is to learn transferable and distinguishable feature representations for false news detection. To achieve this, the EANN model integrates three main components: a multimodal feature extractor, a fake news detector and an event discriminator, as shown in Figure 1:

  • Because posts on social media often contain different forms of information (such as text posts and additional images), multimodal feature extractors (including text and visual feature extractors) are used to process different types of input.
  • After learning the textual and visual representations of potential features, they are linked together to form the final multimodal representation of features, on which both the fake news detector and the event recognizer are built.
  • The fake news detector uses the learned representation of features as input to predict whether a post is true or false. The event identifier identifies the event tag for each post based on this potential representation.

Multimodal feature extractor

Text feature extraction

The input of the text feature extractor is the sequential list of words in the text, and the convolutional neural networks (CNN) is adopted as the core module of the text feature extractor.

As shown in Figure 1, a modified CNN model is called text-CNN. Its architecture is shown in Figure 2. It uses multiple filters of different window sizes to capture features of different granularity to identify fake news.

The specific step of the text feature extractor is to represent each word in the text as a word embedding vector. The embedding vector of KKK dimension corresponding to the third word in the sentence can be expressed as: Yi∈RkY_i \in \mathbb{R}^kYi∈Rk, so a sentence containing NNN words can be expressed as:


T 1 : n = T 1 The radius T 2 The radius . . . The radius T n T_{1:n}=T_1\oplus T_2\oplus … \oplus T_n

⊕\oplus⊕ means series operations of vectors. The convolution filter with window size of HHH takes the sequence of consecutive HHH words in the sentence as input and outputs a feature. Taking the sequence of consecutive HHH words starting from the third word as an example, the filtering operation can be expressed as follows:


t i = sigma ( W c T i : i + h 1 ) t_i=\sigma (W_c \cdot T_{i:i+h-1})

σ()\sigma()σ() is the ReLU activation function, and WcW_cWc is the weight of the filter. Apply the operation to all the words of the sentence to get the feature vector of the sentence:


t = [ t 1 . . . . . t n h + 1 ] t=[t_1,…,t_{n-h+1}]

The maximum pooling operation is used for TTT to extract the most important information.

In order to extract text features of different granularity, different window sizes are applied. There are nHN_hnH different filters for a particular window size. Given CCC possible window sizes, there are c× NHC \times N_hc × NH filters in total. The text features obtained after the maximum pooling operation can be expressed as RTc∈Rc×nhR_{T_c}\in \mathbb{R}^{c\times n_h}RTc∈Rc×nh, Finally, a full connection layer is used to obtain the final representation of the text feature (represented by RT∈RpR_T \in \mathbb{R}^pRT∈Rp), and the text feature has the same dimension as the visual feature representation (denoted PPP) by the following operation:


R T = sigma ( W t f R T c ) R_T=\sigma(W_{tf}\cdot R_{T_c})

WtfW_{tf}Wtf is the weight matrix of the full connection layer.

Visual feature extraction

The input image sample of the visual feature extractor was denoted as VVV, and the preprocessed VGG19 was used to extract visual features. On the last layer of the VGG19 network, a full connection layer was added to adjust the dimension of the final visual feature representation to PPP. The PPP dimension visual feature is defined as RV∈RpR_V\in \mathbb{R}^pRV∈Rp, and the operation of the last layer in the visual feature extractor can be expressed as:


R V = sigma ( W v f R V v g g ) R_V=\sigma(W_{vf}\cdot R_{V_{vgg}})

RVvggR_{V_{VGG}}RVvgg is the visual feature representation obtained from pre-training VGG19, and WvfW_{VF}Wvf is the weight of the full connection layer in the visual feature extractor.

Text feature RTR_TRT and visual feature RVR_VRV will be concatenated into a multimodal feature representation, written as:


R F = R T The radius R V R 2 p R_F=R^T\oplus R_V \in \mathbb{R}^{2p}

The multimodal feature extractor is defined as Gf(M:θf)G_f(M:\ theTA_F)Gf(M: \ theta_F)Gf(M:θf), MMM is a set of text and visual post samples and is the input to the multimodal feature extractor, while θf\ theta_F θf represents the learning parameters.

Fake news detector

The fake news detector uses SoftMax to deploy a fully connected layer to predict the truthfulness of a post’s content, with input from the multimodal feature extractor’s output RFR_FRF. The fake news detector is defined as Gd(⋅; Theta d) G_d (\ cdot; Gd \ theta_d) (⋅; θd), θd\theta_dθd denotes all parameters. The prediction result of the third post by the fake news detector is defined as MIM_IMI, then the probability of the post being fake news is:


P Theta. ( m i ) = G d ( G f ( m i ; Theta. f ) ; Theta. d ) P_{\theta}(m_i)=G_d(G_f(m_i; \theta_f); \theta_d)

YdY_dYd was used to represent the set of sample labels, and cross entropy was used to predict the loss:


L d ( Theta. f . Theta. d ) = E ( m . y )   ( M . Y d ) [ y l o g ( P Theta. ( m ) ) + ( 1 y ) l o g ( 1 P Theta. ( m ) ) ] L_d(\theta_f,\theta_d)=-\mathbb{E}_{(m,y)~(M,Y_d)}[ylog(P_{\theta}(m))+(1-y)log(1-P_{\theta}(m))]

By finding the optimal parameters theta f ^, ^ \ hat theta d {\ theta_f}, \ hat {\ theta_d} theta f ^, theta d ^ to minimize loss function.

A major challenge in fake news detection comes from events not covered by the training dataset, which requires the learner to be able to learn the transferable feature representation of newly emerging events. However directly minimize detecting loss can only help in training data set contains the fake news of the event, so that learning can only get event specific knowledge (keywords), for example, or schema, and we need to make the model can learn more can capture all events, said the public characteristics of general characteristics of the said should have event invariance, And does not contain any event-specific features.

To achieve the above goal, the uniqueness of each event needs to be removed, that is, the dissimilarity of feature representations between different events is measured and removed to capture event invariant feature representations.

Event discriminator

Event discriminator is a neural network composed of two fully connected layers and corresponding activation functions. Its purpose is to correctly classify posts as one of KKK events according to multi-modal feature representation, and the event discriminator is defined as Ge(RF; Theta e) G_e (R_F; \theta_e)Ge(RF; θe), θe\theta_eθe denotes its parameters. Loss of event discriminator is defined by cross entropy:


L e ( Theta. f . Theta. e ) = E ( m . y )   ( M . Y e ) [ k = 1 k 1 [ k = y ] l o g ( G e ( G f ( m ; Theta. f ) ) ; Theta. e ) ] L_e(\theta_f,\theta_e)=-\mathbb{E}_{(m,y)~(M,Y_e)}[\sum^k_{k=1}1_{[k=y]}log(G_e(G_f(m;\theta_f));\theta_e)]

The goal of the event discriminator is to find the parameter θe^\hat{\theta_e}θe^ to minimize the loss function.

Le(θf,θe)L_e(\theta_f,\ theta_E)Le(θf, \theta_e)Le(θf, \theta_e)Le(θf, \theta_e)Le(θf, \theta_e)Le(θf, \theta_e)Le(θf, \theta_e)Le(θf, \theta_e)Le(θf, \theta_e)Le(θf, \theta_e)Le(θf, \ In order to eliminate the uniqueness of each event, need to find parameters theta f ^ \ hat {\ theta_f} theta f ^ to maximize the Le (theta f, theta e ^) L_e (\ theta_f, \ hat {\ theta_e}) Le (theta f, theta e ^).

This reflects the antagonism of the network. On the one hand, the multimodal feature extractor tries to fool the event discriminator to maximize the discrimination loss; on the other hand, the purpose of the event discriminator is to discover the event-specific information contained in the feature representation, so as to identify the event.

Model integration

During the training phase:

  • 12. Multi-mode Feature Extractor: Gf(⋅; Theta f) G_f (\ cdot; \ theta_f) Gf (⋅; θf), which requires a news detector Gd(⋅; Theta d) G_d (\ cdot; Gd \ theta_d) (⋅; Theta d) to minimize loss detection Ld (theta f, theta d) L_d (\ theta_f, \ theta_d) Ld (theta f, theta d)
  • 12. Multi-mode Feature Extractor: Gf(⋅; Theta f) G_f (\ cdot; \ theta_f) Gf (⋅; θf), trying to fool the event discriminator Ge(⋅; Theta e ^) G_e (\ cdot; Ge \ hat {\ theta_e}) (⋅; Theta e ^) by maximizing loss events to identify Le (theta f, theta e) L_e (, \ \ theta_f theta_e) Le (theta f, theta e) to fool event discriminator Ge (⋅; Theta e ^) G_e (\ cdot; Ge \ hat {\ theta_e}) (⋅; Theta e ^)
  • Event discriminator Ge(RF; Theta e) G_e (R_F; \theta_e)Ge(RF; θe) Based on the multimodal feature representation, each event is recognized under the premise of minimizing the event recognition loss.

To sum up, the final loss of this confrontation is defined as:


L f i n a l ( Theta. f . Theta. d . Theta. e ) = L d ( Theta. f . Theta. d ) Lambda. L e ( Theta. f . Theta. e ) L_{final}(\theta_f,\theta_d,\theta_e)=L_d(\theta_f,\theta_d)-\lambda L_e(\theta_f,\theta_e)

λ\lambda lambda controls the tradeoff between the objective function of fake news detection and the objective function of event discrimination (λ=1\lambda=1λ=1).

For optimization parameters, EANN tries to find a saddle point of the final objective function, that is, possibly for a maximum advantage:


( Theta. f ^ . Theta. d ^ ) = a r g min Theta. f . Theta. d L f i n a l ( Theta. f . Theta. d . Theta. e ^ )   Theta. e ^ = a r g max Theta. e L f i n a l ( Theta. f ^ . Theta. e ) (\hat{\theta_f},\hat{\theta_d})=arg\min_{\theta_f,\theta_d}L_{final}(\theta_f,\theta_d,\hat{\theta_e}) \\ \ \\ \hat{\theta_e}=arg\max_{\theta_e}L_{final}(\hat{\theta_f},\theta_e)

The stochastic gradient descent method is used to solve the above problems.

Here the introduced gradient Reversal Layer (GRL) is used, which acts as an identity function in the forward phase, multiplying the gradient by −λ-\lambda−λ, and then transmitting the results to the upper layer in the back propagation phase. GRL can be easily added between the multimodal feature extractor and the event discriminator, as shown in Figure 1.


Theta. f please Theta. f eta ( partial L d partial Theta. f Lambda. partial L e partial Theta. f ) \theta_f \leftarrow \theta_f-\eta(\frac{\partial L_d}{\partial \theta_f}-\lambda\frac{\partial L_e}{\partial \theta_f})

To stabilize the training process, attenuate the learning rate by:


eta = f r a c eta ( 1 + Alpha. p ) Beta. . Alpha. = 10 . Beta. = 0.75 Eta} \ eta ‘= frac {\ {(1 + \ alpha \ cdot p) ^ \ beta}, \ alpha = 10, \ beta = 0.75

PPP is a linear change from 0 to 1 corresponding to the training progress.

The detailed steps of EANN are summarized in Algorithm 1: