HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness

The paper links

background

With the advancement of data acquisition technology, multi-modal data is increasing rapidly. Multi-modal data fusion effectively improves performance in various application scenarios, such as target detection, emotion analysis, emotion recognition and medical detection.

Early studies have learned joint representations or performed predictions by combining multiple modes, including some traditional approaches such as early fusion and late fusion, deep learning methods (graph-based late fusion), and deep fusion (which focuses on exploring multimodal interactions).

In the real world, modal loss often occurs in multimodal data due to various reasons such as sensor corruption, data corruption and human recording errors. Effectively integrating and analyzing incomplete multimodal data is a challenging problem.

There are three main problems to be solved for the loss of multimodal modes:

  1. Multi-modal data combinations with different missing modes may have different dimension and quantity of feature sets, which makes it difficult to apply a complete multi-modal fusion model.
  2. Effective multimodal fusion requires learning complementary information, specific modal information and multimodal interaction, but due to the existence of missing modes, relevant information cannot be directly obtained from incomplete individual data.
  3. A large number of missing data may greatly reduce the size of the data, making it difficult to learn high-dimensional interaction features from a small number of samples.

Some previous studies have usually dealt with modal loss by deleting incomplete data samples or speculating about missing modes. However, directly deleting incomplete data samples will significantly reduce the sample size, which may lead to over-fitting of subsequent deep learning models, especially when a large number of samples have missing data of different modes. A method of loss model and try to according to the observed existing models to generate the missing, such as zero value fill the completion method, average fill method, matrix and the method based on deep learning, etc., but this kind of method could instead to introduce a new noise data, negatively impact the performance of model, and sometimes the noise affect complex auxiliary model, such as deep generation model.

Main Research Contents

This paper mainly studies incomplete multimodal data fusion based on heterogeneous graph, and is devoted to processing incomplete data without speculation. At present, many researchers are trying to do this, such as:

  • Multi-source feature learning method: The incomplete data is divided into multiple sub-combinations, and then the integrated representation of these sub-combinations turns it into a sparse multi-task learning problem.
  • Multi-hypergraph learning method: Merges higher-order subgroup relationships and learns directly from the output.

Although the above method provides a solution, it ignores the intermodal/intra-modal interaction and fails to learn the relationship between incomplete samples. The authors of this paper developed a new basic structure to easily extract complex information and fuse multimodal data with missing modes without deleting or predicting the data.

This method is called Heterogeneous Graph-based Multimodal Fusion (HGMF). It first models Multimodal data with imperfections in a Heterogeneous Graph structure. Then a graph neural network-based direct learning framework is used to extract complementary information from highly interactive incomplete multimodes and fuse the information from different subspaces into a unified space. The specific contents are as follows:

  • A heterogeneous hypernode graph (HHG) was proposed to model highly interactive multimodal data with different incomplete modes.
  • A graph neural network based direct learning framework is proposed to perform multimodal fusion of incomplete data in the constructed HGG.
  • Experiments on multiple levels of missing data show that this method can deal with real scenarios with high missing data ratio.

Problem formalization

An incomplete form of multimodal data

For an incomplete data set of MMM-mode, there are (2M−1)(2^ m-1)(2M−1) different combinations of missing modes, so an incomplete multimodal data set has at most (2M−1)(2^ m-1)(2M−1) incomplete forms. The following figure shows a block-based structure diagram for a three-mode dataset with seven incomplete forms (M=3M=3M=3). The colored mode is the existing mode, and X is the missing mode. The figure also shows that instances can be divided into several groups, and all instances in each group have the same missing form, and each instance belongs to only one form.

Problem 2.1 Multimodal fusion with incomplete data

Assuming MMM is the number of modes in the data set, NNN is the number of samples, and ψ\psiψ is a function that maps each sample to a form, ϕ(q)⊆{1… M}\phi(q)\subseteq \{1… M \} ϕ (q) ⊆ {1… M} represents the set of available patterns for the form QQQ. Given a set of incomplete modal data samples more D = {x ~ I} nd I = 1 = \ {\ tilde {x} _i \} ^ N_ (I = 1} D = {x ~ I} I = 1 n as input, Each contains a set of data samples available at this time the modal x ~ I = {xi, m} m ∈ (bits (I)) \ tilde {x} _i = \ {x_i, m \} _ {m \ (\ psi (I)) in} x ~ I = {xi, m} m ∈ (bits (I))

Direct learning

This paper proposes a framework for direct inductive learning, which is different from inductive learning. Direct inductive learning (instance based learning) directly integrates the characteristic information implied in other samples. The key point of this method is that an incomplete data sample can obtain missing information from other existing samples through the direct learning framework. Instances with different missing data modes can effectively exchange their special modal and interactive information, and achieve multi-modal fusion in the process.

The research methods

This paper proposes the HGMF method, which is constructed based on a graph neural network (GNN) direct push learning framework, consisting of three steps:

  1. Modeling incomplete multimodal data in a heterogeneous supernode graph structure.
  2. Encoding highly interactive multimodal data with missing modes into more explicit modal-specific and cross-modal interaction information.
  3. Integrate and interact information across different missing forms between multi-modal instances, and fuse all data into the same embedded space.

The figure below shows a sample three-phase workflow that HGMF processes with four missing forms of data samples.

Modeling incomplete multimodal data by heterogeneous supernode graph

An incomplete multimodal dataset containing multiple missing data forms can be modeled as a K-NN correlation graph structure, with each node as an instance.

To implement modeling, first define the HGG diagram. A HGG figure can be defined as G = (V, E, bits, ϕ) G = (V, E, \ psi, \ phi) G = (V, E, bits, ϕ), including:

  • In general, nodes of simple graphs are associated with features of the same dimension, while supernodes contain features of different numbers and dimensions. The features of each supernode may interact implicitly/explicitly. V={vi} I =1NV=\{v_i\}^N_{I =1}V={vi} I =1N is defined as the set of supernodes. In the study of this paper, a supernode is equivalent to a multimodal instance, Define the figure feature sets X = {{xi, m ∣ ∀ m ∈ ϕ (bits) (I)}, 1 I N} or less or less X = \ {\ {x_ {I, m} | \ forall m \ \ in phi (\ psi (I)) \}, 1 \ leq \ leq I N ∣ \} X = {{xi, m ∀ m ∈ ϕ (bits) (I)}, 1 I N} or less or less

  • Figure of the set is defined as: E = {ej} j = 1 ∣ E ∣ E = \ {e_j \} ^ {| E |} _ {j = 1} E = {ej} ∣ ∣ E j = 1. When constructing k-NN correlation graph for an example, in order to better represent the high-order relation between nodes, hyperedges in hypergraph are used to replace paired edges. A hyperedge is a subset of (super) nodes that joins KKK instances that share similar information and represents k-NN relationships between these nodes. EEE can be expressed as a correlation matrix H ∈ {0, 1} ∣ V ∣ x ∣ E ∣ H \ in \ {0, 1 \} ^ {| V | \ times | E |} H ∈ {0, 1} ∣ ∣ V x ∣ E ∣, a super node viv_ivi every behavior, every eje_jej as a super. For each term of the matrix, H(vi,ej)=1H(v_i,e_j)=1H(vi,ej)=1 means that the supernode VIv_IVI is connected to some other supernodes through the hyperedge & e_J&. Each hyperedge eje_jej is associated with a single-valued weight wjw_jwj, and all edge weights in this paper are 1.

  • ⟼ T \ psi bits: V: V \ T bits longmapsto: V ⟼ T for the node form – mapping function, T = {1, 2,… , M ‾} T = \ {1, 2,… , \ overline {M} \} T = {1, 2,… ,M} is the formal set, where M‾=2M−1\overline{M}=2^ m-1m =2M−1 represents the number of incomplete forms observed in the data set. The constructed graph is obviously heterogeneous because the supernodes (instances) have different incomplete forms, so the ψ\psiψ is defined to distinguish the multi-mode instances.

  • Define a function that maps a form to a schema set combination as ϕ:T⟼P(M)\∅\phi:T \longmapsto P(M)\backslash \emptysetϕ:T⟼P(M)\∅, P(M)P(M)P(M) denotes the set M={1,,2… ,M}M=\{1,,2,… ,M\}M={1,,2,… M}.

Construction of heterogeneous hypernode graph

The following figure shows the construction of an HHG diagram. The left side of the figure shows a three-mode incomplete data set DDD, with rows representing instances and columns representing modes. DDD makes it easy to obtain the functions ϕ(⋅)\phi(\cdot)ϕ(⋅) and ψ(⋅)\psi(\cdot)ψ(⋅). Feature set XXX is based on data availability and corresponding features. On the right side of the figure is a seven-form HHG constructed according to the input. Different colors indicate different forms, and each supernode contains multiple modes.

First, according to the different combinations of available modes, the block-incomplete data set is reconstructed into B blocks, and then the set of hyperedges between all related instances in each block is calculated. Define VbV_bVb and MbM_bMb as the supernode-set and modal set related to block B, and the normalized distance between any two instances in the block can be calculated:

The Zm here = ∑ I, j ∈ Vb ∣ ∣ um (Xi, m) – um (Xj, m) ∣ ∣ 22 z_m = \ sum_ {I, j \ in V_b} | | u_m (X_i, m) – u_m (X_j, m) | | _2zm ^ 2 = ∑ I, j ∈ Vb ∣ ∣ um (Xi, m) – um (Xj, m) ∣ ∣ 22 (| |… | demonstration | table) and um (⋅), m = 1… Mu_m(\cdot),m=1… Mum (⋅), m = 1… M, the above two formulas are pre-trained single-mode representation learning models, which are used to initialize shallow single-mode features. After all instances are calculated, the k-nearest neighbor algorithm is used to calculate the distance of each hyperedge centering on each node. As shown below in Figure (b), for all predefined blocks, b-ultra edge set is independently calculated by parameters of feature extraction model. Suppose their association matrix is {H1,H2… ,HB}\{H_1,H_2,… ,H_B\}{H1,H2,… ,HB}, then the final correlation matrix in HHG is the series of these matrices H=[H1;H2;… HB]H=[H_1;H_2;… H_b]H=[H1;H2;… HB], by using this method, instances missing a mode can be associated with other instances that are not missing, thus alleviating the problem of data missing.

Internal code of the supernode

After the input graph GGG and feature set XXX are constructed, a super-node internal encoder is used to capture supplementary information from the high interactive mode with missing data. The encoder structure is shown in the figure below, consisting of two components.

  • Single mode embedded network: the network input is single mode data, and the output is single model embedded.

Since the original information in XXX is extremely high-dimensional, sparse, and the data structure is inconsistent, it is difficult to calculate the interaction information between the original modes. Therefore, this paper sets up a series of source-specific deep neural networks to learn compressed and rich feature representation information from single-mode raw data.

Three architectures are mainly used to build single-mode embedded networks:

  1. Convolutional neural network (CNN) is used for image modal embedding.
  2. Bidirectional Long short-term Memory Networks (BI-LSTM) are used for sequential modal embedding, such as video, free text (clinical records), and spoken language.
  3. For high-dimensional or sparse feature-based modes stacking fully connected layers using nonlinear activation functions, such as gene expression.

Assuming that FM (⋅; Θ m) f_m (\ cdot; \ Theta_m) FM (⋅; θ M) is the single mode embedded network of MMM, θ m\Theta_m θ M, for each supernode VIv_IVI, Content for its feature set X ~ I = Xi, mm ∈ ϕ (bits (I)) \ tilde {X} _i = {X_i, m} _ {m \ \ in phi (\ psi (I))} X ~ I = Xi, mm ∈ ϕ (bits (I)), the modal -m is embedded as:


h i m = f m ( x i . m ; Θ m ) h_i^m = f_m(x_{i,m}; \Theta_m)

Where, him∈RFmh_i^m\in \mathbb{R}^{F_m}him∈RFm, R\mathbb{R}R is the set of real numbers, and FmF_mFm is the embedding dimension of modal -m.

  • Feature interaction network: used to capture modal interaction information between embeddings and extract supplementary information (specific modal and cross-modal interaction information).

⋅ P(⋅)P(\cdot)P(⋅) is a power set operation, and M={1,2… , M} M = \ {1, 2,… , M \} M = {1, 2,… ,M}, make every subset ∀S∈P(M)\∅\forall S\in P(M)\backslash \emptyset∀S∈P(M)\∅ as a modal combination, and it learns a multimodal interaction and a message from every SSS, calling it a factor. The supplementary information of a supernode is composed of multiple factors, each of which can be expressed as a F’F ‘-dimensional vector. The calculation method of factors is as follows:

If there is only one element in the SSS (∣ S ∣ = 1 | S | = 1 ∣ S ∣ = 1), the modal calculation prove that this is for a specific modal information, modal information is defined as a specific question, mh_i ^ {m, m} question, m:

Here, Um ∈ RF ‘x FmU_m \ \ mathbb in ^ {R} {F’ \ times F_m} Um ∈ RF ‘x Fm, bm ∈ RF’ b_m \ \ mathbb in ^ {R} {‘} bm ∈ RF ‘, F Um, m ∈ RF ‘x (Fm) 2 u_ {m, m} \ in \ mathbb {R} ^ {F’ \ times (F_m) ^ 2} Um, m ∈ RF ‘x (Fm) 2, Bm, m ∈ RF ‘b_ {m, m} \ \ mathbb in ^ {R} {F’} bm, m ∈ RF ‘gm for neural network (⋅) g_m gm (⋅) and gm (\ cdot), m (⋅) g_ {m, m} \ cdot gm, m (⋅) parameters. Gim∈RFm×fmG_i^m \in \mathbb{R}^{F_m \times F_m}Gim∈RFm× FM is a Single mode embedded HimH_I ^mhim Gram matrix, represents the covariance, is the characteristic self-interaction information. H ˉ IM \bar{h}^m_ihˉ IM can be seen as the mean of mode -m, low-dimensional, and specific information.

If more than one element in the SSS (∣ S ∣ > 1 | S | > 1 ∣ S ∣ > 1), the calculation of is all cross modal interaction information between single-mode state embedded {question ∣ ∀ m ∈ S} \ {h_i ^ m \ | \ forall m in S \} {question ∣ ∀ m ∈ S}, the hiSh_i ^ ShiS is:

The CiSC_i here ^ SCiS said single-mode state related embedded ∣ S ∣ | S | ∣ S ∣ weight vector cross product, And US ∈ RF ‘x ∏ m ∈ (SFm) U_S \ \ mathbb in ^ {R} {F’ \ times (\ prod_ {m \ S} in F_m)} US ∈ RF ‘x ∏ m ∈ (SFm), BS ∈ RF ‘b_S \ \ mathbb in ^ {R} {F’} bS ∈ RF ‘gS for neural network (⋅) gS (\ cdot) gS (⋅) weight of learning.

Multiple bilevel attention layers

As can be seen from the figure (a) above, Super node internal encoder output for a new HHG Genc = (Venc, E, bits, ϕ) G_ {enc} = (V_ {enc}, E \ psi, \ phi) Genc = (Venc, E, bits, ϕ), and a set of new features Xenc nx_ =} {hi I = 1 = {enc} \ {h_i \} ^ N_ Xenc = {I = 1}} {hi I = 1 n. This section mainly addresses the following sub-problems:

  • Mˉ\bar{M}Mˉ -heterotopic graph embedding

Given the heterogeneous super node figure Genc = (Venc, E, bits, ϕ) G_ {enc} = (V_ {enc}, E, \ psi, \ phi) Genc = (Venc, E, bits, ϕ), The node set can be divided into M ˉ = ∣ tau ∣ \ bar {M} = | \ tau ˉ | M = ∣ tau ∣ based on bits (⋅) \ psi (\ cdot) bits (⋅) disjoint subsets Venc = {Vp ∣ ∀ p ∈ tau} V_ {enc} = \ {V_p p \ | \ forall the in Venc \ tau \} = {Vp ∣ ∀ p ∈ tau}, And each node vi ∈ Vpv_i \ in V_pvi ∈ Vp and a F ‘F’ F ‘all – dimensional vector h ~ I = {hiS ∣ ∀ S ∈ P (bits (P)) \ ∅} \ tilde {h} _i = \ {h_i ^ S \ | \ forall S in P \ backslash (\ psi (P)) \ emptyset \} h ~ I = {hiS ∣ ∀ S ∈ P (bits (P)) \ ∅}, the task of learning will be mapped to the super node M ˉ \ bar {M} M ˉ embedded in the space, Into a uniform embedding space Z∈RN×dZ\in \mathbb{R}^{N \times d}Z∈RN×d, the implementation maps the hypernodes with different non-incomplete forms to the same eigenspace.

To solve this subproblem, this paper proposes multi-fold Bilevel Graph Attention Networks (MBGAT). As shown in the figure below, the objective of each layer of MGBAT is to map the existing features in Mˉ\bar{M}Mˉ space to the new Mˉ\bar{M}Mˉ space, so that the features are close to each other.

In the context of multimodal fusion, the difficulty lies in the unknown relationship between different incomplete forms. To solve this problem, this paper designs a two-layer attention strategy to aggregate different forms of neighborhood information. For each MBGAT layer, the input is expressed as much space z = {{zip ∣ ∀ vi ∈ Vp} ∣ ∀ p ∈ tau} z = \ {\ {z_i ^ p | \ forall v_i \ in V_p \} | \ forall p \ \ tau \} in z = {{zip ∣ ∀ vi ∈ Vp} ∣ ∀ p ∈ tau}, Here zip∈Rdpz_i^p \in \mathbb{R}^{d_p}zip∈Rdp is the DPD_PDP-dimension characteristic associated with point viv_ivi of form PPP. More space for z ‘output = {{z’ IP ∣ ∀ vi ∈ Vp} ∣ ∀ p ∈ tau} z ‘= \ {\ {{z’} _ {I} ^ {p} | \ forall v_i \ | in V_p \} \ forall p \ \ tau \} in z ‘= {{z’ IP ∣ ∀ vi ∈ Vp} ∣ ∀ p ∈ tau}, Here z ∈ Rd ‘IP’ p ‘} {z _i ^ p \ \ mathbb in {R} ^ {{d ‘} ‘IP _p} z ∈ Rd’ p, d ‘p {d’} _pd ‘p form of PPP new dimension of feature space.

An MGBAT layer has two components:

  • Multiple internal form aggregations: Independent aggregations of nodes in the same space, as lower level aggregations, mainly aggregations of neighbors of the same feature space (i.e., those instances of multiple modes missing the same modes).

Multiple mapping: As preparation for aggregation, each node needs to be mapped to a new low-dimensional vector space. Define {Wpq ∣ ∀ p, q ∈ tau} \ {W_ {pq} | \ forall p, q \ \ tau in \} {Wpq ∣ ∀ p, q ∈ tau} as ∣ tau ∣ | \ tau | ∣ tau ∣ mapping scheme, Here Wpq∈Rdq ‘×dpW_{pq}\in \mathbb{R}^{d’_q \times d_p}Wpq∈Rdq’ ×dp is a learnable matrix that maps nodes from the feature space of form -p to the new feature space of feature -q. The following figure illustrates multiple mapping, in which nodes of different colors (of different forms) are projected into different Spaces.

In-form polymerization: Define Nq (I) = {vj ∣ ∀ vj ∈ Vq Sunday afternoon (HHT) ij > 0} N_q (I) = \ {v_j | \ forall v_j \ in V_q \ wedge (HH) ^ T _ {ij} > 0 \} Nq (I) = {vj ∣ ∀ vj ∈ Vq Sunday afternoon (HHT) ij > 0} for the form For the neighbor node set of QQQ to viv_ivi, HHH is the association matrix. Nq(I)N_q(I) The importance of each node in Nq(I) to the target node viv_ivi can be measured by the attention mechanism, namely a⃗q∈R2dq ‘×1\vec{a}_q \in \mathbb{R}^{2d’_q \times 1}a q∈R2dq’ ×1, Can be calculated as:

Finally, for all adjacent nodes viv_ivi of form -q, the inter-formal aggregation result is:

σ(⋅)\sigma(\cdot)σ(⋅) is the sigmoid function.

In the figure below you can see that nodes of the same color (of the same form) are aggregated into a particular double circle feature point on the new space.

  • Inter-form aggregation: Learn the relationships between forms and aggregate all adjacent ones into a target space.

Given the result of aggregation between forms {si1, Si2… , siM ˉ}, siq ∈ Rdq ‘\ {s_i ^ 1 and s_i ^ 2,… ,s^{\bar{M}}_i\},s_i^q \in \mathbb{R}^{d’_q}{si1,si2,… SiM ˉ}, SIq ∈Rdq ‘, the importance of the formal -q neighbors to the formal -P goal can be calculated by the attention mechanism, BP ∈R2dp ‘×1b_p \in \mathbb{R}^{2d’_p \times 1} BP ∈R2dp’ ×1, the calculated attention coefficient is:

Vqp∈Rdp ‘×dq’ V_{qp} \in \mathbb{R}^{d’_p \times d’_q}Vqp∈Rdp ‘×dq’ is a space-to-space conversion from form -q to form -p. Finally, the embedding of target node VIv_IVi can be obtained as follows:

The pseudo-codes of HGMF are as follows:

The experiment

Focus on data sets and experimental environments.

The data set

  1. ModelNet40:3D CAD data set
  2. NTU: 3D data sets
  3. IEMOCAP: Recorded conversation video data set

Compare the model

Inductive multimodal fusion model

  1. Concat, Tensor Fusion Network(TFN)
  2. Low-rank Fusion Network (LFM)
  3. Multitask Multimodal Learning (MTL)

The transformation model

  1. Hypergraph Neural Network (HGNN)

The experimental setup

All benchmarks and HGMF proposed in this article are implemented using python3.