Original link:tecdat.cn/?p=22956 

Original source:Tuo End number according to the tribe public number

Bayesian network (BN) is a probabilistic model based on directed acyclic graph, which describes a set of variables and their conditional dependence on each other. It is a graphical model where we can easily check the conditional dependencies of variables and their orientation in the graph. In this article, I will briefly learn how to use Bayesian networks with R.

This tutorial is designed to introduce the basics of Bayesian network learning and reasoning, using real-world data to explore typical data analysis workflows for graphical modeling. Key points will include:

  1. Preprocessing data;
  2. Learn the structure and parameters of Bayesian networks.
  3. Use the network as a predictive model.
  4. Use the network for reasoning.
  5. Verify the validity of the network by comparing it with external information.

Quick introduction

Bayesian network

define

Bayesian network (BNs) is defined as:

  • A network structure, a directed acyclic graph, where each nodeCorresponds to a random variable;
  • A global probability distribution(with parameters), which can be decomposed into smaller local probability distributions based on the arcs present in the graph.

The main function of network structure is to express the conditional independent relationship between variables in the model through graph separation, so as to specify the factorization of global distribution.

Each local distribution has its own set of parameters; While ⋃thanMuch smaller, because many parameters are fixed, because the variables to which they belong are independent.

R implements the following learning algorithm.

Constraint-based: PC, GS, IAMB, MMPC, Hilton-PC Based on score: hill-climbing algorithm, Tabu Search paired: ARACNE, Chow-Liu hybrid: MMHC, RSMAX2

We use a fractional learning algorithm, the Hill algorithm. First, we’ll generate a simple data set for this tutorial.

In this data set, ‘state’ is related to ‘element’ and ‘accept’ columns. And the ‘type’ is related to the ‘color’ column. When you create a data box with classified data, the column should be a factor type. Otherwise, the data box cannot be used to create BN structures.

 

Next, we will create the learning structure.

We can see the structure in a diagram.

> plot(hc_simd)
Copy the code

In this diagram, the states, elements, acceptances, types, and colors are called nodes. The directions between nodes are described by arcs, which are a matrix containing data from element to element orientation.

As the above arc shows, there is a ‘type’ to ‘color’ relationship in our data, as well as’ state ‘to’ accept ‘and’ element ‘. The ‘type’ and ‘state’ are two separate groups that do not depend on each other. Next, we will use the data to fit the model.

simd_fitted
Copy the code

Based on the above training data, we can conduct conditional probability query.

We check the state probabilities of “Outlier “and “Target”.





The sample has a 51 percent chance of being an outlier.

The probability of the state being the target is 0%.

Bayesian network analysis of malocclusion data

Problem: In patients affected by type 3 malocclusion (characterized by projecting dental arches below), skeletal imbalances develop early in life and become more pronounced during adolescence and before the bone matures. Early prediction of treatment success or failure in a single class III patient makes it easier to correct, but it is difficult to predict from only a small number of morphological determinants. The reason is that class III malocclusal malformations are rarely the result of a single cranial facial anomaly, so individual clinical and radiological measurements may be less indicative than the interaction of the measurements themselves.

Task:

  1. We learn a BN and use it to identify and visualize the interactions between various class III dislocated maxillofacial features during growth and treatment.
  2. We tested this consistency by testing some commonly accepted hypotheses about the evolution of these skeletal imbalances.
  3. We show that untreated subjects developed different class III craniofacial growth patterns compared to orthodontic patients treated with rapid maxillary expansion and face mask.
  4. In treated patients, the CoA segment (maxillary length) and ANB Angle (anteroposterior relation of maxilla to mandible) appear to be the primary skeletal subspaces affected by treatment.

data

The data set we will use consists of 143 patients with two sets of measurements (in years) at T1 and T2 ages for the following variables.

  • Treatment: untreated (NT), poor after treatment (TB), good after treatment (TG).
  • Growth: a binary variable with a value of good or bad.
  • ANB: The Angle (degree) between Down's points A and B.
  • IMPA: Incisor-mandibular plane Angle (degree).
  • PPPM: Palatal plane - Angle (degree) of the mandibular plane.
  • CoA: The total length of the maxilla from the condyle to down's point A (mm).
  • GoPg: The length of the mandibular body from gingival to gingival (mm).
  • CoGo: Length of the mandible, from the condyle to the odontoid process (mm).

All measurements were made by X-ray scanning, using a set of reference points set up as shown below.

> str(data)
Copy the code

Preprocessing and exploratory data analysis

First, we create a data framework that includes all variables of variation as well as growth and treatment.

Growth and treatment variables carry redundant information about patient outcomes, as seen in the difference in the proportion of patients who grow well between TB and TG.

To avoid the confounding caused by including these two variables in the model, we re-coded the treatment as a binary variable, with 0 representing NT and 1 representing TB or TG. Similarly, we recode growth with 0 for bad and 1 for good.

Since we will use Gaussian BN for analysis, check whether these variables are normally distributed; From the graph below, it seems that this is not true of all variables.


+   hist(x, prob = TRUE )
+   lines(density(x), lwd = 2 )
Copy the code

Are these variables related by a linear relationship? Some of them are, but not all.

> pairs(diff[, setdiff(names(diff) 
 
Copy the code

Finally, we can see if these variables are clustered in any way, because clustered variables are more likely to be related in the BN.


> heatmap(rho)
Copy the code

We can see two clusters in the heat map: the first cluster includes dCoGo, dGoPg, and dCoA, and the second cluster includes Treatment, dANB, and dCoA. The first cluster is clinically interesting because it includes treatment and two variables both associated with Down’s point A, which provides some clues to the main effect of treatment.

plot(ug )
Copy the code

Model #1: Static Bayesian network as differential model

Here, we use differences stored in the DIff to model the data, not the raw values; We will use GBN because all variables are numbers. Modeling differences leads to local distributions in the form of regression models

Among themFor other regression factors, and so on. We can rewrite this regression as

This is a set of differential equations that model rates of change in a relationship that is assumed to be a good approximation of linear. However, this statement still implies that the original value varies linearly with time, because the rate of change depends on the rate of change of other variables, but not on time itself. To have a non-linear trend, we need

Furthermore, including growth variables means that we can have regression models in the following form

Thus allowing different rates of change, depending on whether the patient is showing positive development in the deformity and whether he is receiving treatment.

Learning Bayesian networks

Learning structure

The first step in learning BN is to learn its structure, known as DAGWe can do this using data (from different data frameworks) combined with prior knowledge; Combining the latter reduces the model space we have to explore and produces a more powerful BN. A straightforward approach is to blacklist arcs that encode relationships we know to be impossible/real; And whitelisting arcs that encode relationships that we know exist.

A blacklist is just a matrix (or data box) with columns from and to that lists arcs we don’t want in BN.

  • We blacklisted any arcs that point to dT, treatment, and growth in the orthodontic variables.
  • We blacklisted arcs from dT to Treatment. This means that whether or not a patient is treated does not change over time.
  • We blacklisted the arc from growth to dT and treatment. This means that whether a patient receives treatment does not change over time, and apparently does not change with prognosis.

The structure of a whitelist is the same as that of a blacklist.

  • We whitelist the dependent structure dANB → dIMPA ← dPPPM.
  • We whitelisted the arc from dT to Growth, which allowed the prognosis to change over time.

A simple studyThe method is to find the network structure with the best fitting degree on the whole data. For example, use HC () with the default score (BIC) and the entire DIFF data framework.

For plotting, the key function is plot().

plot(dag, , highlight )
Copy the code

However, the quality of a DAG depends crucially on whether the variables are normally distributed and whether the relationship connecting them is linear; From exploratory analysis, it is not clear that this is true for all variables. Nor do we know which arcs represent strong relationships, that is, they are resistant to perturbations in the data. We can solve both of these problems with boot.

  1. Re-sample the data using bootstrap.
  2. Learn an individual network from each bootstrap sample.
  3. Check the frequency of each possible arc in the network.
  4. Construct a consensus network with arcs with high frequency.
booth(diff, R = 200)
Copy the code

The return value of boot.Strength () includes, for each pair of nodes, the strength of the arc connecting them (for example, the frequency we observed dANB → dPPPM or dPPPM → dANB) and the strength of its direction (for example, when we observed an arc between dANB and dPPPM, We observed the frequency of dANB → dPPPM).

attr( "threshold")
Copy the code

Therefore, averaged.network() takes all arcs with a strength of at least 0.585 and returns an average consensus network unless different thresholds are specified.

> avg.diff = averaged.network(str.diff)
Copy the code

Incorporate the information we now have about the strength of the arc.

> strength.plot(avg.diff, str.diff, shape = "ellipse", highlight = list(arcs = wl))
Copy the code

 

How do we compare the average network (AVg.diff) to the network we initially learned from all the data (DAG)? The most qualitative approach is to draw two networks side by side, with the same node positions, and highlight arcs that appear in one network but not in the other, or appear in different directions.

> par(mfrow = c(1, 2))
> graphviz.compare(avg.diff, dag, shape = "ellipse", main = c("averaged DAG", "single DAG"))
Copy the code

We can see that arcs such as Treatment→dIMPa, dANB→dGoPg and dCoGo→dPPPM only appear in the average network, while dPPPM→dANB only appear in the network we learn from all the data. We can assume that the first three arcs are hidden by the noise of the data plus the small sample size and deviations from normal. The program can return the number of true positives (arcs occurring in both networks) and false positives/negatives (arcs occurring in only one of the two networks).

> compare
Copy the code

Or arc = TRUE.

However, given that the network is learned with BIC and BIC is equivalent, are all arc directions deterministic? Looking at the CPDAGs of DAG and AVg. Diff (and factoring in whitelists and blacklists), we see that there are no directionless arcs. All arcs have unique directions.

Finally, we can combine to make a principled comparison, saying that two arcs are distinct if they are uniquely determined to be distinct.

Also look at the threshold for arc length distribution: the average network is fairly dense (17 arcs for 9 nodes) and difficult to read.

> plot(str.diff)
> abline(v = 0.75, col = "tomato", lty = 2, lwd = 2)
> abline(v = 0.85, col = "steelblue", lty = 2, lwd = 2)
Copy the code

So it would be nice to raise the threshold a little bit and eliminate a few more arcs. Looking at the figure above, the two natural choices for higher thresholds are 0.75 (red dotted line) and 0.85 (blue dotted line) due to the difference in arc length distribution.

> nrow(strength > "threshold" direction > 0.5, ]) [1] 18 Trength > 0.75&direction > 0.5 [1] 15 Strength > 0.85&direction > 0.5 [1] 12Copy the code

The simpler network we get by setting the threshold =0.85 in network() is shown below; From a qualitative point of view, it is certainly easier to reason.

> avg. Simpler = averaged.network(str.diff, threshold = 0.85) > strength. Plot (avg. Simpler, str.diff, shape = "ellipse", highlight = list(arcs = wl))Copy the code

Learning parameters

Having learned about structures, we can now learn about parameters. Since we are dealing with continuous variables, we choose to model with GBN. Therefore, if we use maximum likelihood estimation to fit the parameters of the network, we find that each local distribution is a typical linear regression.

fit(avg, diff)
Copy the code

We can easily confirm this to be the case by comparing bn.fit() and lm() produced models such as dANB.

 
> summary(lm(dANB ~ Growth + Treatment, data = diff))
Copy the code

Are we going to have collinearity? Theoretically possible, but in practice, network structures learned from data are mostly not a problem. And the reason is that if you have two variables 和Is collinear, after increasing (say) Xi←Xj, then Xj←Xk will no longer significantly improve BIC, because Xj and Xk (to some extent) provide the same information about Xi.

> # gradually increase the correlation between explanatory variables. > for (rho 5)) {+ # Update correlation matrix and generate data. + R = R = rho + data = as.data.frame(mvrnorm(1000)) + #Copy the code

 

Comparative linear model

If the parameter estimates go wrong for any reason, we can replace them with a new set of estimates from a different method.


 dANB

 dANB = penalized( dANB)
 dANB
Copy the code

Model validation

There are two main ways to verify a BN.

  1. Just look at network structure: If the primary goal of learning BN is to identify arcs and paths, as is often the case when BN is interpreted as a causal model, we can do essentially path analysis and study the strength of arcs.

  2. Consider BN as a whole, including parameters: If the main goal of learning BN is to use it as an expert model, then we might want to.

    • Predict the values of one or more variables of the new individual based on the values of some other variables; As well as
    • The results of the CP query are compared with expert knowledge to confirm that the BN reflects the best knowledge about the phenomenon we are modeling.

Accuracy of prediction

We can measure the predictive accuracy of our chosen learning strategy in the usual way, namely cross-validation. Implementation:

  • K-fold cross validation;
  • The specified k is cross-validated;
  • hold-outCross validation

For:

  • Structure learning algorithms (structures and parameters are learned from data).
  • Parameter learning algorithm (structure provided by the user, parameters learned from the data).

First, we examine Growth, which codes the evolution of malocclusion malformations (0 for bad, 1 for good). We examine it, turn it back to the discrete variables and calculate the prediction error.

cv(diff)

> for (i in 1:10) {

+   err[i] = (sum(tt) - sum(diag(tt))) / sum(tt)

+ }
> 
Copy the code

The other variables are continuous, so we can estimate their predictive correlations instead.


> for (var in names(predcor)) {

+   xval = cv(diff)

+     predcor[var] = mean(sapply(xval, function(x) attr(x, "mean")))

+ }
Copy the code

In both cases, we use a variant of the loss function, which makes predictions using a posterior expected value calculated from all other variables. The basic loss functions (COR, MSE, PRED) only predict the value of each node from their parents, which makes no sense when dealing with nodes with few or no parents.

Confirm with expert knowledge

Another way to confirm whether BN makes sense is to treat it as a working model and see if it expresses key facts about facts that are not used as prior knowledge in the learning process. Otherwise, we will simply retrieve the information we put in priori). Some examples.

  1. “Excessive CoGo growth should lead to a reduction in PPPM.” We test this assumption by generating samples of dCoGo and dPPPM for BN stored in FITTED. Simpler and assuming that no processing takes place. As dCoGo increases (which indicates faster and faster growth), DPPPM becomes more and more negative (which indicates that the Angle decreases if the Angle is initially positive.

    > sim = dist(fitted.simpler)
    > plot(sim )
    > abline(v = 0, col = 2, lty = 2, lwd = 2)
     
    Copy the code

  2. “A small increase in CoGo should lead to an increase in PPPM.” From the figure above, the negative or empty growth of CoGo (dCoGo ⋜ 0) corresponds to the positive growth of PPPM with a probability of ≈0.60. For a small increase in CoGo (dCoGo∈[0, 2]), unfortunately, dPPPM ⋜0, probability ≈0.50, so BN does not support this assumption.

    > nrow(sim[( dCoGo <= 0) & ( PPPM > 0), ]) / nrow(sim[( dCoGo <= 0), ]) [1] 0.6112532 > nrow(sim[(dCoGo > 0) &(dCoGo < 2) &(dPPPM > 0),]) / + nrow(sim[(CoGo) > 0 &(dCoGo < 2), 0.4781784]) [1]Copy the code
  3. “If the ANB is reduced, the IMPA is reduced to compensate.” By simulation testing as before, we are looking for the negative value of dANB relative to the negative value of the IMPA (the same) (which indicates that assuming the Angle is initially positive, it will decrease). As can be seen from the figure below, dANB is proportional to dIMPA, so the decrease of one indicates the decrease of the other. The average trend (black line) for both is negative.

    
    > plot(sim )
    
    > abline(coef(lm(dIMPA ~ dANB )) 
    Copy the code

  4. “If GoPg increases strongly, then both ANB and IMPA decrease.” If we simulate dGoPg, dANB and dIMPA from BN, assuming dGoPg>5 (that is, GoPg is increasing), we estimate that the probability of dANB>0 (that is, ANB is increasing) is ≈0.70, and the probability of dIMPA<0 is only ≈0.58.

    
    > nrow(sim[(dGoPg > 5) & (dANB < 0), ]) / nrow(sim[(dGoPg > 5), ])
    [1] 0.695416
    > nrow(sim[(dGoPg > 5) & (dIMPA < 0), ]) / nrow(sim[(dGoPg > 5), ])
    [1] 0.5756936
    Copy the code
  5. “Treatment attempts to prevent the reduction of ANB. If we fix ANB, is there a difference between treated and untreated patients?”

    First, we can examine the relationship between treatment and growth in patients with dANB≈0 in the absence of any intervention (i.e. using BN as we know from the data).

    Load balancing (LOAD balancing) load balancing (LOAD balancing = load balancing < load balancing, load balancing = load balancing > load balancing)Copy the code

  6. Estimated P (good-growth ∣ TREATMENT) is different for those who receive TREATMENT and those who do not (≈0.65 versus ≈0.52). If we model a formal intervention (such as Judea Pearl) and set dANB=0 externally (thus making it independent of its parents and removing the corresponding arcs), we find that good.growth actually has the same distribution for both treated and untreated patients, It becomes irrelevant to TREATMENT. This suggests that the favorable prognosis is indeed determined by the prevention of ANB changes, and that other components of treatment (if any) become unimportant.

    Table (TREATMENT = TREATMENT < 0.5, GOOD.GROWTH = GROWTH > 0.5)Copy the code

  7. “Treatment attempts to prevent the reduction of ANB. If we fix ANB, is there a difference between treated and untreated patients?”

    One way to assess this is to examine whether the Angle (ANB) between point A and point B changes between treated and untreated patients while keeping GoPg fixed.

Assuming no change in GoPg, the Angle between point A and point B increases for the treated patient (A strong negative value indicates A level imbalance, so A positive rate of change indicates A reduction in the imbalance) and decreases for the untreated patient (the imbalance slowly worsens over time).

UNTREATED = C ("UNTREATED", "UNTREATED")[(UNTREATED > 0.5) + 1L] boxplot(UNTREATED)Copy the code

Model #2: Dynamic Bayesian networks

The prediction effect of dynamic Bayesian network is not as good as model 1, and it is more complex. This is inherent in dynamic Bayesian networks, that is, bayesian networks that simulate random processes: each variable is related to a different node at each point in time being simulated. (Normally, we assume that the process is first-order Markov, so we have two time points in BN: t and t-1.) However, we explore it to show that such a BN can be learned and used for BNLearn.

The data we use for this model is the raw data that we store into the orthogonal at the beginning of the analysis. However, we chose to use therapeutic variables rather than growth variables as variables to express the fact that subjects may be receiving medical treatment. The reason is that the growth variable is a variable that measures outcome at the second measurement, and its value is unknown at the first measurement; The therapeutic variables were the same for both measurements.

Learning structure

First, we divide variables into three groups: variables at time T2, variables at time T1 = T2-1, and variables independent of time, since they have the same value at T1 and T1.


> t2.variables
Copy the code

Then we introduce a blacklist in which.

  1. We blacklisted all arcs from clinical variables to T1, T2 and treatment because we know that age and treatment are not determined by clinical measures.
  2. We blacklisted the arcs of all variables entering Treatment and T1 because we assumed that the arcs between variables in T1 were the same as the corresponding ones in T2, and there was no point in learning them twice.
  3. We blacklisted all arcs from T2 to T1.
grid(from = setdiff(names(ortho), c("T1", "T2")),
 to = c("T1", "T2"))
Copy the code

In contrast, we whitelist only the arc from T1 to T2, since the age of the second measurement obviously depends on the age of the first measurement.

>  data.frame(from = c("T1"), to = c("T2"))
Copy the code

Finally, we can use BL and WL to learn the structure of BN.

> dyn.dag
 
Copy the code

Obviously, this BN is more complex than the previous one: it has more nodes (16 versus 9), more arcs (27 versus 19), and therefore more parameters (218 versus 37).

The best way to plot this new model is to start with plot().

plot(dyn, render = FALSE)
Copy the code

We then group variables to make it easier to distinguish const, T1. variables, and t2.variables; We chose to draw the network from left to right rather than top to bottom.


+        attrs = list(graph = list(rankdir = "LR")))

> Graph(gR)
Copy the code

As in the previous model, treatment acts on ANB: the only arcs leaving treatment are treatment →ANB2 and treatment →CoA2. Again, both of these children are associated with point A of Down.

Model averaging in structure learning

We want to evaluate the stability of this dynamic BN structure, just as we did with the static BN before, and we can do this again.

> boot (ortho )
> plot(dyn)
Copy the code

avernet(dyn.str)
Copy the code

 

Avg and DAG are almost the same on average: they differ by only two arcs. This shows that structure learning produces a stable output.

compare(dag, avg)
tp fp fn
26  1  1
Copy the code

Learning parameters

Since Treatment is a discrete variable, BN is a CLGBN. This means that continuous nodes with Treatment as their parent are parameterized differently than other nodes.

fit(dynavg)
Copy the code

 

We can see that ANB2 depends on ANB (so, the same variable at the previous point in time) and treatment. ANB is continuous, so it is used as a regression factor for ANB2. Treatment variables are discrete and determine the components of linear regression.

Model validation and reasoning

We can ask another set of questions about this new model

  1. “How much did ANB change from the first measurement to the second measurement under different treatments?” We can generate a pair (ANB, ANB2) from CPdist (), provided that the treatment is equal to NT, TB, and TG, and observe their distribution.

    data.frame(
     diff = c(nt[, 2] - nt[, 1], tb[, 2] - tb[, 1], tg[, 2] - tg[, 1]),
    
    > by(effect$diff, effect$treatment, FUN = mean)
    Copy the code

     

    density(~ diff, groups = treatment)
    Copy the code

    We know that treatment tries to stop the decline of ANB; This is consistent with the fact that NT is distributed to the left of TB, while TB is to the left of TG. Untreated patients continue to deteriorate; Patients who didn’t respond to treatment didn’t really improve, but they didn’t get worse either; Patients who responded to treatment showed improvement.

By contrast, this is a simulated trajectory of an untreated patient under the same initial conditions.

The simulated trajectory for CoA is realistic: it slows down with age. This is different from ANB, which occurs because CoA2 depends on both T1 and T2. (ANB2 does not depend on either).

> for (I in seq(nrow(interv)) {+ # predict() + dist(dyn.fitted, nodes = c(), + intervals[i,] = weighted.mean(ANB2, weights) + intervals[i,] = weighted.mean(CoA2, weights)Copy the code

 


Most welcome insight

1. Matlab uses Bayesian optimization for deep learning

2. Matlab Bayesian hidden Markov HMM model implementation

3. Simple Bayesian linear regression simulation of Gibbs sampling in R language

4. Block Gibbs Sampling Bayesian multiple linear regression in R language

5. Bayesian model of MCMC sampling for Stan probabilistic programming in R language

6.Python implements Bayesian linear regression model with PyMC3

7.R language uses Bayesian hierarchical model for spatial data analysis

8.R language random search variable selection SSVS estimation Bayesian vector autoregression (BVAR) model

9. Matlab Bayesian hidden Markov HMM model implementation