Original link:http://tecdat.cn/?p=22956 

Bayesian network (BN) is a probabilistic model based on directed acyclic graph, which describes a set of variables and their conditional dependence on each other. It is a graphical model where we can easily examine the conditional dependencies of variables and their orientation in the graph. In this article, I will briefly learn how to use R to use Bayesian networks.

This tutorial aims to introduce the basics of learning and reasoning in Bayesian networks, using real world data to explore typical data analysis workflows in graphical modeling. Key points will include:

  1. Preprocessing data;
  2. Learn the structure and parameters of Bayesian networks.
  3. Use the network as a prediction model.
  4. Use the network to reason.
  5. Verify the effectiveness of the network by comparing it with external information.

Quick introduction

Bayesian network


Bayesian networks (BNS) are defined as:

  • A network structure, a directed acyclic graph, in which each node corresponds to a random variable;
  • A global probability distribution (with parameters) that can be decomposed into smaller local probability distributions based on the presence of arcs in the graph.

The main function of the network structure is to express the conditional independence relationship among the variables in the model by graphic separation, so as to specify the factorization of the global distribution.

Each local distribution has its own set of parameters; While ⋃thanMuch smaller, because many of the parameters are fixed, because the variables to which they belong are independent.

R implements the following learning algorithm.

Constraint-based: PC, GS, IAMB, MMPC, Hilton-PC Fract-based: Hill Climbing Algorithm, Tabu Search Pair: ArAcne, Chow-Liu Mixed: MMHC, RSMAX2

We use a score-based learning algorithm, the Hill algorithm. First, we’ll generate a simple dataset for the tutorial.

In this data set, the ‘status’ is associated with the’ element ‘and’ accept ‘columns. The ‘type’ is related to the ‘color’ column. When you create a data box with classified data, the column should be a factor type. Otherwise, the data box cannot be used for the creation of the BN structure.


Next, we will create the learning structure.

We can see the structure in a diagram.

> plot(hc_simd)

In this diagram, states, elements, acceptability, types, and colors are referred to as nodes. The orientation between nodes is described by an arc, which is a matrix containing the orientation data from element to element.

As the above arc shows, there are relationships between ‘type’ and ‘color’ in our data, as well as’ status’ and ‘acceptance’ and ‘element’. ‘Type’ and ‘State’ are two separate groups, and there is no interdependence between them. Next, we will use the data to fit the model.


Based on the above training data, we can conduct conditional probability query.

We check the state probabilities of “Outlier “and “Target”.

The sample has a 51 per cent chance of being “outlier”.

The state has a 0% chance of being a “target”.

Bayesian network analysis of malocclusion data

Problem: In patients affected by type III malocclusion (characterized by protrusion of the dental arch), the skeletal imbalance develops early in life and becomes more pronounced during adolescence and before bone maturation. Early prediction of treatment success or failure in an individual Class III patient makes it easier to correct, but it is difficult to predict from only a small number of morphological determinants. The reason is that class III malocclusis are rarely the result of a single craniofacial component abnormality, so individual clinical and radiological measurements may not be as indicative as the interaction of the measurements themselves.


  1. We learned a BN and used it to identify and visualize the interactions between various class III malposition maxillofacial features during growth and treatment.
  2. We tested consistency by testing some commonly accepted hypotheses about the evolution of these skeletal imbalances.
  3. We show that untreated subjects developed a different type III craniofacial growth pattern compared to orthodontic patients who received rapid maxillary dilation and mask treatment.
  4. In treated patients, the COA segment (maxillary length) and ANB Angle (maxilla-mandible fore-posterior relationship) appear to be the primary skeletal subspaces that are affected by treatment.


The dataset we will use contains 143 patients with two sets of measurements (in years) at T1 and T2 ages for the following variables.

  • Treatment: Untreated (NT), poor after treatment (TB), good after treatment (TG).
  • Growth: A binary variable with a value of good or bad.
  • Anb: The Angle (degree) between Down points A and B.
  • Impa: incisor-mandibular plane Angle (degree).
  • PPPM: Palatal plane - the Angle (degree) of the mandible plane.
  • CoA: The total length of the maxilla from the condylar process to point A down (mm).
  • Gopg: The length of the mandibular body from gum to gum (mm).
  • COGO: Length of the mandible from the condyle to the odontoid process (mm).

All measurements were made by X-ray scanning, using a set of reference points to build a map, as shown below.

> str(data)

Preprocessing and exploratory data analysis

First, we create a data framework that includes all the variables of difference as well as growth and treatment.

Growth and treatment variables carry redundant information about patient outcomes, as seen in the difference in the proportion of patients with good growth between TB and TG.

To avoid the confusion caused by including these two variables in the model, we recoded the treatment into a binary variable, 0 for NT and 1 for TB or TG. Similarly, we recode growth so that 0 means bad and 1 means good.

Since we are going to use Gaussian BN for analysis, check whether these variables are normally distributed; From the figure below, it seems that this is not true of all variables.

+   hist(x, prob = TRUE )
+   lines(density(x), lwd = 2 )

Are these variables related in a linear way? Some of them are, but not all.

> pairs(diff\[, setdiff(names(diff)

And finally, we can see if these variables are clustered together in any way, because clustered variables are more likely to be related in BN.

> heatmap(rho)

We can see two clusters in the heat map: the first cluster includes DCOGO, DGOPG, and DCOA, and the second cluster includes Treatment, DANB, and DCOA. The first cluster is clinically interesting because it includes treatment and two variables that are both associated with Down’s A, which provides some clues as to the main effects of treatment.

plot(ug )

Model #1: Static Bayesian networks as differential models

Here, we use the differences stored in the diff to model the data instead of the original values; We will use the GBN process because all variables are numbers. Modeling differences results in local distributions in the form of regression models

Among themFor other regression factors, and so on. We can rewrite this regression as

This is a set of differential equations that model the rate of change, and the relationship is assumed to be well approximated by a linear relationship. However, this statement still implies that the original value changes linearly over time, because the rate of change depends on the rate of change of the other variables, but not on the time itself. To have a nonlinear trend, we need to

In addition, including growth variables means that we can have regression models of the following form

This allows for different rates of change, depending on whether the patient is showing positive development in the deformity and whether he is receiving treatment.

Learn about Bayesian networks

Learning structure

The first step in learning Bn is learning its structure, the DAGWe can do this using data (from different data frameworks) in combination with prior knowledge; Combining the latter reduces the model space we have to explore and produces a more powerful BN. One straightforward approach is to blacklist those arcs that encode relationships we know are impossible/true; And whitelist those arcs that encode the relationships we know exist.

The blacklist is just a matrix (or a data box) with columns from and to that lists the arcs we don’t want in BN.

  • We blacklisted any arc pointing to DT, treatment, and growth in orthodontic variables.
  • We will blacklist the arc from DT to Treatment. This means that whether a patient is treated or not does not change over time.
  • We blacklisted the arc from growth to DT and treatment. This means that whether or not a patient receives treatment does not change over time, and apparently does not change with prognosis.

The whitelist has the same structure as the blacklist.

  • We whitelist the dependent structure DANB → DIMPA ← DPPPM.
  • We whitelist the arc from dT to Growth, which allows the prognosis to vary over time.

An easy way to learnThe method is to find the network structure with the best fitting degree on the whole data. For example, use HC () with the default score (BIC) and the entire diff data framework.

As for plotting, the key function is plot().

plot(dag, , highlight )

However, the quality of the DAG depends crucially on whether the variables are normally distributed and whether the relationships linking them are linear; From the exploratory analysis, it is not clear that this is true for all variables. We also don’t know which arcs represent strong relationships, that is, they are resistant to data perturbations. We can use boot to solve these two problems.

  1. Use Bootstrap to resample the data.
  2. Learn a separate network from each bootstrap sample.
  3. Check the frequency at which each possible arc occurs in the network.
  4. Construct a consensus network with arcs with high frequency.
booth(diff, R = 200)

The return value of boot.strength() includes, for each pair of nodes, the strength of the arc connecting them (for example, we observe the frequency of DANB → DPPPM or DPPPM → DANB) and the strength of its direction (for example, when we observe an arc between DANB and DPPPM, We observed the frequency of DANB → DPPPM).

attr( "threshold")

Therefore, the averaged.network() takes all arcs with an intensity of at least 0.585 and returns an average consensus network, unless a different threshold is specified.

> avg.diff = averaged.network(str.diff)

Include the information we now have about the strength of the arc.

> strength.plot(avg.diff, str.diff, shape = "ellipse", highlight = list(arcs = wl))


How do we compare the average network (AVG.DIFF) with the network (DAG) that we initially learned from all the data? The most qualitative approach is to plot the two networks side by side, with nodes in the same position, and highlight arcs that appear in one network but not in the other, or that appear in different directions.

> par(mfrow = c(1, 2))
> graphviz.compare(avg.diff, dag, shape = "ellipse", main = c("averaged DAG", "single DAG"))

We can see that Treatment→ DIMPA, DANB → DGOPG and DCOGO → DPPPM only appear in the average network, while DPPPM → DANB only appear in the network that we have learned from all the data. We can assume that the first three arcs are hidden by the noise of the data plus the small sample size and deviations from the norm. Programming can return the number of true positives (arcs that occur in both networks) and false positives/negatives (arcs that occur in only one of the two networks).

> compare

Or arc = TRUE.

But, given that the network is learned using BIC, and BIC is equivalent, are all the directions of the arcs well defined? Looking at the CPDAGs for DAG and AVG.DIFF (and taking into account whitelists and blacklists), we see that there is no directionless arc. All arcs have a unique direction.

Finally, we can combine to make a principled comparison. If two arcs are uniquely determined to be different, we say they are different.

Also look at thresholds for arc length distribution: the average network is pretty dense (17 arcs for 9 nodes) and hard to read.

> plot(str.diff)
> abline(v = 0.75, col = "tomato", lty = 2, lwd = 2)
> abline(v = 0.85, col = "steelblue", lty = 2, lwd = 2)

So just raise the threshold a little bit and eliminate a few more arcs. Taking a look at the figure above, due to the difference in arc length distribution, the two natural selections for higher thresholds are 0.75 (red dotted line) and 0.85 (blue dotted line).

> nrow(strength > "threshold" direction > 0.5, \]) \[1\] 18 Trength > 0.75 & Direction > 0.5 \[1\] 15 Strength > 0.85 & Direction > 0.5 \[1\] 12

The simpler network we obtained by setting the threshold =0.85 in network() is shown below; From a qualitative point of view, it is certainly easier to reason about.


Learning parameters

After learning about structure, we can now learn about parameters. Since we are dealing with continuous variables, we choose to model with GBN. Therefore, if we use the maximum likelihood estimation to fit the parameters of the network, we find that each local distribution is a typical linear regression.

fit(avg, diff)

We can easily confirm this to be true by comparing the models produced by bn. FIT () and LM (), such as DANB.

 > summary(lm(dANB ~ Growth + Treatment, data = diff))

Are we going to have problems with collinearity? In theory, this is possible, but in practice, learning from data about network structures is mostly not a problem. And the reason is, if two variables 和Is collinear, after adding (say) Xi←Xj, then Xj←Xk will no longer significantly improve Bic, because Xj and Xk (to some extent) provide the same information about Xi.

> # gradually increases the correlation between explanatory variables. > for (rho 5)) {+ # Update the correlation matrix and generate the data. + R = R = rho + data = as.data.frame(mvrnorm(1000)) + # compare linear model + cat(" BIC:", +}


Comparative linear model

If the parameter estimates go wrong for any reason, we can replace them with a new set of estimates from a different method.


 dANB = penalized( dANB)

Model validation

There are two main ways to verify a BN.

  1. Just look at the network structure: if the main goal of learning BN is to identify arcs and paths, which is usually the case when BN is interpreted as a causal model, we can perform an essential path analysis and study the strength of the arcs.
  2. Consider BN as a whole, including parameters: If the main goal of learning BN is to treat it as an expert model, then we might want to.
  • According to the value of some other variables, predict the value of one or more variables of the new individual; As well as
  • The results of the CP query are compared with the expert knowledge to confirm that the BN reflects the best knowledge about the phenomenon we are modeling.

Prediction accuracy

We can measure the predictive accuracy of our chosen learning strategy using a common method, namely cross validation. Implementation:

  • _k-fold cross validation _;
  • The specified k is cross-validated;
  • _hold-out_ cross-validation


  • Structure learning algorithms (structure and parameters are learned from data).
  • Parameter learning algorithm (the structure is provided by the user and the parameters are learned from the data).

First, we examine Growth, which encodes the evolution of the malocclusion deformity (0 for bad, 1 for good). We examine it, turn it back into the discrete variable and calculate the prediction error.


> for (i in 1:10) {

+   err\[i\] = (sum(tt) - sum(diag(tt))) / sum(tt)

+ }

The other variables are continuous, so we can estimate their predictive correlations instead.

> for (var in names(predcor)) {

+   xval = cv(diff)

+     predcor\[var\] = mean(sapply(xval, function(x) attr(x, "mean")))

+ }

In both cases, we use a variant of the loss function, which makes the prediction using the posterior expected value calculated from all the other variables. The basic loss function (COR, MSE, PRED) predicts the value of each node only from their parent, which is not meaningful when dealing with nodes with little or no parent.

Confirm with expert knowledge

Another way to check if Bn makes sense is to treat it as a working model and see if it expresses key facts that were not used as prior knowledge in the learning process. Otherwise, we will just get back the information we put in the priori). Some examples.

  1. “Excessive growth of COGO should lead to a reduction in PPPM”. We test this hypothesis by generating samples of DCOGO and DPPPM for the BN stored in familiar.simpler and assuming that no processing occurs. As DCOGO increases (which indicates faster and faster growth), DPPPM becomes more and more negative (which indicates that if the Angle is initially positive, the Angle will decrease.

    > sim = dist(fitted.simpler)
    > plot(sim )
    > abline(v = 0, col = 2, lty = 2, lwd = 2)
! [](https://img-blog.csdnimg.cn/20210705160433148.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,te xt_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE5NjAwMjkx,size_16,color_FFFFFF,t_70)
  1. “A small increase in COGO should lead to an increase in PPPM.”

    As can be seen from the figure above, the negative or empty growth of COGO (DCOGO ⋜ 0) corresponds to the positive growth of PPPM with a probability of ≈0.60. For a small increase in COGO (dCOGO ∈[0, 2]), unfortunately, dPPPM ⋜0, probability ≈0.50, so Bn does not support this assumption.

    > nrow(sim\[( dCoGo <= 0) & ( PPPM > 0), \]) / nrow(sim\[( dCoGo <= 0), \]) \ [1 \] 0.6112532 > nrow (sim \ [(dCoGo > 0) & (dCoGo < 2) & (dPPPM > 0). \]) / + nrow (sim \ [(CoGo) > 0 & (dCoGo < 2), 0.4781784 \]) \ [1 \]
  1. “If the ANB is reduced, the IMPA is reduced to compensate.”

    Through the simulation test as before, we are looking for a negative value of DANB in relation to a negative value of IMPA (same) (which indicates that assuming the Angle is positive initially, it will decrease). As can be seen from the figure below, DANB is proportional to DIMPA, so the decrease of one indicates the decrease of the other. The average trend (black line) is negative for both.

    > plot(sim )
    > abline(coef(lm(dIMPA ~ dANB ))
! [](https://img-blog.csdnimg.cn/20210705160647789.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,te xt_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE5NjAwMjkx,size_16,color_FFFFFF,t_70)
  1. “If GOPG is strongly increased, then both ANB and IMPA are reduced.” If we simulate DGOPG, DANB and DIMPA from BN, assuming that DGOPG >5 (i.e., that GOPG is increasing), we estimate that the probability of DANB >0 (i.e., that ANB is increasing) is ≈0.70, and that the probability of DIMPA <0 is only ≈0.58.

    > nrow(sim\[(dGoPg > 5) & (dANB < 0), \]) / nrow(sim\[(dGoPg > 5), \]) \ [1 \] 0.695416 > nrow (sim \ [(dGoPg > 5) & (dIMPA < 0), \])/nrow (sim \ [(dGoPg > 5), \]) \ [1 \] of 0.5756936
  1. The treatment tries to stop the decrease in ANB. If we fix ANB, is there a difference between treated and untreated patients?”

    First, we can examine the relationship between treatment and growth in patients with DANB ≈0 in the absence of any intervention (that is, using the BN we know from the data). Fitted table(TREATMENT = TREATMENT < 0.5, GOOD. Growth = GROWTH > 0.5
! [](https://img-blog.csdnimg.cn/20210705160909425.png)
  1. Estimated P (Good. Growth ∣ Treatment is different for patients who are treated and those who are not treated (≈0.65 versus ≈0.52).

    If we model a formal intervention (such as Judea Pearl) and set DANB =0 from the outside (thus making it independent of its parents and removing the corresponding arcs), we find that GOOD.GROWTH actually has the same distribution for treated and untreated patients, It becomes irrelevant to TREATMENT. This suggests that a favorable prognosis is indeed determined by changes in the prevention of ANB, while other components of treatment, if any, become irrelevant. Table (TREATMENT = TREATMENT < 0.5, good. GROWTH = GROWTH > 0.5) ' '
! [](https://img-blog.csdnimg.cn/20210705161057852.png)
  1. The treatment tries to stop the decrease in ANB. If we fix ANB, is there a difference between treated and untreated patients?”

    One way to assess this is to examine whether the Angle between points A and B (ANB) changes between treated and untreated patients while keeping the GOPG fixed. _

Assuming no change in GOPG, the Angle between points A and B would increase for treated patients (A strong negative value indicates A level imbalance, so A positive rate of change would indicate A decrease in the imbalance) and decrease for untreated patients (the imbalance slowly worsens over time).

Treatment = C ("UNTREATED", "TREATED")\[(Treatment BB0 0.5) + 1L\] boxplot(dANB ~ Treatment)

Model #2: Dynamic Bayesian Networks

The prediction effect of dynamic Bayesian network is not as good as Model 1, and it is more complex. This is inherent in dynamic Bayesian networks, which simulate random processes: each variable is associated with a different node at each point in time that is being simulated. (In general, we assume that the process is a first-order Markov, so we have two time points in Bn: t and t-1.) However, we explore it to show that such a BN can be learned and used for bnlearn.

The data we use for this model is the raw data that we stored into the orthogonal at the beginning of the analysis. However, we chose to use the treatment variable rather than the growth variable as the variable to express the fact that the subject may be receiving medical treatment. The reason is that the growth variable is a variable that measures the outcome at the second measurement, and its value is unknown at the first measurement; The therapeutic variables were the same at both measurements.

Learning structure

First, we divide the variables into three groups: variables at time T2, variables at time T1 = T2-1, and variables independent of time because they have the same value at T1 and T1.

> t2.variables

Then we introduce a blacklist in which.

  1. We blacklisted all the arcs from clinical variables to T1, T2 and treatment because we knew that age and treatment were not determined by clinical measurements.
  2. We blacklist all the arcs of all variables entering Treatment and T1, because we assume that the arcs between variables in T1 are the same as the corresponding variables in T2, and it makes no sense to study them twice.
  3. We blacklist all the arcs from T2 to T1.
grid(from = setdiff(names(ortho), c("T1", "T2")),
 to = c("T1", "T2"))

In contrast, we whitelist only the curves from T1 to T2, since the age of the second measurement obviously depends on the age of the first measurement.

>  data.frame(from = c("T1"), to = c("T2"))

And then finally we can use BL and WL to learn the structure of Bn.

> dyn.dag

Clearly, this BN is more complex than the previous one: it has more nodes (16 vs 9), more arcs (27 vs 19), and therefore more parameters (218 vs 37).

The best way to plot this new model is to start with plot().

plot(dyn, render = FALSE)

Then, we group variables to make it easy to distinguish between const, t1.variables, and t2.variables; We chose to draw the network from left to right instead of from top to bottom.

+        attrs = list(graph = list(rankdir = "LR")))

> Graph(gR)

As in the previous model, treatment acts on ANB: the only arcs departing from treatment are treatment →ANB2 and treatment → COA2. Again, both child nodes are related to point A of Down.

Model Averaging in Structural Learning

We want to evaluate the stability of this dynamic BN structure, just like we did for static BN before, and we can do it again.

> boot (ortho )
> plot(dyn)



The average AVG and DAG are almost the same: they differ by only two arcs. This indicates that structural learning produces a stable output.

compare(dag, avg)
tp fp fn
26  1  1

Learning parameters

Since Treatment is a discrete variable, BN is a CLGbn. This means that contiguous nodes with Treatment as their parent are parameterized differently than other nodes.



As we can see, ANB2 depends on ANB (so, the same variable at the previous point in time) and treatment. ANB is continuous, so it is used as a regression factor for ANB2. The therapeutic variables are discrete and determine the components of the linear regression.

Model validation and reasoning

We can ask another set of questions about this new model

  1. “How much does ANB change from the first measure to the second measure under different treatments?” We can use cpdist() to generate a pair (ANB, ANB2), provided that the treatment is equal to NT, TB, and Tg, and observe their distribution.

     diff = c(nt\[, 2\] - nt\[, 1\], tb\[, 2\] - tb\[, 1\], tg\[, 2\] - tg\[, 1\]),
    > by(effect$diff, effect$treatment, FUN = mean)
! [](https://img-blog.csdnimg.cn/20210705163326465.png) ``` density(~ diff, groups = treatment) ``` ! [](https://img-blog.csdnimg.cn/20210705163441945.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,te Xt_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE5NjAwMjkx size_16, color_FFFFFF t_70) we know that treatment trying to stop falling ANB; This is consistent with the fact that NT is distributed to the left of T sub B, and T sub B is distributed to the left of T sub G. Untreated patients continue to deteriorate; Patients who didn't respond to treatment didn't really improve, but they didn't get worse either; Patients who responded to the treatment showed improvement.

In contrast, this is a simulated trajectory of an untreated patient under the same initial conditions.

The simulated trajectory of COA is realistic: it slows down with age. This is different from ANB, which occurs because COA2 depends on both T1 and T2. (Anb2 does not depend on either).

> for (I in seq(nrow(interv)) {+ # for joint prediction, which is currently impossible with predict(). + dist(dyn.fitted, nodes = c(), + intervals\[i,\] = weighted.mean(ANB2, weights) + intervals\[i,\] = weighted.mean(CoA2, weights)


The most popular insight

1. MATLAB uses Bayesian optimization for deep learning

2. MATLAB Bayesian Hidden Markov HMM model implementation

3. Simple Bayesian linear regression simulation of Gibbs sampling in R language

4. Block Gibbs sampling Bayesian multivariate linear regression in R language

5. The Bayesian model of MCMC sampling for Stan probabilistic programming in R language

6.Python implements Bayesian linear regression model with PyMC3

7.R language uses Bayesian hierarchical model for spatial data analysis

8.R language random search variable selection SSVS estimation Bayesian vector autoregressive (BVAR) model

9. MATLAB Bayesian Hidden Markov HMM model implementation