^{1}

^{2}

^{3}

^{2}

^{4}

^{1}

^{2}

^{3}

^{4}

One of the most challenging tasks when adopting Bayesian networks (BNs) is the one of learning their structure from data. This task is complicated by the huge search space of possible solutions and by the fact that the problem is

Bayesian networks (BNs) have been applied to several different fields, ranging from the water resource management [

As discussed in [

Although constraint-based learning is an interesting approach, as it is close to the semantic of BNs, most of the developed structure learning algorithms fall into the score-based method category, given the possibility of formulating such a task in terms of an optimization problem. As the name implies, these methods have two major components: (1) a scoring metric that measures the quality of every candidate BN with respect to a dataset and (2) a search procedure to intelligently move through the space of possible networks, as this space is enormous. More in detail, as shown in [

Hence, regardless of the strategy to learn the structure, one wishes to pursue, greedy local search techniques and heuristic search approaches need to be adopted to tackle the problem of inferring the structure of BNs. However, to the best of our knowledge, a quantitative analysis of the performance of these different techniques has never been done. For this reason, here, we provide a detailed assessment of the performance of different state-of-the-art methods for structural learning on simulated data, considering both BNs with discrete and continuous variables and with different rates of noise in the data. More precisely, we investigate the characteristics of different scores proposed for the inference and the statistical pitfalls within both the constraint-based and score-based techniques. Furthermore, we study the impact of various heuristic search approaches, namely, hill climbing, tabu search, and genetic algorithms.

We notice that, although we here aim to cover some of the main ideas in the area of structure learning of BNs, several interesting topics are beyond the scope of this work [

This work extends previous investigation performed in the same area [

BNs are graphical models representing the

However, even if we consider the simplest case of binary variables, the number of parameters in each conditional probability table is still exponential in size. For example, in the case of binary variables, the total number of parameters needed to compute the full joint distribution is of size

Moreover, the usage of the symmetrical notion of conditional dependence poses further limitations to the task of learning the structure of a BN: two arcs

In the literature, there are two families of methods to learn the structure of a BN from data. The idea behind the first group of methods is the one of

We now briefly describe the main idea behind this class of approaches. For a detailed discussion of this topic, we refer to [

This class of methods is aimed at building a graph structure to reflect the dependence and independence relations in the data that match the empirical distribution. Notwithstanding, the number of conditional independence tests that such algorithms would have to perform among any pair of nodes to test all possible relations is exponential and, because of this, the introduction of some approximations is required.

We now provide some details on two constraint-based algorithms that have been proven to be efficient and of widespread usage: the

This algorithm [

Another constraint-based learning algorithm uses the Markov blankets [

These approaches is aimed at maximizing the likelihood

Practically, however, for any arbitrary set of data, the most likely graph is always the fully connected one, since adding an edge can only increase the likelihood of the data, i.e., this approach overfits the data. To overcome this limitation, the likelihood score is almost always combined with a

As already mentioned, such an optimization problem leads to intractability, due to the enormous search space of the valid solutions. Because of this, the optimization task is often solved with heuristic techniques. Before moving on to describe the main heuristic methods employed to face such complexity (see Section

BIC uses a score that consists of a log-likelihood term and a regularization term that depend on a model

Notice that the likelihood is implicitly weighted by the number of data points since each point contributes to the score. As the sample size increases, both the weight of the regularization term and the “weight” of the likelihood increase. However, the weight of the likelihood increases faster than that of the regularization term. This means that, with more data, the likelihood will contribute more to the score, and we may trust our observations more and have less need for regularization. Statistically speaking, BIC is a

We now describe some of the main state-of-the-art search strategies that we took into account in this work. In particular, as stated in Section

Hill climbing (HC) is one of the simplest iterative techniques that have been proposed for solving optimization problems. While HC consists of a simple and intuitive sequence of steps, it is a good search scheme to be used as a baseline for comparing the performance of more advanced optimization techniques.

Hill climbing shares with other techniques (like simulated annealing [

The sequence of steps of the hill climbing algorithm, for a maximization problem w.r.t. a given objective function

Choose an initial solution

Find the best solution

If

As it is clear from the aforementioned algorithm, hill climbing returns a solution that is a local maximum for the problem at hand. This local maximum does not generally correspond to the global minimum for the problem under exam, that is, hill climbing does not guarantee to return the best possible solution for a given problem. To counteract this limitation, more advanced neighborhood search methods have been defined. One of these methods is tabu search, an optimization technique that uses the concept of “memory.”

Tabu search (TS) is a metaheuristic that guides a local heuristic search procedure to explore the solution space beyond local optimality. One of the main components of this method is the use of an adaptive memory, which creates a more flexible search behavior. Memory-based strategies are therefore the main feature of TS approaches, founded on a quest for “integrating principles,” by which alternative forms of memory are appropriately combined with effective strategies for exploiting them.

As reported in [

While tabus represent the main distinguished feature of TS, this feature can introduce other issues in the search process. In particular, the use of tabus can prohibit attractive moves, or it may lead to an overall stagnation of the search process, with a lot of moves that are not allowed. Hence, several methods for revoking a tabu have been defined [

The basic TS algorithm, considering the maximization of the objective function

Randomly select an initial solution

Set

Choose the best

If

Update the

If a stopping condition is met, then stop; else, go to step 2.

The most commonly adopted conditions to end the algorithm are when the number of iterations (

Specifically, in our experiments, we modeled for both HC and TS the possible valid solutions in the search space as a binary adjacency matrix describing acyclic directed graphs. The starting point of the search is the empty graph (i.e., without any edge), and the search is stopped when the current best solution cannot be improved with any move in its neighborhood.

The algorithms then consider a set of possible solutions (i.e., all the directed acyclic graphs) and navigate among them by means of 2 moves: insertion of a new edge or removal of an edge currently in the structure. We also recall that in the literature, many alternatives are proposed to navigate the search space when learning the structure of Bayesian network (see [

Genetic algorithms (GAs) [

Typical workflow of a GA.

To transform a population of candidate solutions into a new one, GAs make use of particular operators that transform the candidate solutions, called genetic operators:

GAs have been widely used to learn the BN structure considering the search space of the DAGs. In the large majority of the works [

Generate the initial population

Repeat the following operations until the total number of generations is reached:

Select a population of parents by repeatedly applying the parent selection method

Create a population of offspring by applying the crossover operator to each set of parents

For each offspring, apply one of the mutation operators with a given mutation probability

Select the new population by applying the survivor selection method

Return the best performing individual

Unless stated otherwise, the following description is valid for both GA variants (discrete and continuous). The initialization method and the variation operators used in our GA ensure that every individual is valid, i.e., each individual is guaranteed to be an acyclic graph. The initialization method creates individuals with exactly

Three mutation operators are considered in our GA implementation: add connection mutation, remove connection mutation, and perturbation mutation. The first two are applied in both GA variants, while the perturbation mutation is only applied in the continuous variant. The offspring resulting from the add connection mutation operator differs from the parent in having an additional connection. The newly connected nodes are selected with uniform probability. In the continuous variant, the value associated with the new connection is randomly generated between 0

In terms of parameters, a population of size 100 is used, with the evolutionary process being able to perform 100 generations. Parents are selected by applying a tournament of size 3. The survivor selection is elitist in the sense that it ensures that the best individual survives to the next generation.

We made use of a large set of simulations on randomly generated datasets with the aim of assessing the characteristics of the state-of-the-art techniques for structure learning of BNs.

We generated data for both the case of discrete (2 values, i.e., 0 or 1) and continuous (we used 4 levels, i.e., 1, 2, 3, and 4) random variables. Notice that, for computational reasons, we discretized our continuous variables using only 4 categories. For each of them, we randomly generated both the structure (i.e., 100 weakly connected DAGs having 10 and 15 nodes) and the related parameters (we generated random values in (0, 1) for each entry of the conditional probability tables attached to the network structure) to build the simulated BNs. We also considered 3 levels of density of the networks, namely, 0

For all of them, we considered both the constraint-based and the score-based approaches to structural learning. From the former category of methods, we considered the PC algorithm [

Moreover, among the score-based approaches, we consider 3 maximum likelihood scores, namely, log-likelihood [

This led us to 11 different methods to be compared, for a total of 158400 independent experiments. To evaluate the obtained results, we considered both false positives (FP) (i.e., arcs that are not in the generative BN but that are inferred by the algorithm) and false negatives (FN) (i.e., arcs that are in the generative BN but that are not inferred by the algorithm). Also, with TP (true positives) being the arcs in the generative model and TN (true negatives) the arcs that are not in the generative model, for all the experiments, we considered the following metrics:

In this section, we comment on the results of our simulations. As anticipated, we computed precision, recall (sensitivity), and specificity, as well as accuracy and

With a more careful look, the first evidence we obtained from the simulations is that the two parameters with the highest impact on the inferred networks are the density and the number of nodes (i.e., network size). For this reason, we first focused our attention on these two parameters, and we analyzed how the different tested combinations of methods and scores behave.

As shown in Figure

Boxplots showing the accuracy of the results obtained by the 11 tests approaches on the discrete and continuous datasets. In the figure, conditional boxplot in each subpanel reports results for structure of density 0.4, 0.6, and 0

In addition to the accuracy, we also computed the

Bar plots showing the

A further analysis we performed was devoted to assessing the impact of both overfit and underfit. As it is possible to observe from the plots in Figures

Violin plots showing precision and recall of the solutions obtained by the 11 tested approaches, grouped by the type of dataset (discrete and continuous) and density values (0.4, 0.6, and 0.8).

Violin plots showing specificity and sensitivity of the solutions obtained by the 11 tested approaches, grouped by the type of dataset (discrete and continuous) and density values (0.4, 0.6, and 0.8).

The other two scores, i.e., AIC and BIC, present a better trade-off between overfit and underfit, with BIC being less affected by the overfit, while AIC reconstructing slightly denser networks. Once again, this trade-off between the two regularizators is well known in the literature and points to AIC being more suited for predictions while BIC for descriptive purposes.

Another relevant result of our analysis is the characterization of the performance of genetic algorithms in terms of

In summary, we can draw the following conclusions.

Dense networks are harder to learn (see Figure

The number of nodes affects the complexity of the solutions, leading to higher

Networks with continuous variables are harder to learn compared to the binary ones (see Figure

Genetic algorithms tend to reduce underfit (see Figures

We conclude the section by providing

Summary of the major findings. The results of the Mann–Whitney

Comparison | Metric | Test | Mean 1 | Mean 2 | |
---|---|---|---|---|---|

Density 0.4/0.8 | Accuracy | Greater | <2 |
0 |
0 |

Nodes 10/15 | Accuracy | Less | <2 |
0 |
0 |

Nodes 10/15 | Hamming | Less | <2 |
17 |
39 |

Discr./contin. | Accuracy | Greater | <2 |
0 |
0 |

iamb/hc loglik | Sensitivity | Less | <2 |
0 |
0 |

iamb/hc loglik | Specificity | Greater | <2 |
0 |
0 |

ga/hc | Sensitivity | Less | <2 |
0 |
0 |

ga/hc | Specificity | Greater | <2 |
0 |
0 |

The tests are performed on all the configurations of our simulations.

Bayesian networks are a widespread technique to model dependencies among random variables. A challenging problem when dealing with BNs is the one of learning their structures, i.e., the statistical dependencies in the data, which may sometime pose a serious limit to the reliability of the results.

Despite their extensive use in a vast set of fields, to the best of our knowledge, a quantitative assessment of the performance of different state-of-the-art methods to learn the structure of BNs has never been performed. In this work, we aim to go in the direction of filling this gap and we presented a study of different state-of-the-art approaches for structural learning of Bayesian networks on simulated data, considering both discrete and continuous variables.

To this extent, we investigated the characteristics of 3 different likelihood scores, combined with 3 commonly used search strategy schemes (namely, genetic algorithms, hill climbing, and tabu search), as well as 2 constraint-based techniques for the BN inference (i.e., iamb and pc algorithms).

Our analysis identified the factors having the highest impact on the performance, that is, density, number of variables, and variable type (discrete vs continuous). In fact, as shown here, these settings affect the number of parameters to be learned, hence complicating the optimization task. Furthermore, we also discussed the overfit/underfit trade-off of the different tested techniques with the constraint-based approaches showing trends toward underfitting and the loglik score showing high overfit. Interestingly, in all the configurations, genetic algorithms showed evidence of reducing the overfit, leading to denser structures.

Overall, we place our work as a starting effort to better characterize the task of learning the structure of Bayesian networks from data, which may lead in the future to a more effective application of this approach. In particular, we focused on the more general task of learning the structure of a BN [

The datasets used to support the findings of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.

This work was also financed through the Regional Operational Programme CENTRO2020 within the scope of the project CENTRO-01-0145-FEDER-000006.