Four tree-based structure learning methods are implemented with graph and data-driven algorithms.
SEMtree(
graph,
data,
seed,
type = "ST",
eweight = NULL,
alpha = 0.05,
verbose = FALSE,
...
)
An igraph object.
A matrix or data.frame. Rows correspond to subjects, and columns to graph nodes (variables).
A vector of seed nodes.
Tree-based structure learning method. Four algorithms are available:
"ST"(default). Steiner Tree (ST) identification via fast Kou's algorithm
(Kou et al, 1981) connecting a set of seed nodes (called Terminal vertices)
with connector nodes (called Steiner vertices) from input graph as defined
in graph
with minimal total distance on its edges. By default the edge
weights are based on the pairwise correlation, 1-abs(cor(j,k)). If input
graph has E(graph)$weight=1, and eweight = "custom"
, ST seeks a minimum
subtree (i.e., the subtree with minimal number of edges).
"CAT". Causal additive trees (CAT) algorithm as in Jakobsen et al.
(2022). The argument graph
is set to NULL (i.e., no input graph is needed).
In the first step, a (univariate) generalized additive model (GAM) is employed
to estimate the residual variances, var(X(j) - [X(j)|X(k)]) for all j != k,
then use these to construct edge weights as inputs to the Chu–Liu–Edmonds’
algorithm (Chow and Liu, 1968) to recover the arborescence. Argument seed
must be specified to analyse a subset of nodes (variables) of interest.
"CPDAG". CLE algorithm for Skeleton Recovery and CPDAG
estimation as in Lou et al. (2021). Together with "CAT" algorithm, "CPDAG" is
data-driven and the argument graph
is set to NULL.
The key idea is to first recover the skeleton of the polytree by applying
the CLE algorithm to the pairwise sample correlations of the data matrix.
After the skeleton is recovered, the set of all v-structures can be correctly
identified via a simple thresholding approach to pairwise sample correlations.
CPDAG can be found applying iteratively only Rule 1 of Meek (1995).
Argument seed
must be specified to analyse a subset of nodes
(variables) of interest.
"MST". Minimum Spanning Tree (MST) identification via Prim's algorithm
(Prim, 1957). The latter finds the subset of edges that includes every vertex
of the graph (as defined in graph
) such that the sum of the weights
of the edges can be minimized. The argument seed
is set to NULL (i.e.,
no seed nodes are needed).
Edge weight type for igraph object can be externally derived
using weightGraph
or from user-defined distances.
This option determines the weight-to-distance transform. If set to:
"NULL" (default), edge weights will be internally computed equal to 1 - abs(pairwise Pearson's correlation).
"kegg", repressing(-1), neutral(0) and activating(+1) kegg interactions will be multiplied by "zsign" attributes, and positive (i.e., concordant) values will be set to 1 (minimum distance), while negative (i.e., discordant) values will be set to 2.
"zsign", all significant interactions (abs(zsign) > 0) will be set to 1 (minimum distance), while non-significant (zsign=0) ones will be set to 2.
"pvalue", edge p-value atributes will be transformed to the inverse of negative base-10 logarithm, 1/(-log(E(graph)$pv)).
"custom", the algorithm will use the distance measure specified by the user as "weight" edge attribute in the input graph.
Threshold for rejecting a pair of node being independent in
"CPDAG" algorithm. The latter implements a natural v-structure identification
procedure by thresholding the pairwise sample correlations over all adjacent
pairs of edges with some appropriate threshold. By default,
alpha = 0.05
.
If TRUE, it shows the output tree (not recommended for large graphs).
Currently ignored.
An igraph
object. If type = "ST"
, seed nodes are
colored in "aquamarine" and connectors in "white". If type = "ST"
and
type = "MST"
, edges are colored in "green" if not present in the input,
graph. If type = "CPDAG"
, bidirected edges are colored in "black"
(if the algorithm is not able to establish the direction of the relationship
between x and y).
A tree ia an acyclic graph with p vertices and p-1 edges. The graph method
refers to the Steiner Tree (ST), a tree from an undirected graph that connect "seed"
with additional nodes in the "most compact" way possible. The data-driven methods
propose fast and scalable procedures based on Chu-Liu–Edmonds’ algorithm (CLE) to
recover a tree from a full graph. The first method, called Causal Additive Trees (CAT)
uses pairwise mutual weights as input for CLE algorithm to recover a directed tree
(an "arborescence"). The second one applies CLE algorithm for skeleton recovery and
extends the skeleton to a tree (a "polytree") represented by a Completed Partially
Directed Acyclic Graph (CPDAG). Finally, the Minimum Spanning Tree (MST) connecting
an undirected graph with minimal edge weights can be identified.
To note, if the input graph is a directed graph, ST and MST undirected trees are
converted in directed trees using the orientEdges
function.
Grassi M, Tarantino B (2023). SEMtree: tree-based structure learning methods with structural equation models. Bioinformatics, 39 (6), 4829–4830 <https://doi.org/10.1093/bioinformatics/btad377>
Kou, L., Markowsky, G., Berman, L. (1981). A fast algorithm for Steiner trees. Acta Informatica 15, 141–145. <https://doi.org/10.1007/BF00288961>
Prim, R.C. (1957). Shortest connection networks and some generalizations Bell System Technical Journal, 37 1389–1401.
Chow, C.K. and Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462–467.
Meek, C. (1995). Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, 403–410.
Jakobsen, M, Shah, R., Bühlmann, P., Peters, J. (2022). Structure Learning for Directed Trees. arXiv: <https://doi.org/10.48550/arxiv.2108.08871>.
Lou, X., Hu, Y., Li, X. (2022). Linear Polytree Structural Equation Models: Structural Learning and Inverse Correlation Estimation. arXiv: <https://doi.org/10.48550/arxiv.2107.10955>
# \donttest{
# Nonparanormal(npn) transformation
als.npn <- transformData(alsData$exprs)$data
#> Conducting the nonparanormal transformation via shrunkun ECDF...done.
# graph-based trees
graph <- alsData$graph
seed <- V(graph)$name[sample(1:vcount(graph), 10)]
tree1 <- SEMtree(graph, als.npn, seed=seed, type="ST", verbose=TRUE)
#> RICF solver ended normally after 2 iterations
#>
#> deviance/df: 15.65363 srmr: 0.2923645
#>
tree2 <- SEMtree(graph, als.npn, seed=NULL, type="MST", verbose=TRUE)
#> RICF solver ended normally after 2 iterations
#>
#> deviance/df: 10.64067 srmr: 0.2815872
#>
# data-driven trees
V <- colnames(als.npn)[colnames(als.npn) %in% V(graph)$name]
tree3 <- SEMtree(NULL, als.npn, seed=V, type="CAT", verbose=TRUE)
#> Score weighting of 930 edges...
#>
/ 0 % elapsed=00s
- 1 % elapsed=00s, remaining~08s
\ 2 % elapsed=00s, remaining~07s
| 3 % elapsed=00s, remaining~07s
/ 4 % elapsed=00s, remaining~07s
- 5 % elapsed=00s, remaining~07s
\ 6 % elapsed=00s, remaining~06s
| 8 % elapsed=00s, remaining~06s
/ 9 % elapsed=01s, remaining~06s
- 10% elapsed=01s, remaining~06s
\ 11% elapsed=01s, remaining~06s
| 12% elapsed=01s, remaining~06s
/ 13% elapsed=01s, remaining~06s
- 14% elapsed=01s, remaining~06s
\ 15% elapsed=01s, remaining~06s
| 16% elapsed=01s, remaining~06s
/ 17% elapsed=01s, remaining~05s
- 18% elapsed=01s, remaining~05s
\ 19% elapsed=01s, remaining~05s
| 20% elapsed=01s, remaining~05s
/ 22% elapsed=01s, remaining~05s
- 23% elapsed=02s, remaining~05s
\ 24% elapsed=02s, remaining~05s
| 25% elapsed=02s, remaining~05s
/ 26% elapsed=02s, remaining~05s
- 27% elapsed=02s, remaining~05s
\ 28% elapsed=02s, remaining~05s
| 29% elapsed=02s, remaining~05s
/ 30% elapsed=02s, remaining~05s
- 31% elapsed=02s, remaining~05s
\ 32% elapsed=02s, remaining~04s
| 33% elapsed=02s, remaining~04s
/ 34% elapsed=02s, remaining~04s
- 35% elapsed=02s, remaining~04s
\ 37% elapsed=02s, remaining~04s
| 38% elapsed=02s, remaining~04s
/ 39% elapsed=03s, remaining~04s
- 40% elapsed=03s, remaining~04s
\ 41% elapsed=03s, remaining~04s
| 42% elapsed=03s, remaining~04s
/ 43% elapsed=03s, remaining~04s
- 44% elapsed=03s, remaining~04s
\ 45% elapsed=03s, remaining~04s
| 46% elapsed=03s, remaining~04s
/ 47% elapsed=03s, remaining~03s
- 48% elapsed=03s, remaining~03s
\ 49% elapsed=03s, remaining~03s
| 51% elapsed=03s, remaining~03s
/ 52% elapsed=03s, remaining~03s
- 53% elapsed=03s, remaining~03s
\ 54% elapsed=04s, remaining~03s
| 55% elapsed=04s, remaining~03s
/ 56% elapsed=04s, remaining~03s
- 57% elapsed=04s, remaining~03s
\ 58% elapsed=04s, remaining~03s
| 59% elapsed=04s, remaining~03s
/ 60% elapsed=04s, remaining~03s
- 61% elapsed=04s, remaining~03s
\ 62% elapsed=04s, remaining~02s
| 63% elapsed=04s, remaining~02s
/ 65% elapsed=04s, remaining~02s
- 66% elapsed=04s, remaining~02s
\ 67% elapsed=04s, remaining~02s
| 68% elapsed=04s, remaining~02s
/ 69% elapsed=05s, remaining~02s
- 70% elapsed=05s, remaining~02s
\ 71% elapsed=05s, remaining~02s
| 72% elapsed=05s, remaining~02s
/ 73% elapsed=05s, remaining~02s
- 74% elapsed=05s, remaining~02s
\ 75% elapsed=05s, remaining~02s
| 76% elapsed=05s, remaining~02s
/ 77% elapsed=05s, remaining~01s
- 78% elapsed=05s, remaining~01s
\ 80% elapsed=05s, remaining~01s
| 81% elapsed=05s, remaining~01s
/ 82% elapsed=05s, remaining~01s
- 83% elapsed=05s, remaining~01s
\ 84% elapsed=06s, remaining~01s
| 85% elapsed=06s, remaining~01s
/ 86% elapsed=06s, remaining~01s
- 87% elapsed=06s, remaining~01s
\ 88% elapsed=06s, remaining~01s
| 89% elapsed=06s, remaining~01s
/ 90% elapsed=06s, remaining~01s
- 91% elapsed=06s, remaining~01s
\ 92% elapsed=06s, remaining~00s
| 94% elapsed=06s, remaining~00s
/ 95% elapsed=06s, remaining~00s
- 96% elapsed=06s, remaining~00s
\ 97% elapsed=06s, remaining~00s
| 98% elapsed=06s, remaining~00s
/ 99% elapsed=07s, remaining~00s
- 100% elapsed=07s, remaining~00s
#> RICF solver ended normally after 2 iterations
#>
#> deviance/df: 5.696462 srmr: 0.2143798
#>
tree4 <- SEMtree(NULL, als.npn, seed=V, type="CPDAG", alpha=0.05, verbose=TRUE)
#>
#> 9 -> 3 <- 29
#>
#> 29 -> 3 <- 9
#>
#> 4 -> 30 <- 23
#>
#> 23 -> 30 <- 4
#> RICF solver ended normally after 8 iterations
#>
#> deviance/df: 5.759128 srmr: 0.209403
#>
# }