Covariate-varying networks (CVNs)
are high-dimensional dynamic graphical models that vary with multiple
external covariates. The methodology is explained in detail in the
pre-print by Dijkstra, Godt
and Foraita (2024). In this vignette, we describe how to apply the
method to data with the associated R package CVN
.
A graphical model is a powerful tool for exploring complex dependency structures in high-throughput datasets, such as genomics, epigenomics, and proteomics. They allow for the investigation of, e.g. biologically, meaningful patterns that aid in understanding biological processes and generating new hypotheses. Despite the widespread application of Gaussian graphical models, there was an urgent need connecting graph structures to external covariates.
A CVN is a Gaussian graphical model for high-dimensional data that can change by multiple external discrete covariates. For each combination of the categories between the covariates, a graph is estimated. This is not done individually, since we allow for similarities between different graphs related to various covariates. The smoothness between the graphs in the CVN model is introduced by a meta-graph W. In this meta-graph, each node corresponds to a graph in the CVN model and an edge between two graphs enforces smoothing between these graphs.
More formally, a CVN is represented as graphical model with the quintupel
CVN = {X, U, 𝒰, f(X ∣ U), {G(u) = (V, E(u))}u ∈ 𝒰}, where X = (X1, X2…, Xp)⊤ is a p-dimensional random vector and U = (U1, U2, …, Uq)⊤ is a random vector representing q external discrete covariates. The external discrete covariates are not included in the graph, but they determine the smoothing between all possible combinations of the covariates (K1, …, Kq)⊤ categories.
The vector U lies in the discrete space 𝒰 with cardinality $m \leq \prod_{k=1}^q K_k$. The joint density function of X conditioned on U is f(X ∣ U). The fifth element of the CVN is a set of m graphs, one for each value u in 𝒰. The vertices of G(u), V = {1, 2, …, p}, correspond to the variables X1, X2, …, Xp and do not change with U.
We assume that X ∣ U follows a multivariate normal distribution with μ(u) = 0 and covariance matrix Σ(u). To estimate the structure of the graph, we focus on the precision matrix Θ(u) = Σ(u)−1 which has the following property under the normality assumption:
$$ \theta_{ij}(u) = 0 \Leftrightarrow X_i \not\!\perp\!\!\!\perp X_j \mid \mathbf{X}_{V \setminus \{i,j\}} \land U = u \Leftrightarrow \text{edge } \{i,j\} \notin E(u) $$ Hence, the (in)dependence structure of the CVN can be estimated by determining the zero entries of the precision matrices. Our goal is to enable a CVN to handle high-dimensional data and identify structural similarities between graphs. We therefore introduce regularization and smoothness in the definition of the CVN estimator $\mathbf{ \Theta}_{\text{CVN}} = \{ \widehat{\mathbf{\Theta}}_i \}_{i = 1}^m$ as follows:
$$ \mathbf{ \Theta}_{\text{CVN}} = \text{argmin}_{\{ \mathbf{\Theta}_{i}\}_{i = 1}^m} \Bigg[ \sum_{i = 1}^m \ell\left({\mathbf{\Theta}_i}\right) + \lambda_1 \sum_{i = 1}^m \left\rVert \mathbf{\Theta}_i \right\rVert_1 + \lambda_2 \sum_{i < j} w_{ij} \left\lVert \mathbf{\Theta}_i - \mathbf{\Theta}_j \right\rVert_1 \Bigg],$$
where Θi is the i-th precision matrix, ℓ(Θi) the log-likelihood function for precision matrix i, wij values of the symmetric weighted adjacency matrix of the meta-graph (see below) and λ1, λ2 are two tuning parameters.
The CVN model might differ from other graphical model approaches especially through the concept of the (undirected) meta-graph. The adjacency matrix of the meta-graph equals the weight matrix W = (wij)m × m, which is a m × m symmetric matrix. It encodes the smoothing structure between the graphs in a CVN and must be set a priori.
A weight of 0 does not smooth the structure between two graphs. Weights must be chosen between 0 and 1.
Here are some examples how W and the corresponding meta-graph for a CVN with 2 covariates can look like:
Meta-graph for a CVN with 2 covariates |
W |
---|---|
Grid structure | Each covariate has 3 categories |
Weighted grid structure | Each covariate has 3 categories |
Arbitrary structure | Each covariate has 3 categories |
GLASSO-type structure | No smoothing Each covariate has 3 categories |
Full/saturated structure | Complete smoothing Covariates with 2 and 3 categories |
In this fictitious example, we are interested in how the independence structure of collected biomarkers differ by the 2 external variables Body mass index (BMI) and income level. The biomarker values were transformed to normally distributed z-score values. BMI has the three categories underweight, normal weight, overweight, and income level has the categories low, middle and high.
For this example, we will use randomly simulated data where the variables in the graph are z-score values of various biomarkers, each following a normal distribution. We are interested in, if the graphs differ by the two external variables, each of them having K = L = 3 categories.
The CVN
function requires as input data:
A list with K ⋅ L = m data sets
Each list element needs to be a matrix or a data object of Gaussian data, corresponding to each combination of the categories of the two external variables
The number of variables p must be equal in each list element, the number of observations may differ.
The input data is then a list of 3 × 3 = 9 list elements that corresponding to the combinations (underweight - low income), (underweight - middle income), …, (overweight - high income).
# Load required library
library(CVN)
library(dplyr)
# Simulate the dataset
set.seed(2024)
n <- 300
# Create 10 normally distributed variables for the graph
data <- as.data.frame(matrix(rnorm(n * 10), ncol = 10))
colnames(data) <- paste0("X", 1:10)
# Add two discrete external covariates
data$income <- sample(c("low","middle", "high"), n, replace = TRUE)
data$bmi <- sample(c("underweight","normal weight", "overweight"), n, replace = TRUE)
# Split the dataset into subsets based on dosis and bmi
data_list <- data %>%
group_by(income, bmi) %>%
group_split() %>%
lapply(function(df) df %>% select(-income, -bmi))
names(data_list) <-
apply(expand.grid(income = c("low","middle", "high"),
bmi = c("underweight","normal weight", "overweight")),
1,
function(x) paste0(x[1], "_", x[2]))
We set the meta-graph W in our example to the following:
In case of two external covariates, the function
create_weight_matrix
might be used to generated simple
weight matrices from type full (equals saturated graphs),
grid, glasso and random structures using
uniform-random.
# plots can be turned on by setting plot = TRUE. Requires the igraph package
W_grid <- create_weight_matrix(type = "grid", k = 3, l = 3, plot = FALSE)
Weight matrices can be plotted as a meta-graph or as heatmap plot: This can be done in several ways:
A graph output can be generated when creating the weight matrix
using the function create_weight_matrix
by setting the
parameter plot=TRUE
, or
by using the plot_weight_matrix
function (see
below).
A heatmap plot is generated using the function
hd_weight_matrix
.
The first two options require the igraph
package.
A heatmap plot looks like the following. Note that it has a dimension of size m × m. Axis labeling is according to the order of your data_list containing the m data sets.
W_random <- round(create_weight_matrix(type = "uniform-random", k = 3, l = 2), 2)
hd_weight_matrix(W_random) # randomly chosen weights
As know from other methods, e.g., the GLASSO, we need to tune our model. A CVN requires to select two tuning parameters which control the regularization applied to the CVN. The choice of the tuning parameters affects how dense the graphs are and how much the edges between the graphs have been smoothed. These parameters are usually searched for in a predefined regularization path.
λ1 governs the sparsity in the CVN, which gets sparser the larger λ1 is chosen.
λ2 is responsible for regulating the smoothness or similarity between the graphs by penalizing the differences between the precision matrices that are connected in the meta-graph. It encourages similar values in the precision matrices and hence also if an edge is included in the graph or not. The graphs in the CVN are getting more and more similar the larger λ2 is selected.
The CVN
function fits all combinations for a given
selection of λ1 and
λ2 values.
A CVN graphical model is fitted using an alternating direction method
of multipliers (ADMM). By default, the CVN
function will
parallelize when multiple values of λ1 or λ2 are provided. However,
the number of cores used can be adjusted by setting, e.g,
n_cores = 1
.
The CVN is simply estimated by the following.
cvn <- CVN(data = data_list,
W = W_grid,
lambda1 = lambda1,
lambda2 = lambda2,
eps = 1e-2, # makes it faster but less precise; default = 1e-4
maxiter = 500,
n_cores = 1, # no parallizing
warmstart = TRUE, # uses the glasso
verbose = FALSE)
# Print results
print(cvn)
#> Covariate-varying Network (CVN)
#>
#> ✓ all converged
#>
#> Number of graphs (m) : 9
#> Number of variables (p) : 10
#> Number of lambda pairs : 6
#>
#> Weight matrix (W):
#> 9 x 9 sparse Matrix of class "dsCMatrix"
#>
#> [1,] . 1 . 1 . . . . .
#> [2,] 1 . 1 . 1 . . . .
#> [3,] . 1 . . . 1 . . .
#> [4,] 1 . . . 1 . 1 . .
#> [5,] . 1 . 1 . 1 . 1 .
#> [6,] . . 1 . 1 . . . 1
#> [7,] . . . 1 . . . 1 .
#> [8,] . . . . 1 . 1 . 1
#> [9,] . . . . . 1 . 1 .
#>
#> id lambda1 lambda2 gamma1 gamma2 converged value
#> 1 1 0.50 1.0 0.001234568 0.0006172840 TRUE 0.008176557
#> 2 2 1.25 1.0 0.003086420 0.0006172840 TRUE 0.008932883
#> 3 3 2.00 1.0 0.004938272 0.0006172840 TRUE 0.009769567
#> 4 4 0.50 1.5 0.001234568 0.0009259259 TRUE 0.009720175
#> 5 5 1.25 1.5 0.003086420 0.0009259259 TRUE 0.009410001
#> 6 6 2.00 1.5 0.004938272 0.0009259259 TRUE 0.009319307
#> n_iterations aic bic ebic edges_median edges_iqr
#> 1 16 5255.323 6071.667 8586.090 29 6
#> 2 19 4763.677 5258.571 6778.278 18 3
#> 3 22 4476.400 4769.034 5662.437 10 2
#> 4 19 5724.758 6922.427 10606.563 44 1
#> 5 22 4682.392 5122.325 6476.245 16 3
#> 6 26 4404.656 4651.844 5407.092 8 1
In our manuscript, we also introduce an alternative tuning parameterization. Due to the algorithm’s computational complexity, performing an exhaustive naive search across a broad range of potential values is usually not feasible. We realized that introducing a different external covariate, thereby altering the number of graphs m, or considering different variables, which changes p, impacts the interpretation and meaning of the tuning parameters λ1 and λ2. In other words, if one selects optimal values for λ1 and λ2 and then slightly alters the dataset, these values may no longer be informative for the new dataset.
To address this issue, we propose a new parameterization, denoted as γ1 and γ2, replacing λ1 and λ2. Unlike the previous tuning parameters that penalize the entire precision matrix and differences between precision matrices, γ1 and γ2 are used to penalize individual edges.
This alternative tuning parameterization remains robust when changing the number of variables or graphs in the CVN model. This might also be beneficial for other applications as well.
The results table of the CVN model includes the gamma tuning
parameters by default, which
are much smaller than the lambda values (see our manuscript for more
information).
id | lambda1 | lambda2 | gamma1 | gamma2 | converged | value | n_iterations | aic | bic | ebic | edges_median | edges_iqr |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.50 | 1.0 | 0.0012346 | 0.0006173 | TRUE | 0.0081766 | 16 | 5255.323 | 6071.667 | 8586.090 | 29 | 6 |
2 | 1.25 | 1.0 | 0.0030864 | 0.0006173 | TRUE | 0.0089329 | 19 | 4763.677 | 5258.571 | 6778.278 | 18 | 3 |
3 | 2.00 | 1.0 | 0.0049383 | 0.0006173 | TRUE | 0.0097696 | 22 | 4476.400 | 4769.034 | 5662.437 | 10 | 2 |
4 | 0.50 | 1.5 | 0.0012346 | 0.0009259 | TRUE | 0.0097202 | 19 | 5724.758 | 6922.427 | 10606.563 | 44 | 1 |
5 | 1.25 | 1.5 | 0.0030864 | 0.0009259 | TRUE | 0.0094100 | 22 | 4682.392 | 5122.325 | 6476.245 | 16 | 3 |
6 | 2.00 | 1.5 | 0.0049383 | 0.0009259 | TRUE | 0.0093193 | 26 | 4404.656 | 4651.844 | 5407.092 | 8 | 1 |
After fitting the CVN models, the next step involves selecting suitable values for the tuning parameters (λ1, λ2) or for (γ1, γ2). The choice of the tuning parameter determine the network structure and it is thus an important and challenging step.
The CVN package provides the Akaike Information Criteria (AIC), the Bayesian Information Criteria (BIC) and the extended BIC, which can be used to select suitable values for the tuning parameters.
The calculation of the eBIC requires to set the parameter γeBIC ∈ [0, 1]. It is by default set to γebic = 0.5. All three information criteria are returned in the CVN result table.
# print results
cvn$results
#> id lambda1 lambda2 gamma1 gamma2 converged value
#> 1 1 0.50 1.0 0.001234568 0.0006172840 TRUE 0.008176557
#> 2 2 1.25 1.0 0.003086420 0.0006172840 TRUE 0.008932883
#> 3 3 2.00 1.0 0.004938272 0.0006172840 TRUE 0.009769567
#> 4 4 0.50 1.5 0.001234568 0.0009259259 TRUE 0.009720175
#> 5 5 1.25 1.5 0.003086420 0.0009259259 TRUE 0.009410001
#> 6 6 2.00 1.5 0.004938272 0.0009259259 TRUE 0.009319307
#> n_iterations aic bic ebic edges_median edges_iqr
#> 1 16 5255.323 6071.667 8586.090 29 6
#> 2 19 4763.677 5258.571 6778.278 18 3
#> 3 22 4476.400 4769.034 5662.437 10 2
#> 4 19 5724.758 6922.427 10606.563 44 1
#> 5 22 4682.392 5122.325 6476.245 16 3
#> 6 26 4404.656 4651.844 5407.092 8 1
# which combination of lambda values shows the best AIC?
best.aic <- cvn$results[which.min(cvn$results$aic), "id"]
cvn$results[best.aic,]
#> id lambda1 lambda2 gamma1 gamma2 converged value
#> 6 6 2 1.5 0.004938272 0.0009259259 TRUE 0.009319307
#> n_iterations aic bic ebic edges_median edges_iqr
#> 6 26 4404.656 4651.844 5407.092 8 1
👍 Rule of thumb: The smaller the IC value, the better the fit.
The package contains also an option of plotting the results of the information criteria. The plot is a map of all IC values for every tuning parameter combination fitted by CVN. The yellow dot denotes the tuning parameter combination which shows the smallest IC value.
It is also possible to change the γ-value for the eBIC. The default is γ = 0.5.
(dic <- determine_information_criterion_cvn(cvn, gamma = 0.9))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> aic 5255.323 4763.677 4476.4 5724.758 4682.392 4404.656
#> bic 6071.667 5258.571 4769.034 6922.427 5122.325 4651.844
#> ebic 10597.63 7994.042 6377.159 13553.87 7559.381 6011.29
IC | |
---|---|
minimal AIC: | 6 |
minimal BIC: | 6 |
minimal eBIC with gamma = 0.5: | 6 |
minimal eBIC with gamma = 0.9: | 6 |
Based on the information criteria, we select CVN 6. Here, all IC measures decide in favour of ID 6. However, this is not always the case. In principle, the AIC selects slightly denser models than the BIC. For tuning graphs, many prefer the eBIC, although you have specify a value for γeBIC as well.
Our simulations showed better results in detecting the true underlying graph for using the AIC than the BIC (eBIC was not included in the comparison). But in general do both perform rather poorly in practice.
Although ICs are objective measures, they sometimes provide useless graphs (too sparse, to dense, …).
There are other approaches that can be used to find a suitable graph structure, like stability approaches. These, however, are based on resampling strategies which demand a lot of computational power.
Others prefer to assume a sparsity or density value and select the graph that comes closest to that threshold We output therefore also the median number of edges and the interquartile range (IQR) across the m estimated graphs of the CVN results output (see above).
Once a graph has been selected, there is usually an interest in further investigation of its structure etc.
The process becomes easier if the CVN model of interest is extracted as a single object from the list of fitted CVNs for further investigation.
In our example, we aim to examine the CVN with the lowest AIC value, which corresponds to the model with ID 6.
All subgraphs of a CVN model can be easily plotted. The function
requires the visnet
package to visualize the network
structure of each CVN subgraphs.
The plot shows edges with different color coding:
Red edges: The edges are present in all m subgraphs and are referred to as core edges.
Blue edges: These edges are absent from at least one of m subgraphs.
It is also possible to generate all plots for every tuning parameter constellation over all m fitted graphs in one function.
In order to examine the differences between the subgraphs of a CVN model, it is possible to determine the number of edges in each CVN subgraph and to analyze the Hamming distances between the subgraphs.
The Hamming distance between two graphs with the same set of nodes is the number of edges that differ between them. It is calculated by counting the edges that need to be added or removed to transform one graph into the other.
The following code shows the number of edges in each subgraph (E(g1), …, E(g9)), the number of edges that are shared by all graphs (E(core)) and the number of edges that are unique for each subgraph (E(g1_u),…, E(g9_u)).
cvn_edge_summary(cvn)
#> E(g1) E(g2) E(g3) E(g4) E(g5) E(g6) E(g7) E(g8) E(g9) E(core) E(g1_u) E(g2_u)
#> 1 68 52 56 64 56 56 68 68 58 12 6 0
#> 2 44 32 36 38 32 30 44 38 36 1 14 0
#> 3 28 22 20 22 18 16 36 18 14 0 12 2
#> 4 90 88 88 88 88 88 90 90 90 44 0 0
#> 5 36 28 30 32 32 28 38 38 32 2 12 0
#> 6 28 18 16 18 16 16 22 16 14 1 12 0
#> E(g3_u) E(g4_u) E(g5_u) E(g6_u) E(g7_u) E(g8_u) E(g9_u)
#> 1 0 0 0 0 0 0 0
#> 2 0 0 2 0 4 0 0
#> 3 4 0 2 0 10 0 2
#> 4 0 0 0 0 0 0 0
#> 5 4 0 2 0 4 2 2
#> 6 2 0 2 4 4 2 2
You can also print the summary for one extracted CVN (fit6 in the example). The ouptut shows than the number of edges in each subgraph (edges), the number of edges that are shared by all subgraphs (core_edges), and the number of edges that are unique for this particular subgraph (unique_edges).
cvn_edge_summary(fit6)
#> edges core_edges unique_edges
#> 1 28 1 12
#> 2 18 1 0
#> 3 16 1 2
#> 4 18 1 0
#> 5 16 1 2
#> 6 16 1 4
#> 7 22 1 4
#> 8 16 1 2
#> 9 14 1 2
You have the option to return the result as a matrix or as a heatmap plot. The number in each cell reflects the number of different edges between two graphs.
# Calculate Hamming distance matrix and plot it as heat map
# plot_hamming_distances_cvn(fit3)
hd_matrix <- hamming_distance_adj_matrices(fit6$adj_matrices[[1]])
plot_hamming_distances(hd_matrix)
As expected, the heatmap shows that the graph structures change randomly along the two covariates. No pattern can be seen
We have also proposed a method to interpolate a subgraph for which we do not have collected data. This can be done using an estimated CVN model based on m graphs. Let us assume that Gm + 1 represent the graph we want to interpolate.
For this, we need to set a vector of smoothing coefficients that determine the amount of smoothing between the new graph Gm + 1 and the m original graphs.
For example, suppose we want to interpolate a graph smoothed by graphs 8 and 9, where the structure is weighted by 0.5 each.
# interpolate generates objects of class 'cvn_interpolate'
interpolate6 <- interpolate(fit6, c(0,0,0,0,0,0,0,0.5,0.5), truncate = 0.05)
cvn6 <- combine_cvn_interpolated(fit6, interpolate6)
# plot with title
plot10 <- visnetwork_cvn(cvn6, verbose = FALSE,
title = lapply(1:cvn6$n_lambda_values,
function(i){sapply(1:cvn6$m, function(j) paste("Graph", j))}
))
Note, that the function is still work in progress and we did not yet investigate its performance in simulation study. It should therefore be used with caution.