Title: | Multiple Imputation in Causal Graph Discovery |
---|---|
Description: | Modified functions of the package 'pcalg' and some additional functions to run the PC and the FCI (Fast Causal Inference) algorithm for constraint-based causal discovery in incomplete and multiply imputed datasets. Foraita R, Friemel J, Günther K, Behrens T, Bullerdiek J, Nimzyk R, Ahrens W, Didelez V (2020) <doi:10.1111/rssa.12565>; Andrews RM, Foraita R, Didelez V, Witte J (2021) <arXiv:2108.13395>; Witte J, Foraita R, Didelez V (2022) <doi:10.1002/sim.9535>. |
Authors: | Ronja Foraita [aut, cph, cre] , Janine Witte [aut] |
Maintainer: | Ronja Foraita <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.1 |
Built: | 2025-01-07 03:45:27 UTC |
Source: | https://github.com/bips-hb/micd |
Generate R bootstrap replicates for the PC or FCI algorithm for data with missing values.
boot.graph( data, select = NULL, method = c("pcMI", "fciMI"), method.mice = NULL, args, R, m = 10, args.residuals = NULL, seed = NA, quickpred = FALSE, ... )
boot.graph( data, select = NULL, method = c("pcMI", "fciMI"), method.mice = NULL, args, R, m = 10, args.residuals = NULL, seed = NA, quickpred = FALSE, ... )
data |
Data.frame with missing values |
select |
Variable of integers, indicating columns to select from a data frame; only continuous variables can be included in the model selection |
method |
Character string specifying the algorithm for causal discovery from the package 'pcalg'. |
method.mice |
Character string specifying imputation method; see |
args |
Arguments passed to |
R |
A positive integer number of bootstrap replications. |
m |
Number of chains included in mice()'. |
args.residuals |
(Optional) list containing vertices and confounders.
May be specified when residuals for vertices should be calculated in each bootstrap
data set. See |
seed |
A positive integer that is used as argument for set.seed(). |
quickpred |
If true, mice uses quickpred to select predictors. |
... |
Further arguments passed to the imputation function |
List of objects of class pcalgo
(see pcalg::pcAlgo)
or of fcmialgo
(see pcalg::fciAlgo).
data(windspeed) daten <- mice::ampute(windspeed)$amp bgraph <- boot.graph(data = daten, method = "pcMI", args = "solve.confl = TRUE, alpha = 0.05", R = 5)
data(windspeed) daten <- mice::ampute(windspeed)$amp bgraph <- boot.graph(data = daten, method = "pcMI", args = "solve.confl = TRUE, alpha = 0.05", R = 5)
A wrapper for pcalg::disCItest
, to be used within
pcalg::skeleton
, pcalg::pc
or
pcalg::fci
when the data contain missing values.
Observations where at least one of the variables involved in the test is
missing are deleted prior to performing the test (test-wise deletion).
disCItwd(x, y, S = NULL, suffStat)
disCItwd(x, y, S = NULL, suffStat)
x , y , S
|
(Integer) position of variable X, Y and set of variables S,
respectively, in |
suffStat |
A list with three elements, |
See disCItest
for details on the G square test. Test-wise deletion
is valid if missingness does not jointly depend on X and Y.
A p-value.
pcalg::disCItest
for complete data, disMItest
for multiply imputed data
## load data (200 observations) data(gmD) dat <- gmD$x[1:1000,] ## delete some observations of X2 and X3 set.seed(123) dat[sample(1:1000, 50), 2] <- NA dat[sample(1:1000, 50), 3] <- NA ## analyse incomplete data # test-wise deletion ========== sufftwd <- getSuff(dat, test = "disCItwd") disCItwd(1, 3, NULL, suffStat = sufftwd) # list-wise deletion ========== dat2 <- dat[complete.cases(dat), ] suffStat2 <- getSuff(dat2, test = "disCItest", adaptDF = FALSE) disCItest(1, 3, NULL, suffStat = suffStat2) ## use disCItwd within pcalg::pc ========== pc.fit <- pc(suffStat = sufftwd, indepTest = disCItwd, alpha = 0.1, p = 5) pc.fit if (requireNamespace("Rgraphviz", quietly = TRUE)) plot(pc.fit)
## load data (200 observations) data(gmD) dat <- gmD$x[1:1000,] ## delete some observations of X2 and X3 set.seed(123) dat[sample(1:1000, 50), 2] <- NA dat[sample(1:1000, 50), 3] <- NA ## analyse incomplete data # test-wise deletion ========== sufftwd <- getSuff(dat, test = "disCItwd") disCItwd(1, 3, NULL, suffStat = sufftwd) # list-wise deletion ========== dat2 <- dat[complete.cases(dat), ] suffStat2 <- getSuff(dat2, test = "disCItest", adaptDF = FALSE) disCItest(1, 3, NULL, suffStat = suffStat2) ## use disCItwd within pcalg::pc ========== pc.fit <- pc(suffStat = sufftwd, indepTest = disCItwd, alpha = 0.1, p = 5) pc.fit if (requireNamespace("Rgraphviz", quietly = TRUE)) plot(pc.fit)
A modified version of pcalg::disCItest
, to be used within
pcalg::skeleton
, pcalg::pc
or
pcalg::fci
when multiply imputed data sets are available.
Note that in contrast to pcalg::disCItest
, the variables must
here be coded as factors.
disMItest(x, y, S = NULL, suffStat)
disMItest(x, y, S = NULL, suffStat)
x , y , S
|
(Integer) position of variable X, Y and set of variables S,
respectively, in |
suffStat |
A list of |
See pcalg::disCItest
for details on the G square test. disMItest applies this test to each
data.frame
in suffStat
, then combines the results using the rules
in Meng & Rubin (1992). Degrees of freedom are never adapted, and there is no
minimum required sample size, while pcalg::disCItest
requires
10*df
observations and otherwise returns a p-value of 1.
A p-value.
Janine Witte
Meng X.-L., Rubin D.B. (1992): Performing likelihood ratio tests with multiply imputed data sets. Biometrika 79(1):103-111.
pcalg::disCItest
for complete data,
disCItwd
for test-wise deletion
## load data (200 observations) and factorise data(gmD) dat <- gmD$x[1:1000, ] dat[] <- lapply(dat, as.factor) ## delete some observations of X2 and X3 set.seed(123) dat[sample(1:1000, 40), 2] <- NA dat[sample(1:1000, 40), 3] <- NA ## impute missing values under model with two-way interactions form <- make.formulas.saturated(dat, d = 2) imp <- mice::mice(dat, formulas = form, printFlag = FALSE) imp <- mice::complete(imp, action = "all") ## analyse imputed data disMItest(1, 3, NULL, suffStat = imp) ## use disMItest within pcalg::pc pc.fit <- pc(suffStat = imp, indepTest = disMItest, alpha = 0.01, p = 5) pc.fit if(require("Rgraphviz", character.only = TRUE, quietly = TRUE)){ plot(pc.fit) }
## load data (200 observations) and factorise data(gmD) dat <- gmD$x[1:1000, ] dat[] <- lapply(dat, as.factor) ## delete some observations of X2 and X3 set.seed(123) dat[sample(1:1000, 40), 2] <- NA dat[sample(1:1000, 40), 3] <- NA ## impute missing values under model with two-way interactions form <- make.formulas.saturated(dat, d = 2) imp <- mice::mice(dat, formulas = form, printFlag = FALSE) imp <- mice::complete(imp, action = "all") ## analyse imputed data disMItest(1, 3, NULL, suffStat = imp) ## use disMItest within pcalg::pc pc.fit <- pc(suffStat = imp, indepTest = disMItest, alpha = 0.01, p = 5) pc.fit if(require("Rgraphviz", character.only = TRUE, quietly = TRUE)){ plot(pc.fit) }
This function is a modification of pcalg::fci()
to be used for multiple imputation.
fciMI( data, alpha, labels, p, skel.method = c("stable", "original"), type = c("normal", "anytime", "adaptive"), fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, m.max = Inf, pdsep.max = Inf, rules = rep(TRUE, 10), doPdsep = TRUE, biCC = FALSE, conservative = FALSE, maj.rule = FALSE, verbose = FALSE )
fciMI( data, alpha, labels, p, skel.method = c("stable", "original"), type = c("normal", "anytime", "adaptive"), fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, m.max = Inf, pdsep.max = Inf, rules = rep(TRUE, 10), doPdsep = TRUE, biCC = FALSE, conservative = FALSE, maj.rule = FALSE, verbose = FALSE )
data |
An object of type mids, which stands for 'multiply imputed data set', typically created by a call to function mice() |
alpha |
Significance level (number in (0,1) for the conditional independence tests |
labels |
(Optional) character vector of variable (or "node") names. Typically preferred to specifying p. |
p |
(Optional) number of variables (or nodes). May be specified if labels are not, in which case labels is set to 1:p. |
skel.method |
Character string specifying method; the default, "stable"
provides an order-independent skeleton, see |
type |
Character string specifying the version of the FCI algorithm to be used.
See |
fixedGaps |
See |
fixedEdges |
See |
NAdelete |
See |
m.max |
Maximum size of the conditioning sets that are considered in the conditional independence tests. |
pdsep.max |
See |
rules |
Logical vector of length 10 indicating which rules should be used when directing edges. The order of the rules is taken from Zhang (2008). |
doPdsep |
See |
biCC |
See |
conservative |
See |
maj.rule |
See |
verbose |
If true, more detailed output is provided. |
See pcalg::fci()
for details.
Original code by Diego Colombo, Markus Kalisch, and Joris Mooij. Modifications by Ronja Foraita.
daten <- windspeed[,1] for(i in 2:ncol(windspeed)) daten <- c(daten, windspeed[,i]) daten[sample(1:length(daten), 260)] <- NA daten <- matrix(daten, ncol = 6) ## Impute missing values imp <- mice(daten, printFlag = FALSE) fc.res <- fciMI(data = imp, label = colnames(imp$data), alpha = 0.01) if (requireNamespace("Rgraphviz", quietly = TRUE)) plot(fc.res)
daten <- windspeed[,1] for(i in 2:ncol(windspeed)) daten <- c(daten, windspeed[,i]) daten[sample(1:length(daten), 260)] <- NA daten <- matrix(daten, ncol = 6) ## Impute missing values imp <- mice(daten, printFlag = FALSE) fc.res <- fciMI(data = imp, label = colnames(imp$data), alpha = 0.01) if (requireNamespace("Rgraphviz", quietly = TRUE)) plot(fc.res)
A plug-in conditional independence test for pcalg::skeleton()
, pcalg::pc()
or
pcalg::fci()
when multiply imputed data sets are available. flexCItest()
detects whether
variables are continuous, discrete or mixed, and automatically switches between gaussCItest()
(continuous only),
disCItest()
(discrete only) and mixCItest()
(mixed variables).
flexCItest(x, y, S = NULL, suffStat)
flexCItest(x, y, S = NULL, suffStat)
x , y , S
|
(integer) position of variable X, Y and set of variables S, respectively, in the dataset. It is tested whether X and Y are conditionally independent given the subset S of the remaining variables. |
suffStat |
a list generated using |
suffStat
needs to be a list with four elements named datlist
, corlist
,
conpos
and dispos
. datlist
is the list of imputed datasets. corlist
is a list with M+1 elements, where M is the number of imputed datasets. For i=1,...,M, the
the i-th element of corlist
is the correlation matrix of the continuous variables in the i-th imputed dataset;
the (M+1)-the element is the number of rows in each imputed dataset.
conpos
is a vector containing the integer positions of the continuous variables in the original dataset.
dispos
is a vector containing the integer positions of the discrete variables in the original dataset.
A p-value.
gaussCItest()
, disCItest()
and mixCItest()
.
# load data (numeric and factor variables) dat <- toenail2[1:400, ] # obtain correct input 'suffStat' for 'flexCItest' suff <- getSuff(dat, test="flexCItest") flexCItest(2,3,NULL, suffStat = suff)
# load data (numeric and factor variables) dat <- toenail2[1:400, ] # obtain correct input 'suffStat' for 'flexCItest' suff <- getSuff(dat, test="flexCItest") flexCItest(2,3,NULL, suffStat = suff)
A plug-in conditional independence test for pcalg::skeleton
, pcalg::pc
or
pcalg::fci
when the data contain missing values. Observations
where at least one of the variables involved in the test is missing are
deleted prior to performing the test (test-wise deletion). The function flexCItwd
detects whether
variables are continuous, discrete or mixed, and automatically switches between gaussCItwd
(continuous only),
link{disCItwd}
(discrete only) and mixCItwd
(mixed).
flexCItwd(x, y, S = NULL, data)
flexCItwd(x, y, S = NULL, data)
x , y , S
|
(Integer) position of variable X, Y and set of variables S,
respectively, in each correlation matrix in |
data |
A data frame |
A p-value
## load data (numeric and factor variables) dat <- toenail2[1:400, ] ## delete some observations set.seed(123) dat[sample(400, 20), 2] <- NA dat[sample(400, 30), 4] <- NA ## obtain correct input 'suffStat' for 'flexMItest' suff <- getSuff(imp, test="flexCItwd") ## analyse data # continuous variables only flexCItwd(4, 5, NULL, dat) # discrete variables only flexCItwd(2, 3, NULL, dat) # mixed variables flexCItwd(2, 3, 4, dat)
## load data (numeric and factor variables) dat <- toenail2[1:400, ] ## delete some observations set.seed(123) dat[sample(400, 20), 2] <- NA dat[sample(400, 30), 4] <- NA ## obtain correct input 'suffStat' for 'flexMItest' suff <- getSuff(imp, test="flexCItwd") ## analyse data # continuous variables only flexCItwd(4, 5, NULL, dat) # discrete variables only flexCItwd(2, 3, NULL, dat) # mixed variables flexCItwd(2, 3, 4, dat)
A plug-in conditional independence test for pcalg::skeleton
, pcalg::pc
or
pcalg::fci
when multiply imputed data sets are available. flexMItest
detects whether
variables are continuous, discrete or mixed, and automatically switches between gaussMItest
(continuous only),
link{disMItest}
(discrete only) and mixMItest
(mixed).
flexMItest(x, y, S = NULL, suffStat)
flexMItest(x, y, S = NULL, suffStat)
x , y , S
|
(integer) position of variable X, Y and set of variables S, respectively, in the dataset. It is tested whether X and Y are conditionally independent given the subset S of the remaining variables. |
suffStat |
a list generated using |
suffStat
needs to be a list with four elements named datlist
, corlist
,
conpos
and dispos
. datlist
is the list of imputed datasets. corlist
is a list with M+1 elements, where M is the number of imputed datasets. For i=1,...,M, the
the i-th element of corlist
is the correlation matrix of the continuous variables in the i-th imputed dataset;
the (M+1)-the element is the number of rows in each imputed dataset.
conpos
is a vector containing the integer positions of the continuous variables in the original dataset.
dispos
is a vector containing the integer positions of the discrete variables in the original dataset.
A p-value.
gaussMItest
, disMItest
and mixMItest
## load data (numeric and factor variables) library(ranger) dat <- toenail2[1:400, ] ## delete some observations set.seed(123) dat[sample(400, 20), 2] <- NA dat[sample(400, 30), 4] <- NA ## impute missing values using random forests imp <- mice::mice(dat, method = "rf", m = 3, printFlag = FALSE) ## obtain correct input 'suffStat' for 'flexMItest' suff <- getSuff(imp, test="flexMItest") ## analyse data # continuous variables only flexMItest(4,5,NULL, suffStat = suff) implist <- complete(imp, action="all") gaussSuff <- c(lapply(implist, function(i){cor(i[ ,c(4,5)])}), n = 400) gaussMItest(1,2,NULL, suffStat = gaussSuff) flexCItwd(4, 5, NULL, dat) # discrete variables only flexMItest(2,3,NULL, suffStat = suff) disMItest(2,3,NULL, suffStat = complete(imp, action="all")) flexCItwd(2,3,NULL, dat) # mixed variables flexMItest(2,3,4, suffStat = suff) mixMItest(2,3,4, suffStat = complete(imp, action="all")) flexCItwd(2,3,4, dat)
## load data (numeric and factor variables) library(ranger) dat <- toenail2[1:400, ] ## delete some observations set.seed(123) dat[sample(400, 20), 2] <- NA dat[sample(400, 30), 4] <- NA ## impute missing values using random forests imp <- mice::mice(dat, method = "rf", m = 3, printFlag = FALSE) ## obtain correct input 'suffStat' for 'flexMItest' suff <- getSuff(imp, test="flexMItest") ## analyse data # continuous variables only flexMItest(4,5,NULL, suffStat = suff) implist <- complete(imp, action="all") gaussSuff <- c(lapply(implist, function(i){cor(i[ ,c(4,5)])}), n = 400) gaussMItest(1,2,NULL, suffStat = gaussSuff) flexCItwd(4, 5, NULL, dat) # discrete variables only flexMItest(2,3,NULL, suffStat = suff) disMItest(2,3,NULL, suffStat = complete(imp, action="all")) flexCItwd(2,3,NULL, dat) # mixed variables flexMItest(2,3,4, suffStat = suff) mixMItest(2,3,4, suffStat = complete(imp, action="all")) flexCItwd(2,3,4, dat)
A wrapper for pcalg::gaussCItest
,
to be used within pcalg::skeleton
, pcalg::pc
or
pcalg::fci
when the data contain missing values. Observations
where at least one of the variables involved in the test is missing are
deleted prior to performing the test (test-wise deletion).
gaussCItwd(x, y, S = NULL, suffStat)
gaussCItwd(x, y, S = NULL, suffStat)
x , y , S
|
(integer) position of variable X, Y and set of variables S,
respectively, in each correlation matrix in |
suffStat |
|
See pcalg::gaussCItest
for details on
Fisher's z-test. Test-wise deletion is valid if missingness does not jointly
depend on X and Y.
A p-value.
pcalg::condIndFisherZ()
for complete data, gaussCItestMI()
for multiply imputed data
## load data (numeric variables) dat <- as.matrix(windspeed) ## delete some observations set.seed(123) dat[sample(1:length(dat), 260)] <- NA ## analyse data # complete data: suffcomplete <- getSuff(windspeed, test="gaussCItest") gaussCItest(1, 2, c(4,5), suffStat = suffcomplete) # test-wise deletion: ========== gaussCItwd(1, 2, c(4,5), suffStat = dat) # list-wise deletion: ========== sufflwd <- getSuff(dat[complete.cases(dat), ], test="gaussCItest") gaussCItest(1, 2, c(4,5), suffStat = sufflwd) ## use gaussCItwd within pcalg::pc pc.fit <- pc(suffStat = dat, indepTest = gaussCItwd, alpha = 0.01, p = 6) pc.fit
## load data (numeric variables) dat <- as.matrix(windspeed) ## delete some observations set.seed(123) dat[sample(1:length(dat), 260)] <- NA ## analyse data # complete data: suffcomplete <- getSuff(windspeed, test="gaussCItest") gaussCItest(1, 2, c(4,5), suffStat = suffcomplete) # test-wise deletion: ========== gaussCItwd(1, 2, c(4,5), suffStat = dat) # list-wise deletion: ========== sufflwd <- getSuff(dat[complete.cases(dat), ], test="gaussCItest") gaussCItest(1, 2, c(4,5), suffStat = sufflwd) ## use gaussCItwd within pcalg::pc pc.fit <- pc(suffStat = dat, indepTest = gaussCItwd, alpha = 0.01, p = 6) pc.fit
A modified version of pcalg::gaussCItest
,
to be used within
pcalg::skeleton
, pcalg::pc
or
pcalg::fci
when multiply imputated data sets are available.
gaussMItest(x, y, S, suffStat) gaussCItestMI(x, y, S = NULL, data)
gaussMItest(x, y, S, suffStat) gaussCItestMI(x, y, S = NULL, data)
x , y , S
|
(Integer) position of variable X, Y and set of variables S, respectively, in the adjacency matrix. It is tested, whether X and Y are conditionally independent given the subset S of the remaining nodes. |
suffStat |
A list of length m+1, where m is the number of imputations; the first m elements are the covariance matrices of the m imputed data sets, the m-th element is the sample size. Can be obtained from a mids object by getSuff(mids, test="gaussMItest") |
data |
An object of type mids, which stands for 'multiply imputed data set', typically created by a call to function mice() |
gaussMItest
is faster, as it uses pre-calculated covariance matrices.
A p-value.
## load data (numeric variables) dat <- as.matrix(windspeed) ## delete some observations set.seed(123) dat[sample(1:length(dat), 260)] <- NA ## Impute missing values under normal model imp <- mice(dat, method = "norm", printFlag = FALSE) ## analyse data # complete data: suffcomplete <- getSuff(windspeed, test = "gaussCItest") gaussCItest(1, 2, c(4,5), suffStat = suffcomplete) # multiple imputation: suffMI <- getSuff(imp, test = "gaussMItest") gaussMItest(1, 2, c(4,5), suffStat = suffMI) gaussCItestMI(1, 2, c(4,5), data = imp) # test-wise deletion: gaussCItwd(1, 2, c(4,5), suffStat = dat) # list-wise deletion: dat2 <- dat[complete.cases(dat), ] sufflwd <- getSuff(dat2, test = "gaussCItest") gaussCItest(1, 2, c(4,5), suffStat = sufflwd) ## use gaussMItest or gaussCItestMI within pcalg::pc (pc.fit <- pc(suffStat = suffMI, indepTest = gaussMItest, alpha = 0.01, p = 6)) (pc.fit <- pc(suffStat = imp, indepTest = gaussCItestMI, alpha = 0.01, p = 6))
## load data (numeric variables) dat <- as.matrix(windspeed) ## delete some observations set.seed(123) dat[sample(1:length(dat), 260)] <- NA ## Impute missing values under normal model imp <- mice(dat, method = "norm", printFlag = FALSE) ## analyse data # complete data: suffcomplete <- getSuff(windspeed, test = "gaussCItest") gaussCItest(1, 2, c(4,5), suffStat = suffcomplete) # multiple imputation: suffMI <- getSuff(imp, test = "gaussMItest") gaussMItest(1, 2, c(4,5), suffStat = suffMI) gaussCItestMI(1, 2, c(4,5), data = imp) # test-wise deletion: gaussCItwd(1, 2, c(4,5), suffStat = dat) # list-wise deletion: dat2 <- dat[complete.cases(dat), ] sufflwd <- getSuff(dat2, test = "gaussCItest") gaussCItest(1, 2, c(4,5), suffStat = sufflwd) ## use gaussMItest or gaussCItestMI within pcalg::pc (pc.fit <- pc(suffStat = suffMI, indepTest = gaussMItest, alpha = 0.01, p = 6)) (pc.fit <- pc(suffStat = imp, indepTest = gaussCItestMI, alpha = 0.01, p = 6))
A convenience function for transforming a multiply imputed data set into the 'suffStat' required
by pcalg::gaussCItest()
, pcalg::disCItest()
, mixCItest()
, flexCItest()
, gaussMItest()
,
disMItest()
, mixMItest()
and flexMItest()
.
getSuff( X, test = c("gaussCItest", "gaussMItest", "disCItest", "disMItest", "disCItwd", "mixCItest", "mixMItest", "flexMItest", "flexCItest"), adaptDF = NULL, nlev = NULL )
getSuff( X, test = c("gaussCItest", "gaussMItest", "disCItest", "disMItest", "disCItwd", "mixCItest", "mixMItest", "flexMItest", "flexCItest"), adaptDF = NULL, nlev = NULL )
X |
For 'test=xxxCItest': a data.frame or matrix;
for 'test=xxxMItest': an object of class |
test |
one of |
adaptDF |
for discrete variables: logical specifying if the degrees of freedom should be lowered by one for each zero count. The value for the degrees of freedom cannot go below 1. |
nlev |
(Optional) for discrete variables: vector with numbers of levels for each variable in the data. |
An R object that can be used as input to the specified conditional independence test:
# Example 1: continuous variables, no missing values ===================== data(windspeed) dat1 <- as.matrix(windspeed) ## analyse data gaussCItest(1, 2, NULL, suffStat = getSuff(windspeed, test = "gaussCItest")) mixCItest(1, 2, NULL, suffStat = windspeed) ## Example 2: continuous variables, multiple imputation =================== dat2 <- mice::ampute(windspeed)$amp ## delete some observations set.seed(123) ## Impute missing values under normal model imp2 <- mice(dat2, method = "norm", printFlag = FALSE) ## analyse imputed data gaussMItest(1, 2, c(4,5), suffStat = getSuff(imp2, test="gaussMItest")) mixMItest(1, 2, c(4,5), suffStat = getSuff(imp2, test="mixMItest")) mixMItest(1, 2, c(4,5), suffStat = mice::complete(imp2, action="all")) flexMItest(1, 2, c(4,5), suffStat = getSuff(imp2, test="flexMItest")) ## Example 3: discrete variables, multiple imputation ===================== ## simulate factor variables n <- 200 set.seed(789) x <- factor(sample(0:2, n, TRUE)) # factor, 3 levels y <- factor(sample(0:3, n, TRUE)) # factor, 4 levels z <- factor(sample(0:1, n, TRUE)) # factor, 2 levels dat3 <- data.frame(x,y,z) ## delete some observations of z dat3[sample(1:n, 40), 3] <- NA ## impute missing values under saturated model form <- make.formulas.saturated(dat3) imp3 <- mice::mice(dat3, method = "logreg", formulas = form, printFlag = FALSE) ## analyse imputed data disMItest(1, 3, 2, suffStat = getSuff(imp3, test="disMItest")) disMItest(1, 3, 2, suffStat = mice::complete(imp3, action = "all")) mixMItest(1, 3, 2, suffStat = getSuff(imp3, test="mixMItest")) mixMItest(1, 3, 2, suffStat = mice::complete(imp3, action = "all")) flexMItest(1, 3, 2, suffStat = getSuff(imp3, test="flexMItest")) # Example 4: mixed variables, multiple imputation ========================= dat4 <- toenail2[1:400, ] set.seed(123) dat4[sample(400, 20), 2] <- NA dat4[sample(400, 30), 4] <- NA ## impute missing values using random forests imp4 <- mice(dat4, method="rf", m = 3, printFlag = FALSE) mixMItest(2, 3, 5, suffStat = getSuff(imp4, test="mixMItest")) mixMItest(2, 3, 5, suffStat = mice::complete(imp4, action="all")) flexMItest(2, 3, 5, suffStat = getSuff(imp4, test="flexMItest"))
# Example 1: continuous variables, no missing values ===================== data(windspeed) dat1 <- as.matrix(windspeed) ## analyse data gaussCItest(1, 2, NULL, suffStat = getSuff(windspeed, test = "gaussCItest")) mixCItest(1, 2, NULL, suffStat = windspeed) ## Example 2: continuous variables, multiple imputation =================== dat2 <- mice::ampute(windspeed)$amp ## delete some observations set.seed(123) ## Impute missing values under normal model imp2 <- mice(dat2, method = "norm", printFlag = FALSE) ## analyse imputed data gaussMItest(1, 2, c(4,5), suffStat = getSuff(imp2, test="gaussMItest")) mixMItest(1, 2, c(4,5), suffStat = getSuff(imp2, test="mixMItest")) mixMItest(1, 2, c(4,5), suffStat = mice::complete(imp2, action="all")) flexMItest(1, 2, c(4,5), suffStat = getSuff(imp2, test="flexMItest")) ## Example 3: discrete variables, multiple imputation ===================== ## simulate factor variables n <- 200 set.seed(789) x <- factor(sample(0:2, n, TRUE)) # factor, 3 levels y <- factor(sample(0:3, n, TRUE)) # factor, 4 levels z <- factor(sample(0:1, n, TRUE)) # factor, 2 levels dat3 <- data.frame(x,y,z) ## delete some observations of z dat3[sample(1:n, 40), 3] <- NA ## impute missing values under saturated model form <- make.formulas.saturated(dat3) imp3 <- mice::mice(dat3, method = "logreg", formulas = form, printFlag = FALSE) ## analyse imputed data disMItest(1, 3, 2, suffStat = getSuff(imp3, test="disMItest")) disMItest(1, 3, 2, suffStat = mice::complete(imp3, action = "all")) mixMItest(1, 3, 2, suffStat = getSuff(imp3, test="mixMItest")) mixMItest(1, 3, 2, suffStat = mice::complete(imp3, action = "all")) flexMItest(1, 3, 2, suffStat = getSuff(imp3, test="flexMItest")) # Example 4: mixed variables, multiple imputation ========================= dat4 <- toenail2[1:400, ] set.seed(123) dat4[sample(400, 20), 2] <- NA dat4[sample(400, 30), 4] <- NA ## impute missing values using random forests imp4 <- mice(dat4, method="rf", m = 3, printFlag = FALSE) mixMItest(2, 3, 5, suffStat = getSuff(imp4, test="mixMItest")) mixMItest(2, 3, 5, suffStat = mice::complete(imp4, action="all")) flexMItest(2, 3, 5, suffStat = getSuff(imp4, test="flexMItest"))
formulas
ArgumentThis helper function creates a valid formulas
object.
The formulas
object is an argument to the mice::mice
function.
It is a list of formulas that specifies the target variables and the predictors
by means of the standard ~ operator. In contrast to mice::make.formulas
,
which creates main effects formulas, make.formulas.saturated
creates formulas including interaction effects.
make.formulas.saturated( data, blocks = mice::make.blocks(data), predictorMatrix = NULL, d = NULL )
make.formulas.saturated( data, blocks = mice::make.blocks(data), predictorMatrix = NULL, d = NULL )
data |
A |
blocks |
An optional specification for blocks of variables in the rows. The default assigns each variable in its own block. |
predictorMatrix |
A |
d |
maximum depth of interactions to be considered (1=no interactions, 2=two-way interactions, etc.) |
A list of formulas.
A modification of mice::make.formulas
by Stef van Buuren et al.
mice::make.formulas
## main effects model: data(nhanes) f1 <- make.formulas(nhanes) f1 ## saturated model: f2 <- make.formulas.saturated(nhanes) f2
## main effects model: data(nhanes) f1 <- make.formulas(nhanes) f1 ## saturated model: f2 <- make.formulas.saturated(nhanes) f2
Generate residuals based on variables in imputed data sets
makeResiduals(data, v, confounder, method = c("res", "cc", "pd"))
makeResiduals(data, v, confounder, method = c("res", "cc", "pd"))
data |
A data.frame. |
v |
Vector of integers referring to the location of the variable(s) in the data set |
confounder |
Vector of integers referring to the location of the variable(s) in the data set (confounders are not included in the network!) |
method |
Default method 'res' uses residuals, 'cc' uses complete cases and 'pd' uses pairwise deletion |
A data matrix of residuals.
data(windspeed) daten <- mice::ampute(windspeed)$amp # Impute missing values imp <- mice(daten, m = 5) # Build residuals knoten <- 1:4 confounder <- 5:6 # Residuals based on dataset with missing values res.pd <- makeResiduals(daten, v = knoten, confounder = confounder, method = "pd") # Residuals based in multiple imputed data residuals <- list(data = list(), m = 5) imp_c <- mice::complete(imp, "all") for (i in 1:imp$m){ residuals$data[[i]] <- makeResiduals(imp_c[[i]], v = knoten, confounder = confounder) } pc.res <- pcMI(data = residuals, p = length(knoten), alpha = 0.05) fci.res <- fciMI(data = imp, p = length(knoten), alpha = 0.05) if (requireNamespace("Rgraphviz", quietly = TRUE)){ oldpar <- par(mfrow = c(1,2)) plot(pc.res) plot(fci.res) par(oldpar) }
data(windspeed) daten <- mice::ampute(windspeed)$amp # Impute missing values imp <- mice(daten, m = 5) # Build residuals knoten <- 1:4 confounder <- 5:6 # Residuals based on dataset with missing values res.pd <- makeResiduals(daten, v = knoten, confounder = confounder, method = "pd") # Residuals based in multiple imputed data residuals <- list(data = list(), m = 5) imp_c <- mice::complete(imp, "all") for (i in 1:imp$m){ residuals$data[[i]] <- makeResiduals(imp_c[[i]], v = knoten, confounder = confounder) } pc.res <- pcMI(data = residuals, p = length(knoten), alpha = 0.05) fci.res <- fciMI(data = imp, p = length(knoten), alpha = 0.05) if (requireNamespace("Rgraphviz", quietly = TRUE)){ oldpar <- par(mfrow = c(1,2)) plot(pc.res) plot(fci.res) par(oldpar) }
A likelihood ratio test for (conditional) independence between mixed
(continuous and unordered categorical) variables, to be used within
pcalg::skeleton
, pcalg::pc
or
pcalg::fci
. It assumes that the variables in the test
follow a Conditional Gaussian distribution, i.e. conditional on each
combination of values of the discrete variables, the continuous variables
are multivariate Gaussian. Each multivariate Gaussian distribution is
allowed to have its own mean vector and covariance matrix.
mixCItest(x, y, S = NULL, suffStat, moreOutput = FALSE)
mixCItest(x, y, S = NULL, suffStat, moreOutput = FALSE)
x , y , S
|
(Integer) position of variable X, Y and set of variables S,
respectively, in |
suffStat |
A |
moreOutput |
If |
The implementation follows Andrews et al. (2018). The same test is also implemented in TETRAD and in the R-package rcausal, a wrapper for the TETRAD Java library. Small differences in the p-values returned by CGtest and the TETRAD/rcausal equivalent are due to differences in handling sparse or empty cells.
A p-value. If moreOutput=TRUE
, the test statistic and the
degrees of freedom are returned as well.
Janine Witte
Andrews B., Ramsey J., Cooper G.F. (2018): Scoring Bayesian networks of mixed variables. International Journal of Data Science and Analytics 6:3-18.
Lauritzen S.L., Wermuth N. (1989): Graphical models for associations between variables, some of which are qualitative and some quantitative. The Annals of Statistics 17(1):31-57.
Scheines R., Spirtes P., Glymour C., Meek C., Richardson T. (1998): The TETRAD project: Constraint based aids to causal model specification. Multivariate Behavioral Research 33(1):65-117. http://www.phil.cmu.edu/tetrad/index.html
# load data (numeric and factor variables) dat <- toenail2[,-1] # analyse data mixCItest(4, 1, NULL, suffStat = dat) mixCItest(1, 2, 3, suffStat = dat) ## use mixCItest within pcalg::fci fci.fit <- fci(suffStat = dat, indepTest = mixCItest, alpha = 0.01, p = 4) if (requireNamespace("Rgraphviz", quietly = TRUE)) plot(fci.fit)
# load data (numeric and factor variables) dat <- toenail2[,-1] # analyse data mixCItest(4, 1, NULL, suffStat = dat) mixCItest(1, 2, 3, suffStat = dat) ## use mixCItest within pcalg::fci fci.fit <- fci(suffStat = dat, indepTest = mixCItest, alpha = 0.01, p = 4) if (requireNamespace("Rgraphviz", quietly = TRUE)) plot(fci.fit)
A version of mixCItest
, to be used within pcalg::skeleton
,
pcalg::pc
or pcalg::fci
when the data contain missing values.
Observations where at least one of the variables involved in the test is missing
are deleted prior to performing the test (test-wise deletion).
mixCItwd(x, y, S = NULL, suffStat)
mixCItwd(x, y, S = NULL, suffStat)
x , y , S
|
(Integer) position of variable X, Y and set of variables S,
respectively, in |
suffStat |
|
See mixCItest
for details on the assumptions of the
Conditional Gaussian likelihood ratio test. Test-wise deletion is valid if
missingness does not jointly depend on X and Y.
A p-value.
mixCItest()
for complete data, mixMItest()
for multiply imputed data
## load data (numeric and factor variables) data(toenail2) dat <- toenail2[, -1] ## delete some observations set.seed(123) dat[sample(2000, 20), 1] <- NA dat[sample(2000, 30), 3] <- NA ## analyse data # complete data: ========== mixCItest(1, 2, 4, suffStat=toenail2) # test-wise deletion: ========== mixCItwd(1, 2, 4, suffStat = dat) # list-wise deletion: ========== dat2 <- dat[complete.cases(dat), ] mixCItest(1, 2, 4, suffStat = dat2) ## use mixCItwd within pcalg::pc pc.fit <- pc(suffStat = dat, indepTest = mixCItwd, alpha = 0.01, p = 4)
## load data (numeric and factor variables) data(toenail2) dat <- toenail2[, -1] ## delete some observations set.seed(123) dat[sample(2000, 20), 1] <- NA dat[sample(2000, 30), 3] <- NA ## analyse data # complete data: ========== mixCItest(1, 2, 4, suffStat=toenail2) # test-wise deletion: ========== mixCItwd(1, 2, 4, suffStat = dat) # list-wise deletion: ========== dat2 <- dat[complete.cases(dat), ] mixCItest(1, 2, 4, suffStat = dat2) ## use mixCItwd within pcalg::pc pc.fit <- pc(suffStat = dat, indepTest = mixCItwd, alpha = 0.01, p = 4)
A modified version of mixCItest
, to be used within pcalg::skeleton
,
pcalg::pc
or pcalg::fci
when multiply imputed data sets are available.
mixMItest(x, y, S = NULL, suffStat, moreOutput = FALSE)
mixMItest(x, y, S = NULL, suffStat, moreOutput = FALSE)
x , y , S
|
(integer) position of variable X, Y and set of variables S,
respectively, in |
suffStat |
A list of |
moreOutput |
(only for mixed of discrete variables) If |
See mixCItest
for details on the assumptions of the
Conditional Gaussian likelihood ratio test. CGtestMI
applies this test
to each data.frame
in suffStat
, then combines the results using
the rules in Meng & Rubin (1992).
A p-value. If moreOutput=TRUE
, the test statistic, its main
components and the degrees of freedom are returned as well.
Janine Witte
Meng X.-L., Rubin D.B. (1992): Performing likelihood ratio tests with multiply imputed data sets. Biometrika 79(1):103-111.
## load data (numeric and factor variables) data(toenail2) dat <- toenail2[1:1000, ] ## delete some observations set.seed(123) dat[sample(1000, 20), 2] <- NA dat[sample(1000, 30), 4] <- NA ## impute missing values using random forests (because of run time we just impute 2 chains) imp <- mice(dat, method = "rf", m = 2, printFlag = FALSE) ## analyse data # complete data: mixCItest(2, 3, 5, suffStat = toenail2[1:1000, ]) # multiple imputation: suffMI <- complete(imp, action = "all") mixMItest(2, 3, 5, suffStat = suffMI) # test-wise deletion: mixCItwd(2, 3, 5, suffStat = dat) # list-wise deletion: sufflwd <- dat[complete.cases(dat), ] mixCItest(2, 3, 5, suffStat = sufflwd) ## use mixMItest within pcalg::pc pc.fit <- pc(suffStat = suffMI, indepTest = mixMItest, alpha = 0.01, p = 5) pc.fit
## load data (numeric and factor variables) data(toenail2) dat <- toenail2[1:1000, ] ## delete some observations set.seed(123) dat[sample(1000, 20), 2] <- NA dat[sample(1000, 30), 4] <- NA ## impute missing values using random forests (because of run time we just impute 2 chains) imp <- mice(dat, method = "rf", m = 2, printFlag = FALSE) ## analyse data # complete data: mixCItest(2, 3, 5, suffStat = toenail2[1:1000, ]) # multiple imputation: suffMI <- complete(imp, action = "all") mixMItest(2, 3, 5, suffStat = suffMI) # test-wise deletion: mixCItwd(2, 3, 5, suffStat = dat) # list-wise deletion: sufflwd <- dat[complete.cases(dat), ] mixCItest(2, 3, 5, suffStat = sufflwd) ## use mixMItest within pcalg::pc pc.fit <- pc(suffStat = suffMI, indepTest = mixMItest, alpha = 0.01, p = 5) pc.fit
This function is a modification of pcalg::pc()
to be used for multiple imputation.
pcMI( data, alpha, labels, p, fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, m.max = Inf, u2pd = c("relaxed", "rand", "retry"), skel.method = c("stable", "original"), conservative = FALSE, maj.rule = FALSE, solve.confl = FALSE, verbose = FALSE )
pcMI( data, alpha, labels, p, fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, m.max = Inf, u2pd = c("relaxed", "rand", "retry"), skel.method = c("stable", "original"), conservative = FALSE, maj.rule = FALSE, solve.confl = FALSE, verbose = FALSE )
data |
An object of type mids, which stands for 'multiply imputed data set', typically created by a call to function mice() |
alpha |
Significance level (number in (0,1) for the conditional independence tests |
labels |
(Optional) character vector of variable (or "node") names. Typically preferred to specifying p. |
p |
(Optional) number of variables (or nodes). May be specified if labels are not, in which case labels is set to 1:p. |
fixedGaps |
A logical matrix of dimension p*p. If entry |
fixedEdges |
A logical matrix of dimension p*p. If entry |
NAdelete |
If indepTest returns NA and this option is TRUE, the corresponding edge is deleted. If this option is FALSE, the edge is not deleted. |
m.max |
Maximal size of the conditioning sets that are considered in the conditional independence tests. |
u2pd |
String specifying the method for dealing with conflicting information when trying to orient edges (see details below). |
skel.method |
Character string specifying method; the default, "stable"
provides an order-independent skeleton, see
|
conservative |
Logical indicating if the conservative PC is used. See
|
maj.rule |
Logical indicating that the triples shall be checked for
ambiguity using a majority rule idea, which is less strict
than the conservative PC algorithm. For more information, see
|
solve.confl |
See |
verbose |
If TRUE, detailed output is provided. |
An object of class "pcAlgo" (see pcAlgo) containing an estimate of the equivalence class of the underlying DAG.
See pcalg::pc()
for more details.
This is a modified function of pcalg::pc()
from the package 'pcalg' (Kalisch et al., 2012;
http://www.jstatsoft.org/v47/i11/).
Original code by Markus Kalisch, Martin Maechler, and Diego Colombo. Modifications by Ronja Foraita.
daten <- mice::ampute(windspeed)$amp ## Impute missing values imp <- mice(daten) pcMI(data = imp, label = colnames(imp$data), alpha = 0.01)
daten <- mice::ampute(windspeed)$amp ## Impute missing values imp <- mice(daten) pcMI(data = imp, label = colnames(imp$data), alpha = 0.01)
This function is a modification of pcalg::skeleton()
to be used for multiple imputation.
skeletonMI( data, alpha, labels, p, method = c("stable", "original"), m.max = Inf, fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, verbose = FALSE )
skeletonMI( data, alpha, labels, p, method = c("stable", "original"), m.max = Inf, fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, verbose = FALSE )
data |
An object of type mids, which stands for 'multiply imputed data set', typically created by a call to function mice() |
alpha |
Significance level |
labels |
(Optional) character vector of variable (or "node") names. Typically preferred to specifying p |
p |
(Optional) number of variables (or nodes). May be specified if labels are not, in which case labels is set to 1:p. |
method |
Character string specifying method; the default, "stable"
provides an order-independent skeleton, see
|
m.max |
Maximal size of the conditioning sets that are considered in the conditional independence tests. |
fixedGaps |
Logical symmetric matrix of dimension p*p. If entry [i,j] is true, the edge i-j is removed before starting the algorithm. Therefore, this edge is guaranteed to be absent in the resulting graph. |
fixedEdges |
A logical symmetric matrix of dimension p*p. If entry [i,j] is true, the edge i-j is never considered for removal. Therefore, this edge is guaranteed to be present in the resulting graph. |
NAdelete |
Logical needed for the case indepTest(*) returns NA. If it is true, the corresponding edge is deleted, otherwise not. |
verbose |
If TRUE, detailed output is provided. |
See pcalg::skeleton()
for more details.
This is a modified function of pcalg::skeleton()
from the package 'pcalg' (Kalisch et al., 2012;
http://www.jstatsoft.org/v47/i11/).
Original code by Markus Kalisch, Martin Maechler, Alain Hauser, and Diego Colombo. Modifications by Ronja Foraita.
data(gmG) n <- nrow(gmG8$x) V <- colnames(gmG8$x) # labels aka node names ## estimate Skeleton data_mids <- mice(gmG8$x, printFlag = FALSE) (skel.fit <- skeletonMI(data = data_mids, alpha = 0.01, labels = V, verbose = FALSE))
data(gmG) n <- nrow(gmG8$x) V <- colnames(gmG8$x) # labels aka node names ## estimate Skeleton data_mids <- mice(gmG8$x, printFlag = FALSE) (skel.fit <- skeletonMI(data = data_mids, alpha = 0.01, labels = V, verbose = FALSE))
Evaluate Causal Graph Discovery Algorithm in Multiple Imputed Data sets
with_graph(data, algo = c("pc", "fci", "fciPlus", "ges"), args, score = FALSE)
with_graph(data, algo = c("pc", "fci", "fciPlus", "ges"), args, score = FALSE)
data |
An object of type mids, which stands for 'multiply imputed data set', typically created by a call to function mice() |
algo |
An algorithm for causal discovery from the package 'pcalg' (see details). |
args |
Additional arguments passed to the algo. Must be a string vector starting with comma, i.e. ", ..." |
score |
Logical indicating whether a score-based or a constrained-based algorithm is applied. |
A list object of S3 class mice::mira-class
.
data(windspeed) dat <- as.matrix(windspeed) ## delete some observations set.seed(123) dat[sample(1:length(dat), 260)] <- NA ## Impute missing values under normal model imp <- mice(dat, method = "norm", printFlag = FALSE) mylabels <- names(imp$imp) out.fci <- with_graph(data = imp, algo = "fciPlus", args = ", indepTest = gaussCItest, verbose = FALSE, labels = mylabels, alpha = 0.01") out.ges <- with_graph(data = imp, algo = "ges", arg = NULL, score = TRUE) if (requireNamespace("Rgraphviz", quietly = TRUE)){ oldpar <- par(mfrow = c(1,2)) plot(out.fci$res[[1]]) plot(out.ges$res[[1]]$essgraph) par(oldpar) }
data(windspeed) dat <- as.matrix(windspeed) ## delete some observations set.seed(123) dat[sample(1:length(dat), 260)] <- NA ## Impute missing values under normal model imp <- mice(dat, method = "norm", printFlag = FALSE) mylabels <- names(imp$imp) out.fci <- with_graph(data = imp, algo = "fciPlus", args = ", indepTest = gaussCItest, verbose = FALSE, labels = mylabels, alpha = 0.01") out.ges <- with_graph(data = imp, algo = "ges", arg = NULL, score = TRUE) if (requireNamespace("Rgraphviz", quietly = TRUE)){ oldpar <- par(mfrow = c(1,2)) plot(out.fci$res[[1]]) plot(out.ges$res[[1]]$essgraph) par(oldpar) }