Title: | Tiered PC Algorithm |
---|---|
Description: | Constraint-based causal discovery using the PC algorithm while accounting for a partial node ordering, for example a partial temporal ordering when the data were collected in different waves of a cohort study. Andrews RM, Foraita R, Didelez V, Witte J (2021) <arXiv:2108.13395> provide a guide how to use tpc to analyse cohort data. |
Authors: | Janine Witte [aut], Ronja Foraita [cre, ctb] , DFG [fnd] |
Maintainer: | Ronja Foraita <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0 |
Built: | 2025-01-25 05:36:08 UTC |
Source: | https://github.com/bips-hb/tpc |
Simulated data based on 'true_sim' of a European child-and-youth cohort study with three waves
(t0, t1 and t2)
. See Andrews et al. (2021) <https://arxiv.org/abs/2108.13395>
for more information on how the data were generated.
dat_cohort
dat_cohort
A data frame with 5000 observations and 34 variables (10 variables were measured at three time points each, denoted as "_t0", "_t1" and "_t2").
Sex. Factor variable with levels "male" and "female".
Country of residence. Factor variable with levels "ITA", "EST", "CYP", "BEL", "SWE", "GER", "HUN" and "ESP".
Genotype of one SNP located in the FTO gene. Factor variable with levels "TT", "AT", "AA".
Birth weight in grams (numeric).
Age in years at survey 't0' (numeric).
Age in years at survey 't1' (numeric).
Age in years at survey 't2' (numeric).
Body mass index z-score adjusted for sex and age at survey 't0' (numeric).
Body mass index z-score adjusted for sex and age at survey 't1' (numeric).
Body mass index z-score adjusted for sex and age at survey 't2' (numeric).
Per cent body fat measured at survey 't0' (numeric).
Per cent body fat measured at survey 't1' (numeric).
Per cent body fat measured at survey 't2' (numeric).
Educational level at survey 't0'. Factor variable with levels "low education", "medium education" and "high education".
Educational level at survey 't1'. Factor variable with levels "low education", "medium education" and "high education".
Educational level at survey 't2'. Factor variable with levels "low education", "medium education" and "high education".
Fiber intake in log(mg/kcal) at survey 't0' (numeric).
Fiber intake in log(mg/kcal) at survey 't1' (numeric).
Fiber intake in log(mg/kcal) at survey 't2' (numeric).
Number of audiovisual media in the child's bedroom at survey 't0' (numeric).
Number of audiovisual media in the child's bedroom at survey 't1' (numeric).
Number of audiovisual media in the child's bedroom at survey 't2' (numeric).
Use of audiovisual media in log(h/week+1) at survey 't0' (numeric)
Use of audiovisual media in log(h/week+1) at survey 't1' (numeric)
Use of audiovisual media in log(h/week+1) at survey 't2' (numeric)
Moderate to vigorous physical activity in sqrt(min/day) at survey 't0' (numeric).
Moderate to vigorous physical activity in sqrt(min/day) at survey 't1' (numeric).
Moderate to vigorous physical activity in sqrt(min/day) at survey 't2' (numeric).
Square root of sugar intake score at survey 't0' (numeric).
Square root of sugar intake score at survey 't1' (numeric).
Square root of sugar intake score at survey 't2' (numeric).
Box-Cox-transformed well-being score at survey 't0' (numeric).
Box-Cox-transformed well-being score at survey 't1' (numeric).
Box-Cox-transformed well-being score at survey 't2' (numeric).
Andrews RM, Foraita R, Witte J (2021). A practical guide to causal discovery with cohort data. <https://doi.org/10.48550/arXiv.2108.13395>
[tpc::dat_cohort_dis()], [tpc::dat_cohort_mis()]
Data from dat_cohort
for which all continuous variables have been
categorized into three categories.
dat_cohort_dis
dat_cohort_dis
A data frame with 5000 observations and 34 variables (10 variables were measured at three time points each, denoted as "_t0", "_t1" and "_t2").
Sex. Factor variable with levels "male" and "female".
Country of residence. Factor variable with levels "ITA", "EST", "CYP", "BEL", "SWE", "GER", "HUN" and "ESP".
Genotype of one SNP located in the FTO gene. Factor variable with levels "TT", "AT", "AA".
Birth weight in grams (numeric).
Age in years at survey 't0' (numeric).
Age in years at survey 't1' (numeric).
Age in years at survey 't2' (numeric).
Body mass index z-score adjusted for sex and age at survey 't0' (numeric).
Body mass index z-score adjusted for sex and age at survey 't1' (numeric).
Body mass index z-score adjusted for sex and age at survey 't2' (numeric).
Per cent body fat measured at survey 't0' (numeric).
Per cent body fat measured at survey 't1' (numeric).
Per cent body fat measured at survey 't2' (numeric).
Educational level at survey 't0'. Factor variable with levels "low education", "medium education" and "high education".
Educational level at survey 't1'. Factor variable with levels "low education", "medium education" and "high education".
Educational level at survey 't2'. Factor variable with levels "low education", "medium education" and "high education".
Fiber intake in log(mg/kcal) at survey 't0' (numeric).
Fiber intake in log(mg/kcal) at survey 't1' (numeric).
Fiber intake in log(mg/kcal) at survey 't2' (numeric).
Number of audiovisual media in the child's bedroom at survey 't0' (numeric).
Number of audiovisual media in the child's bedroom at survey 't1' (numeric).
Number of audiovisual media in the child's bedroom at survey 't2' (numeric).
Use of audiovisual media in log(h/week+1) at survey 't0' (numeric)
Use of audiovisual media in log(h/week+1) at survey 't1' (numeric)
Use of audiovisual media in log(h/week+1) at survey 't2' (numeric)
Moderate to vigorous physical activity in sqrt(min/day) at survey 't0' (numeric).
Moderate to vigorous physical activity in sqrt(min/day) at survey 't1' (numeric).
Moderate to vigorous physical activity in sqrt(min/day) at survey 't2' (numeric).
Square root of sugar intake score at survey 't0' (numeric).
Square root of sugar intake score at survey 't1' (numeric).
Square root of sugar intake score at survey 't2' (numeric).
Box-Cox-transformed well-being score at survey 't0' (numeric).
Box-Cox-transformed well-being score at survey 't1' (numeric).
Box-Cox-transformed well-being score at survey 't2' (numeric).
Andrews RM, Foraita R, Witte J (2021). A practical guide to causal discovery with cohort data. <https://doi.org/10.48550/arXiv.2108.13395>
[tpc::dat_cohort()], [tpc::dat_cohort_mis()]
Data from dat_cohort
with missing values.
dat_cohort_mis
dat_cohort_mis
A data frame with 5000 observations and 34 variables (10 variables were measured at three time points each, denoted as "_t0", "_t1" and "_t2").
Sex. Factor variable with levels "male" and "female".
Country of residence. Factor variable with levels "ITA", "EST", "CYP", "BEL", "SWE", "GER", "HUN" and "ESP".
Genotype of one SNP located in the FTO gene. Ordinal variable with levels "TT", "AT", "AA".
Birth weight in grams (numeric).
Age in years at survey 't0' (numeric).
Age in years at survey 't1' (numeric).
Age in years at survey 't2' (numeric).
Body mass index z-score adjusted for sex and age at survey 't0' (numeric).
Body mass index z-score adjusted for sex and age at survey 't1' (numeric).
Body mass index z-score adjusted for sex and age at survey 't2' (numeric).
Per cent body fat measured at survey 't0' (numeric).
Per cent body fat measured at survey 't1' (numeric).
Per cent body fat measured at survey 't2' (numeric).
Educational level at survey 't0'. Ordinal variable with levels "low education", "medium education" and "high education".
Educational level at survey 't1'. Ordinal variable with levels "low education", "medium education" and "high education".
Educational level at survey 't2'. Ordinal variable with levels "low education", "medium education" and "high education".
Fiber intake in log(mg/kcal) at survey 't0' (numeric).
Fiber intake in log(mg/kcal) at survey 't1' (numeric).
Fiber intake in log(mg/kcal) at survey 't2' (numeric).
Number of audiovisual media in the child's bedroom at survey 't0' (numeric).
Number of audiovisual media in the child's bedroom at survey 't1' (numeric).
Number of audiovisual media in the child's bedroom at survey 't2' (numeric).
Use of audiovisual media in log(h/week+1) at survey 't0' (numeric)
Use of audiovisual media in log(h/week+1) at survey 't1' (numeric)
Use of audiovisual media in log(h/week+1) at survey 't2' (numeric)
Moderate to vigorous physical activity in sqrt(min/day) at survey 't0' (numeric).
Moderate to vigorous physical activity in sqrt(min/day) at survey 't1' (numeric).
Moderate to vigorous physical activity in sqrt(min/day) at survey 't2' (numeric).
Square root of sugar intake score at survey 't0' (numeric).
Square root of sugar intake score at survey 't1' (numeric).
Square root of sugar intake score at survey 't2' (numeric).
Box-Cox-transformed well-being score at survey 't0' (numeric).
Box-Cox-transformed well-being score at survey 't1' (numeric).
Box-Cox-transformed well-being score at survey 't2' (numeric).
Andrews RM, Foraita R, Witte J (2021). A practical guide to causal discovery with cohort data. <https://doi.org/10.48550/arXiv.2108.13395>
[tpc::dat_cohort()], [tpc::dat_cohort_dis()]
A simple graph and corresponding dataset used in the examples illustrating tpc
.
dat_sim
dat_sim
A data frame with 1000 observations and 9 numerical variables simulated by
drawing from a multivariate distribution according to the DAG true_sim
.
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
This is a modified version of pcalg::udag2pdagRelaxed
.
It applies Meek's rules to the partially oriented graph obtained after orienting edges
between time points / tiers.
MeekRules( gInput, verbose = FALSE, unfVect = NULL, solve.confl = FALSE, rules = rep(TRUE, 4) )
MeekRules( gInput, verbose = FALSE, unfVect = NULL, solve.confl = FALSE, rules = rep(TRUE, 4) )
gInput |
'pcAlgo'-object containing skeleton and conditional indepedence information. |
verbose |
FALSE: No output; TRUE: Details |
unfVect |
Vector containing numbers that encode ambiguous triples (as returned by [tpc_cons_intern()]. This is needed in the conservative and majority rule PC algorithms. |
solve.confl |
If |
rules |
A vector of length 4 containing |
If unfVect = NULL
(no ambiguous triples), the four orientation
rules are applied to each eligible structure until no more edges can be
oriented. Otherwise, unfVect contains the numbers of all ambiguous triples in
the graph as determined by [tpc_cons_intern()]. Then the orientation
rules take this information into account. For example, if a -> b - c
and <a,b,c>
is an unambigous triple and a non-v-structure, then rule 1 implies b -> c
. On
the other hand, if a -> b - c
but <a,b,c>
is an ambiguous triple, then the edge
b - c
is not oriented.
If solve.confl = FALSE
, earlier edge orientations are overwritten by
later ones.
If solv.confl = TRUE
, both the v-structures and the orientation rules
work with lists for the candidate edges and allow bi-directed edges if there are
conflicting orientations. For example, two v-structures a -> b <- c
and
b -> c <- d
then yield a -> b <-> c <- d
. This option can be used to get an
order-independent version of the PC algorithm (see Colombo and Maathuis (2014)).
We denote bi-directed edges, for example between two variables i and j, in the
adjacency matrix M of the graph as M[i,j]=2
and M[j,i]=2
. Such edges should be
interpreted as indications of conflicts in the algorithm, for example due to
errors in the conditional independence tests or violations of the faithfulness
assumption.
An object of class pcAlgo-class
.
Original code by Markus Kalisch, modifications by Janine Witte.
C. Meek (1995). Causal inference and causal explanation with background knowledge. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI-95), pp. 403-411. Morgan Kaufmann Publishers.
D. Colombo and M.H. Maathuis (2014). Order-independent constraint-based causal structure learning. Journal of Machine Learning Research 15:3741-3782.
data(dat_sim) sk.fit <- skeleton(suffStat = list(C = cor(dat_sim), n = nrow(dat_sim)), indepTest = gaussCItest, labels = names(dat_sim), alpha = 0.05) MeekRules(sk.fit)
data(dat_sim) sk.fit <- skeleton(suffStat = list(C = cor(dat_sim), n = nrow(dat_sim)), indepTest = gaussCItest, labels = names(dat_sim), alpha = 0.05) MeekRules(sk.fit)
Like [pcalg::pc()], but takes into account a user-specified partial
ordering of the nodes/variables. This has two effects:
1) The conditional independence between x
and y
given S
is
ot tested if any variable in S
lies in the future of both x
and y
;
2) edges cannot be oriented from a higher-order to a lower-order node. In addition,
the user may specify individual forbidden edges and context variables.
tpc( suffStat, indepTest, alpha, labels, p, skel.method = c("stable", "stable.parallel"), forbEdges = NULL, m.max = Inf, conservative = FALSE, maj.rule = TRUE, tiers = NULL, context.all = NULL, context.tier = NULL, verbose = FALSE, numCores = NULL, cl.type = "PSOCK", clusterexport = NULL )
tpc( suffStat, indepTest, alpha, labels, p, skel.method = c("stable", "stable.parallel"), forbEdges = NULL, m.max = Inf, conservative = FALSE, maj.rule = TRUE, tiers = NULL, context.all = NULL, context.tier = NULL, verbose = FALSE, numCores = NULL, cl.type = "PSOCK", clusterexport = NULL )
suffStat |
A [base::list()] of sufficient statistics, containing all necessary elements for the conditional independence decisions in the function [indepTest()]. |
indepTest |
A function for testing conditional independence. It is internally
called as |
alpha |
significance level (number in (0,1) for the individual conditional independence tests. |
labels |
(optional) character vector of variable (or "node") names.
Typically preferred to specifying |
p |
(optional) number of variables (or nodes). May be specified if |
skel.method |
Character string specifying method; the default, "stable" provides an order-independent skeleton, see [tpc::tskeleton()]. |
forbEdges |
A logical matrix of dimension p*p. If |
m.max |
Maximal size of the conditioning sets that are considered in the conditional independence tests. |
conservative |
Logical indicating if conservative PC should be used. Defaults to FALSE. See [pcalg::pc()] for details. |
maj.rule |
Logical indicating if the majority rule should be used. Defaults to TRUE. See [pcalg::pc()] for details. |
tiers |
Numeric vector specifying the tier / time point for each variable. Must be of length 'p', if specified, or have the same length as 'labels', if specified. A smaller number corresponds to an earlier tier / time point. |
context.all |
Numeric or character vector. Specifies the positions or names of global context variables. Global context variables have no incoming edges, i.e. no parents, and are themselves parents of all non-context variables in the graph. |
context.tier |
Numeric or character vector. Specifies the positions or names of tier-specific context variables. Tier-specific context variables have no incoming edges, i.e. no parents, and are themselves parents of all non-context variables in the same tier. |
verbose |
if |
numCores |
The numbers of CPU cores to be used. |
cl.type |
The cluster type. Default value is |
clusterexport |
Character vector. Lists functions to be exported to nodes if numCores > 1. |
See pcalg::pc
for further information on the PC algorithm.
The PC algorithm is named after its developers Peter Spirtes and Clark Glymour
(Spirtes et al., 2000).
Specifying a tier for each variable using the tier
argument has the
following effects:
1) In the skeleton phase and v-structure learing phases,
conditional independence testing is restricted such that if x is in tier t(x)
and y is in t(y), only those variables are allowed in the conditioning set whose
tier is not larger than t(x).
2) Following the v-structure phase, all
edges that were found between two tiers are directed into the direction of the
higher-order tier. If context variables are specified using context.all
and/or context.tier
, the corresponding orientations are added in this step.
An object of class
"pcAlgo
"
(see [pcalg::pcalgo] containing an estimate of the equivalence class of
the underlying DAG.
Original code by Markus Kalisch, Martin Maechler, and Diego Colombo. Modifications by Janine Witte (Kalisch et al., 2012).
M. Kalisch, M. Maechler, D. Colombo, M.H. Maathuis and P. Buehlmann (2012). Causal Inference Using Graphical Models with the R Package pcalg. Journal of Statistical Software 47(11): 1–26.
P. Spirtes, C. Glymour and R. Scheines (2000). Causation, Prediction, and Search, 2nd edition. The MIT Press. https://philarchive.org/archive/SPICPA-2.
# load simulated cohort data data(dat_sim) n <- nrow(dat_sim) lab <- colnames(dat_sim) # estimate skeleton without taking background information into account tpc.fit <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab) pc.fit <- pcalg::pc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, maj.rule = TRUE, solve.conf = TRUE) identical(pc.fit@graph, tpc.fit@graph) # TRUE # estimate skeleton with temporal ordering as background information tiers <- rep(c(1,2,3), times=c(3,3,3)) tpc.fit2 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers) tpc.fit3 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers, skel.method = "stable.parallel", numCores = 2, clusterexport = c("cor", "ecdf")) if(requireNamespace("Rgraphviz", quietly = TRUE)){ data("true_sim") oldpar <- par(mfrow = c(1,3)) plot(true_sim, main = "True DAG") plot(tpc.fit, main = "PC estimate") plot(tpc.fit2, main = "tPC estimate") par(oldpar) } # require that there is no edge between A1 and A1, and that any edge between A2 and B2 # or A2 and C2 is directed away from A2 forb <- matrix(FALSE, nrow=9, ncol=9) rownames(forb) <- colnames(forb) <- lab forb["A1","A3"] <- forb["A3","A1"] <- TRUE forb["B2","A2"] <- TRUE forb["C2","A2"] <- TRUE tpc.fit3 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01,labels = lab, forbEdges = forb, tiers = tiers) if (requireNamespace("Rgraphviz", quietly = TRUE)) { # compare estimated CPDAGs data("true_sim") oldpar <- par(mfrow = c(1,2)) plot(tpc.fit2, main = "old tPC estimate") plot(tpc.fit3, main = "new tPC estimate") par(oldpar) } # force edge from A1 to all other nodes measured at time 1 # into the graph (note that the edge from A1 to A2 is then # forbidden) tpc.fit4 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers, context.tier = "A1") if (requireNamespace("Rgraphviz", quietly = TRUE)) { # compare estimated CPDAGs data("true_sim") plot(tpc.fit4, main = "alternative tPC estimate") } # force edge from A1 to all other nodes into the graph tpc.fit5 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers, context.all = "A1") if (requireNamespace("Rgraphviz", quietly = TRUE)) { # compare estimated CPDAGs data("true_sim") plot(tpc.fit5, main = "alternative tPC estimate") }
# load simulated cohort data data(dat_sim) n <- nrow(dat_sim) lab <- colnames(dat_sim) # estimate skeleton without taking background information into account tpc.fit <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab) pc.fit <- pcalg::pc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, maj.rule = TRUE, solve.conf = TRUE) identical(pc.fit@graph, tpc.fit@graph) # TRUE # estimate skeleton with temporal ordering as background information tiers <- rep(c(1,2,3), times=c(3,3,3)) tpc.fit2 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers) tpc.fit3 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers, skel.method = "stable.parallel", numCores = 2, clusterexport = c("cor", "ecdf")) if(requireNamespace("Rgraphviz", quietly = TRUE)){ data("true_sim") oldpar <- par(mfrow = c(1,3)) plot(true_sim, main = "True DAG") plot(tpc.fit, main = "PC estimate") plot(tpc.fit2, main = "tPC estimate") par(oldpar) } # require that there is no edge between A1 and A1, and that any edge between A2 and B2 # or A2 and C2 is directed away from A2 forb <- matrix(FALSE, nrow=9, ncol=9) rownames(forb) <- colnames(forb) <- lab forb["A1","A3"] <- forb["A3","A1"] <- TRUE forb["B2","A2"] <- TRUE forb["C2","A2"] <- TRUE tpc.fit3 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01,labels = lab, forbEdges = forb, tiers = tiers) if (requireNamespace("Rgraphviz", quietly = TRUE)) { # compare estimated CPDAGs data("true_sim") oldpar <- par(mfrow = c(1,2)) plot(tpc.fit2, main = "old tPC estimate") plot(tpc.fit3, main = "new tPC estimate") par(oldpar) } # force edge from A1 to all other nodes measured at time 1 # into the graph (note that the edge from A1 to A2 is then # forbidden) tpc.fit4 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers, context.tier = "A1") if (requireNamespace("Rgraphviz", quietly = TRUE)) { # compare estimated CPDAGs data("true_sim") plot(tpc.fit4, main = "alternative tPC estimate") } # force edge from A1 to all other nodes into the graph tpc.fit5 <- tpc(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers, context.all = "A1") if (requireNamespace("Rgraphviz", quietly = TRUE)) { # compare estimated CPDAGs data("true_sim") plot(tpc.fit5, main = "alternative tPC estimate") }
Like pcalg::pc.cons.intern
, but takes into account the
user-specified partial node/variable ordering.
tpc.cons.intern( sk, suffStat, indepTest, alpha, version.unf = c(NA, NA), maj.rule = FALSE, forbEdges = NULL, tiers = NULL, context.all = NULL, context.tier = NULL, verbose = FALSE )
tpc.cons.intern( sk, suffStat, indepTest, alpha, version.unf = c(NA, NA), maj.rule = FALSE, forbEdges = NULL, tiers = NULL, context.all = NULL, context.tier = NULL, verbose = FALSE )
sk |
A skeleton object as returned from |
suffStat |
Sufficient statistic: List containing all relevant elements for the conditional independence decisions. |
indepTest |
Pre-defined |
alpha |
Significance level for the individual conditional independence tests. |
version.unf |
Vector of length two. If |
maj.rule |
Logical indicating if the triples are checked for ambiguity using the majority rule idea, which is less strict than the standard conservative method. |
forbEdges |
A logical matrix of dimension |
tiers |
Numeric vector specifying the tier / time point for each variable. A smaller number corresponds to an earlier tier / time point. |
context.all |
Numeric or character vector. Specifies the positions or names of global context variables. Global context variables have no incoming edges, i.e. no parents, and are themselves parents of all non-context variables in the graph. |
context.tier |
Numeric or character vector. Specifies the positions or names of tier-specific context variables. Tier-specific context variables have no incoming edges, i.e. no parents, and are themselves parents of all non-context variables in the same tier. |
verbose |
Logical asking for detailed output. |
See pcalg::pc.cons.intern
for further information on the
majority and conservative approaches to learning v-structures.
Specifying a tier for each variable using the tier
argument has the
following effects:
1) Only those triples x-y-z
are considered as potential v-structures that
satisfy t(y)=max(t(x),t(z))
. This allows for three constellations: either y
is
in the same tier as x
and both are later than z
, or y
is in the same tier as z
and both are later than x
, or all three are in the same tier. Triples where y
is
earlier than one or both of x
and z
need not be considered, as y
being a
collider would be against the partial ordering. Triples where y
is later than
both x
and z
will be oriented later in the pc algorithm and are left out here to
minimize the number of conditional independence tests.
2) Conditional independence testing is restricted such that if x
is in tier t(x)
and y
is in t(y)
, only those variables are allowed in the conditioning set whose
tier is not larger than t(x)
.
Context variables specified via context.all
or context.tier
are
not considered as candidate colliders or candidate parents of colliders.
numeric vector of triples coded as numbers (via pcalg::triple2numb
)
that were marked as ambiguous.
The updated skeleton-object (separating sets might have been updated).
Original code by Markus Kalisch and Diego Colombo. Modifications by Janine Witte.
A DAG from which the data 'data_cohort' was simulated from. See Andrews et al. (2021) <https://arxiv.org/abs/2108.13395> for more information on how the data were generated.
true_cohort
true_cohort
A DAG (graphNEL object) with 34 nodes and 128 edges.
Andrews RM, Foraita R, Witte J (2021). A practical guide to causal discovery with cohort data. <https://doi.org/10.48550/arXiv.2108.13395>
See [graph::graphNEL()] for the class 'graphNEL'.
An example DAG from which the data 'data_sim' was simulated from.
true_sim
true_sim
A DAG (graphNEL object) with 9 nodes and 7 edges.
See [graph::graphNEL()] for the class 'graphNEL'.
Like pcalg::skeleton
, but takes a user-specified partial node
ordering into account. The conditional independence
between x
and y
given S
is not tested if any variable in
S
lies in the future of both x
and y
.
tskeleton( suffStat, indepTest, alpha, labels, p, method = c("stable", "original"), m.max = Inf, fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, tiers = NULL, verbose = FALSE )
tskeleton( suffStat, indepTest, alpha, labels, p, method = c("stable", "original"), m.max = Inf, fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, tiers = NULL, verbose = FALSE )
suffStat |
A list of sufficient statistics, containing all necessary elements for
the conditional independence decisions in the function |
indepTest |
Predefined |
alpha |
Significance level (number in (0,1) for the individual conditional independence tests. |
labels |
(optional) character vector of variable (or "node") names.
Typically preferred to specifying |
p |
(optional) number of variables (or nodes). May be specified if |
method |
Character string specifying method; the default, "stable" provides an order-independent skeleton, see 'Details' below. |
m.max |
Maximal size of the conditioning sets that are considered in the conditional independence tests. |
fixedGaps |
logical symmetric matrix of dimension |
fixedEdges |
a logical symmetric matrix of dimension |
NAdelete |
logical needed for the case |
tiers |
Numeric vector specifying the tier / time point for each variable.
Must be of length 'p', if specified, or have the same length as 'labels', if specified.
A smaller number corresponds to an earlier tier / time point. Conditional independence
testing is restricted such that if |
verbose |
if |
See pcalg::skeleton
for further information on the
skeleton algorithm.
An object of class "pcAlgo" (see pcalg::pcAlgo
)
containing an estimate of the skeleton of the underlying DAG, the conditioning
sets (sepset) that led to edge removals and several other parameters.
Original code by Markus Kalisch, Martin Maechler, Alain Hauser and Diego Colombo. Modifications by Janine Witte.
# load simulated cohort data data("dat_sim") n <- nrow(dat_sim) lab <- colnames(dat_sim) # estimate skeleton without taking background information into account tskel.fit <- tskeleton(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab) skel.fit <- pcalg::skeleton(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab) identical(skel.fit@graph, tskel.fit@graph) # TRUE # estimate skeleton with temporal ordering as background information tiers <- rep(c(1,2,3), times=c(3,3,3)) tskel.fit2 <- tskeleton(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers) # in this case, the skeletons estimated with and without # background knowledge are identical, but fewer conditional # independence tests were performed when background # knowledge was taken into account identical(tskel.fit@graph, tskel.fit2@graph) # TRUE [email protected] [email protected]
# load simulated cohort data data("dat_sim") n <- nrow(dat_sim) lab <- colnames(dat_sim) # estimate skeleton without taking background information into account tskel.fit <- tskeleton(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab) skel.fit <- pcalg::skeleton(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab) identical(skel.fit@graph, tskel.fit@graph) # TRUE # estimate skeleton with temporal ordering as background information tiers <- rep(c(1,2,3), times=c(3,3,3)) tskel.fit2 <- tskeleton(suffStat = list(C = cor(dat_sim), n = n), indepTest = gaussCItest, alpha = 0.01, labels = lab, tiers = tiers) # in this case, the skeletons estimated with and without # background knowledge are identical, but fewer conditional # independence tests were performed when background # knowledge was taken into account identical(tskel.fit@graph, tskel.fit2@graph) # TRUE tskel.fit@n.edgetests tskel.fit2@n.edgetests