Title: | Interface to Diverse Estimation Methods of Causal Networks |
---|---|
Description: | Unified interface for the estimation of causal networks, including the methods 'backShift' (from package 'backShift'), 'bivariateANM' (bivariate additive noise model), 'bivariateCAM' (bivariate causal additive model), 'CAM' (causal additive model) (from package 'CAM'; the package is temporarily unavailable on the CRAN repository; formerly available versions can be obtained from the archive), 'hiddenICP' (invariant causal prediction with hidden variables), 'ICP' (invariant causal prediction) (from package 'InvariantCausalPrediction'), 'GES' (greedy equivalence search), 'GIES' (greedy interventional equivalence search), 'LINGAM', 'PC' (PC Algorithm), 'FCI' (fast causal inference), 'RFCI' (really fast causal inference) (all from package 'pcalg') and regression. |
Authors: | Christina Heinze-Deml <[email protected]>, Nicolai Meinshausen <[email protected]> |
Maintainer: | Christina Heinze-Deml <[email protected]> |
License: | GPL |
Version: | 0.2.6.2 |
Built: | 2025-02-20 03:45:48 UTC |
Source: | https://github.com/christinaheinze/comparecausalnetworks |
Provides a unified interface to various causal graph estimation methods.
Package: | CompareCausalNetworks |
Type: | Package |
Version: | 0.2.2 |
Date: | 2018-05-18 |
License: | GPL |
The causal graphs can be estimated with function getParents
and a stability-selection version is available at getParentsStable
.
The supported methods are provided through the packages listed in Suggests
.
Thus, to use a particular method the corresponding package needs to be installed
on your machine. To run the examples, most of these packages need to be installed.
Christina Heinze-Deml <[email protected]>, Nicolai Meinshausen <[email protected]>
Estimates the connectivity matrix of a directed causal graph, using various possible methods. Supported methods at the moment are ARGES, backShift, bivariateANM, bivariateCAM, CAM, FCI, FCI+, GES, GIES, hiddenICP, ICP, LINGAM, MMHC, rankARGES, rankFci, rankGES, rankGIES, rankPC, regression, RFCI and PC.
getParents( X, environment = NULL, interventions = NULL, parentsOf = 1:ncol(X), method = c("arges", "backShift", "bivariateANM", "bivariateCAM", "CAM", "fci", "fciplus", "ges", "gies", "hiddenICP", "ICP", "LINGAM", "mmhc", "rankArges", "rankFci", "rankGes", "rankGies", "rankPc", "rfci", "pc", "regression")[12], alpha = 0.1, mode = c("raw", "parental", "ancestral")[1], variableSelMat = NULL, excludeTargetInterventions = TRUE, onlyObservationalData = FALSE, indexObservationalData = 1, returnAsList = FALSE, sparse = FALSE, directed = FALSE, pointConf = FALSE, setOptions = list(), assumeNoSelectionVars = TRUE, verbose = FALSE, ... )
getParents( X, environment = NULL, interventions = NULL, parentsOf = 1:ncol(X), method = c("arges", "backShift", "bivariateANM", "bivariateCAM", "CAM", "fci", "fciplus", "ges", "gies", "hiddenICP", "ICP", "LINGAM", "mmhc", "rankArges", "rankFci", "rankGes", "rankGies", "rankPc", "rfci", "pc", "regression")[12], alpha = 0.1, mode = c("raw", "parental", "ancestral")[1], variableSelMat = NULL, excludeTargetInterventions = TRUE, onlyObservationalData = FALSE, indexObservationalData = 1, returnAsList = FALSE, sparse = FALSE, directed = FALSE, pointConf = FALSE, setOptions = list(), assumeNoSelectionVars = TRUE, verbose = FALSE, ... )
X |
A |
environment |
An optional vector of length |
interventions |
A optional list of length |
parentsOf |
The variables for which we would like to estimate the
parents. Default are all variables. Currently only used with |
method |
A string that specfies the method to use. The methods
|
alpha |
The level at which tests are done. This leads to confidence
intervals for |
mode |
Determines output type - can be "raw" or one of the queries "isParent",
"isMaybeParent", "isNoParent", "isAncestor","isMaybeAncestor", "isNoAncestor".
If "raw", |
variableSelMat |
An optional logical matrix of dimension |
excludeTargetInterventions |
When looking for parents of variable |
onlyObservationalData |
If set to |
indexObservationalData |
Index in |
returnAsList |
If set to |
sparse |
If set to |
directed |
If |
pointConf |
If |
setOptions |
A list that can take method-specific options; see the individual documentations of the methods for more options and their possible values. |
assumeNoSelectionVars |
Set to |
verbose |
If |
... |
Parameters to be passed to underlying method's function. |
If option returnAsList
is FALSE
, a sparse matrix,
where a 0 entry in position (j,k) corresponds to an estimate of "no edge"
j
-> k
, while an entry 1 corresponds to an
estimated egde. If option pointConf
is TRUE
, the 1 entries
will be replaced by numerical values that are either point estimates of the
causal coefficients or confidence bounds (see above).
If option returnAsList
is TRUE
, a list will be returned.
The k-th entry in the list is the numeric vector with the indices of the
estimated parents of node k
.
Christina Heinze-Deml [email protected], Nicolai Meinshausen [email protected]
Naftali Harris and Mathias Drton: PC Algorithm for Nonparanormal Graphical Models. J. Mach. Learn. Res. 14(1) 2013.
getParentsStable
for stability selection-based
estimation of the causal graph.
## load the backShift package for data generation and plotting functionality if(require(backShift) & require(pcalg)){ # Simulate data with connectivity matrix A with assumptions # 1) hidden variables present # 2) precise location of interventions is assumed unknown # 3) different environments can be distinguished ## simulate data myseed <- 1 # sample size n n <- 10000 # p=3 predictor variables and connectivity matrix A p <- 3 labels <- c("1", "2", "3") A <- diag(p)*0 A[1,2] <- 0.8 A[2,3] <- 0.8 A[3,1] <- -0.4 # divide data in 10 different environments G <- 10 # simulate simResult <- backShift::simulateInterventions(n, p, A, G, intervMultiplier = 3, noiseMult = 1, nonGauss = TRUE, hiddenVars = TRUE, knownInterventions = FALSE, fracVarInt = NULL, simulateObs = TRUE, seed = myseed) X <- simResult$X environment <- simResult$environment ## apply all methods given in vector 'methods' ## (using all data pooled for pc/LINGAM/rfci/ges -- can be changed with option ## 'onlyObservationalData=TRUE') methods <- c("backShift", "LINGAM") #c("pc", "rfci", "ges") # select whether you want to run stability selection stability <- FALSE # arrange graphical output into a rectangular grid sq <- ceiling(sqrt(length(methods)+1)) par(mfrow=c(ceiling((length(methods)+1)/sq),sq)) ## plot and print true graph cat("\n true graph is ------ \n" ) print(A) plotGraphEdgeAttr(A, plotStabSelec = FALSE, labels = labels, thres.point = 0, main = "TRUE GRAPH") ## loop over all methods and compute and print/plot estimate for (method in methods){ cat("\n result for method", method," ------ \n" ) if(!stability){ # Option 1): use this estimator as a point estimate Ahat <- getParents(X, environment, method=method, alpha=0.1, pointConf = TRUE) }else{ # Option 2): use a stability selection based estimator # with expected number of false positives bounded by EV=2 Ahat <- getParentsStable(X, environment, EV=2, method=method, alpha=0.1) } # print and plot estimate (point estimate thresholded if numerical estimates # are returned) print(Ahat) if(!stability) plotGraphEdgeAttr(Ahat, plotStabSelec = FALSE, labels = labels, thres.point = 0.05, main=paste("POINT ESTIMATE FOR METHOD\n", toupper(method))) else plotGraphEdgeAttr(Ahat, plotStabSelec = TRUE, labels = labels, thres.point = 0, main = paste("STABILITY SELECTION ESTIMATE\n FOR METHOD", toupper(method))) } }else{ cat("\nThe packages 'backShift' and 'pcalg' are needed for the examples to work. Please install them.") }
## load the backShift package for data generation and plotting functionality if(require(backShift) & require(pcalg)){ # Simulate data with connectivity matrix A with assumptions # 1) hidden variables present # 2) precise location of interventions is assumed unknown # 3) different environments can be distinguished ## simulate data myseed <- 1 # sample size n n <- 10000 # p=3 predictor variables and connectivity matrix A p <- 3 labels <- c("1", "2", "3") A <- diag(p)*0 A[1,2] <- 0.8 A[2,3] <- 0.8 A[3,1] <- -0.4 # divide data in 10 different environments G <- 10 # simulate simResult <- backShift::simulateInterventions(n, p, A, G, intervMultiplier = 3, noiseMult = 1, nonGauss = TRUE, hiddenVars = TRUE, knownInterventions = FALSE, fracVarInt = NULL, simulateObs = TRUE, seed = myseed) X <- simResult$X environment <- simResult$environment ## apply all methods given in vector 'methods' ## (using all data pooled for pc/LINGAM/rfci/ges -- can be changed with option ## 'onlyObservationalData=TRUE') methods <- c("backShift", "LINGAM") #c("pc", "rfci", "ges") # select whether you want to run stability selection stability <- FALSE # arrange graphical output into a rectangular grid sq <- ceiling(sqrt(length(methods)+1)) par(mfrow=c(ceiling((length(methods)+1)/sq),sq)) ## plot and print true graph cat("\n true graph is ------ \n" ) print(A) plotGraphEdgeAttr(A, plotStabSelec = FALSE, labels = labels, thres.point = 0, main = "TRUE GRAPH") ## loop over all methods and compute and print/plot estimate for (method in methods){ cat("\n result for method", method," ------ \n" ) if(!stability){ # Option 1): use this estimator as a point estimate Ahat <- getParents(X, environment, method=method, alpha=0.1, pointConf = TRUE) }else{ # Option 2): use a stability selection based estimator # with expected number of false positives bounded by EV=2 Ahat <- getParentsStable(X, environment, EV=2, method=method, alpha=0.1) } # print and plot estimate (point estimate thresholded if numerical estimates # are returned) print(Ahat) if(!stability) plotGraphEdgeAttr(Ahat, plotStabSelec = FALSE, labels = labels, thres.point = 0.05, main=paste("POINT ESTIMATE FOR METHOD\n", toupper(method))) else plotGraphEdgeAttr(Ahat, plotStabSelec = TRUE, labels = labels, thres.point = 0, main = paste("STABILITY SELECTION ESTIMATE\n FOR METHOD", toupper(method))) } }else{ cat("\nThe packages 'backShift' and 'pcalg' are needed for the examples to work. Please install them.") }
Estimates the connectivity matrix of a directed causal graph, using various possible methods. Supported methods at the moment are ARGES, backShift, bivariateANM, bivariateCAM, CAM, FCI, FCI+, GES, GIES, hiddenICP, ICP, LINGAM, MMHC, rankARGES, rankFci, rankGES, rankGIES, rankPC, regression, RFCI and PC. Uses stability selection to select an appropriate sparseness.
getParentsStable( X, environment, interventions = NULL, EV = 1, nodewise = TRUE, threshold = 0.75, nsim = 100, sampleSettings = 1/sqrt(2), sampleObservations = 1/sqrt(2), parentsOf = 1:ncol(X), method = c("ICP", "hiddenICP", "backShift", "pc", "LINGAM", "ges", "gies", "CAM", "fci", "rfci", "regression", "bivariateANM", "bivariateCAM")[1], alpha = 0.1, mode = c("raw", "parental", "ancestral")[1], variableSelMat = NULL, excludeTargetInterventions = TRUE, onlyObservationalData = FALSE, indexObservationalData = NULL, setOptions = list(), verbose = FALSE )
getParentsStable( X, environment, interventions = NULL, EV = 1, nodewise = TRUE, threshold = 0.75, nsim = 100, sampleSettings = 1/sqrt(2), sampleObservations = 1/sqrt(2), parentsOf = 1:ncol(X), method = c("ICP", "hiddenICP", "backShift", "pc", "LINGAM", "ges", "gies", "CAM", "fci", "rfci", "regression", "bivariateANM", "bivariateCAM")[1], alpha = 0.1, mode = c("raw", "parental", "ancestral")[1], variableSelMat = NULL, excludeTargetInterventions = TRUE, onlyObservationalData = FALSE, indexObservationalData = NULL, setOptions = list(), verbose = FALSE )
X |
A (nxp)-data matrix with n observations of p variables. |
environment |
A vector of length n, where the entry for
observation i is an index for the environment in which observation i took
place (simplest case entries |
interventions |
A optional list of length n. The entry for observation
i is a numeric vector that specifies the variables on which interventions
happened for observation i (a scalar if an intervention happened on just
one variable and |
EV |
A bound on the expected number of falsely selected edges. |
nodewise |
If |
threshold |
The empirical selection frequency in (0.5,1) under subsampling that needs to be surpassed for an edge to be selected. |
nsim |
The number of resamples for stability selection. |
sampleSettings |
The fraction of different environments to resample in each resampling (at least two different environments will be selected so the argument is without effect if there are just two different environments in total). |
sampleObservations |
The fraction of samples to resample in each environment. |
parentsOf |
The variables for which we would like to estimate the parents. Default are all variables. |
method |
A string that specfies the method to use. The methods
|
alpha |
The level at which tests are done. This leads to confidence
intervals for |
mode |
Output type - can be "raw", "parental" or "ancestral". If "raw" output is the output of the underlying method, without modifications. If "parental" output described parental relations; if "ancestral" output is casted to ancestral relations. #TODO explain further |
variableSelMat |
An optional logical matrix of dimension (pxp). An
entry |
excludeTargetInterventions |
When looking for parents of variable k
in 1,...,p, set to |
onlyObservationalData |
If set to |
indexObservationalData |
Index in |
setOptions |
A list that can take method-specific options; see the individual documentations of the methods for more options and their possible values. |
verbose |
If |
A sparse matrix, where a 0 entry in (j,k) corresponds to an estimate
of 'no edge' j
-> parentsOf[k]
. Entries between 0 and 100
give the selection percentage of this edge over all resamples (set to 0 if
below critical threshold) and all non-zero values are considered as selected
edges.
Nicolai Meinshausen [email protected], Christina Heinze-Deml [email protected]
Stability selection (2010): N. Meinshausen and P. Buhlmann, Journal of the Royal Statistical Society: Series B, 72, 417-473
getParents
for the underlying point-estimate of
the causal graph.
Estimates a ranking of edges for a given query, e.g. for parental relations in the underlying causal graph structure, using various possible methods.
Supported methods at the moment are ARGES, backShift, bivariateANM, bivariateCAM, CAM, FCI, FCI+, GES, GIES, hiddenICP, ICP, LINGAM, MMHC, rankARGES, rankFci, rankGES, rankGIES, rankPC, regression, RFCI and PC.
getRanking( X, environment, interventions = NULL, queries = c("isParent", "isMaybeParent", "isNoParent", "isAncestor", "isMaybeAncestor", "isNoAncestor"), method = c("ICP", "hiddenICP", "backShift", "pc", "LINGAM", "ges", "gies", "CAM", "fci", "rfci", "regression", "bivariateANM", "bivariateCAM")[1], alpha = 0.1, variableSelMat = NULL, excludeTargetInterventions = TRUE, onlyObservationalData = FALSE, indexObservationalData = NULL, setOptions = list(), assumeNoSelectionVars = TRUE, nsim = 100, sampleSettings = 1/sqrt(2), sampleObservations = 1/sqrt(2), verbose = FALSE, ... )
getRanking( X, environment, interventions = NULL, queries = c("isParent", "isMaybeParent", "isNoParent", "isAncestor", "isMaybeAncestor", "isNoAncestor"), method = c("ICP", "hiddenICP", "backShift", "pc", "LINGAM", "ges", "gies", "CAM", "fci", "rfci", "regression", "bivariateANM", "bivariateCAM")[1], alpha = 0.1, variableSelMat = NULL, excludeTargetInterventions = TRUE, onlyObservationalData = FALSE, indexObservationalData = NULL, setOptions = list(), assumeNoSelectionVars = TRUE, nsim = 100, sampleSettings = 1/sqrt(2), sampleObservations = 1/sqrt(2), verbose = FALSE, ... )
X |
A |
environment |
A vector of length |
interventions |
A optional list of length n. The entry for observation
i is a numeric vector that specifies the variables on which interventions
happened for observation i (a scalar if an intervention happened on just
one variable and |
queries |
One (or more of) "isParent", "isMaybeParent", "isNoParent", "isAncestor","isMaybeAncestor", "isNoAncestor" |
method |
A string that specfies the method to use. The methods
|
alpha |
The level at which tests are done. This leads to confidence
intervals for |
variableSelMat |
An optional logical matrix of dimension (pxp). An
entry |
excludeTargetInterventions |
When looking for parents of variable k
in 1,...,p, set to |
onlyObservationalData |
If set to |
indexObservationalData |
Index in |
setOptions |
A list that can take method-specific options; see the individual documentations of the methods for more options and their possible values. |
assumeNoSelectionVars |
Set to |
nsim |
The number of resamples for stability selection. |
sampleSettings |
The fraction of different environments to resample in each resampling (at least two different environments will be selected so the argument is without effect if there are just two different environments in total). |
sampleObservations |
The fraction of samples to resample in each environment. |
verbose |
If |
... |
Parameters to be passed to underlying method's function. |
For both parental and ancestral relations, three queries are supported.
The existence of a relation is assessed by the queries isParent
and
isAncestor
; the absence of a relation is assessed by the queries
isNoParent
and isNoAncestor
; the potential existence of a
relation is addressed by the queries isMaybeParent
and
isMaybeAncestor
.
All queries return a connectivity matrix which we denote by .
The interpretation of the entries of
differs according to the considered query:
Parental relations: Queries concerning parental relations can only be answered by those methods under consideration that return a DAG, a CPDAG or a directed cyclic graph. When we say that a particular method cannot answer a given query, then the method's output with respect to this query will be the zero matrix. However, the eventual ranking for such a query will not necessarily be random due to the tie breaking scheme that is applied when ranking pairs of variables (see below).
isParent
In the connectivity matrix returned by this
query, the entry
means that there is a directed edge
from node
to node
in the graph structure estimated by the
method under consideration. Otherwise,
.
isMaybeParent
means that there is
a directed or an undirected edge from node
to node
in the estimated graph structure. Otherwise,
.
isNoParent
means that there is neither a
directed nor an undirected edge from node
to node
in the
estimated graph structure. Otherwise,
.
Ancestral relations: Queries concerning ancestral relations can be answered by all methods under consideration.
isAncestor
means that there is a
directed path from node
to node
in the estimated graph
structure. Otherwise,
. In case of PAGs, directed paths can
contain the edge types
and
. Including the latter
edge type in this category implies that we exclude the existence of selection
variables.
isMaybeAncestor
then means that there is a
path from node
to node
that contains directed and/or undirected
edges. Otherwise,
. For PAGs, such paths can contain the edge
types
,
,
and/or
. Otherwise,
.
isNoAncestor
means that there is neither a
directed path nor a partially directed path from node
to node
in the estimated graph structure. Otherwise,
.
Stability ranking: To obtain a ranking of edges for a given set of
queries, we run the method under consideration on nsims
random
subsamples of the data. In each round, we draw samples from a fraction of
settings, where the size of the fraction is specified by sampleSettings
.
In each chosen setting, we sample a fraction of observations
uniformly at random without replacement, where the size of the fraction is
specified by sampleObservations
.
For each subsample we randomly permute the order of the variables in the input. Methods that are order-dependent can therefore not exploit any potential advantage stemming from a data matrix with columns ordered according to the causal ordering or a similar one. We then run the method on each subsample.
For each subsample and a particular query, we obtain the corresponding
connectivity matrix . We can then rank all pairs of nodes
according to the frequency of the occurrence of
across
subsamples. Ties between pairs of variables can be broken with the results
of the other queries if they are also computed as specified by
queries
;
otherwise ties are broken at random:
If the query is isParent
, ties are broken with counts for
isMaybeParent
.
For the query isMaybeParent
ties are broken with counts for
isParent
, i.e. in case of equal counts we give a preference to the
edge that was considered more often to be a 'certain' parent. For methods
returning DAGs this scheme makes the ranking for isMaybeParent
equal
to the result for isParent
, up to the random tie breaking that is
applied for isParent
.
If the query is isNoParent
, ties are broken according to which
edge was selected less often in the query isMaybeParent
.
If the query is isAncestor
, ties are broken with counts for
isMaybeAncestor
.
For the query isMaybeAncestor
ties are broken with counts
for isAncestor
, i.e. in case of equal counts we give a preference
to the edge that was considered more often to be a 'certain' ancestor.
For methods returning DAGs this scheme makes the ranking for isMaybeAncestor
equal to the result for isAncestor
, up to the random tie breaking
that is applied for isAncestor
.
If the query is isNoAncestor
, ties are broken according to
which one was selected less often in the query isMaybeAncestor
.
If the tie breaking matrix defined according to these rules is 0, a matrix with standard normal random entries is used to break ties. Similarly, if there are remaining ties after applying the tie breaking rules described above, ties are broken randomly.
A list with the following entries:
ranking
A list of length length(queries)
. For each query,
the corresponding list entry contains a matrix of dimension
with the ranking of edges. E.g. the first row indicates that the edge from
ranking$isParent[1,1] to ranking$isParent[1,2] is the most likely edge according
to the method under consideration.
resList
A list of length length(queries)
. For each query,
the corresponding list entry contains a matrix of dimension with the counts for
across the
nsim
subsamples.
simEstimates
A list of length nsim
with the method's
output for each of the nsim
subsamples.
Christina Heinze-Deml [email protected]
getParents
for the underlying point-estimate of
the causal graph.
data("simDataInv") X <- simDataInv$X set.seed(1) if(require(pcalg)){ rank <- getRanking(X, environment = simDataInv$environment, queries = c("isParent","isMaybeParent"), method = c("LINGAM"), verbose = FALSE) # estimated ranking print(rank$ranking$isParent) # true adjacency matrix print(simDataInv$configs$trueA) }else{ cat("\nThe packages 'pcalg' is needed for the example to work. Please install it.") }
data("simDataInv") X <- simDataInv$X set.seed(1) if(require(pcalg)){ rank <- getRanking(X, environment = simDataInv$environment, queries = c("isParent","isMaybeParent"), method = c("LINGAM"), verbose = FALSE) # estimated ranking print(rank$ranking$isParent) # true adjacency matrix print(simDataInv$configs$trueA) }else{ cat("\nThe packages 'pcalg' is needed for the example to work. Please install it.") }
A dataset to run the tests.
simDataInv
simDataInv
A list created by simulateInterventions
.
All inputs are contained in the list element configs
. For details
see the help page of simulateInterventions
.
Simulate data of a causal (possibly cyclic model) under interventions.
simulateInterventions( n, p, df, rhoNoise, snrPar, sparse, doInterv, numberInt, strengthInt, cyclic, strengthCycle, modelMis = FALSE, modelMisPar = 1, seed = 1 )
simulateInterventions( n, p, df, rhoNoise, snrPar, sparse, doInterv, numberInt, strengthInt, cyclic, strengthCycle, modelMis = FALSE, modelMisPar = 1, seed = 1 )
n |
Number of observations. |
p |
Number of variables. |
df |
Degrees of freedom in t-distribution of noise and interventions. |
rhoNoise |
Correlation between noise terms to model hidden variabkes. Set to 0 for independent noise. |
snrPar |
Signal-to-noise parameter: steers what proportion of the variance stems from
the signal resp.\ from the noise: The SNR is given by $SNR = (1- |
sparse |
Probability that an entry |
doInterv |
Set to TRUE if interventions should be do-interventions; otherwise noise interventions (also called shift interventions) are generated. |
numberInt |
Total number of settings. |
strengthInt |
Regulates the strength of the interventions, see details. |
cyclic |
Set to TRUE is resulting graph should contain a cycle. |
strengthCycle |
Steers strength of feedback, see details. |
modelMis |
Add a model misspecification that applies |
modelMisPar |
Parameter steering the strength of the model misspecification. |
seed |
Random seed. |
The adjacency matrix is generated as follows. Assume the variables
with indices
are causally ordered. For each edge from node
to node
where
precedes
in the causal ordering,
we draw a sample from Bin(
sparse
) to determine whether to add an edge
from node to node
. After having sampled the non-zero entries
of
in this fashion, we sample the coefficients from Unif(-1,1).
As described below, the edge weights are later rescaled to achieve a specified
signal-to-noise ratio. We exclude the possibility of
,
i.e. we resample until
contains at least one non-zero entry.
Second, the interventions are generated as follows. numberInt
denotes the total
number of (interventional and observational) settings that are generated.
For each variable, we sample uniformly at random with replacement one setting
in which this variable is intervened on. In other words, each variable is
intervened on in exactly one setting. Hence it is possible that there are
settings where no interventions take place which then correspond to the
observational case. Similarly, there may be settings where interventions
are performed on multiple variables at once. After defining the settings,
we sample (uniformly at random with replacement) what setting each data point
belongs to. So for each setting we generate approximately the same number of
samples. In one generated data set, the interventions are all of the same
type, i.e. they are either all shift interventions (when doInterv = FALSE
)
or do-interventions (when doInterv = TRUE
). In both cases, an intervention
on is modelled by generating
as
strengthInt
(
dfNoise
).
If strengthInt
= 0, all interventional settings correspond to purely
observational data.
Third, the noise terms are generated by first sampling from
where
and
rhoNoise
. To steer the signal-to-noise ratio,
we set the variance of the noise terms of all nodes except source nodes
to snrPar
where snrPar
. Stepping through the
variables in causal order, for each variable
that has parents, we
uniformly rescale the edge weights
for
in the structural equation of variable
such that the variance of
the sum
is approximately
1 in the observational setting. In other words, the parameter
snrPar
steers what proportion of the variance stems from the signal given by
and what proportion stems from the
noise
. The signal-to-noise ratio can then be computed
as SNR = (1-
snrPar
)/snrPar
.
Forth, a cycle is added to the causal graph if cyclic = TRUE
. If the
causal graph shall contain a cycle, we sample two nodes and
such that adding an edge between them creates a cycle in the causal graph.
We then compute the largest possible coefficient for this edge such that the
cycle product is smaller than 1. Subsequently, we sample the sign of the
coefficient and set the magnitude by scaling the largest possible coefficient
by
strengthCycle
where strengthCycle
.
Fifth, we rescale the noise variables to obtain a -distribution with
dfNoise
degrees of freedom. is then generated as
in the observational case; under a shift
interventions
can be generated as
where the coordinates of
are only non-zero for the variables
that are intervened on. Under a do-intervention on
,
for
are set to 0 to yield
and
is set to
to yield
. We then obtain
as
.
Lastly, if modelMis = TRUE
a model misspecification is added to the
data by marginally transforming all variables as tanh(modelMisPar*x)/modelMisPar)
.
A list with the following elements:
X
-dimensional data matrix
environment
Indicator of the experiment or the intervention type an
observation belongs to. A numeric vector of length .
interventions
A list of length . Indicates location of interventions
for each data point.
whereInt
A list of length numberInt
. Indicates location of interventions
in each setting.
noise
configs
A list with the generated adjacency matrix (trueA
)
as well as all input arguments.