Title: | Approximate Bayesian Computation via Random Forests |
---|---|
Description: | Performs Approximate Bayesian Computation (ABC) model choice and parameter inference via random forests. Pudlo P., Marin J.-M., Estoup A., Cornuet J.-M., Gautier M. and Robert C. P. (2016) <doi:10.1093/bioinformatics/btv684>. Estoup A., Raynal L., Verdu P. and Marin J.-M. <http://journal-sfds.fr/article/view/709>. Raynal L., Marin J.-M., Pudlo P., Ribatet M., Robert C. P. and Estoup A. (2019) <doi:10.1093/bioinformatics/bty867>. |
Authors: | Jean-Michel Marin [aut, cre], Louis Raynal [aut], Pierre Pudlo [aut], Christian P. Robert [ctb], Arnaud Estoup [ctb] |
Maintainer: | Jean-Michel Marin <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.9 |
Built: | 2025-01-08 04:58:27 UTC |
Source: | https://github.com/jmm34/abcrf |
abcrf
constructs a random forest from a reference table towards performing
an ABC model choice. Basically, the reference table (i.e. the dataset that will
be treated with the present package) includes a column with the index
of the models to be compared and additional columns corresponding
to the values of the simulated summary statistics.
## S3 method for class 'formula' abcrf(formula, data, group=list(), lda=TRUE, ntree=500, sampsize=min(1e5, nrow(data)), paral=FALSE, ncores= if(paral) max(detectCores()-1,1) else 1, ...)
## S3 method for class 'formula' abcrf(formula, data, group=list(), lda=TRUE, ntree=500, sampsize=min(1e5, nrow(data)), paral=FALSE, ncores= if(paral) max(detectCores()-1,1) else 1, ...)
formula |
a formula: left of ~, variable representing the model index; right of ~, summary statistics of the reference table. |
data |
a data frame containing the reference table. |
group |
a list containing groups (at least 2) of model(s) on which the model choice will be performed. This is not necessarily a partition, one or more models can be excluded from the elements of the list and by default no grouping is done. |
lda |
should LDA scores be added to the list of summary statistics? |
ntree |
number of trees to grow in the forest, by default 500 trees. |
sampsize |
size of the sample from the reference table to grow a tree of the classification forest, by default the minimum between the number of elements of the reference table and 100,000. |
paral |
a boolean that indicates if the calculations of the classification random forest (forest used to assign a model to the observed dataset) should be parallelized. |
ncores |
the number of CPU cores to use. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
... |
additional arguments to be passed on to |
An object of class abcrf
, which is a list with the
following components:
call |
the original call to |
lda |
a boolean indicating if LDA scores have been added to the list of summary statistics, |
formula |
the formula used to construct the classification random forest, |
group |
a list contining the groups of model(s) used. This list is empty if no grouping has been performed, |
model.rf |
an object of class |
model.lda |
an object of class |
prior.err |
prior error rates of model selection on the reference table, estimated with the "out-of-bag" error of the forest. |
Pudlo P., Marin J.-M., Estoup A., Cornuet J.-M., Gautier M. and Robert, C. P. (2016) Reliable ABC model choice via random forests Bioinformatics doi:10.1093/bioinformatics/btv684
Estoup A., Raynal L., Verdu P. and Marin J.-M. (2018) Model choice using Approximate Bayesian Computation and Random Forests: analyses based on model grouping to make inferences about the genetic history of Pygmy human populations Jounal de la Société Française de Statistique http://journal-sfds.fr/article/view/709
plot.abcrf
,
predict.abcrf
,
err.abcrf
,
ranger
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf1 <- abcrf(modindex~., data = data1, ntree=100) model.rf1 model.rf2 <- abcrf(modindex~., data = data1, group = list(c("1","2"),"3"), ntree=100) model.rf2
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf1 <- abcrf(modindex~., data = data1, ntree=100) model.rf1 model.rf2 <- abcrf(modindex~., data = data1, group = list(c("1","2"),"3"), ntree=100) model.rf2
Using two reg-ABC-RF objects constructed on the same reference table for two different response variables, this function predicts the posterior covariance between those two response variables, given a new dataset of summaries.
## S3 method for class 'regAbcrf' covRegAbcrf(regForest1, regForest2, obs, training1, training2, ntree=500, mtry=max(floor((dim(training1)[2]-1)/3), 1), sampsize=min(1e5, dim(training1)[1]), paral = FALSE, ncores = if(paral) max(detectCores()-1,1) else 1, paral.predict = FALSE, ncores.predict = if(paral.predict) max(detectCores()-1,1) else 1, ... )
## S3 method for class 'regAbcrf' covRegAbcrf(regForest1, regForest2, obs, training1, training2, ntree=500, mtry=max(floor((dim(training1)[2]-1)/3), 1), sampsize=min(1e5, dim(training1)[1]), paral = FALSE, ncores = if(paral) max(detectCores()-1,1) else 1, paral.predict = FALSE, ncores.predict = if(paral.predict) max(detectCores()-1,1) else 1, ... )
regForest1 , regForest2
|
|
obs |
a data frame containing the summary statistics of the observed data sets. |
training1 , training2
|
data frames containing the reference table respectively used to train the |
ntree |
number of trees to grow in the forest, by default equal to 500 trees. |
mtry |
Number of variables to possibly split at in each node for the regression random forest. Default is the minimum between 1 and the number of variables divided by 3. |
sampsize |
size of the sample from the reference table used to grow a tree of the regression forest, by default the minimum between the number of elements of the reference table and 100,000. |
paral |
a boolean that indicates whether or not the calculations of the regression random forest (forest used to predict a response from the observed dataset) should be parallelized. |
ncores |
the number of CPU cores to use. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
paral.predict |
a boolean that indicates if random forests predictions should be parallelized. |
ncores.predict |
the number of CPU cores to use for the regression random forest predictions. If paral.predict=TRUE, it is used the number of CPU cores minus 1. If ncores.predict is not specified and |
... |
additional arguments to be passed on to |
covRegAbcrf
returns predicted posterior covariances between response variables of two reg-ABC-RF objects, for a new data set.
Raynal L., Marin J.-M. Pudlo P., Ribatet M., Robert C. P. and Estoup, A. (2019) ABC random forests for Bayesian parameter inference Bioinformatics doi:10.1093/bioinformatics/bty867
regAbcrf
,
predict.regAbcrf
,
err.regAbcrf
,
plot.regAbcrf
,
ranger
,
densityPlot
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) N1 <- snp$param$N1[modindex == "3"] N1 <- N1[1:500] data3 <- data.frame(N1, sumsta) model.rf.N1 <- regAbcrf(N1~., data3, ntree=100) data(snp.obs) covRegAbcrf(model.rf.r, model.rf.N1, snp.obs, data2, data3, ntree=100)
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) N1 <- snp$param$N1[modindex == "3"] N1 <- N1[1:500] data3 <- data.frame(N1, sumsta) model.rf.N1 <- regAbcrf(N1~., data3, ntree=100) data(snp.obs) covRegAbcrf(model.rf.r, model.rf.N1, snp.obs, data2, data3, ntree=100)
Given a reg-ABC-RF object and a new value of the summary statistics,
densityPlot
gives the corresponding posterior density plot of the parameter, as well as the prior (in grey).
## S3 method for class 'regAbcrf' densityPlot(object, obs, training, add=TRUE, main="Posterior density", log="", xlim=NULL, ylim=NULL, xlab=NULL, ylab=NULL, paral=FALSE, ncores= if(paral) max(detectCores()-1,1) else 1, ...)
## S3 method for class 'regAbcrf' densityPlot(object, obs, training, add=TRUE, main="Posterior density", log="", xlim=NULL, ylim=NULL, xlab=NULL, ylab=NULL, paral=FALSE, ncores= if(paral) max(detectCores()-1,1) else 1, ...)
object |
a |
obs |
a data frame containing the summary statistics of the observed data sets. |
training |
the data frame containing the reference table used to train the |
add |
a boolean that indicates if the posterior distributions should be ploted on the same graph or not, when more than one observed summary statistics is given. |
main |
main title to be used for the posterior density plot. |
log |
a character string which contains "x" if the x axis is to be logarithmic, "y" if the y axis is to be logarithmic and "xy" or "yx" if both axes are to be logarithmic. The default value "" implies no logarithmic transfomation. |
xlim |
range of the abscissa. |
ylim |
range of the ordinate. |
xlab |
label of the abscissa. |
ylab |
label of the ordinate. |
paral |
a boolean that indicates if random forests predictions should be parallelized. |
ncores |
the number of CPU cores to use for the regression random forest predictions. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
... |
additional arguments to be passed on to |
Raynal L., Marin J.-M. Pudlo P., Ribatet M., Robert C. P. and Estoup, A. (2019) ABC random forests for Bayesian parameter inference Bioinformatics doi:10.1093/bioinformatics/bty867
regAbcrf
,
predict.regAbcrf
,
err.regAbcrf
,
covRegAbcrf
,
ranger
,
plot.regAbcrf
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) data(snp.obs) densityPlot(model.rf.r, snp.obs, data2, ylab="density", main = "Posterior density of r")
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) data(snp.obs) densityPlot(model.rf.r, snp.obs, data2, ylab="density", main = "Posterior density of r")
err.abcrf
returns out-of-bag errors and plot them.
err.abcrf(object, training, paral=FALSE, ncores= if(paral) max(detectCores()-1,1) else 1)
err.abcrf(object, training, paral=FALSE, ncores= if(paral) max(detectCores()-1,1) else 1)
object |
an |
training |
the data frame containing the reference table used to train the |
paral |
a boolean that indicates if random forests predictions should be parallelized. |
ncores |
the number of CPU cores to use for the random forest predictions. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
A matrix with 2 columns: the number of trees and the out-of-bag errors. Errors are computed from 40 trees to the total number.
Pudlo P., Marin J.-M., Estoup A., Cornuet J.-M., Gautier M. and Robert, C. P. (2016) Reliable ABC model choice via random forests Bioinformatics doi:10.1093/bioinformatics/btv684
abcrf
,
predict.abcrf
,
plot.abcrf
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf <- abcrf(modindex~., data1, ntree=100) err.rf <- err.abcrf(model.rf, data1)
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf <- abcrf(modindex~., data1, ntree=100) err.rf <- err.abcrf(model.rf, data1)
err.regAbcrf
returns out-of-bag mean squared errors and plot them.
err.regAbcrf(object, training, paral=FALSE, ncores= if(paral) max(detectCores()-1,1) else 1, what="mean")
err.regAbcrf(object, training, paral=FALSE, ncores= if(paral) max(detectCores()-1,1) else 1, what="mean")
object |
a |
training |
the data frame containing the reference table used to train the |
paral |
a boolean that indicates if random forests predictions should be parallelized. |
ncores |
the number of CPU cores to use for the random forest predictions. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
what |
a string caracter "mean" or "median" indicating if the predictions are computed with mean or median of the response variable. |
A matrix with 2 columns: the number of trees and the out-of-bag mean squared errors. NAs might be returned if the number of trees is too low. Errors are computed from 40 trees to the total number.
Raynal L., Marin J.-M. Pudlo P., Ribatet M., Robert C. P. and Estoup, A. (2019) ABC random forests for Bayesian parameter inference Bioinformatics doi:10.1093/bioinformatics/bty867
regAbcrf
,
predict.regAbcrf
,
plot.regAbcrf
,
densityPlot
,
covRegAbcrf
,
ranger
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) err.regAbcrf(model.rf.r, data2)
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) err.regAbcrf(model.rf.r, data2)
plot.abcrf
provides both a variable importance plot of a model choice ABC-RF object
and the projection of the reference table on the LDA axes.
## S3 method for class 'abcrf' plot(x, training, obs=NULL, n.var=20, pdf=FALSE, xlim=NULL, ...)
## S3 method for class 'abcrf' plot(x, training, obs=NULL, n.var=20, pdf=FALSE, xlim=NULL, ...)
x |
an |
training |
the data frame containing the reference table used to train the |
obs |
a vector containing the summary statistics of an observed dataset that will be added to the graph of the projected reference table (black star or vertical line). |
n.var |
number of variables in the variable importance representation. |
pdf |
a boolean that indicates if a pdf version of the graph(s) should be saved in the current directory. |
xlim |
range of the abscissa for the variable importance plot. |
... |
not used. |
The graph of the reference table projected
on the LD axes is shown only if LD axes has
been added to the set of summary statistics
in the call of abcrf
.
Pudlo P., Marin J.-M., Estoup A., Cornuet J.-M., Gautier M. and Robert, C. P. (2016) Reliable ABC model choice via random forests Bioinformatics doi:10.1093/bioinformatics/btv684
abcrf
,
predict.abcrf
,
err.abcrf
,
variableImpPlot
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf <- abcrf(modindex~., data1, ntree=100) plot(model.rf, data1) data(snp.obs) plot(model.rf, data1, obs=snp.obs[1,])
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf <- abcrf(modindex~., data1, ntree=100) plot(model.rf, data1) data(snp.obs) plot(model.rf, data1, obs=snp.obs[1,])
plot.regAbcrf
provides a variable importance plot used to construct the reg-ABC-RF object, as measured by ranger
with the argument importance='impurity'.
## S3 method for class 'regAbcrf' plot(x, n.var=min(30, length(x$model.rf$variable.importance)), xlim=NULL, main=NULL, ...)
## S3 method for class 'regAbcrf' plot(x, n.var=min(30, length(x$model.rf$variable.importance)), xlim=NULL, main=NULL, ...)
x |
a |
n.var |
number of variables in the variable importance representation. The default value is equal to the minimum between 30 and the number of summary statistics. |
xlim |
range of the abscissa for the variable importance plot. |
main |
an overall title for the variable importance plot. |
... |
not used. |
Raynal L., Marin J.-M. Pudlo P., Ribatet M., Robert C. P. and Estoup, A. (2019) ABC random forests for Bayesian parameter inference Bioinformatics doi:bioinformatics/bty867
regAbcrf
,
predict.regAbcrf
,
err.regAbcrf
,
covRegAbcrf
,
ranger
,
densityPlot
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) plot(model.rf.r)
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) plot(model.rf.r)
Based on an ABC-RF object this function predicts the best model for new data and evaluate the posterior probability of the MAP.
## S3 method for class 'abcrf' predict(object, obs, training, ntree = 1000, sampsize = min(1e5, object$model.rf$num.samples ), paral = FALSE, ncores = if(paral) max(detectCores()-1,1) else 1, paral.predict = FALSE, ncores.predict = if(paral.predict) max(detectCores()-1,1) else 1, ...)
## S3 method for class 'abcrf' predict(object, obs, training, ntree = 1000, sampsize = min(1e5, object$model.rf$num.samples ), paral = FALSE, ncores = if(paral) max(detectCores()-1,1) else 1, paral.predict = FALSE, ncores.predict = if(paral.predict) max(detectCores()-1,1) else 1, ...)
object |
an |
obs |
a data frame containing the summary statistics of the observed data sets. |
training |
the data frame containing the reference table used to train the |
ntree |
number of trees to grow in the regression forest, by default 1,000 trees. |
sampsize |
size of the sample from the reference table used to grow a tree of the forest, by default the minimum between the number of elements of the reference table and 100,000. |
paral |
a boolean that indicates if the calculations of the regression random forest (forest that returns the posterior probability of the selected model) should be parallelized. |
ncores |
the number of CPU cores to use for the regression random forest construction. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
paral.predict |
a boolean that indicates if random forests predictions should be parallelized. |
ncores.predict |
the number of CPU cores to use for random forest predictions (classification and regression). If paral.predict=TRUE, it is used the number of CPU cores minus 1. If ncores.predict is not specified and |
... |
additional arguments to be passed on to |
An object of class abcrfpredict
, which is a list with the
following components:
allocation |
indices of the selected models for each observed data set, |
vote |
votes for each observed dataset, |
post.prob |
ABC-RF approximations of the posterior probability of the selected model for each observed dataset. |
Pudlo P., Marin J.-M., Estoup A., Cornuet J.-M., Gautier M. and Robert, C. P. (2016) Reliable ABC model choice via random forests Bioinformatics doi:10.1093/bioinformatics/btv684
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf <- abcrf(modindex~., data1, ntree=100) data(snp.obs) predict(model.rf, snp.obs, data1, ntree=100)
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf <- abcrf(modindex~., data1, ntree=100) data(snp.obs) predict(model.rf, snp.obs, data1, ntree=100)
Based on a reg-ABC-RF object this function predicts the posterior expectation, median, variance, quantiles for the corresponding parameter given new dataset. Somes posterior errors can be computed at an higher computational price.
## S3 method for class 'regAbcrf' predict(object, obs, training, quantiles=c(0.025,0.975), paral = FALSE, ncores = if(paral) max(detectCores()-1,1) else 1, rf.weights = FALSE, post.err.med = FALSE, ...)
## S3 method for class 'regAbcrf' predict(object, obs, training, quantiles=c(0.025,0.975), paral = FALSE, ncores = if(paral) max(detectCores()-1,1) else 1, rf.weights = FALSE, post.err.med = FALSE, ...)
object |
a |
obs |
a data frame containing the summary statistics of the observed data sets. |
training |
the data frame containing the reference table used to train the |
quantiles |
numeric vector of probabilities with values in [0,1]. The default value is equal to |
paral |
a boolean that indicates if random forests predictions should be parallelized. |
ncores |
the number of CPU cores to use for the regression random forest predictions. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
rf.weights |
a boolean that indicates if the random forest weights used to predict quantities of interest should we returned. The default value is FALSE. |
post.err.med |
a boolean that indicates if posterior errors based on posterior medians should be computed. The default value is FALSE. If computed, this function might take a much more time. |
... |
optional arguments to be passed on to the function |
An object of class regAbcrfpredict
, which is a list with the following components:
expectation |
predicted posterior expectation for each oberved data set, |
med |
predicted posterior median for each oberved data set, |
variance |
predicted posterior variance for each observed data set, computed by reusing weights, this quantity is also the posterior mean squared error, |
variance.cdf |
predicted posterior variance for each observed data set, computed by approximation of the cumulative distribution function, |
quantiles |
predicted posterior quantiles for each observed data set, |
weights |
a matrix composed of the weights used to predict quantities of interest. Returned if |
post.NMAE.mean |
posterior normalized mean absolute error obtained using the out-of-bag posterior expectation (mean) and previously computed random forest weights, for each observed data set, |
post.MSE.med |
posterior mean squared error obtained using the out-of-bag posterior median and previously computed random forest weights, for each observed data set, |
post.NMAE.med |
posterior normalized mean absolute error obtained using the out-of-bag posterior expectation (mean) and previously computed random forest weights, for each observed data set, |
prior.MSE |
prior mean squared error computed with training out-of-bag prediction based on mean of response variable, |
prior.NMAE |
prior normalized mean absolute error computed with training out-of-bag predictions based on mean of response variable, |
prior.MSE.med |
prior mean squared error computed with training out-of-bag predictions based on median of response variable, |
prior.NMAE.med |
prior normalized mean absolute error with training out-of-bag predictions based on median of response variable, |
prior.coverage |
prior credible inteval coverage computed for training instances, if only two quantiles are of interest, NULL otherwise. |
Raynal L., Marin J.-M. Pudlo P., Ribatet M., Robert C. P. and Estoup, A. (2019) ABC random forests for Bayesian parameter inference Bioinformatics doi:10.1093/bioinformatics/bty867
regAbcrf
,
predictOOB
,
plot.regAbcrf
,
err.regAbcrf
,
covRegAbcrf
,
ranger
,
densityPlot
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) data(snp.obs) predict(model.rf.r, snp.obs, data2)
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) data(snp.obs) predict(model.rf.r, snp.obs, data2)
Based on a reg-ABC-RF object this function predicts the out-of-bag posterior expectation, median, variance, quantiles, mean squared errors, normalized mean absolute errors, credible interval and coverage, for the corresponding parameter using the out-of-bag observations of the training data set.
Mean squared errors and normalized mean absolute errors are computed both with mean and median of the response variable.
Memory allocation issues might be encountered when the size of the training data set is large.
## S3 method for class 'regAbcrf' predictOOB(object, training, quantiles=c(0.025,0.975), paral = FALSE, ncores = if(paral) max(detectCores()-1,1) else 1,...)
## S3 method for class 'regAbcrf' predictOOB(object, training, quantiles=c(0.025,0.975), paral = FALSE, ncores = if(paral) max(detectCores()-1,1) else 1,...)
object |
a |
training |
the data frame containing the reference table used to train the |
quantiles |
numeric vector of probabilities with values in [0,1]. The default value is equal to |
paral |
a boolean that indicates if training data predictions should be parallelized or not. |
ncores |
the number of CPU cores to use for the regression random forest predictions. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
... |
optional arguments to be passed on to the function |
An object of class regAbcrfOOBpredict
, which is a list with the following components:
expectation |
predicted posterior expectation for each oberved data set, |
med |
predicted posterior median for each oberved data set, |
variance |
predicted posterior variance for each observed data set, computed by reusing weights, |
variance.cdf |
predicted posterior variance for each observed data set, computed by approximation of the cumulative distribution function, |
quantiles |
predicted posterior quantiles for each observed data set, |
MSE |
mean squared error computed with prediction based on mean of response variable, |
NMAE |
normalized mean absolute error computed with predictions based on mean of response variable, |
MSE.med |
mean squared error computed with predictions based on median of response variable, |
NMAE.med |
normalized mean absolute error with predictions based on median of response variable, |
coverage |
credible inteval coverage if only two quantiles are of interest, NULL otherwise. |
Raynal L., Marin J.-M. Pudlo P., Ribatet M., Robert C. P. and Estoup, A. (2019) ABC random forests for Bayesian parameter inference Bioinformatics doi:10.1093/bioinformatics/bty867
regAbcrf
,
predict.regAbcrf
,
plot.regAbcrf
,
err.regAbcrf
,
covRegAbcrf
,
ranger
,
densityPlot
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) res <- predictOOB(model.rf.r, data2)
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) res <- predictOOB(model.rf.r, data2)
readRefTable
reads a reference table simulated from DIYABC thanks to a .bin and a .txt file, respectively containing the reference table and its header.
readRefTable(filename = "reftable.bin", header = "header.txt", N = 0)
readRefTable(filename = "reftable.bin", header = "header.txt", N = 0)
filename |
a .bin file from DIYABC containing the reference table. |
header |
a .txt file containing the header of the reference table in |
N |
an integer indicating the number of observations to extract from the reference table. The default is 0 indicating that the whole reference table is used. Warning: if N is specified, nrecscen returns is obsolete. |
A list with the following components:
nrec |
number of individuals of the reference table, |
nscen |
number of scenarios in the reference table, |
nrecscen |
number of individuals by scenario, |
nparam |
number of parameters by scenario, |
scenarios |
a vector of factor containing the scenario indices, |
params |
a matrix with the parameters, |
stats |
a matrix with the summary statistics. |
Cornuet J.-M., Pudlo P., Veyssier J., Dehne-Garcia A., Gautier M., Leblois R., Marin J.-M. and Estoup A. (2014) DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about population history using Single Nucleotide Polymorphism, DNA sequence and microsatellite data Bioinformatics doi:10.1093/bioinformatics/btn514
regAbcrf
constructs a regression random forest from a reference table towards predicting posterior expectations, variances and quantiles of a parameter.
## S3 method for class 'formula' regAbcrf(formula, data, ntree=500, mtry=max(floor((dim(data)[2]-1)/3), 1), sampsize=min(1e5, nrow(data)), paral=FALSE, ncores=if(paral) max(detectCores()-1,1) else 1, ...)
## S3 method for class 'formula' regAbcrf(formula, data, ntree=500, mtry=max(floor((dim(data)[2]-1)/3), 1), sampsize=min(1e5, nrow(data)), paral=FALSE, ncores=if(paral) max(detectCores()-1,1) else 1, ...)
formula |
a formula: left of ~, variable representing the response variable; right of ~, summary statistics of the reference table. |
data |
a data frame containing the reference table, composed of response variable (parameter) and summary statistics. |
ntree |
number of trees to grow in the forest, by default 500 trees. |
mtry |
Number of variables to possibly split at in each node. Default is the minimum between 1 and the number of variables divided by 3. |
sampsize |
size of the sample from the reference table used to grow a tree of the regression forest, by default the minimum between the number of elements of the reference table and 100,000. |
paral |
a boolean that indicates if the calculations of the regression random forest should be parallelized. |
ncores |
the number of CPU cores to use. If paral=TRUE, it is used the number of CPU cores minus 1. If ncores is not specified and |
... |
additional arguments to be passed on to |
An object of class regAbcrf
, which is a list with the
following components:
call |
the original call to |
formula |
the formula used to construct the regression random forest, |
model.rf |
an object of class |
Raynal L., Marin J.-M. Pudlo P., Ribatet M., Robert C. P. and Estoup, A. (2019) ABC random forests for Bayesian parameter inference Bioinformatics doi:10.1093/bioinformatics/bty867
plot.regAbcrf
,
err.regAbcrf
,
predict.regAbcrf
,
covRegAbcrf
,
ranger
,
densityPlot
,
predictOOB
.
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) model.rf.r
data(snp) modindex <- snp$modindex sumsta <- snp$sumsta[modindex == "3",] r <- snp$param$r[modindex == "3"] r <- r[1:500] sumsta <- sumsta[1:500,] data2 <- data.frame(r, sumsta) model.rf.r <- regAbcrf(r~., data2, ntree=100) model.rf.r
The simulated example of population genetics with SNP loci used in Pudlo et al.
(2016): snp
contains the reference table on which to perform ABC model choice, it also contains the simulated parameters to perform regression random forest.
snp.obs
contains two pseudo-observed data sets. The first one
(favorable
) should be easily allocated to a model, while that is not the case for the second one (unfavorable
).
data(snp) data(snp.obs)
data(snp) data(snp.obs)
snp
is a list containing an ABC reference table of 10,000
simulations from a Bayesian prior predictive model (see Pudlo et al., 2016, for a description of the model choice issue). The first element, named modindex
is a factor
containing the model indices, the second element, param
, is a data frame with seven simulated parameters. The last element of this list, named sumsta
, contains the reference table on which to perform ABC model choice and parameter estimation.
snp.obs
is a data frame containing the summary statistics of two pseudo-observed data sets.
Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M. and Robert, C.P. (2016) Reliable ABC model choice via random forests Bioinformatics doi:10.1093/bioinformatics/btv684
data(snp) data(snp.obs)
data(snp) data(snp.obs)
variableImpPlot
provides a dotchart of variable importance as measured by ranger
with the argument importance='impurity'.
variableImpPlot(object, n.var=min(30, length(object$model.rf$variable.importance)), xlim=NULL, main=NULL)
variableImpPlot(object, n.var=min(30, length(object$model.rf$variable.importance)), xlim=NULL, main=NULL)
object |
an |
n.var |
number of variables in the variable importance representation. |
xlim |
range of the abscissa. |
main |
an overall title for the variable importance plot. |
Invisibly, the importance of the variables that were plotted.
abcrf
,
plot.abcrf
,
plot.regAbcrf
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf <- abcrf(modindex~., data1, ntree=100) variableImpPlot(model.rf)
data(snp) modindex <- snp$modindex[1:500] sumsta <- snp$sumsta[1:500,] data1 <- data.frame(modindex, sumsta) model.rf <- abcrf(modindex~., data1, ntree=100) variableImpPlot(model.rf)