This function builds and trains a DGP emulator.
Usage
dgp(
X,
Y,
struc = NULL,
depth = 2,
node = ncol(X),
name = "sexp",
lengthscale = 1,
bounds = NULL,
prior = "ga",
share = TRUE,
nugget_est = FALSE,
nugget = ifelse(all(nugget_est), 0.01, 1e-06),
scale_est = TRUE,
scale = 1,
connect = TRUE,
likelihood = NULL,
training = TRUE,
verb = TRUE,
check_rep = TRUE,
rff = FALSE,
M = NULL,
N = 500,
cores = 1,
blocked_gibbs = TRUE,
ess_burn = 10,
burnin = NULL,
B = 10,
internal_input_idx = NULL,
linked_idx = NULL,
id = NULL
)
Arguments
- X
a matrix where each row is an input training data point and each column is an input dimension.
- Y
a matrix containing observed training output data. The matrix has its rows being output data points and columns being output dimensions. When
likelihood
(see below) is notNULL
,Y
must be a matrix with only one column.- struc
a list that specifies a user-defined DGP structure. It should contain L (the number of DGP layers) sub-lists, each of which represents a layer and contains a number of GP nodes (defined by
kernel()
) in the corresponding layer. The final layer of the DGP structure (i.e., the final sub-list instruc
) can be a likelihood layer that contains a likelihood function (e.g.,Poisson()
). Whenstruc = NULL
, the DGP structure is automatically generated and can be checked by applyingsummary()
to the output fromdgp()
withtraining = FALSE
. If this argument is used (i.e., user provides a customized DGP structure), argumentsdepth
,node
,name
,lengthscale
,bounds
,prior
,share
,nugget_est
,nugget
,scale_est
,scale
,connect
,likelihood
, andinternal_input_idx
will NOT be used. Defaults toNULL
.- depth
number of layers (including the likelihood layer) for a DGP structure.
depth
must be at least2
. Defaults to2
. This argument is only used whenstruc = NULL
.- node
number of GP nodes in each layer (except for the final layer or the layer feeding the likelihood node) of the DGP. Defaults to
ncol(X)
. This argument is only used whenstruc = NULL
.- name
a character or a vector of characters that indicates the kernel functions (either
"sexp"
for squared exponential kernel or"matern2.5"
for Matérn-2.5 kernel) used in the DGP emulator: 1. if a single character is supplied, the corresponding kernel function will be used for all GP nodes in the DGP hierarchy. 2. if a vector of characters is supplied, each character of the vector specifies the kernel function that will be applied to all GP nodes in the corresponding layer.Defaults to
"sexp"
. This argument is only used whenstruc = NULL
.- lengthscale
initial lengthscales for GP nodes in the DGP emulator. It can be a single numeric value or a vector:
if it is a single numeric value, the value will be applied as the initial lengthscales for all GP nodes in the DGP hierarchy.
if it is a vector, each element of the vector specifies the initial lengthscales that will be applied to all GP nodes in the corresponding layer. The vector should have a length of
depth
iflikelihood = NULL
or a length ofdepth - 1
iflikelihood
is notNULL
.
Defaults to a numeric value of
1.0
. This argument is only used whenstruc = NULL
.- bounds
the lower and upper bounds of lengthscales in GP nodes. It can be a vector or a matrix:
if it is a vector, the lower bound (the first element of the vector) and upper bound (the second element of the vector) will be applied to lengthscales for all GP nodes in the DGP hierarchy.
if it is a matrix, each row of the matrix specifies the lower and upper bounds of lengthscales for all GP nodes in the corresponding layer. The matrix should have its row number equal to
depth
iflikelihood = NULL
or todepth - 1
iflikelihood
is notNULL
.
Defaults to
NULL
where no bounds are specified for the lengthscales. This argument is only used whenstruc = NULL
.- prior
prior to be used for Maximum a Posterior for lengthscales and nuggets of all GP nodes in the DGP hierarchy:
gamma prior (
"ga"
),inverse gamma prior (
"inv_ga"
), orjointly robust prior (
"ref"
).
Defaults to
"ga"
. This argument is only used whenstruc = NULL
.- share
a bool indicating if all input dimensions of a GP node share a common lengthscale. Defaults to
TRUE
. This argument is only used whenstruc = NULL
.- nugget_est
a bool or a bool vector that indicates if the nuggets of GP nodes (if any) in the final layer are to be estimated. If a single bool is provided, it will be applied to all GP nodes (if any) in the final layer. If a bool vector (which must have a length of
ncol(Y)
) is provided, each bool element in the vector will be applied to the corresponding GP node (if any) in the final layer. The value of a bool has following effects:FALSE
: the nugget of the corresponding GP in the final layer is fixed to the corresponding value defined innugget
(see below).TRUE
: the nugget of the corresponding GP in the final layer will be estimated with the initial value given by the correspondence innugget
(see below).
Defaults to
FALSE
. This argument is only used whenstruc = NULL
.- nugget
the initial nugget value(s) of GP nodes (if any) in each layer:
if it is a single numeric value, the value will be applied as the initial nugget for all GP nodes in the DGP hierarchy.
if it is a vector, each element of the vector specifies the initial nugget that will be applied to all GP nodes in the corresponding layer. The vector should have a length of
depth
iflikelihood = NULL
or a length ofdepth - 1
iflikelihood
is notNULL
.
Set
nugget
to a small value and the bools innugget_est
toFASLE
for deterministic emulations where the emulator interpolates the training data points. Setnugget
to a reasonable larger value and the bools innugget_est
toTRUE
for stochastic emulations where the computer model outputs are assumed to follow a homogeneous Gaussian distribution. Defaults to1e-6
ifnugget_est = FALSE
and0.01
ifnugget_est = TRUE
. This argument is only used whenstruc = NULL
.- scale_est
a bool or a bool vector that indicates if variance of GP nodes (if any) in the final layer are to be estimated. If a single bool is provided, it will be applied to all GP nodes (if any) in the final layer. If a bool vector (which must have a length of
ncol(Y)
) is provided, each bool element in the vector will be applied to the corresponding GP node (if any) in the final layer. The value of a bool has following effects:FALSE
: the variance of the corresponding GP in the final layer is fixed to the corresponding value defined inscale
(see below).TRUE
: the variance of the corresponding GP in the final layer will be estimated with the initial value given by the correspondence inscale
(see below).
Defaults to
TRUE
. This argument is only used whenstruc = NULL
.- scale
the initial variance value(s) of GP nodes (if any) in the final layer. If it is a single numeric value, it will be applied to all GP nodes (if any) in the final layer. If it is a vector (which must have a length of
ncol(Y)
), each numeric in the vector will be applied to the corresponding GP node (if any) in the final layer. Defaults to1
. This argument is only used whenstruc = NULL
.- connect
a bool indicating whether to implement global input connection to the DGP structure. Setting it to
FALSE
may produce a better emulator in some cases at the cost of slower training. Defaults toTRUE
. This argument is only used whenstruc = NULL
.- likelihood
the likelihood type of a DGP emulator:
NULL
: no likelihood layer is included in the emulator."Hetero"
: a heteroskedastic Gaussian likelihood layer is added for stochastic emulation where the computer model outputs are assumed to follow a heteroskedastic Gaussian distribution (i.e., the computer model outputs have varying noises)."Poisson"
: a Poisson likelihood layer is added for stochastic emulation where the computer model outputs are assumed to a Poisson distribution."NegBin"
: a negative Binomial likelihood layer is added for stochastic emulation where the computer model outputs are assumed to follow a negative Binomial distribution.
When
likelihood
is notNULL
, the value ofnugget_est
is overridden byFALSE
. Defaults toNULL
. This argument is only used whenstruc = NULL
.- training
a bool indicating if the initialized DGP emulator will be trained. When set to
FALSE
,dgp()
returns an untrained DGP emulator, to which one can applysummary()
to inspect its specifications (especially when a customizedstruc
is provided) or applypredict()
to check its emulation performance before the training. Defaults toTRUE
.- verb
a bool indicating if the trace information on DGP emulator construction and training will be printed during the function execution. Defaults to
TRUE
.- check_rep
a bool indicating whether to check the repetitions in the dataset, i.e., if one input position has multiple outputs. Defaults to
TRUE
.- rff
a bool indicating whether to use random Fourier features to approximate the correlation matrices in training. Turning on this option could help accelerate the training when the training data is relatively large but may reduce the quality of the resulting emulator. Defaults to
FALSE
.- M
the number of features to be used by random Fourier approximation. It is only used when
rff
is set toTRUE
. Defaults toNULL
. If it isNULL
,M
is automatically set tomax(100, ceiling(sqrt(nrow(X))*log(nrow(X))))
.- N
number of iterations for the training. Defaults to
500
. This argument is only used whentraining = TRUE
.- cores
the number of cores/workers to be used to optimize GP components (in the same layer) at each M-step of the training. If set to
NULL
, the number of cores is set to(max physical cores available - 1)
. Only use multiple cores when there is a large number of GP components in different layers and optimization of GP components is computationally expensive. Defaults to1
.- blocked_gibbs
a bool indicating if the latent variables are imputed layer-wise using ESS-within-Blocked-Gibbs. ESS-within-Blocked-Gibbs would be faster and more efficient than ESS-within-Gibbs that imputes latent variables node-wise because it reduces the number of components to be sampled during the Gibbs, especially when there is a large number of GP nodes in layers due to higher input dimensions. Default to
TRUE
.- ess_burn
number of burnin steps for the ESS-within-Gibbs at each I-step of the training. Defaults to
10
. This argument is only used whentraining = TRUE
.- burnin
the number of training iterations to be discarded for point estimates of model parameters. Must be smaller than the training iterations
N
. If this is not specified, only the last 25% of iterations are used. Defaults toNULL
. This argument is only used whentraining = TRUE
.- B
the number of imputations to produce the later predictions. Increase the value to account for more imputation uncertainties with slower predictions. Decrease the value for lower imputation uncertainties but faster predictions. Defaults to
10
.- internal_input_idx
column indices of
X
that are generated by the linked emulators in the preceding layers. Setinternal_input_idx = NULL
if the DGP emulator is in the first layer of a system or all columns inX
are generated by the linked emulators in the preceding layers. Defaults toNULL
. This argument is only used whenstruc = NULL
.- linked_idx
either a vector or a list of vectors:
If
linked_idx
is a vector, it gives indices of columns in the pooled output matrix (formed by column-combined outputs of all emulators in the feeding layer) that feed into the DGP emulator. The length of the vector shall equal to the length ofinternal_input_idx
wheninternal_input_idx
is notNULL
. If the DGP emulator is in the first layer of a linked emulator system, the vector gives the column indices of the global input (formed by column-combining all input matrices of emulators in the first layer) that the DGP emulator will use. If the DGP emulator is to be used in both the first and subsequent layers, one should initially setlinked_idx
to the appropriate values for the situation where the emulator is not in the first layer. Then, use the functionset_linked_idx()
to reset the linking information when the emulator is in the first layer.When the DGP emulator is not in the first layer of a linked emulator system,
linked_idx
can be a list that gives the information on connections between the DGP emulator and emulators in all preceding layers. The length of the list should equal to the number of layers before the DGP emulator. Each element of the list is a vector that gives indices of columns in the pooled output matrix (formed by column-combined outputs of all emulators) in the corresponding layer that feed into the DGP emulator. If the DGP emulator has no connections to any emulator in a certain layer, setNULL
in the corresponding position of the list. The order of input dimensions inX[,internal_input_idx]
should be consistent withlinked_idx
. For example, a DGP emulator in the 4th-layer that is fed by the output dimension 2 and 4 of emulators in layer 2 and all output dimension 1 to 3 of emulators in layer 3 should havelinked_idx = list( NULL, c(2,4), c(1,2,3) )
. In addition, the first and second columns ofX[,internal_input_idx]
should correspond to the output dimensions 2 and 4 from layer 2, and the third to fifth columns ofX[,internal_input_idx]
should correspond to the output dimensions 1 to 3 from layer 3.
Set
linked_idx = NULL
if the DGP emulator will not be used for linked emulations. However, if this is no longer the case, one can useset_linked_idx()
to add linking information to the DGP emulator. Defaults toNULL
.- id
an ID to be assigned to the DGP emulator. If an ID is not provided (i.e.,
id = NULL
), a UUID (Universally Unique Identifier) will be automatically generated and assigned to the emulator. Default toNULL
.
Value
An S3 class named dgp
that contains five slots:
id
: A number or character string assigned through theid
argument.data
: a list that contains two elements:X
andY
which are the training input and output data respectively.specs
: a list that containsL (i.e., the number of layers in the DGP hierarchy) sub-lists named
layer1, layer2,..., layerL
. Each sub-list contains D (i.e., the number of GP/likelihood nodes in the corresponding layer) sub-lists namednode1, node2,..., nodeD
. If a sub-list corresponds to a likelihood node, it contains one element calledtype
that gives the name (Hetero
,Poisson
, orNegBin
) of the likelihood node. If a sub-list corresponds to a GP node, it contains four elements:kernel
: the type of the kernel function used for the GP node.lengthscales
: a vector of lengthscales in the kernel function.scale
: the variance value in the kernel function.nugget
: the nugget value in the kernel function.
internal_dims
: the column indices ofX
that correspond to the linked emulators in the preceding layers of a linked system.external_dims
: the column indices ofX
that correspond to global inputs to the linked system of emulators. It is shown asFALSE
ifinternal_input_idx = NULL
.linked_idx
: the value passed to argumentlinked_idx
. It is shown asFALSE
if the argumentlinked_idx
isNULL
.seed
: the random seed generated to produce the imputations. This information is stored for the reproducibility when the DGP emulator (that was saved bywrite()
with the light optionlight = TRUE
) is loaded back to R byread()
.B
: the number of imputations used to generate the emulator.
internal_dims
andexternal_dims
are generated only whenstruc = NULL
.constructor_obj
: a 'python' object that stores the information of the constructed DGP emulator.container_obj
: a 'python' object that stores the information for the linked emulation.emulator_obj
: a 'python' object that stores the information for the predictions from the DGP emulator.
The returned dgp
object can be used by
predict()
for DGP predictions.continue()
for additional DGP training iterations.validate()
for LOO and OOS validations.plot()
for validation plots.lgp()
for linked (D)GP emulator constructions.window()
for model parameter trimming.summary()
to summarize the trained DGP emulator.write()
to save the DGP emulator to a.pkl
file.set_imp()
to change the number of imputations.set_linked_idx()
to add the linking information to the DGP emulator for linked emulations.design()
for sequential designs.update()
to update the DGP emulator with new inputs and outputs.alm()
,mice()
,pei()
, andvigf()
to locate next design points.
Details
See further examples and tutorials at https://mingdeyu.github.io/dgpsi-R/ and learn how to customize a DGP structure.
Note
Any R vector detected in X
and Y
will be treated as a column vector and automatically converted into a single-column
R matrix. Thus, if X
is a single data point with multiple dimensions, it must be given as a matrix.
Examples
if (FALSE) {
# load the package and the Python env
library(dgpsi)
# construct a step function
f <- function(x) {
if (x < 0.5) return(-1)
if (x >= 0.5) return(1)
}
# generate training data
X <- seq(0, 1, length = 10)
Y <- sapply(X, f)
# set a random seed
set_seed(999)
# training a DGP emulator
m <- dgp(X, Y)
# continue for further training iterations
m <- continue(m)
# summarizing
summary(m)
# trace plot
trace_plot(m)
# trim the traces of model parameters
m <- window(m, 800)
# LOO cross validation
m <- validate(m)
plot(m)
# prediction
test_x <- seq(0, 1, length = 200)
m <- predict(m, x = test_x)
# OOS validation
validate_x <- sample(test_x, 10)
validate_y <- sapply(validate_x, f)
plot(m, validate_x, validate_y)
# write and read the constructed emulator
write(m, 'step_dgp')
m <- read('step_dgp')
}