Recent work in survey domain estimation has shown that incorporating a priori assumptions about orderings of population domain means reduces the variance of the estimators. The new R package csurvey allows users to implement order constraints using a design specified in the well-known survey package. The order constraints not only give estimates that satisfy a priori assumptions, but they also provide smaller confidence intervals with good coverage, and allow for design-based estimation in small-sample or empty domains. A test for constant versus increasing domain means is available in the package, with generalizations to other one-sided tests. A cone information criterion may be used to provide evidence that the order constraints are valid. Examples with well-known survey data sets show the utility of the methods. This package is now available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=csurvey.
+We assume that a finite population is partitioned into a number of domains, and the goal is to estimate the population domain means for the study variable and provide valid inference, such as confidence intervals and hypothesis tests.
+Order constraints are a common type of a priori knowledge in survey data analysis. For example, we might know that salary increases with job rank within job type and location, or average cholesterol in a population increases with age category, or test scores decrease as poverty increases, or that the amount of pollution decreases with distance from the source. There might be a “block ordering” of salaries by location, where the locations are partitioned into blocks where salaries for all locations in one block (such as major metropolitan areas) are assumed to be higher, on average, than salaries in another block (such as rural areas), without imposing orderings within the blocks.
+The order constraints may be imposed on domain mean estimates, or systematic component estimates if the response variable is not Gaussian but belongs to an exponential family of distributions. For example, if we are estimating the proportion of people with diabetes in a population, the study variable might be binary (whether or not the subject has diabetes), and perhaps we want to assume that diabetes prevalence increases with age, within ethnicity and socio-economic categories. The csurvey package extends the survey package in that it allows the user to impose linear inequality constraints on the domain means. If the inequality constraints are valid, this leads to improved inference, as well as the ability to estimate means for empty or small-sample domains without additional assumptions. The csurvey package provides constrained estimates of means or proportions, estimated variances of the estimates, and confidence intervals where the upper and lower bounds of the intervals also satisfy the order constraints.
+The csurvey package implements one-sided tests. Suppose the study variable is salary, and interest is in whether the subject’s salary is affected by the level of education of the subject’s father, controlling for the subject’s education level and field. The null hypothesis is that there is no difference in salary by the level of father’s education. The one-sided test with alternative hypothesis that the salary increases with father’s education has a higher power than the two-sided test with the larger alternative “salary is not the same across the levels of father’s education.”
+Finally, the csurvey package includes graphical functions for visualizing the order-constrained estimator and comparing it to the unconstrained estimator. Confidence bands can also be displayed to illustrate estimation uncertainty.
+This package relies on coneproj and survey for its core computations and handling of complex sampling designs. The domain grid is constructed using the data.table package. Additional functionality, such as data filtering, variable transformation, and visualization of model results, leverages functions from igraph , dplyr , and ggplot2 . When simulating the mixture covariance matrix and the sampling distribution of the one-sided test statistic, selected functions from MASS and Matrix are employed.
+The csurvey package provides estimation and inference on population domain variables with order constraints, using recently developed methodology. considered a complete ordering on a sequence of domain means. They applied the pooled adjacent violators algorithm for domain mean estimation, and derived asymptotic confidence intervals that have smaller width without sacrificing coverage, compared to the estimators that do not consider the ordering. developed methodology for partial orderings and more general constraints on domains. These include block orderings and orderings on domains arranged in grids by multiple variables of interest. refined these methods by proposing variance estimators based on a mixture of covariance matrices, and showed that the mixture covariance estimator improves coverage of confidence intervals while retaining smaller interval lengths. showed how to use the order constraints to provide conservative design-based inference in domains with small sample sizes, or even in empty domains, without additional model assumptions. developed a test for constant versus increasing domain means, and extended it to one-sided tests for more general orderings. proposed a cone information criterion (CIC) for survey data as a diagnostic method to measure possible departures from the assumed ordering. This criterion is similar to Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC) , and can be used for model selection. For example, we can use CIC to choose between an unconstrained estimator and an order constrained estimator. The one with a smaller CIC value will be chosen.
+All of the above papers include extensive simulations showing that the methods substantially improve inference when the order restrictions are valid. These methods are easy for users to implement in csurvey, with commands to impose common orderings.
+Statisticians and practitioners working with survey data are familiar with the R package survey (see ) for analysis of complex survey data. Commands in the package allow users to specify the survey design with which their data were collected, then obtain estimation and inference for population variables of interest. The new csurvey package extends the utility of survey by allowing users to implement order constraints on domains.
+The csurvey package relies on the functions in the survey package such as svydesign and svrepdesign, which allow users to specify the survey design. This function produces an object that contains information about the sampling design, allowing the user to specify strata, sampling weights, etc. This object is used in statistical functions in the csurvey package in the same manner as for the survey package. In addition, the mixture covariance matrix in is constructed from an initial estimate of covariance obtained from survey.
Consider a finite population with labels \(U = \{1,\ldots,N\}\), partitioned into domains \(U_d\), \(d=1,\ldots,D\), where \(U_d\) has \(N_d\) elements. For a study variable \(y\), suppose interest is in estimating the population domain means +\[ +\bar{y}_{U_d} = \frac{\sum_{k\in U_d} y_k}{N_d} +\] +for each \(d\), and providing inference such as confidence intervals for each \(\bar{y}_{U_d}\). +Given a survey design, a sample \(s\subset U\) is chosen, and the unconstrained estimator of the population domain means is a weighted average of the sample observations in each domain. This estimator \(\tilde{\mathbf{y}}_s=(\tilde{y}_{s_1},\ldots, \tilde{y}_{s_D})\) in is provided by the survey package.
+The desired orderings are imposed as linear inequality constraints on the domain means, in the form of an \(m\times D\) constraint matrix \(\mathbf{A}\). The csurvey package will find the constrained estimator \(\tilde{\boldsymbol{\theta}}\) by minimizing +\[ +\min_{\theta}(\tilde{\mathbf{y}}_s - \boldsymbol{\theta})^{\top}\mathbf{W}_s(\tilde{\mathbf{y}}_s-\boldsymbol{\theta}) \quad \mbox{such that} \hspace{2mm} \mathbf{A}\boldsymbol{\theta}\geq \mathbf{0} +\] +where the weights \(\mathbf{W}_s\) are provided by the survey design. (See for details.) For a simple example of a constraint matrix, consider five domains with a simple ordering, where we assume \(\bar{y}_{U_1}\leq \bar{y}_{U_2}\leq \bar{y}_{U_3}\leq \bar{y}_{U_4}\leq \bar{y}_{U_5}\). Perhaps these are average salaries over five job levels. Then the constraint matrix is +\[ \mathbf{A} = \left(\begin{array}{ccccc} -1 & 1 & 0 & 0 & 0 \\0 & -1 & 1 & 0 & 0 \\0 & 0 & -1 & 1 & 0 \\0 &0 &0 & -1 & 1 \end{array}\right). +\] +For simple orderings on \(D\) domains, the constraint matrix is \((D-1) \times D\). +For a block ordering example, suppose we again have five domains, and we know that each of the population means in the first two domains must be smaller than each of the means of the last three domains. The constraint matrix is +\[ \mathbf{A} = \left(\begin{array}{ccccc} -1 & 0 & 1 & 0 & 0 \\ -1 & 0 & 0 & 1& 0 \\ -1 & 0 & 0 & 0 & 1\\ +0 & -1 & 1 & 0 & 0 \\0 & -1 & 0 & 1& 0 \\ 0 & -1 & 0 & 0 & 1\\ +\end{array}\right). +\] +The number of rows for a block ordering with two blocks is \(D_1 \times D_2\), where \(D_1\) is the number of domains in the first block and \(D_2\) is the number in the second block. The package also allows users to specify grids of domains with various order constraints along the dimensions of the grid. The constraint matrices are automatically generated for some standard types of constraints.
+The csurvey package allows users to specify a simple ordering with the symbolic function. For example, suppose is the survey variable of interest (say, cholesterol level), and the variable takes values \(1\) through \(D\) (age groups for example). Suppose we assume that average cholesterol level is increasing with age in this population, and that the design is specified by the object . Then
+ans <- csvy(y ~ incr(x), design = ds)creates an object containing the estimated population means and confidence intervals. The function is used similarly, to specify decreasing means.
+Next, suppose takes values \(1\) through \(D_1\) and takes values \(1\) through \(D_2\) (say, age group and ethnicity), and we wish to estimate the population means over the \(D_1\times D_2\) grid of values of and If we assume that the population domain means are ordered in but there is no ordering in , then the command
+ans <- csvy(y ~ incr(x1) * x2, design = ds)will provide domain mean estimates where the means are non-decreasing in age, within each ethnicity. Note that we don’t allow “+” when defining a formula in csvy() because we consider all combinations \(D_1\times D_2\) given by and .
For an example of a block ordering with three blocks, the command
+ans <- csvy(y ~ block.Ord(x, order = c(1,1,1,2,2,2,3,3,3)), design = ds)specifies that the variable takes values \(1\) through \(9\), and the domains with values 1, 2, and 3 each have population means not greater than each of the population means in the domains with values 4, 5, and 6. The domains with values 4, 5, and 6 each have population means not greater than each of the population means in the domains with values 7, 8, and 9. More examples of implementation of constraints will be given below.
+Implementing order constraints leads to “pooling” of domains where the order constraints are binding. This naturally leads to smaller confidence intervals as the averaging is over a larger number of observations. The mixture covariance estimator for \(\tilde{\boldsymbol{\theta}}\) that was derived in is provided by csurvey. This covariance estimator is constructed by recognizing that for different samples, different constraints are binding, so that different sets of domains are “pooled” to obtain the constrained estimator. The covariance estimator then is a mixture of pooled covariance matrix estimators, with the mixture distribution approximated via simulations. Using this mixture rather than the covariance matrix for the observed pooling provides confidence intervals with coverage that is closer to the target, while retaining the shorter lengths. The method introduced in further pools information across domains to provide upper and lower confidence interval bounds that also satisfy the constraints, effectively reducing the confidence interval length for domains with small sample sizes, and allowing for estimation and inference in empty domains.
+The test \(H_0:\mathbf{A}\bar{\mathbf{y}}_U=0\) versus the one-sided \(H_1:\mathbf{A}\bar{\mathbf{y}}_U\geq0\) in csurvey has improved power over the \(F\)-test implemented using the command using the object from in the survey package. That \(F\)-test uses the two-sided alternative \(H_2:\mathbf{A}\bar{\mathbf{y}}_U\neq0\). For example, suppose we have measures of amounts of pollutants in samples of water from small lakes in a certain region. We also have measurements of distances from sources, such as factories or waste dumps, that are suspected of contributing to the pollution. The test of the null hypothesis that the amount of pollution does not depend on the distance will have greater power if the alternative is one-sided, that is, that the pollution amount is, on average, larger for smaller distances.
+In the following sections, we only show the main part of code used to generate the results. Supplementary or non-essential code is available in the accompanying R script submitted with this manuscript.
+The National Health and Nutrition Examination Survey (NHANES) combines in-person interviews and physical examinations to produce a comprehensive data set from a probability sample of residents of the U.S. The data are made available to the public at this CDC NHANES website. The subset used in this example is derived from the 2009–2010 NHANES cycle and is provided in the csurvey package as the data set. We consider the task of estimating the average total cholesterol level (mg/dL), originally recorded as LBXTC in the raw NHANES data, by age (in years), corresponding to the original variable RIDAGEYR. Focusing on young adults aged 21 to 45, we analyze a sample of \(n = 1,933\) participants and specify the survey design using the associated weights and strata available in the data.
Then, to get the proposed constrained domain mean estimate, we use the function in the csurvey package. In this function, is a symbolic function used to define that the population domain means of chol are increasing with respect to the predictor age.
ans <- csvy(chol ~ incr(age), design = dstrat, n.mix = 100)
+The n.mix parameter controls the number of simulations used to estimate the mixture covariance matrix, with a default of n.mix = 100. To speed up computation, users can set n.mix to be a smaller number, e.g., 10.
We can extract from the CIC value for the constrained estimator as
+cat("CIC (constrained):", ans$CIC, "\n")
+CIC (constrained): 32.99313
+and the CIC value for the unconstrained estimator as
+cat("CIC (unconstrained):", ans$CIC.un, "\n")
+CIC (unconstrained): 51.65159
+We see that for this example, the constrained estimator has a smaller CIC value and this implies it is a better fit.
+If we want to construct a contrast of domain means and get its standard error, we can use the function, which is inherited from the survey package. Note that in the survey package, it is impossible to get a contrast estimate when there is any empty domain. In the csurvey package, we inherit this feature. For example, suppose we want to compare the average cholesterol for the first thirteen age groups to the last twelve age groups, we can code it as:
+ +The function produces both the constrained fit and the corresponding unconstrained fit using methods from the survey package. A visual comparison between the two fits can be easily obtained by applying the method to the resulting object by specifying the argument type = "both". For illustration, Figure 1 displays the estimated domain means along with 95% confidence intervals, generated by the following code:
plot(ans, type = "both")
+
+Figure 1: Estimates of average cholesterol level for 25 ages, with 95% confidence intervals, for a stratified sample in the R dataset nhdat2, \(n=1933\).
+
The function can be used to extract the confidence interval for domain mean. When the response is not Gaussian, then produces the confidence interval for the average systematic component over domains, while produces the confidence interval for the domain mean.
+For this data set, the sample sizes for each of the twenty-five ages range from 54 to 99, so that none of the domains is “small.” To demonstrate in the case of small domains, we next provide domain mean estimates and confidence intervals for \(400\) domains, arranged in a grid with 25 ages, four waist-size categories, and four income categories. We divide the waist measurement by height to make a relative girth, and split the observations into four groups with variable name wcat, which is a 4-level ordinal categorical variable representing waist-to-height ratio categories, computed from BMXWAIST (waist circumference in cm) and BMXHT (height in cm) in the body measures file BMX_F.XPT from NHANES. We have income information for the subjects, in terms of a multiple of the federal poverty level. Our first income category includes those with income that is 75% or less than the poverty line. The second category is .75 through 1.38 times the poverty level (1.38 determines Medicaid eligibility), the third category goes from above 1.38 to 3.5, and finally above 3.5 times the poverty level is the fourth category. These indicators are contained in the variable icat (categorized income). We create icat by the INDFMPIR variable from the file DEMO_F.XPT from NHANES.
The domain sample sizes average only 4.8 observations, and there are 16 empty domains. The sample size for each domain can be checked by ans$nd. To estimate the population means we assume average cholesterol level is increasing in both age and waist size, but no ordering is imposed on income categories:
To extract estimates and confidence intervals for specific domains defined by the model, the user can use the function as follows:
+domains <- data.frame(age = c(24, 35), wcat = c(2, 4), icat = c(2, 3))
+pans <- predict(ans, newdata = domains, se.fit = TRUE)
+cat("Predicted values, confidence intervals and standard errors for specified domains:\n")
+Predicted values, confidence intervals and standard errors for specified domains:
+print (pans)
+$fit
+[1] 162.9599 202.0061
+
+$lwr
+[1] 148.6083 183.6379
+
+$upp
+[1] 177.8976 219.7764
+
+$se.fit
+[1] 7.471908 9.219167
+Figure 2 displays the domain mean estimates along with 95% confidence intervals, highlighting in red the two domains specified in the function. The code to create this plot is as below. The control argument is used to let the user adjust the aesthetics of the plot.
++Figure 2: Constrained estimates of population domain means for 400 domains in a 25x4x4 grid. The increasing population domain estimates for the 25 ages are shown within the waist size and income categories. The blue bands indicate 95% confidence intervals for the population domain means, with two specific domains, namely, (age, waist, income) = (24, 2, 2) and (35, 4, 3) marked in red. Empty domains are marked with a red ‘x’ sign. +
+ctl <- list(x1lab = "waist", x2lab = "income", subtitle.size = 8)
+plot(ans, x1 = "wcat", x2 = "icat", control = ctl, domains = domains)The user can visualize the unconstrained fit by specifying in the function. The corresponding output is presented in Figure 3.
+plot(ans, x1 = "wcat", x2 = "icat", control = ctl, type="unconstrained")
++Figure 3: Unconstrained estimates of population domain means for 400 domains in a 25x4x4 grid. The population domain estimates for the 25 ages are shown within the waist size and income categories. The green bands indicate 95% confidence intervals for the population domain means. Empty domains are marked with a red ‘x’ sign. +
+Without the order constraints, the sample sizes are too small to provide valid estimates and confidence intervals, unless further model assumptions are used, as in some small area estimation methods. With order constraints, design-based estimation and inference are possible for substantially smaller domain sample sizes, compared to the unconstrained design-based estimation.
+We consider the 2019 National Survey of College Graduates (NSCG), conducted by the U.S. Census Bureau and sponsored by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation. The NSCG provides data on the characteristics of the nation’s college graduates, with a focus on those in the science and engineering workforce. The datasets and documentation are available to the public on the National Survey of College Graduates (NSCF) website. Replicate weights are available separately from NCSES upon request. Because the size of subsets of the NSCG survey used in this paper exceeds the size limit allowed for an R package stored on CRAN, the subsets are not included in the csurvey package. Instead, we provide the link to access the subsets nscg19.rda and nscg19_2.rda used in this section at this website.
The study variable of interest is annual salary (denoted as SALARY in the dataset), which exhibits substantial right-skewness in its raw form. To reduce the influence of outliers and improve model stability, we restricted the analysis to observations with annual salaries between $30,000 and $400,000. A logarithmic transformation was applied to the salary variable to address skewness. Four predictors of salary were considered:
+Field of study (field, denoted by NDGMEMG in the raw dataset): This nominal variable defines the field of study for the highest degree. There are five levels: (1) Computer and mathematical sciences; (2) Biological, agricultural and environmental life sciences; (3) Physical and related sciences; (4) Social and related sciences; (5) Engineering. Block ordering constraint: given the other predictors, the average annual salary for each of the fields (2) and (4) is less than for the STEM fields (1), (3) and (5).
Grouped year of award of highest degree (hd_year_grouped, denoted by HDAY5): This ordinal variable has five levels: (1) 1995 to 1999; (2) 2000 to 2004; (3) 2005 to 2009; (4) 2010 to 2014; (5) 2015 or later. Isotonic constraint: given the other predictors, the average annual salary decreases with the year of award of highest degree; i.e., the more experience respondents have, on average, the higher the annual salary.
Highest degree type (hd_type, denoted by DGRDG): The three levels are: (1) Bachelor’s; (2) Master’s; (3) Doctorate and Professional. Isotonic constraint: given the other predictors, the average annual salary increases with respect to the highest degree type.
Region code for employer (region, denoted by EMRG): This nominal variable defines the regions in which the respondents worked within the U.S. Nine levels are: (1) New England; (2) Middle Atlantic; (3) East North Central; (4) West North Central; (5) South Atlantic; (6) East South Central; (7) West South Central; (8) Mountain; (9) Pacific and US Territories. There is no constraint for this predictor.
This data set contains \(n=30,368\) observations in a four-dimensional grid of 675 domains, where the sample size in the domains ranges from one to 491. Here, we specify the shape and order constraints in a similar fashion as we did in previous examples. The symbolic routine is used to impose a block ordering on field and the order is specified in the argument. The specifies a survey design with no clusters. The command creates a survey design with replicate weights, where the columns named as “RW0001”, “RW0002”,,“RW0320” are the 320 NSCG replicate weights and denotes the sampling weight. The variance is computed as the sum of squared deviation of the replicates from the mean. The general formula for computing a variance estimate using replicate weights follows:
+\[v_{REP}(\hat{\theta})=\sum_{r=1}^{R}c_r(\hat{\theta}_r-\hat{\theta})^2\]
+where the estimate \(\hat{\theta}\) is computed based on the final full sample survey weights and the calculation of each replicate estimate \(\hat{\theta}_r\) is according to the \(r\)th set of replicate weights (\(r=1,\cdots, R\)). The replication adjustment factor \(c_r\) is a multiplier for the \(r\)th replicate of squared difference. The is an overall multiplier and the denotes a vector of replicate-specific multipliers, which are the values of \(c_r\) in above formula.
load("./nscg19.rda")
+rds <- svrepdesign(data = nscg, repweights = dplyr::select(nscg, "RW0001":"RW0320"),
+ weights = ~w, combined.weights = TRUE, mse = TRUE, type = "other", scale = 1, rscale = 0.05)Estimates of domain means for the 225 domains for which the highest degree is a Bachelor’s are shown in Figure 4, as surfaces connecting the estimates over the grid of field indicators and year of award of highest degree. The block-ordering constraints can be seen in the surfaces, where the fields labeled 2 and 4 have lower means than those labeled 1, 3, and 5. The surfaces are also constrained to be decreasing in year of award of highest degree.
+To improve computational efficiency, the simulations used to estimate the mixture covariance matrix and the sampling distribution of the test statistic can be parallelized. This can be enabled by setting the following R option
+options(csurvey.multicore = TRUE)The model is fitted using the following code
+ans <- csvy(logSalary~decr(hd_year_grouped)*incr(hd_type)*block.Ord(field,
+ order = c(2, 1, 2, 1, 2))*region, design = rds)The function is used to create Figure 4. It will create a three-dimensional estimated surface plot when there are at least two predictors. In this example, we use to generate a 3D perspective plot of the estimated average log-transformed salary with respect to field and year of award of highest degree. Among the three hd_type categories, the Bachelor’s degree group has the highest number of observations. The function visualizes the 225 domains corresponding to the most frequently observed level of this fourth predictor. Plot aesthetics are customized via the control argument:
ctl <- list(categ = "region", categnm = c("New England", "Middle Atlantic", "East North Central",
+ "West North Central", "South Atlantic", "East South Central", "West South Central", "Mountain",
+ "Pacific and US Territories"), NCOL = 3, th = 60, xlab = "years since degree",
+ ylab = "field of study", zlab = "log(salary)")
+
+plotpersp(ans, x1="hd_year_grouped", x2="field", control = ctl)
++Figure 4: Estimates of average log(salary) by field of study and year of degree, for observations where highest degree is a Bachelor’s, for each of the nine regions. +
+Estimates and confidence intervals for the 75 domain means associated with three of the regions are shown in Figure 5. The constrained estimates are shown as blue dots, and the unconstrained means are depicted as grey dots. The confidence intervals are indicated with blue bands for the constrained estimates and grey bands for the unconstrained estimates.
+For the Northeast Region, the average length of the constrained confidence intervals is .353, while the average length for the unconstrained confidence intervals is .477. In the domain for those in the math and computer science field, with PhDs obtained in 2000-2004, the unconstrained log-salary estimate for this domain is much below the corresponding constrained estimate, because the latter is forced to be at least that of lower degrees, and that of newer PhDs. If the constraints hold in the population, then the unconstrained confidence interval is unlikely to capture the population value. As seen in the NHANES example in Section 4, the unconstrained estimators are unreliable when the sample domain size is small. The Pacific region has the largest sample sizes, ranging from 12 to 435 observations. With this larger sample size, most of the unconstrained estimates already satisfy the constraints, but the average length for the constrained estimator is .301, while the average length for the unconstrained estimator is .338, showing that the mixture covariance matrix leads to more precision in the confidence intervals. Also shown is the region with the smallest sample sizes: the East South Central region. Here the average length for the unconstrained confidence intervals is .488 while the average length for the constrained confidence intervals is .444.
+
++Figure 5: Estimates of average log(salary) for the 75 domains in each of three regions. The blue dots represent the constrained domain mean estimates, while the grey dots represent the unconstrained domain mean estimates. The blue band is the 95% confidence interval for the domains, using the constraints; the grey band is the 95% unconstrained domain mean confidence interval. +
+For this example we again use the NCGS data set. But for this example, we use some new variables and make some variable groupings. First, instead of using the grouped year of award of highest degree (hd_year_grouped), we use the actual calendar year when the highest degree was awarded (hd_year, denoted as HDACYR in the raw data set). We conduct the one-sided test for six pairs of consecutive calendar years. Besides, we choose father’s education level (daded, denoted as EDDAD) as the main predictor, and the interest is in determining whether salaries are higher for people whose father has a higher education. In the raw data set, EDDAD has seven categories, we group them into five categories for this example: 1 = no high school degree, 2 = high school degree but no college degree, 3 = bachelor’s degree, 4 = master’s degree, and 5 = PhD or professional degree. We further group region (EMRG) as: 1 = Northeast, 2 = North Central, 3 = Southeast, 4 = West, 5 = Pacific and Territories, and group field (NDGMEMG) as: 1 = Computer and Mathematical Sciences, Physical and Related Sciences, Engineering, which can be considered as core STEM fields, 2 = Biological, Agricultural, and Environmental Life Sciences, Social and Related Sciences, which can be considered as life and social sciences, 3 = Science and Engineering-Related Fields, 4 = Non-Science and Engineering Fields, which is Non-STEM. Finally, people’s highest degree type (denoted as DGRDG in the raw data) is used to limit the study sample to people whose highest degree is a Master’s degree. The sample size is \(n=25,177\).
It seems reasonable that people whose parents are more educated are more educated themselves, but if we control for the person’s level of education, as well as field, region, and year of degree, will the salary still be increasing with level of father’s education? For this, the null hypothesis is that within the region, field, and year of degree, the salary is constant over levels of father’s education. The one-sided alternative is that salaries increase with father’s education level.
+The constrained and unconstrained fits to the data under the alternative hypotheses are shown in Figure 6 for five regions and four fields, for degree years 2016-2017. The dots connected by solid blue lines represent the fit with the one-sided alternative, i.e., constrained to be increasing in father’s education. The dots connected by solid red lines represent the fit with the two-sided alternative, i.e., with unconstrained domain means.
+
++Figure 6: Estimates of average log(salary) by father’s education level, for each of five regions and four fields, for subjects whose degree was attained in 2016-2017. The solid blue lines connect the estimates where the average salary is constrained to be increasing in father’s education, and the solid red lines connect unconstrained estimates of average salary. +
+The \(p\)-values for the test of the null hypotheses are given in Table 1, where n/a is shown for the two-sided \(p\)-value in the case of an empty domain. The test is conducted for six pairs of consecutive calendar years with the sample sizes to be 2,129, 2,795, 3,069, 2,895, 2,423, and 1,368 respectively. The one-sided test consistently has a small \(p\)-value, indicating that the father’s education level is positively associated with salary earned, even after controlling for region, degree type, and field.
+| +one + | ++two + | ++one + | ++two + | ++one + | ++two + | ++one + | ++two + | ++one + | ++two + | ++one + | ++two + | +
|---|---|---|---|---|---|---|---|---|---|---|---|
| +.008 + | ++n/a + | ++<.001 + | ++.018 + | ++<.001 + | ++<.001 + | ++<.001 + | ++n/a + | ++.003 + | ++.417 + | ++<.001 + | ++n/a + | +
The \(p\)-value of the one-sided test is included in the object . For example, to check the \(p\)-value for the fit for the year 2008 to 2009, we can fit the model and print out its summary table as:
+load("./nscg19_2.rda")
+data <- nscg2 |>
+ dplyr::filter(hd_year %in% c(2008, 2009))
+
+rds <- svrepdesign(data = data, repweights = dplyr::select(data, "RW0001":"RW0320"), weights = ~w,
+ combined.weights = TRUE, mse = TRUE, type = "other",
+ scale = 1, rscale = 0.05)
+
+set.seed(1)
+ans <- csvy(logSalary ~ incr(daded) * field * region, design = rds, test = TRUE)
+summary(ans)
+Call:
+csvy(formula = logSalary ~ incr(daded) * field * region, design = rds,
+ test = TRUE)
+
+Null deviance: 460.6182 on 97 degrees of freedom
+Residual deviance: 391.3722 on 51 degrees of freedom
+
+Approximate significance of constrained fit:
+ edf mixture.of.Beta p.value
+incr(daded):field:region 47 0.5536 0.0068 **
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+CIC (constrained estimator): 0.0275
+CIC (unconstrained estimator): 0.0354
+Finally, we use another subset of the NHANES 2009–2010 data to demonstrate how our method applies when the outcome is binary. This subset is included in the csurvey package as nhdat. The construction of variables, sampling weights, and strata in this subset closely follows the approach described in Section 4. It contains \(n = 1,680\) observations with complete records on total cholesterol, age, height, and waist circumference for adults aged 21–40. The binary outcome indicates whether an individual has high total cholesterol, coded as 1 if total cholesterol exceeds 200 mg/dL, and 0 otherwise. We estimate the population proportion with high cholesterol by age, waist, and gender (1 = male, 2 = female). The waist variable, denoted as wcat, is a 4-level categorized ordinal variable representing waist-to-height ratios.
It is reasonable to assume that, on average, the proportion of individuals with high cholesterol increases with both age and waist. The model is specified using the following code:
+The CIC of the constrained estimator is smaller than that of the unconstrained estimator, and the one-sided hypothesis test has a \(p\)-value close to zero.
+summary(ans)
+Call:
+csvy(formula = chol ~ incr(age) * incr(wcat) * gender, design = dstrat,
+ family = binomial(link = "logit"), test = TRUE)
+
+Null deviance: 2054.947 on 159 degrees of freedom
+Residual deviance: 1906.762 on 126 degrees of freedom
+
+Approximate significance of constrained fit:
+ edf mixture.of.Beta p.value
+incr(age):incr(wcat):gender 34 0.5174 < 0.00000000000000022
+
+incr(age):incr(wcat):gender ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+CIC (constrained estimator): 0.3199
+CIC (unconstrained estimator): 1.2778
+The combination of age, waist, and gender gives 160 domains. This implies that the average sample size for each domain is only around 10. Due to the small sample sizes, the unconstrained estimator shows unlikely jumps as age increases within each waist category. On the other hand, the constrained estimator is more stable and tends to have smaller confidence intervals compared with the unconstrained Hájek estimator.
+
++Figure 7: Estimates of probability of high cholesterol level for each combination of age, waist and gender. The blue dots represent the constrained domain mean estimates, while the green dots represent the unconstrained domain mean estimates. The blue band is the 95% confidence interval for the domains, using the constraints; the green band is the 95% unconstrained domain mean confidence interval. +
+While model-based small area estimators - such as those implemented in the sae package and the emdi package - are powerful tools for borrowing strength across domains, they rely on parametric assumptions that may be violated in practice. Design-based methods remain essential for official statistical agencies, as they provide transparent and model-free inference that is directly tied to the survey design. Estimation and inference for population domain means with survey data can be substantially improved if constraints based on natural orderings are implemented. The csurvey package (version 1.15) allows users to specify orderings on grids of domains, and obtain estimates of and confidence intervals for population domain means. This package also implements the design-based small area estimation method, which allows inference for population domain means for which the sample domain is empty, and further is used to improve estimates for domains with small sample size. The one-sided testing procedure available in csurvey has higher power than the standard two-sided test, and further can be applied in grids with some empty domains. Confidence intervals for domain means have better coverage rate and smaller interval width than what is produced by unconstrained estimation. Finally, the package provides functions to allow the user to easily visualize the data and the fits. The utility of the package has been demonstrated with well-known survey data sets.
+Acknowledgment: This work was partially funded by NSF MMS-1533804.
+Supplementary materials are available in addition to this article. It can be downloaded at +RJ-2025-032.zip
+csurvey, survey, coneproj, data.table, igraph, dplyr, ggplot2, MASS, Matrix, sae, emdi
+ChemPhys, Databases, Distributions, Econometrics, Environmetrics, Finance, GraphicalModels, HighPerformanceComputing, MixedModels, ModelDeployment, NetworkAnalysis, NumericalMathematics, OfficialStatistics, Optimization, Phylogenetics, Psychometrics, Robust, Spatial, Survival, TeachingStatistics, TimeSeries, WebTechnologies
+ + +Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
+For attribution, please cite this work as
+Liao & Meyer, "csurvey: Implementing Order Constraints in Survey Data Analysis", The R Journal, 2026+
BibTeX citation
+@article{RJ-2025-032,
+ author = {Liao, Xiyue and Meyer, Mary C.},
+ title = {csurvey: Implementing Order Constraints in Survey Data Analysis},
+ journal = {The R Journal},
+ year = {2026},
+ note = {https://doi.org/10.32614/RJ-2025-032},
+ doi = {10.32614/RJ-2025-032},
+ volume = {17},
+ issue = {4},
+ issn = {2073-4859},
+ pages = {4-19}
+}
+Optimal Latin hypercube designs (LHDs), including maximin distance +LHDs, maximum projection LHDs and orthogonal LHDs, are widely used in +computer experiments. It is challenging to construct such designs with +flexible sizes, especially for large ones, for two main reasons. One +reason is that theoretical results, such as algebraic constructions +ensuring the maximin distance property or orthogonality, are only +available for certain design sizes. For design sizes where theoretical +results are unavailable, search algorithms can generate designs. +However, their numerical performance is not guaranteed to be optimal. +Another reason is that when design sizes increase, the number of +permutations grows exponentially. Constructing optimal LHDs is a +discrete optimization process, and enumeration is nearly impossible +for large or moderate design sizes. Various search algorithms and +algebraic constructions have been proposed to identify optimal LHDs, +each having its own pros and cons. We develop the R package LHD which +implements various search algorithms and algebraic constructions. We +embedded different optimality criteria into each of the search +algorithms, and they are capable of constructing different types of +optimal LHDs even though they were originally invented to construct +maximin distance LHDs only. Another input argument that controls +maximum CPU time is added to each of the search algorithms to let +users flexibly allocate their computational resources. We demonstrate +functionalities of the package by using various examples, and we +provide guidance for experimenters on finding suitable optimal +designs. The LHD package is easy to use for practitioners and possibly +serves as a benchmark for future developments in LHD.
+Computer experiments are widely used in scientific research and +industrial production, where complex computer codes, commonly +high-fidelity simulators, generate data instead of real physical systems +(Sacks et al. 1989; Fang et al. 2005). The outputs from computer +experiments are deterministic (that is, free of random errors), and +therefore replications are not needed +(Butler 2001; Joseph and Hung 2008; Ba et al. 2015). Latin +hypercube designs (LHDs, (McKay et al. 1979)) may be the most popular +type of experimental designs for computer experiments +(Fang et al. 2005; Xiao and Xu 2018), which avoid replications on +every dimension and have uniform one-dimensional projections. According +to practical needs, there are various types of optimal LHDs, including +space-filling LHDs, maximum projection LHDs, and orthogonal LHDs. There +is a rich literature on the construction of such designs, but it is +still very challenging to find good ones for moderate to large design +sizes +(Ye 1998; Fang et al. 2005; Joseph et al. 2015; Xiao and Xu 2018). +One key reason is that theoretical results, such as algebraic +constructions which guarantee the maximin distance property or +orthogonality, are only established for specific design sizes. These +constructions provide theoretical guarantees on the design quality but +are limited in their applicability. For design sizes where such +theoretical guarantees do not exist, search algorithms can generate +designs. However, the performance of search-based designs depends on the +algorithm employed, the search space explored, and the computational +resources allocated, meaning they cannot be guaranteed to be optimal. +Constructing optimal LHDs is a discrete optimization process, where +enumerating all possible solutions guarantees the optimal design for a +given size. However, this approach becomes computationally infeasible as +the number of permutations grows exponentially with increasing design +sizes, making it another key reason that adds to the challenge.
+An LHD with \(n\) runs and \(k\) factors is an \(n \times k\) matrix with each +column being a random permutation of numbers: \(1, \ldots, n\). Throughout +this paper, \(n\) denotes the run size and \(k\) denotes the factor size. A +space-filling LHD has its sampled region as scattered as possible, +minimizing the unsampled region, thus accounting for the uniformity of +all dimensions. Different criteria were proposed to measure designs’ +space-filling properties, including the maximin and minimax distance +criteria (Johnson et al. 1990; Morris and Mitchell 1995), the discrepancy +criteria +(Hickernell 1998; Fang et al. 2002, 2005) and the +entropy criterion (Shewry and Wynn 1987). Since there are as many as +\((n!)^{k}\) candidate LHDs for a given design size, it is nearly +impossible to find the space-filling one by enumeration when \(n\) and \(k\) +are moderate or large. In the current literature, both the search +algorithms +(Morris and Mitchell 1995; Ye et al. 2000; Leary et al. 2003; Jin et al. 2005; Liefvendahl and Stocki 2006; Joseph and Hung 2008; Grosso et al. 2009; Chen et al. 2013; Ba et al. 2015) +and algebraic constructions +(Zhou and Xu 2015; Xiao and Xu 2017; Wang et al. 2018) are used to +construct space-filling LHDs.
+Space-filling designs often focus on the full-dimensional space. To
+further improve the space-filling properties of all possible sub-spaces,
+(Joseph et al. 2015) proposed to use the maximum projection designs.
+Considering from two to \(k-1\) dimensional sub-spaces, maximum projection
+LHDs (MaxPro LHDs) are generally more space-filling compared to the
+classic maximin distance LHDs. The construction of MaxPro LHDs is also
+challenging, especially for large ones, and (Joseph et al. 2015)
+proposed a simulated annealing (SA) based algorithm. In the LHD
+package, we incorporated the MaxPro criterion with other different
+algorithms such as the particle swarm optimization (PSO) and genetic
+algorithm (GA) framework, leading to many better MaxPro LHDs; see
+Section 3 for examples.
Unlike space-filling LHDs that minimize the similarities among rows,
+orthogonal LHDs (OLHDs) are another popular type of optimal design which
+consider similarities among columns. For example, OLHDs have zero
+column-wise correlations. Algebraic constructions are available for
+certain design sizes
+(Tang 1993; Ye 1998; Butler 2001; Steinberg and Lin 2006; Cioppa and Lucas 2007; Lin et al. 2009; Sun et al. 2009, 2010; Yang and Liu 2012; Georgiou and Efthimiou 2014),
+but there are many design sizes where theoretical results are not
+available. In the LHD package, we implemented the average absolute
+correlation criterion and the maximum absolute correlation criterion
+(Georgiou 2009) with SA, PSO, and GA to identify both OLHDs
+and nearly orthogonal LHDs (NOLHDs) for almost all design sizes.
This paper introduces the R package LHD available on the Comprehensive
+R Archive Network
+(https://cran.r-project.org/web/packages/LHD/index.html), which
+implements some currently popular search algorithms and algebraic
+constructions for constructing maximin distance LHDs, Maxpro LHDs, OLHDs
+and NOLHDs. We embedded different optimality criteria including the
+maximin distance criterion, the MaxPro criterion, the average absolute
+correlation criterion, and the maximum absolute correlation criterion in
+each of the search algorithms which were originally invented to
+construct maximin distance LHDs only
+(Morris and Mitchell 1995; Leary et al. 2003; Liefvendahl and Stocki 2006; Joseph and Hung 2008; Chen et al. 2013),
+and each of them is capable of constructing different types of optimal
+LHDs through the package. To let users flexibly allocate their
+computational resources, we also embedded an input argument that limits
+the maximum CPU time for each of the algorithms, where users can easily
+define how and when they want the algorithms to stop. An algorithm can
+stop in one of two ways: either when the user-defined maximum number of
+iterations is reached or when the user-defined maximum CPU time is
+exceeded. For example, users can either allow the algorithm to run for a
+specified number of iterations without restricting the maximum CPU time
+or set a maximum CPU time limit to stop the algorithm regardless of the
+number of iterations completed. After an algorithm is completed or
+stopped, the number of iterations completed along with the average CPU
+time per iteration will be presented to users for their information. The
+R package LHD is an integrated tool for users with little or no
+background in design theory, and they can easily find optimal LHDs with
+desired sizes. Many new designs that are better than the existing ones
+are discovered; see Section 3.
The remainder of the paper is organized as follows. Section 2
+illustrates different optimality criteria for LHDs. Section 3
+demonstrates some popular search algorithms and their implementation
+details in the LHD package along with examples. Section 4 discusses
+some useful algebraic constructions as well as examples of how to
+implement them via the developed package. Section 5 reviews other R packages for Latin hypercubes and provides a comparative discussion. Section 6 concludes with a
+summary.
Various criteria are proposed to measure designs’ space-filling
+properties
+(Johnson et al. 1990; Hickernell 1998; Fang et al. 2002). In
+this paper, we focus on the currently popular maximin distance criterion
+(Johnson et al. 1990), which seeks to scatter design points over
+experimental domains so that the minimum distances between points are
+maximized. Let \(\textbf{X}\) denote an LHD matrix throughout this paper.
+Define the \(L_q\)-distance between two runs \(x_i\) and \(x_j\) of
+\(\textbf{X}\) as
+\(d_q(x_i, x_j) = \left\{ \sum_{m=1}^{k} \vert x_{im}-x_{jm}\vert ^q \right\}^{1/q}\)
+where \(q\) is an integer. Define the \(L_q\) distance of the design
+\(\textbf{X}\) as
+\(d_q(\textbf{X}) = \text{min} \{d_q(x_i, x_j), 1 \leq i<j \leq n \}\).
+In this paper, we consider \(q=1\) and \(q=2\), i.e. the Manhattan (\(L_1\))
+and Euclidean (\(L_2\)) distances. A design \(\textbf{X}\) is called a
+maximin \(L_q\) distance design if it has the unique largest
+\(d_q(\textbf{X})\) value among all designs of the same size. When more
+than one design has the same largest \(d_q(\textbf{X})\), the maximin
+distance design sequentially maximizes the next minimum inter-site
+distances. To evaluate the maximin distance criterion in a more
+convenient way, (Morris and Mitchell 1995) and (Jin et al. 2005)
+proposed to minimize a scalar value:
+\[\label{E2}
+ \phi_{p}= \bigg\{\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}d_q(x_i, x_j)^{-p} \bigg\} ^{1/p}, \tag{1}\]
+where \(p\) is a tuning parameter. This \(\phi_{p}\) criterion in
+Equation (1) is asymptotically equivalent to the Maximin
+distance criterion as \(p \to \infty\). In practice, \(p=15\) often suffices
+(Morris and Mitchell 1995). In the LHD package, the function phi_p()
+implements this criterion.
Maximin distance LHDs focus on the space-filling properties in the
+full-dimensional space, but their space-filling properties in the
+subspaces are not guaranteed. (Joseph et al. 2015) proposed the maximum
+projection criterion that considers designs’ space-filling properties in
+all possible dimensional spaces. An LHD \(\textbf{X}\) is called a maximum
+projection LHD (MaxPro LHD) if it minimizes the maximum projection
+criterion such that
+\[\label{E3}
+ \mathop{\mathrm{min}}\limits_{\textbf{X}} \psi (\textbf{X}) = \Bigg\{ \frac{1}{{n \choose 2}} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \frac{1}{\Pi_{l=1}^{k}(x_{il}-x_{jl})^2} \Bigg\}^{1/k}. \tag{2}\]
+From Equation (2), we can see that any two design points should
+be apart from each other in any projection to minimize the value of
+\(\psi (\textbf{X})\). Thus, the maximum projection LHDs consider the
+space-filling properties in all possible subspaces. Note that this
+criterion was originally defined using design points scaled to the unit
+hypercube \([0,1]^{k}\) in (Joseph et al. 2015), whereas our design points
+are represented as integer levels. A simple transformation can be
+applied to revert the scaling. For example, the transformation
+\(\textbf{X}_{Scaled}*n-0.5\), can be applied, meaning that each element
+of every design point in scaled unit hypercube is multiplied by its run
+size \(n\) and then adding \(0.5\). The illustrative example at the end of
+Section 3 applies this transformation to ensure a fair comparison of
+performance. In the LHD package, the function MaxProCriterion()
+implements this criterion.
Orthogonal and nearly orthogonal designs that aim to minimize the
+correlations between factors are widely used in experiments
+(Steinberg and Lin 2006; Georgiou 2009; Sun and Tang 2017).
+Two major correlation-based criteria to measure designs’ orthogonality
+are the average absolute correlation criterion and the maximum absolute
+correlation criterion (Georgiou 2009), denoted as ave\((|q|)\)
+and max\(|q|\), respectively:
+\[\label{E4}
+ \mathop{\mathrm{ave}}\limits(|q|) = \frac{2 \sum_{i=1}^{k-1} \sum_{j=i+1}^{k}|q_{ij}|}{k(k-1)} \text{ and } \mathop{\mathrm{max}}\limits|q| = \mathop{\mathrm{max}}\limits_{i,j} |q_{ij}|, \tag{3}\]
+where \(q_{ij}\) is the correlation between the \(i\)th and \(j\)th columns of
+the design matrix \(\textbf{X}\). Orthogonal designs have ave\((|q|)=0\) and
+max\(|q|=0\), which may not exist for all design sizes. Designs with a
+smaller ave\((|q|)\) or max\(|q|\) are generally preferred in practice. In
+the LHD package, functions AvgAbsCor() and MaxAbsCor() implement
+the criteria ave\((|q|)\) and max\(|q|\), respectively.
This subsection demonstrates some examples of how to use the optimality
+criteria introduced above from the developed LHD package. To generate
+a random LHD matrix, the function rLHD can be used. For example,
> X = rLHD(n = 5, k = 3); X #This generates a 5 by 3 random LHD, denoted as X
+ [,1] [,2] [,3]
+[1,] 2 1 4
+[2,] 4 3 3
+[3,] 3 2 2
+[4,] 1 4 5
+[5,] 5 5 1The input arguments for the function rLHD are the run-size n and the
+factor size k. Continuing with the above randomly generated LHD X, we
+evaluate it with respect to different optimality criteria. For example,
> phi_p(X) #The maximin L1-distance criterion.
+[1] 0.3336608
+> phi_p(X, p = 10, q = 2) #The maximin L2-distance criterion.
+[1] 0.5797347
+> MaxProCriterion(X) #The maximum projection criterion.
+[1] 0.5375482
+> AvgAbsCor(X) #The average absolute correlation criterion.
+[1] 0.5333333
+> MaxAbsCor(X) #The maximum absolute correlation criterion.
+[1] 0.9The input arguments of the function phi_p are an LHD matrix X, p
+and q, where p and q come directly from the equation (1).
+Note that the default settings within function phi_p are \(p=15\) and
+\(q=1\) (the Manhattan distance) and user can change the settings. For
+functions MaxProCriterion, AvgAbsCor, and MaxAbsCor, there is only
+one input argument, which is an LHD matrix X.
Simulated annealing (SA, (Kirkpatrick et al. 1983)) is a
+probabilistic optimization algorithm, whose name comes from the
+phenomenon of the annealing process in metallurgy.
+(Morris and Mitchell 1995) proposed a modified SA that randomly exchanges
+the elements in LHD to seek potential improvements. If such an exchange
+leads to a better LHD under a given optimality criterion, the exchange
+is maintained. Otherwise, it is kept with a probability of
+\(\hbox{exp}[-(\Phi(\textbf{X}_{new})-\Phi(\textbf{X}))/T]\), where \(\Phi\)
+is a given optimality criterion, \(\textbf{X}\) is the original LHD,
+\(\textbf{X}_{new}\) is the LHD after the exchange and \(T\) is the current
+temperature. In this article, we focus on minimizing the optimality
+criteria outlined in Section 2, meaning only minimization optimization
+problems are considered. Such probability guarantees that the exchange
+that leads to a slightly worse LHD has a higher chance of being kept
+than the exchange that leads to a significantly worse LHD, because an
+exchange which leads to a slightly worse LHD has a lower value of
+\(\Phi(\textbf{X}_{new})-\Phi(\textbf{X})\). Such an exchange procedure
+will be implemented iteratively to improve LHD. When there are no
+improvements after certain attempts, the current temperature \(T\) is
+annealed. Note that a large value of
+\(\Phi(\textbf{X}_{new})-\Phi(\textbf{X})\) (exchange leading to a
+significantly worse LHD) is more likely to remain during the early phase
+of the search process when \(T\) is relatively high, and it is less likely
+to stay later when \(T\) decreases (annealed). The best LHD is identified
+after the algorithm converges or the budget constraint is reached. In
+the LHD package, the function SA() implements this algorithm:
SA(n, k, N = 10, T0 = 10, rate = 0.1, Tmin = 1, Imax = 5, OC = "phi_p",
+p = 15, q = 1, maxtime = 5)Table 1 provides an
+overview of all the input arguments in SA(). n and k are the
+desired run size and factor size. T0 is an initial temperature, rate
+is the temperature decreasing rate, and Tmin is the minimum
+temperature. If the current temperature is smaller than Tmin, the
+current loop in the algorithm will stop and the current number of
+iterations will increase by one. There are two stopping criteria for the
+entire function: when the current number of iterations reaches the
+maximum (denoted as N in the function) or when the cumulative CPU time
+reaches the maximum (denoted as maxtime in the function),
+respectively. Either of those will trigger the stop of the function,
+whichever is earlier. For input argument OC (optimality criterion),
+“phi_p" returns maximin distance LHDs,”MaxProCriterion" returns
+MaxPro LHDs, and “AvgAbsCor" or”MaxAbsCor" returns orthogonal LHDs.
| Argument | +Description | +
|---|---|
n |
+A positive integer that defines the number of rows (or run size) of output LHD. | +
k |
+A positive integer that defines the number of columns (or factor size) of output LHD. | +
N |
+A positive integer that defines the maximum number of iterations in the algorithm. | +
| + | A large value of N will result in a high CPU time, and it is recommended to be no |
+
| + | greater than 500. The default is set to be 10. | +
T0 |
+A positive number that defines the initial temperature. The default is set to be 10, | +
| + | which means the temperature anneals from 10 in the algorithm. | +
rate |
+A positive percentage that defines the temperature decrease rate, and it should be | +
| + | in (0,1). For example, rate=0.25 means the temperature decreases by 25% each time. | +
| + | The default is set to be 10%. | +
Tmin |
+A positive number that defines the minimum temperature allowed. When current | +
| + | temperature becomes smaller or equal to Tmin, the stopping criterion for current |
+
| + | loop is met. The default is set to be 1. | +
Imax |
+A positive integer that defines the maximum perturbations the algorithm will try | +
| + | without improvements before temperature is reduced. The default is set to be 5. | +
| + | For CPU time consideration, Imax is recommended to be no greater than 5. |
+
OC |
+An optimality criterion. The default setting is “phi_p", and it could be one of | +
| + | the following: “phi_p",”AvgAbsCor", “MaxAbsCor",”MaxProCriterion". | +
p |
+A positive integer, which is one parameter in the \(\phi_{p}\) formula, and p is preferred |
+
| + | to be large. The default is set to be 15. | +
q |
+A positive integer, which is one parameter in the \(\phi_{p}\) formula, and q could be |
+
| + | either 1 or 2. If q is 1, the Manhattan (rectangular) distance will be calculated. |
+
| + | If q is 2, the Euclidean distance will be calculated. |
+
maxtime |
+A positive number, which indicates the expected maximum CPU time, and it is | +
| + | measured by minutes. For example, maxtime=3.5 indicates the CPU time will |
+
| + | be no greater than three and a half minutes. The default is set to be 5. | +
(Leary et al. 2003) modified the SA algorithm in
+(Morris and Mitchell 1995) to search for optimal orthogonal array-based
+LHDs (OALHDs). (Tang 1993) showed that OALHDs tend to have
+better space-filling properties than random LHDs. The SA in
+(Leary et al. 2003) starts with a random OALHD rather than a random LHD.
+The remaining steps are the same as the SA in (Morris and Mitchell 1995).
+Note that the existence of OALHDs is determined by the existence of the
+corresponding initial OAs. In the LHD package, the function OASA()
+implements the modified SA algorithm.:
OASA(OA, N = 10, T0 = 10, rate = 0.1, Tmin = 1, Imax = 5, OC = "phi_p",
+p = 15, q = 1, maxtime = 5),where all the input arguments are the same as in SA except that OA
+must be an orthogonal array.
(Joseph and Hung 2008) proposed another modified SA to identify the
+orthogonal-maximin LHDs, which considers both the orthogonality and the
+maximin distance criteria. The algorithm starts with generating a random
+LHD and then chooses the column that has the largest average pairwise
+correlations with all other columns. Next, the algorithm will select the
+row which has the largest total row-wise distance with all other rows.
+Then, the element at the selected row and column will be exchanged with
+a random element from the same column. The remaining steps are the same
+as the SA in (Morris and Mitchell 1995). In the LHD package, the
+function SA2008() implements this algorithm:
SA2008(n, k, N = 10, T0 = 10, rate = 0.1, Tmin = 1, Imax = 5, OC = "phi_p",
+p = 15, q = 1, maxtime = 5),where all the input arguments are the same as in SA.
Particle swarm optimization (PSO, (Kennedy and Eberhart 1995)) is a
+metaheuristic optimization algorithm inspired by the social behaviors of
+animals. Recent research (Chen et al. 2013) adapted the classic PSO
+algorithm and proposed LaPSO to identify maximin distance LHDs. Since
+this is a discrete optimization task, LaPSO redefines the steps in which
+each particle updates its velocity and position in the general PSO
+framework. In the LHD package, the function LaPSO() implements this
+algorithm:
LaPSO(n, k, m = 10, N = 10, SameNumP = 0, SameNumG = n/4, p0 = 1/(k - 1),
+OC = "phi_p", p = 15, q = 1, maxtime = 5)Table 2 provides an
+overview of all the input arguments in LaPSO(), where n, k, N,
+OC, p, q, and maxtime are exactly the same as the input
+arguments in the function SA(). m is the number of particles, which
+represents candidate solutions in the PSO framework. SameNumP and
+SameNumG are two tuning parameters that denote how many exchanges
+would be performed to reduce the Hamming distance towards the personal
+best and the global best. p0 is the tuning parameter that denotes the
+probability of a random swap for two elements in the current column of
+the current particle to prevent the algorithm from being stuck at the
+local optimum. In (Chen et al. 2013), they provided the following
+suggestions: SameNumP is approximately \(n/2\) when SameNumG is \(0\),
+SameNumG is approximately \(n/4\) when SameNumP is \(0\), and p0
+should be between \(1/(k-1)\) and \(2/(k-1)\). The stopping criterion of the
+function is the same as that of the function SA.
| Argument | +Description | +
|---|---|
n |
+A positive integer that defines the number of rows (or run size) of output LHD. | +
k |
+A positive integer that defines the number of columns (or factor size) of output LHD. | +
m |
+A positive integer that defines the number of particles, where each particle is a | +
| + | candidate solution. A large value of N will result in a high CPU time, and it is |
+
| + | recommended to be no greater than 100. The default is set to be 10. | +
N |
+A positive integer that defines the maximum number of iterations in the algorithm. | +
| + | A large value of N will result in a high CPU time, and it is recommended to be no |
+
| + | greater than 500. The default is set to be 10. | +
SameNumP |
+A non-negative integer that defines how many elements in current column of | +
| + | current particle should be the same as corresponding Personal Best. SameNumP | +
| + | can be 0, 1, 2, …, n, where 0 means to skip the element exchange, which is the | +
| + | default setting. | +
SameNumG |
+A non-negative integer that defines how many elements in current column of | +
| + | current particle should be the same as corresponding Global Best. SameNumG can | +
| + | be 0, 1, 2, …, n, where 0 means to skip the element exchange. The default setting | +
| + | is n/4. Note that SameNumP and SameNumG cannot be 0 at the same time. | +
p0 |
+A probability of exchanging two randomly selected elements in current column of | +
| + | current particle LHD. The default is set to be 1/(k - 1). | +
OC |
+An optimality criterion. The default setting is “phi_p", and it could be one of | +
| + | the following: “phi_p",”AvgAbsCor", “MaxAbsCor",”MaxProCriterion". | +
p |
+A positive integer, which is one parameter in the \(\phi_{p}\) formula, and p is preferred |
+
| + | to be large. The default is set to be 15. | +
q |
+A positive integer, which is one parameter in the \(\phi_{p}\) formula, and q could be |
+
| + | either 1 or 2. If q is 1, the Manhattan (rectangular) distance will be calculated. |
+
| + | If q is 2, the Euclidean distance will be calculated. |
+
maxtime |
+A positive number, which indicates the expected maximum CPU time, and it is | +
| + | measured by minutes. For example, maxtime=3.5 indicates the CPU time will |
+
| + | be no greater than three and a half minutes. The default is set to be 5. | +
The genetic algorithm (GA) is a nature-inspired metaheuristic
+optimization algorithm that mimics Charles Darwin’s idea of natural
+selection (Goldberg 1989; Holland et al. 1992).
+(Liefvendahl and Stocki 2006) proposed a version of GA for identifying maximin
+distance LHDs. They implement the column exchange technique to solve the
+discrete optimization task. In the LHD package, the function GA()
+implements this algorithm:
GA(n, k, m = 10, N = 10, pmut = 1/(k - 1), OC = "phi_p", p = 15, q = 1,
+maxtime = 5)Table 3 provides an
+overview of all the input arguments in GA(), where n, k, N,
+OC, p, q, and maxtime are exactly the same as the input
+arguments in the function SA(). m is the population size, which
+represents how many candidate solutions in each iteration, and must be
+an even number. pmut is the tuning parameter that controls how likely
+the mutation would happen. When mutation occurs, two randomly selected
+elements will be exchanged in the current column of the current LHD.
+pmut serves the same purpose as p0 in LaPSO(), which prevents the
+algorithm from getting stuck at the local optimum, and it is recommended
+to be \(1/(k-1)\). The stopping criterion of the function is the same as
+that of the function SA.
| Argument | +Description | +
|---|---|
n |
+A positive integer that defines the number of rows (or run size) of output LHD. | +
k |
+A positive integer that defines the number of columns (or factor size) of output LHD. | +
m |
+A positive even integer, which stands for the population size and it must be an even | +
| + | number. The default is set to be 10. A large value of m will result in a high CPU time, | +
| + | and it is recommended to be no greater than 100. | +
N |
+A positive integer that defines the maximum number of iterations in the algorithm. | +
| + | A large value of N will result in a high CPU time, and it is recommended to be no |
+
| + | greater than 500. The default is set to be 10. | +
pmut |
+A probability for mutation. When the mutation happens, two randomly selected | +
| + | elements in current column of current LHD will be exchanged. The default is | +
| + | set to be 1/(k - 1). | +
OC |
+An optimality criterion. The default setting is “phi_p", and it could be one of | +
| + | the following: “phi_p",”AvgAbsCor", “MaxAbsCor",”MaxProCriterion". | +
p |
+A positive integer, which is one parameter in the \(\phi_{p}\) formula, and p is preferred |
+
| + | to be large. The default is set to be 15. | +
q |
+A positive integer, which is one parameter in the \(\phi_{p}\) formula, and q could be |
+
| + | either 1 or 2. If q is 1, the Manhattan (rectangular) distance will be calculated. |
+
| + | If q is 2, the Euclidean distance will be calculated. |
+
maxtime |
+A positive number, which indicates the expected maximum CPU time, and it is | +
| + | measured by minutes. For example, maxtime=3.5 indicates the CPU time will |
+
| + | be no greater than three and a half minutes. The default is set to be 5. | +
This subsection demonstrates some examples on how to use the search
+algorithms in the developed LHD package. In
+Table 4, we summarize
+the R functions of the algorithms discussed in the previous subsections,
+which can be used to identify different types of optimal LHDs. Users who
+seek fast solutions can use the default settings of the input arguments
+after specifying the design sizes. See the following examples.
| Function | +Description | +
|---|---|
| SA | +Returns an LHD via the simulated annealing algorithm (Morris and Mitchell 1995). | +
| OASA | +Returns an LHD via the orthogonal-array-based simulated annealing algorithm | +
| + | (Leary et al. 2003), where an OA of the required design size must exist. | +
| SA2008 | +Returns an LHD via the simulated annealing algorithm with the multi-objective | +
| + | optimization approach (Joseph and Hung 2008). | +
| LaPSO | +Returns an LHD via the particle swarm optimization (Chen et al. 2013). | +
| GA | +Returns an LHD via the genetic algorithm (Liefvendahl and Stocki 2006). | +
#Generate a 5 by 3 maximin distance LHD by the SA function.
+> try.SA = SA(n = 5, k = 3); try.SA
+ [,1] [,2] [,3]
+[1,] 2 2 1
+[2,] 5 3 2
+[3,] 4 5 5
+[4,] 3 1 4
+[5,] 1 4 3
+> phi_p(try.SA) #\phi_p is smaller than that of a random LHD (0.3336608).
+[1] 0.2169567
+
+#Similarly, generations of 5 by 3 maximin distance LHD by the SA2008, LaPSO and GA functions.
+> try.SA2008 = SA2008(n = 5, k = 3)
+> try.LaPSO = LaPSO(n = 5, k = 3)
+> try.GA = GA(n = 5, k = 3)
+
+#Generate an OA(9,2,3,2), an orthogonal array with 9 runs, 2 factors, 3 levels, and 2 strength.
+> OA = matrix(c(rep(1:3, each = 3), rep(1:3, times = 3)),
++ ncol = 2, nrow = 9, byrow = FALSE)
+#Generates a maximin distance LHD with the same design size as the input OA
+#by the orthogonal-array-based simulated annealing algorithm.
+> try.OASA = OASA(OA)
+> OA; try.OASA
+ [,1] [,2] [,1] [,2]
+[1,] 1 1 [1,] 1 2
+[2,] 1 2 [2,] 2 6
+[3,] 1 3 [3,] 3 9
+[4,] 2 1 [4,] 4 3
+[5,] 2 2 [5,] 6 5
+[6,] 2 3 [6,] 5 7
+[7,] 3 1 [7,] 7 1
+[8,] 3 2 [8,] 9 4
+[9,] 3 3; [9,] 8 8Note that the default optimality criterion embedded in all search
+algorithms is “phi_p" (that is, the maximin distance criterion),
+leading to the maximin \(L_2\)-distance LHDs. For other optimality
+criteria, users should change the setting of the input argument OC
+(with options”phi_p", “MaxProCriterion",”MaxAbsCor" and
+“MaxProCriterion"). The following examples illustrate some details of
+different argument settings.
#Below try.SA is a 5 by 3 maximin distance LHD generated by the SA with 30 iterations (N = 30).
+#The temperature starts at 10 (T0 = 10) and decreases 10% (rate = 0.1) each time.
+#The minimium temperature allowed is 1 (Tmin = 1) and the maximum perturbations that
+#the algorithm will try without improvements is 5 (Imax = 5). The optimality criterion
+#used is maximin distance criterion (OC = "phi_p") with p = 15 and q = 1, and the
+#maximum CPU time is 5 minutes (maxtime = 5).
+> try.SA = SA(n = 5, k = 3, N = 30, T0 = 10, rate = 0.1, Tmin = 1, Imax = 5, OC = "phi_p",
++ p = 15, q = 1, maxtime = 5); try.SA
+ [,1] [,2] [,3]
+[1,] 1 3 4
+[2,] 2 5 2
+[3,] 5 4 3
+[4,] 4 1 5
+[5,] 3 2 1
+> phi_p(try.SA)
+[1] 0.2169567
+
+#Below try.SA2008 is a 5 by 3 maximin distance LHD generated by SA with
+#the multi-objective optimization approach. The input arguments are interpreted
+#the same as the design try.SA above.
+> try.SA2008 = SA2008(n = 5, k = 3, N = 30, T0 = 10, rate = 0.1, Tmin = 1, Imax = 5,
++ OC = "phi_p", p = 15, q = 1, maxtime = 5)
+
+#Below try.OASA is a 9 by 2 maximin distance LHD generated by the
+#orthogonal-array-based simulated annealing algorithm with the input
+#OA (defined previously), and the rest input arguments are interpreted the
+#same as the design try.SA above.
+> try.OASA = OASA(OA, N = 30, T0 = 10, rate = 0.1, Tmin = 1, Imax = 5,
++ OC = "phi_p", p = 15, q = 1, maxtime = 5)
+
+#Below try.LaPSO is a 5 by 3 maximum projection LHD generated by the particle swarm
+#optimization algorithm with 20 particles (m = 20) and 30 iterations (N = 30).
+#Zero (or two) elements in any column of the current particle should be the same as
+#the elements of corresponding column from personal best (or global best), because
+#of SameNumP = 0 (or SameNumG = 2).
+#The probability of exchanging two randomly selected elements is 0.5 (p0 = 0.5).
+#The optimality criterion is maximum projection criterion (OC = "MaxProCriterion").
+#The maximum CPU time is 5 minutes (maxtime = 5).
+> try.LaPSO = LaPSO(n = 5, k = 3, m = 20, N = 30, SameNumP = 0, SameNumG = 2,
++ p0 = 0.5, OC = "MaxProCriterion", maxtime = 5); try.LaPSO
+ [,1] [,2] [,3]
+[1,] 4 5 4
+[2,] 3 1 3
+[3,] 5 2 1
+[4,] 2 3 5
+[5,] 1 4 2
+#Recall the value is 0.5375482 from the random LHD in Section 2.
+> MaxProCriterion(try.LaPSO)
+[1] 0.3561056
+
+#Below try.GA is a 5 by 3 OLHD generated by the genetic algorithm with the
+#population size 20 (m = 20), number of iterations 30 (N = 30), mutation
+#probability 0.5 (pmut = 0.5), maximum absolute correlation criterion
+#(OC = "MaxAbsCor"), and maximum CPU time 5 minutes (maxtime = 5).
+> try.GA = GA(n = 5, k = 3, m = 20, N = 30, pmut = 0.5, OC = "MaxAbsCor",
++ maxtime = 5); try.GA
+ [,1] [,2] [,3]
+[1,] 2 1 2
+[2,] 4 4 5
+[3,] 3 5 1
+[4,] 5 2 3
+[5,] 1 3 4
+#Recall the value is 0.9 from the random LHD in Section 2.
+> MaxAbsCor(try.GA)
+[1] 0.1 #The maximum absolute correlation between columns is 0.1Next, we discuss some details of the implementation. In SA based
+algorithms (SA, SA2008, and OASA), the number of iterations N is
+recommended to be no greater than 500 for computing time considerations.
+The input rate determines the percentage of the decrease in current
+temperature (for example, \(0.1\) means a decrease of \(10\%\) each time). A
+high rate would make the temperature rapidly drop, which leads to a fast
+stop of the algorithm. It is recommended to set rate from \(0.1\) to
+\(0.15\). Imax indicates the maximum perturbations that the algorithm
+will attempt without improvements before the temperature reduces, and it
+is recommended to be no greater than 5 for computing time
+considerations. OC chooses the optimality criterion, and the "phi_p"
+criterion in (1) is set as default. OC has other options,
+including "MaxProCriterion", "AvgAbsCor" and "MaxAbsCor". Our
+algorithms support both the \(L_1\) and \(L_2\) distances.
For every algorithm, we incorporate a progress bar to visualize the
+computing time used. After an algorithm is completed, information on
+“average CPU time per iteration" and”numbers of iterations completed"
+will be presented. Users can set the limit for the CPU time used for
+each algorithm using the argument maxtime, according to their
+practical needs.
We also provide some illustrative code to demonstrate that the designs
+found in the LHD package are better than the existing ones, and the
+code in the following can be easily modified to construct other design
+sizes or other LHD types. Out of 100 trials, the code below shows the GA
+in the LHD package constructed better MaxPro LHDs 99 times compared to
+the algorithm in the MaxPro package, when 500 iterations are set for
+both algorithms. We did not compare the CPU time between these two
+packages since one is written in the R environment and the other one is
+written in the C++ environment, but with the same number of iterations,
+the GA in the LHD package almost always constructs better MaxPro LHDs.
#Make sure both packages are properly installed before load them
+> library(LHD)
+> library(MaxPro)
+
+> count = 0 #Define a variable for counting purpose
+
+> k = 5 #Factor size 5
+> n = 10*k #Run size = 10*factor size
+
+#Setting 500 iterations for both algorithms, below loop counts
+#how many times the GA from LHD package outperforms the algorithm
+#from MaxPro package out of 100 times
+> for (i in 1:100) {
+
+ LHD = LHD::GA(n = n, k = k, m = 100, N = 500)
+ MaxPro = MaxPro::MaxProLHD(n = n, p = k, total_iter = 500)$Design
+
+ #MaxPro * n + 0.5 applied the transformation mentioned in Section 2
+ #to revert the scaling.
+ Result.LHD = LHD::MaxProCriterion(LHD)
+ Result.MaxPro = LHD::MaxProCriterion(MaxPro * n + 0.5)
+
+ if (Result.LHD < Result.MaxPro) {count = count + 1}
+
+}
+
+> count
+[1] 99There are algebraic constructions available for certain design sizes,
+and theoretical results are developed to guarantee the efficiency of
+such designs. Algebraic constructions almost do not require any
+searching, which are especially attractive for large designs. In this
+section, we present algebraic constructions that are available in the
+LHD package for maximin distance LHDs and orthogonal LHDs.
(Wang et al. 2018) proposed to generate maximin distance LHDs via good
+lattice point (GLP) sets (Zhou and Xu 2015) and Williams transformation
+(Williams 1949). In practice, their method can lead to
+space-filling designs with relatively flexible sizes, where the run size
+\(n\) is flexible but the factor size \(k\) must be no greater than the
+number of positive integers that are co-prime to \(n\). They proved that
+the resulting designs of sizes \(n \times (n-1)\) (with \(n\) being any odd
+prime) and \(n \times n\) (with \(2n+1\) or \(n+1\) being odd prime) are
+optimal under the maximin \(L_1\) distance criterion. This construction
+method by (Wang et al. 2018) is very attractive for constructing large
+maximin distance LHDs. In the LHD package, function FastMmLHD()
+implements this method:
FastMmLHD(n, k, method = "manhattan", t1 = 10),where n and k are the desired run size and factor size. method is
+a distance measure method which can be one of the following:
+“euclidean",”maximum", “manhattan",”canberra", “binary" or”minkowski". Any unambiguous substring can be given. t1 is a tuning
+parameter, which determines how many repeats will be implemented to
+search for the optimal design. The default is set to be 10.
(Tang 1993) proposed to construct orthogonal array-based LHDs
+(OALHDs) from existing orthogonal arrays (OAs), and
+(Tang 1993) showed that the OALHDs can have better
+space-filling properties than the general ones. In the LHD package,
+function OA2LHD() implements this method:
OA2LHD(OA),where OA is an orthogonal array matrix. Users only need to input an OA
+and the function will return an OALHD with the same design size as the
+input OA.
Orthogonal LHDs (OLHDs) have zero pairwise correlation between any two
+columns, which are widely used by practitioners. There is a rich
+literature on the constructions of OLHDs with various design sizes, but
+they are often too hard for practitioners to replicate in practice. The
+LHD package implements some currently popular methods
+(Tang 1993; Ye 1998; Butler 2001; Cioppa and Lucas 2007; Lin et al. 2009; Sun et al. 2010)
+for practitioners and the functions are easy to use.
(Ye 1998) proposed a construction for OLHDs with run sizes
+\(n=2^m+1\) and factor sizes \(k=2m-2\) where \(m\) is any integer bigger than
+2. In the LHD package, function OLHD.Y1998() implements this
+algebraic construction:
OLHD.Y1998(m),where input argument m is the \(m\) in the construction of
+(Ye 1998). (Cioppa and Lucas 2007) extended
+(Ye 1998)’s method to construct OLHDs with run size \(n=2^m+1\)
+and factor size \(k=m+ {m-1 \choose 2}\), where \(m\) is any integer bigger
+than 2. In the LHD package, function OLHD.C2007() implements this
+algebraic construction with input argument m remaining the same:
OLHD.C2007(m)(Sun et al. 2010) extended their earlier work
+(Sun et al. 2009) to construct OLHDs with \(n=r2^{c+1}+1\) or
+\(n=r2^{c+1}\) and \(k=2^c\), where \(r\) and \(c\) are positive integers. In
+the LHD package, function OLHD.S2010() implements this algebraic
+construction:
OLHD.S2010(C, r, type = "odd"),where input arguments C and r are \(c\) and \(r\) in the construction.
+When input argument type is "odd", the output design size would be
+\(n=r2^{c+1}+1\) by \(k=2^c\). When input argument type is "even", the
+output design size would be \(n=r2^{c+1}\) by \(k=2^c\).
(Lin et al. 2009) constructed OLHDs or NOLHDs with \(n^2\) runs and
+\(2fp\) factors by coupling OLHD(\(n\), \(p\)) or NOLHD(\(n\), \(p\)) with an
+OA(\(n^2,2f,n,2\)). For example, an OLHD(11, 7), coupled with an
+OA(121,12,11,2), would yield an OLHD(121, 84). The design size of output
+OLHD or NOLHD highly depends on the existence of the OAs. In the LHD
+package, function OLHD.L2009() implements this algebraic construction:
OLHD.L2009(OLHD, OA),where input arguments OLHD and OA are the OLHD and OA to be coupled,
+and their design sizes need to be aligned with the designated pattern of
+the construction.
(Butler 2001) proposed a method to construct OLHDs with the run
+size \(n\) being odd primes and factor size \(k\) being less than or equal
+to \(n-1\) via the Williams transformation (Williams 1949). In
+the LHD package, function OLHD.B2001() implements this algebraic
+construction with input arguments n and k exactly matching those in
+construction:
OLHD.B2001(n, k)In Table 5, we summarize
+the algebraic constructions implemented by the developed LHD package,
+where FastMmLHD and OA2LHD are for maximin distance LHDs and
+OLHD.Y1998, OLHD.C2007, OLHD.S2010, OLHD.L2009 and OLHD.B2001
+are for orthogonal LHDs. The following examples will illustrate how to
+use them.
| Function | +Description | +
|---|---|
| FastMmLHD | +Returns a maximin distance LHD matrix (Wang et al. 2018). | +
| OA2LHD | +Expands an orthogonal array to an LHD (Tang 1993). | +
| OLHD.Y1998 | +Returns a \(2^m+1\) by \(2m-2\) orthogonal LHD matrix (Ye 1998) | +
| + | where \(m\) is an integer and \(m \geq 2\). | +
| OLHD.C2007 | +Returns a \(2^m+1\) by \(m+{m-1 \choose 2}\) orthogonal LHD matrix | +
| + | (Cioppa and Lucas 2007) where \(m\) is an integer and \(m \geq 2\). | +
| OLHD.S2010 | +Returns a \(r2^{c+1}+1\) or \(r2^{c+1}\) by \(2^c\) orthogonal LHD matrix | +
| + | (Sun et al. 2010) where \(r\) and \(c\) are positive integers. | +
| OLHD.L2009 | +Couples an \(n\) by \(p\) orthogonal LHD with a \(n^2\) by \(2f\) strength \(2\) and | +
| + | level \(n\) orthogonal array to generate a \(n^2\) by \(2fp\) orthogonal LHD | +
| + | (Lin et al. 2009). | +
| OLHD.B2001 | +Returns an orthogonal LHD (Butler 2001) with the run size \(n\) | +
| + | being odd primes and factor size \(k\) being less than or equal to \(n-1\) . | +
#FastMmLHD(8, 8) generates an optimal 8 by 8 maximin L_1 distance LHD.
+>try.FastMm = FastMmLHD(n = 8, k = 8); try.FastMm
+ [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
+[1,] 0 1 2 3 4 5 6 7
+[2,] 1 3 5 7 6 4 2 0
+[3,] 2 5 7 4 1 0 3 6
+[4,] 3 7 4 0 2 6 5 1
+[5,] 4 6 1 2 7 3 0 5
+[6,] 5 4 0 6 3 1 7 2
+[7,] 6 2 3 5 0 7 1 4
+[8,] 7 0 6 1 5 2 4 3
+
+#OA2LHD(OA) expands an input OA to an LHD of the same run size.
+>try.OA2LHD = OA2LHD(OA)
+>OA; try.OA2LHD
+ [,1] [,2] [,1] [,2]
+[1,] 1 1 [1,] 1 2
+[2,] 1 2 [2,] 2 4
+[3,] 1 3 [3,] 3 9
+[4,] 2 1 [4,] 4 3
+[5,] 2 2 [5,] 5 5
+[6,] 2 3 [6,] 6 7
+[7,] 3 1 [7,] 9 1
+[8,] 3 2 [8,] 8 6
+[9,] 3 3; [9,] 7 8#OLHD.Y1998(m = 3) generates a 9 by 4 orthogonal LHD.
+#Note that 2^m+1 = 9 and 2*m-2 = 4.
+> try.Y1998 = OLHD.Y1998(m = 3); try.Y1998
+ [,1] [,2] [,3] [,4]
+[1,] 4 -3 -2 1
+[2,] 3 4 -1 -2
+[3,] 1 -2 3 -4
+[4,] 2 1 4 3
+[5,] 0 0 0 0
+[6,] -4 3 2 -1
+[7,] -3 -4 1 2
+[8,] -1 2 -3 4
+[9,] -2 -1 -4 -3
+> MaxAbsCor(try.Y1998) #column-wise correlations are 0.
+[1] 0
+
+#OLHD.C2007(m = 4) generates a 17 by 7 orthogonal LHD.
+#Note that 2^m+1 = 17 and $4+{4-1 \choose 2}$ = 7.
+> try.C2007 = OLHD.C2007(m = 4); dim(try.C2007)
+[1] 17 7
+> MaxAbsCor(try.C2007) #column-wise correlations are 0
+[1] 0
+
+#OLHD.S2010(C = 3, r = 3, type = "odd") generates a 49 by 8 orthogonal LHD.
+#Note that 3*2^4+1 = 49 and 2^3 = 8.
+> dim(OLHD.S2010(C = 3, r = 3, type = "odd"))
+[1] 49 8
+> MaxAbsCor(OLHD.S2010(C = 3, r = 3, type = "odd")) #column-wise correlations are 0
+[1] 0
+
+#OLHD.S2010(C = 3, r = 3, type = "even") generates a 48 by 8 orthogonal LHD.
+#Note that 3*2^4 = 48 and 2^3 = 8.
+> dim(OLHD.S2010(C = 3, r = 3, type = "even"))
+[1] 48 8
+> MaxAbsCor(OLHD.S2010(C = 3, r = 3, type = "even")) #column-wise correlations are 0
+[1] 0
+
+#Create a 5 by 2 OLHD.
+> OLHD = OLHD.C2007(m = 2)
+
+#Create an OA(25, 6, 5, 2).
+> OA = matrix(c(2,2,2,2,2,1,2,1,5,4,3,5,3,2,1,5,4,5,1,5,4,3,2,5,4,1,3,5,2,3,
+1,2,3,4,5,2,1,3,5,2,4,3,1,1,1,1,1,1,4,3,2,1,5,5,5,5,5,5,5,1,4,4,4,4,4,1,
+3,1,4,2,5,4,3,3,3,3,3,1,3,5,2,4,1,3,3,4,5,1,2,2,5,4,3,2,1,5,2,3,4,5,1,2,
+2,5,3,1,4,4,1,4,2,5,3,4,4,2,5,3,1,4,2,4,1,3,5,3,5,3,1,4,2,4,5,2,4,1,3,3,
+5,1,2,3,4,2,4,5,1,2,3,2), ncol = 6, nrow = 25, byrow = TRUE)
+
+#OLHD.L2009(OLHD, OA) generates a 25 by 12 orthogonal LHD.
+#Note that n = 5 so n^2 = 25. p = 2 and f = 3 so 2fp = 12.
+> dim(OLHD.L2009(OLHD, OA))
+[1] 25 12
+> MaxAbsCor(OLHD.L2009(OLHD, OA)) #column-wise correlations are 0.
+[1] 0
+
+#OLHD.B2001(n = 11, k = 5) generates a 11 by 5 orthogonal LHD.
+> dim(OLHD.B2001(n = 11, k = 5))
+[1] 11 5Several R packages have been developed to facilitate Latin hypercube
+samples and design constructions for computer experiments. Among these,
+the lhs package (Carnell 2024) is widely recognized for its utility. It
+provides functions for generating both random and optimized Latin
+hypercube samples (but not designs), and its methods are particularly
+useful for simulation studies where space-filling properties are desired
+but design optimality is not the primary focus. The SLHD package
+(Ba 2015) was originally developed for generating sliced LHDs
+(Ba et al. 2015), while practitioners can set the number of slices to
+one to use the package for generating maximin LHDs. The MaxPro package
+(Ba and Joseph 2018) focuses on constructing designs that maximize projection
+properties. One of its functions, MaxProLHD, generates MaxPro LHDs
+using a simulated annealing algorithm (Joseph et al. 2015).
While we acknowledge the contributions of other relevant R packages, we
+emphasize the distinguishing features of our developed package. The
+LHD package embeds multiple optimality criteria, enabling the
+construction of various types of optimal LHDs. In contrast, lhs and
+SLHD primarily focus on space-filling Latin hypercube samples and
+designs, while MaxPro primarily focuses on maximum projection LHDs.
+The LHD package implements various search algorithms and algebraic
+constructions, whereas the other three packages do not implement
+algebraic constructions, and both SLHD and MaxPro only implement one
+algorithm to construct LHDs. The primary application of LHD is in the
+design of computer experiments, whereas lhs is mainly used for
+sampling and simulation studies. Therefore, LHD emphasizes design
+optimality, while lhs emphasizes the space-filling properties of
+samples.
LHD package implements popular search algorithms, including the SA
+(Morris and Mitchell 1995), OASA (Leary et al. 2003), SA2008
+(Joseph and Hung 2008), LaPSO (Chen et al. 2013) and GA
+(Liefvendahl and Stocki 2006), along with some widely used algebraic
+constructions
+(Tang 1993; Ye 1998; Butler 2001; Cioppa and Lucas 2007; Lin et al. 2009; Sun et al. 2010; Wang et al. 2018),
+for constructing three types of commonly used optimal LHDs: the maximin
+distance LHDs, the maximum projection LHDs and the (nearly) orthogonal
+LHDs. We aim to provide guidance and an easy-to-use tool for
+practitioners to find appropriate experimental designs. Algebraic
+constructions are preferred when available, especially for large
+designs. Search algorithms are used to generate optimal LHDs with
+flexible sizes.
Among very few R libraries particularly for LHDs, LHD is comprehensive
+and self-contained as it not only has search algorithms and algebraic
+constructions, but also has other useful functions for LHD research and
+development such as calculating different optimality criteria,
+generating random LHDs, exchanging two random elements in a matrix, and
+calculating intersite distance between matrix rows. The help manual in
+the package documentation contains further details and illustrative
+examples for users who want to explore more of the functions in the
+package.
This research was partially supported by the National Science Foundation +(NSF) grant DMS-2311186 and the National Key R&D Program of China +2024YFA1016200. The authors appreciate the reviewers’ constructive +comments and suggestions.
+Supplementary materials are available in addition to this article. It can be downloaded at +RJ-2025-033.zip
+This article is converted from a Legacy LaTeX article using the +texor package. +The pdf version is the official version. To report a problem with the html, +refer to CONTRIBUTE on the R Journal homepage.
+Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
+For attribution, please cite this work as
+Wang, et al., "LHD: An All-encompassing R Package for Constructing Optimal Latin Hypercube Designs", The R Journal, 2026+
BibTeX citation
+@article{RJ-2025-033,
+ author = {Wang, Hongzhi and Xiao, Qian and Mandal, Abhyuday},
+ title = {LHD: An All-encompassing R Package for Constructing Optimal Latin Hypercube Designs},
+ journal = {The R Journal},
+ year = {2026},
+ note = {https://doi.org/10.32614/RJ-2025-033},
+ doi = {10.32614/RJ-2025-033},
+ volume = {17},
+ issue = {4},
+ issn = {2073-4859},
+ pages = {20-36}
+}
+The longevity R package provides a maximum likelihood estimation routine for modelling of survival data that are subject to non-informative censoring or truncation. It includes a selection of 12 parametric models of varying complexity, with a focus on tools for extreme value analysis and more specifically univariate peaks over threshold modelling. The package provides utilities for univariate threshold selection, parametric and nonparametric maximum likelihood estimation, goodness of fit diagnostics and model comparison tools. These different methods are illustrated using individual Dutch records and aggregated Japanese human lifetime data.
+Many data sets collected by demographers for the analysis of human longevity have unusual features and for which limited software implementations exist. The longevity package was initially built for dealing with human longevity records and data from the International Database on Longevity (IDL), which provides age at death of supercentenarians, i.e., people who died above age 110. Data for the statistical analysis of (human) longevity can take the form of aggregated counts per age at death, or most commonly life trajectory of individuals with both birth and death dates. Such lifetimes are often interval truncated (only age at death of individuals dying between two calendar dates are recorded) or left truncated and right censored (when data of individuals still alive at the end of the collection period are also included). Another frequent format is death counts, aggregated per age band. Censoring and truncation are typically of administrative nature and thus non-informative about death.
+Supercentenarians are extremely rare and records are sparse. The most popular parametric models used by practitioners are justified by asymptotic arguments and have their roots in extreme value theory. Univariate extreme value distributions are well implemented in software and Belzile et al. (2023b) provides a recent review of existing implementations. While there are many standard R packages for the analysis of univariate extremes using likelihood-based inference, such as evd (Stephenson 2002), mev and extRemes (Gilleland and Katz 2016), only the evgam package includes functionalities to fit threshold exceedance models with censoring, as showcased in Youngman (2022) with rounded rainfall measurements. Support for survival data for extreme value models is in general wholly lacking, which motivated the development of longevity.
+The longevity package also includes several parametric models commonly used in demography. Many existing packages that focus on tools for modelling mortality rates, typically through life tables, are listed in the CRAN Task View ActuarialScience in the Life insurance section. They do not however allow for truncation or more general survival mechanisms, as the aggregated data used are typically complete except for potential right censoring for the oldest age group. The survival package (Therneau and Grambsch 2000) includes utilities for accelerated failure time models for 10 parametric models. The MortCast package can be used to estimate age-specific mortality rates using the Kannisto and Lee–Carter approaches, among others (Ševčíková et al. 2016). The demography package provides forecasting methods for death rates based on constructed life tables using the Lee–Carter or ARIMA models. MortalityLaws includes utilities to download data from the Human Mortality Database (HMD), and fit a total of 27 parametric models for life table data (death counts and population at risk per age group) using Poisson, binomial or alternative loss functions. The vitality package fits the family of vitality models (Anderson 2000; Li and Anderson 2009) via maximum likelihood based on empirical survival data. The fitdistrplus (Delignette-Muller and Dutang 2015) allows for generic parametric distributions to be fitted to interval censored data via maximum likelihood, with various S3 methods for model assessment. The package also allows user-specified models, thereby permitting custom definitions for truncated distributions whose truncation bounds are passed as fixed vector parameters. Parameter uncertainty can be obtained via additional functions using nonparametric bootstrap. Many parametric distributions also appear in the VGAM package (Yee and Wild 1996; Yee 2015), which allows for vector generalized linear modelling. The longevity package is less general and offers support only for selected parametric distributions, but contrary to the aforementioned packages allows for truncation and general patterns. One strength of longevity is that it also includes model comparison tools that account for non-regular asymptotics and goodness of fit diagnostics.
+Nonparametric methods are popular tools for the analysis of survival data with large samples, owing to their limited set of assumptions. They also serve for the validation of parametric models. Without explanatory variable, a closed-form estimator of the nonparametric maximum likelihood estimator of the survival function can be derived in particular instances, including the product limit estimator (Kaplan and Meier 1958) for the case of random or non-informative right censoring and an extension allowing for left truncation (Tsai et al. 1987). In general, the nonparametric maximum likelihood estimator of the survival function needs to be computed using an expectation-maximization (EM) algorithm (Turnbull 1976). Nonparametric estimators only assign probability mass on observed failure times and intervals and so cannot be used for extrapolation beyond the range of the data, limiting its utility in extreme value analysis.
+The CRAN Task View on Survival Analysis lists various implementations of nonparametric maximum likelihood estimators of the survival or hazard functions: survival implements the Kaplan–Meier and Nelson–Aalen estimators. Many packages focus on the case of interval censoring (Groeneboom and Wellner 1992 3.2), including prodlim; Anderson-Bergman (2017a) reviews the performance of the implementations in icenReg (Anderson-Bergman 2017b) and Icens. The latter uses the incidence matrix as input data. Routines for doubly censored data are provided by dblcens. The interval (Fay and Shaw 2010) implements Turnbull’s EM algorithm for interval censored data. The case of left truncated and right censored data is handled by survival, and tranSurv provides transformation models with the potential to account for truncation dependent on survival times. For interval truncated data, dedicated algorithms that use gradient-based steps (Efron and Petrosian 1999) or inverse probability weighting (Shen 2010) exist and can be more efficient than the EM algorithm of Turnbull (1976). Many of these are implemented in the DTDA package. The longevity package includes a C\(^{++}\) implementation of the corrected Turnbull’s algorithm (Turnbull 1976), returning the nonparametric maximum likelihood estimator for arbitrary censoring and truncation patterns, as opposed to the specific existing aforementioned implementations already available which focus on specific subcases. With a small number of observations, it is also relatively straightforward to maximize the log likelihood for the concave program subject to linear constraints using constrained optimization algorithms; longevity relies for this on Rsolnp, which uses augmented Lagrangian methods and sequential quadratic programming (Ye 1987).
+To showcase the functionality of the package and particularity for the modelling of threshold exceedances, we consider Dutch and Japanese life lengths. The dutch database contains the age at death (in days) of Dutch people who died above age 92 between 1986 and 2015; these data were obtained from Statistics Netherlands and analyzed in Einmahl et al. (2019) and Belzile et al. (2022). Records are interval truncated, as people are included in the database only if they died during the collection period. In addition, there are 226 interval censored and interval truncated records for which only the month and year of birth and death are known, as opposed to exact dates.
The second database we consider is drawn from Maier et al. (2021). The data frame japanese2 consists of counts of Japanese above age 100 by age band and are stratified by both birth cohort and sex. To illustrate the format of the data, counts for female Japanese are reproduced in Table 1. The data were constructed using the extinct cohort method and are interval censored between \(\texttt{age}\) and \(\texttt{age} + 1\) and right truncated at the age reached by the oldest individuals of their birth cohort in 2020. The count variable lists the number of instances in the contingency table, and serves as a weight for likelihood contributions.
| age | +1874-1878 | +1879-1883 | +1884-1888 | +1889-1893 | +1894-1898 | +1899-1900 | +
|---|---|---|---|---|---|---|
| 100 | +1648 | +2513 | +4413 | +8079 | +16036 | +9858 | +
| 101 | +975 | +1596 | +2921 | +5376 | +11047 | +7091 | +
| 102 | +597 | +987 | +1864 | +3446 | +7487 | +4869 | +
| 103 | +345 | +597 | +1153 | +2230 | +5014 | +3293 | +
| 104 | +191 | +351 | +662 | +1403 | +3242 | +2133 | +
| 105 | +121 | +197 | +381 | +855 | +2084 | +1357 | +
| 106 | +64 | +122 | +210 | +495 | +1284 | +836 | +
| 107 | +34 | +74 | +120 | +274 | +774 | +521 | +
| 108 | +16 | +41 | +66 | +152 | +433 | +297 | +
| 109 | +12 | +30 | +39 | +83 | +252 | +167 | +
| 110 | +6 | +17 | +21 | +49 | +130 | +92 | +
| 111 | +4 | +10 | +15 | +26 | +69 | +47 | +
| 112 | +3 | +3 | +11 | +15 | +29 | +22 | +
| 113 | +2 | +2 | +8 | +5 | +15 | +9 | +
| 114 | +1 | +2 | +4 | +3 | +7 | +2 | +
| 115 | +0 | +1 | +1 | +0 | +3 | +2 | +
| 116 | +0 | +1 | +1 | +0 | +1 | +1 | +
| 117 | +0 | +0 | +0 | +0 | +1 | +1 | +
The longevity package uses the S3 object oriented system and provides a series of functions with common arguments. The syntax used by the longevity package purposely mimics that of the survival package (Therneau and Grambsch 2000), except that it does not specify models using a formula. Users must provide vectors for the time or age (or bounds in case of intervals) via arguments time and time2, as well as lower and upper truncation bounds (ltrunc and rtrunc) if applicable. The integer vector event is used to indicate the type of event, where following survival 0 indicates right-censoring, 1 observed events, 2 left-censoring and 3 interval censoring. Together, these five vectors characterize the data and the survival mechanisms at play. Depending on the sampling scheme, not all arguments are required or relevant and they need not be of the same length, but are common to most functions. Users can also specify a named list args to pass arguments: as illustrated below, this is convenient to avoid specifying repeatedly the common arguments in each function call. Default values are overridden by elements in args, with the exception of those that are passed by the user directly in the call. Relative to survival, functions have additional arguments ltrunc and rtrunc for left and right truncation limits, as these are also possibly matrices for the case of double interval truncation (Belzile et al. 2022), since both censoring and truncation can be present simultaneously.
We can manipulate the data set to build the time vectors and truncation bounds for the Dutch data. We re-scale observations to years for interpretability and keep only records above age 98 for simplicity. We split the data to handle the observed age at death first: these are treated as observed (uncensored) whenever time and time2 coincide. When exact dates are not available, we compute the range of possible age at which individuals may have died, given their birth and death years and months. The truncation bounds for each individual can be obtained by subtracting from the endpoints of the sampling frame the birth dates, with left and right truncation bounds
+\[\begin{align*}
+\texttt{ltrunc}=\min\{92 \text{ years}, 1986.01.01 - \texttt{bdate}\}, \qquad \texttt{rtrunc} = 2015.12.31- \texttt{bdate}.
+\end{align*}\]
+Table 2 shows a sample of five individuals, two of whom are interval-censored, and the corresponding vectors of arguments along with two covariates, gender (gender) and birth year (byear).
| time | +time2 | +ltrunc | +rtrunc | +event | +gender | +byear | +
|---|---|---|---|---|---|---|
| 104.67 | +105.74 | +80.00 | +111.00 | +3 | +female | +1905 | +
| 103.50 | +104.58 | +78.00 | +109.00 | +3 | +female | +1907 | +
| 100.28 | +100.28 | +92.01 | +104.50 | +1 | +female | +1911 | +
| 100.46 | +100.46 | +92.01 | +102.00 | +1 | +male | +1913 | +
| 100.51 | +100.51 | +92.01 | +104.35 | +1 | +female | +1911 | +
We can proceed similarly for the Japanese data. Ages of centenarians are rounded down to the nearest year, so all observations are interval censored within one-year intervals. Assuming that the ages at death are independent and identically distributed with distribution function \(F(\cdot; \boldsymbol{\theta})\), the log likelihood for exceedances \(y_i = \texttt{age}_i - u\) above age \(u\) is +\[\begin{align*} +\ell(\boldsymbol{\theta}) = \sum_{i: \texttt{age}_i > u}n_i \left[\log \{F(y_i+1; \boldsymbol{\theta}) - F(y_i; \boldsymbol{\theta})\} - \log F(r_i - u; \boldsymbol{\theta})\right] +\end{align*}\] +where \(n_i\) is the count of the number of individuals in cell \(i\) and \(r_i > \texttt{age}_i+1\) is the right truncation limit for that cell, i.e., the maximum age that could have been achieved for that birth cohort by the end of the data collection period.
+data(japanese2, package = "longevity")
+# Keep only non-empty cells
+japanese2 <- japanese2[japanese2$count > 0, ]
+# Define arguments that are recycled
+japanese2$rtrunc <- 2020 -
+ as.integer(substr(japanese2$bcohort, 1, 4))
+# The line above extracts the earliest year of the birth cohort
+# Create a list with all arguments common to package functions
+args_japan <- with(japanese2,
+ list(
+ time = age, # lower censoring bound
+ time2 = age + 1L, # upper censoring bound
+ event = 3, # define interval censoring
+ type = "interval2",
+ rtrunc = rtrunc, # right truncation limit
+ weights = count)) # counts as weights
+Various models are implemented in longevity: their hazard functions are reported in Table 3. Two of those models, labelled perks and beard are logistic-type hazard functions proposed in Perks (1932) that have been used by Beard (1963), and popularized in work of Kannisto and Thatcher; we use the parametrization of Richards (2012), from which we also adopt the nomenclature. Users can compare the models with those available in MortalityLaws; see ?availableLaws for the list of hazard functions and parametrizations.
| +model + | ++hazard function + | ++constraints + | +
|---|---|---|
+exp
+ |
++\(\sigma^{-1}\) + | ++\(\sigma > 0\) + | +
+gomp
+ |
++\(\sigma^{-1}\exp(\beta t/\sigma)\) + | ++\(\sigma > 0, \beta \ge 0\) + | +
+gp
+ |
++\((\sigma + \xi t)_{+}^{-1}\) + | ++\(\sigma > 0, \xi \in \mathbb{R}\) + | +
+weibull
+ |
++\(\sigma^{-\alpha} \alpha t^{\alpha-1}\) + | ++\(\sigma > 0, \alpha > 0\) + | +
+extgp
+ |
++\(\beta\sigma^{-1}\exp(\beta t/\sigma)[\beta+\xi\{\exp(\beta t/\sigma) -1\}]^{-1}\) + | ++\(\sigma > 0, \beta \ge 0, \xi \in \mathbb{R}\) + | +
+extweibull
+ |
++\(\alpha\sigma^{-\alpha}t^{\alpha-1}\{1+\xi(t/\sigma)^{\alpha}\}_{+}\) + | ++\(\sigma > 0, \alpha > 0, \xi \in \mathbb{R}\) + | +
+perks
+ |
++\(\alpha\exp(\nu x)/\{1+\alpha\exp(\nu x)\}\) + | ++\(\nu \ge 0, \alpha >0\) + | +
+beard
+ |
++\(\alpha\exp(\nu x)/\{1+\alpha\beta\exp(\nu x)\}\) + | ++\(\nu \ge 0, \alpha >0, \beta \ge 0\) + | +
+gompmake
+ |
++\(\lambda + \sigma^{-1}\exp(\beta t/\sigma)\) + | ++\(\lambda \ge 0, \sigma > 0, \beta \ge 0\) + | +
+perksmake
+ |
++\(\lambda + \alpha\exp(\nu x)/\{1+\alpha\exp(\nu x)\}\) + | ++\(\lambda \ge 0, \nu \ge 0, \alpha > 0\) + | +
+beardmake
+ |
++\(\lambda + \alpha\exp(\nu x)/\{1+\alpha\beta\exp(\nu x)\}\) + | ++\(\lambda \ge 0, \nu \ge 0, \alpha > 0, \beta \ge 0\) + | +
Many of the models are nested and Figure 1 shows the logical relation between the various families. The function fit_elife allows users to fit all of the parametric models of Table 3: the print method returns a summary of the sampling mechanism, the number of observations, the maximum log likelihood and parameter estimates with standard errors. Depending on the data, some models may be overparametrized and parameters need not be numerically identifiable. To palliate such issues, the optimization routine, which uses Rsolnp, can try multiple starting values or fit various sub-models to ensure that the parameter values returned are indeed the maximum likelihood estimates. If one tries to compare nested models and the fit of the simpler model is better than the alternative, the anova function will return an error message.
The fit_elife function handles arbitrary censoring patterns over single intervals, along with single interval truncation and interval censoring. To accommodate the sampling scheme of the International Database on Longevity (IDL), an option also allows for double interval truncation (Belzile et al. 2022), whereby observations are included only if the person dies between time intervals, potentially overlapping, which defines the observation window over which dead individuals are recorded.
thresh <- 108
+model0 <- fit_elife(arguments = args_japan,
+ thresh = thresh,
+ family = "exp")
+(model1 <- fit_elife(arguments = args_japan,
+ thresh = thresh,
+ family = "gomp"))
+#> Model: Gompertz distribution.
+#> Sampling: interval censored, right truncated
+#> Log-likelihood: -3599.037
+#>
+#> Threshold: 108
+#> Number of exceedances: 2489
+#>
+#> Estimates
+#> scale shape
+#> 1.6855 0.0991
+#>
+#> Standard Errors
+#> scale shape
+#> 0.0523 0.0273
+#>
+#> Optimization Information
+#> Convergence: TRUE
+Goodness of fit of nested models can be compared using likelihood ratio tests via the anova method. Most of the interrelations between models yield non-regular model comparisons since, to recover the simpler model, one must often fix parameters to values that lie on the boundary of the parameter space. For example, if we compare a Gompertz model with the exponential, the limiting null distribution is a mixture of a point mass at zero and a \(\chi^2_1\) variable, both with probability half (Chernoff 1954). Many authors (e.g., Camarda 2022) fail to recognize this fact. The case becomes more complicated with more than one boundary constraint: for example, the deviance statistic comparing the Beard–Makeham and the Gompertz model, which constrains two parameters on the boundary of the parameter space, has a null distribution which is a mixture of \(\chi^2_2/4 + \chi^2_1/2 + \chi^2_0/4\) (Self and Liang 1987).
Nonidentifiability impacts testing: for example, if the rate parameter of the Perks–Makeham model (perksmake) \(\nu \to 0\), the limiting hazard, \(\lambda + \exp(\alpha)/\{1+\exp(\alpha)\}\), is constant (exponential model), but neither \(\alpha\) nor \(\lambda\) is identifiable. The usual asymptotics for the likelihood ratio test break down as the information matrix is singular (Rotnitzky et al. 2000). As such, all three families that include a Makeham component cannot be directly compared to the exponential in longevity and the call to anova returns an error message.
Users can also access information criteria, AIC and BIC. The correction factors implemented depend on the number of parameters of the distribution, but do not account for singular fit, non-identifiable parameters or singular models for which the usual corrections \(2p\) and \(\ln(n)p\) are inadequate (Watanabe 2010).
++Figure 1: Relationship between parametric models showing nested relations. Dashed arrows represent restrictions that lead to nonregular asymptotic null distribution for comparison of nested models. Comparisons between models with Makeham components and exponential are not permitted by the software because of nonidentifiability issues. +
+To showcase how hypothesis testing is performed, we consider a simple example with two nested models. We test whether the exponential model is an adequate simplification of the Gompertz model for exceedances above 108 years — an irregular testing problem since \(\beta=0\) is a restriction on the boundary of the parameter space. The drop in log likelihood is quite large, indicating the exponential model is not an adequate simplification of the Gompertz fit. This is also what is suggested by the Bayesian information criterion, which is much lower for the Gompertz model than for the exponential.
+| + | npar | +Deviance | +Df | +Chisq | +Pr(>Chisq) | +
|---|---|---|---|---|---|
| gomp | +2 | +7198.07 | ++ | + | + |
| exp | +1 | +7213.25 | +1 | +15.17 | +0 | +
#> exponential Gompertz
+#> 7221.065 7213.713
+Given the poor finite sample properties of the aforementioned tests, it may be preferable to rely on a parametric bootstrap rather than on the asymptotic distribution of the test statistic (Belzile et al. 2022) for model comparison. +Simulation-based inference requires capabilities for drawing new data sets whose features match those of the original one. For example, the International Database on Longevity (IDL) (Jdanov et al. 2021) features data that are interval truncated above 110 years, but doubly interval truncated since the sampling period for semisupercentenarians (who died age 105 to 110) and supercentenarians (who died above 110) are not always the same (Belzile et al. 2022). The 2018 Istat semisupercentenarians database analyzed by Barbi et al. (2018) on Italians includes left truncated right censored records.
+To mimic the postulated data generating mechanism while accounting for the sampling scheme, we could use the observed birth dates, or simulate new birth dates (possibly through a kernel estimator of the empirical distribution of birth dates) while keeping the sampling frame with the first and last date of data collection to define the truncation interval. In other settings, one could obtain the nonparametric maximum likelihood estimator of the distribution of the upper truncation bound (Shen 2010) using an inverse probability weighted estimator, which for fixed data collection windows is equivalent to setting the birth date.
+The samp_elife function includes multiple type2 arguments to handle these. For interval truncated data (type2="ltrt"), it uses the inversion method (Section 2 of Devroye (1986)): for \(F\) an absolutely continuous distribution function and \(F^{-1}\) the corresponding quantile function, a random variable distributed according to \(F\) truncated on \([a,b]\) is generated as \(X \sim F^{-1}[F(a) + U\{F(b)-F(a)\}]\)
+where \(U \sim \mathsf{U}(0,1)\) is standard uniform.
The function samp_elife also has an argument upper which serves for both right truncation, and right censoring. For the latter, any record simulated that exceeds upper is capped at that upper bound and declared partially observed. This is useful for simulating administrative censoring, whereby the birth date and the upper bound of the collection window fully determine whether an observation is right censored or not. An illustrative example is provided in the next section.
The anova method call uses the asymptotic null distribution for comparison of nested parametric distributions \(\mathcal{F}_0 \subseteq \mathcal{F}_1\). We could use the bootstrap to see how good this approximation to the null distribution is. To mimic as closely as possible the data generating mechanism, which is custom in most scenarios, we condition on the sampling frame and the number of individuals in each birth cohort. The number dying at each age is random, but the right truncation limits will be the same for anyone in that cohort. We simulate excess lifetimes, then interval censor observations by keeping only the corresponding age bracket. Under the null hypothesis, the data are drawn from \(\widehat{F}_0 \in \mathcal{F}_0\) and we generate observations from this right truncated distribution using the samp_elife utility, which also supports double interval truncation and left truncation right censoring. This must be done within a for loop since we have count attached to each upper bound, but the function is vectorized should we use a single vector containing all of the right truncation limits.
The bootstrap \(p\)-value for comparing models \(M_0 \subset M_1\) would be obtained by repeating the following steps \(B\) times and calculating the rank of the observed test statistic among alternatives:
+samp_elife to simulate new observations from a parametric interval truncated distribution from the null model \(M_0\)fit_elife to fit the model with both \(M_1\) and \(M_2\) and calculate the deviance, and from them the likelihood ratio statistic.The algorithm is implemented below for comparing the Gompertz and the exponential model. Since the procedure is computationally intensive, users must trade off between the precision of the bootstrap \(p\)-value estimate and the number of replications, \(B\).
+set.seed(2022)
+# Count of unique right truncation limit
+db_rtrunc <- aggregate(count ~ rtrunc,
+ FUN = "sum",
+ data = japanese2,
+ subset = age >= thresh)
+B <- 1000L # Number of bootstrap replications
+boot_anova <- numeric(length = B)
+boot_gof <- numeric(length = B)
+for(b in seq_len(B - 1L)){
+ boot_samp <- # Generate bootstrap sample
+ do.call(rbind, #merge data frames
+ apply(db_rtrunc, 1, function(x){ # for each rtrunc and count
+ count <- table( #tabulate count
+ floor( #round down
+ samp_elife( # sample right truncated exponential
+ n = x["count"],
+ scale = model0$par,
+ family = "exp", #null model
+ upper = x["rtrunc"] - thresh,
+ type2 = "ltrt")))
+ data.frame( # return data frame
+ count = as.integer(count),
+ rtrunc = as.numeric(x["rtrunc"]) - thresh,
+ eage = as.integer(names(count)))
+ }))
+ boot_mod0 <- # Fit null model to bootstrap sample
+ with(boot_samp,
+ fit_elife(time = eage,
+ time2 = eage + 1L,
+ rtrunc = rtrunc,
+ type = "interval",
+ event = 3,
+ family = "exp",
+ weights = count))
+ boot_mod1 <- # Fit alternative model to bootstrap sample
+ with(boot_samp,
+ fit_elife(time = eage,
+ time2 = eage + 1L,
+ rtrunc = rtrunc,
+ type = "interval",
+ event = 3,
+ family = "gomp",
+ weights = count))
+ boot_anova[b] <- deviance(boot_mod0) -
+ deviance(boot_mod1)
+}
+# Add original statistic
+boot_anova[B] <- deviance(model1) - deviance(model0)
+# Bootstrap p-value
+(pval <- rank(boot_anova)[B] / B)
+#> [1] 0.001
+The asymptotic approximation is of similar magnitude as the bootstrap \(p\)-value. Both suggest that the more complex Gompertz model provides a significantly better fit.
+Extreme value theory suggests that, in many instances, the limiting conditional distribution of exceedances of a random variable \(Y\) with distribution function \(F\) is generalized Pareto, meaning +\[\begin{align} +\lim_{u \to x^*}\Pr(Y-u > y \mid Y > u)= \begin{cases} +\left(1+\xi y/\sigma\right)_{+}^{-1/\xi}, & \xi \neq 0;\\ +\exp(-y/\sigma), & \xi = 0; +\end{cases} +\tag{1} +\end{align}\] +with \(x_{+} = \max\{x, 0\}\) and \(x^*=\sup\{x: F(x) < 1\}\). This justifies the use of Equation (1) for the survival function of threshold exceedances when dealing with rare events. The model has two parameters: a scale \(\sigma\) and a shape \(\xi\) which determines the behavior of the upper tail. Negative shape parameters correspond to bounded upper tails and a finite right endpoint for the support.
+Study of population dynamics and mortality generally requires knowledge of the total population from which observations are drawn to derive rates. By contrast, the peaks over threshold method, by which one models the \(k\) largest observations of a sample, is a conditional analysis (e.g., given survival until a certain age), and is therefore free of denominator specification since we only model exceedances above a high threshold \(u\). For modelling purposes, we need to pick a threshold \(u\) that is smaller than the upper endpoint \(x^*\) in order to have sufficient number of observations to estimate parameters. The threshold selection problem is a classical instance of bias-variance trade-off: the parameter estimators are possibly biased if the threshold is too low because the generalized Pareto approximation is not good enough, whereas choosing a larger threshold to ensure we are closer to the asymptotic regime leads to reduced sample size and increased parameter uncertainty.
+To aid threshold selection, users commonly resort to parameter stability plots. These are common visual diagnostics consisting of a plot of estimates of the shape parameter \(\widehat{\xi}\) (with confidence or credible intervals) based on sample exceedances over a range of thresholds \(u_1, \ldots, u_K\). If the data were drawn from a generalized Pareto distribution, the conditional distribution above higher threshold \(v > u\) is also generalized Pareto with the same shape: this threshold stability property is the basis for extrapolation beyond the range of observed records. Indeed, if the change in estimation of \(\xi\) is nearly constant, this provides reassurance that the approximation can be used for extrapolation. The only difference with survival data, relative to the classical setting, is that the likelihood must account for censoring and truncation. Note that, when we use threshold exceedances with a nonzero threshold (argument thresh), it however isn’t possible to unambiguously determine whether left censored observations are still exceedances: such cases yield errors in the functions.
Theory on penultimate extremes suggests that, for finite levels and general distribution function \(F\) for which (1) holds, the shape parameter varies as a function of the threshold \(u\), behaving like the derivative of the reciprocal hazard \(r(x) = \{1-F(x)\}/f(x)\). We can thus model the shape as piece-wise constant by fitting a piece-wise generalized Pareto model due to Northrop and Coleman (2014) and adapted in Belzile et al. (2022) for survival data. The latter can be viewed as a mixture of generalized Pareto over \(K\) disjoint intervals with continuity constraints to ensure a smooth hazard, which reduces to the generalized Pareto if we force the \(K+1\) shape parameters to be equal. We can use a likelihood ratio test to compare the model, or a score test if the latter is too computationally intensive, and plot the \(p\)-values for each of the \(K\) thresholds, corresponding to the null hypotheses \(\mathrm{H}_k: \xi_k = \cdots = \xi_{K}\) (\(k=1, \ldots, K-1\)). As the model quickly becomes overparametrized, optimization is difficult and the score test may be a safer option as it only requires estimation of the null model of a single generalized Pareto over the whole range.
+To illustrate these diagnostic tools, Figure 2 shows a threshold stability plot, which features a small increase in the shape parameters as the threshold increases, corresponding to a stabilization or even a slight decrease of the hazard at higher ages. We can envision a threshold of 108 years as being reasonable: the Northrop–Coleman diagnostic plot suggests lower thresholds are compatible with a constant shape above 100. Additional goodness-of-fit diagnostics are necessary to determine if the generalized Pareto model fits well.
+par(mfrow = c(1, 2), mar = c(4, 4, 1, 1))
+# Threshold sequence
+u <- 100:110
+# Threshold stability plot
+tstab(arguments = args_japan,
+ family = "gp",
+ method = "profile",
+ which.plot = "shape",
+ thresh = u)
+# Northrop-Coleman diagnostic based on score tests
+nu <- length(u) - 1L
+nc_score <- nc_test(arguments = c(args_japan, list(thresh = u)))
+score_plot <- plot(nc_score)
+graphics.off()
+
++Figure 2: Threshold diagnostic tools: parameter stability plots for the generalized Pareto model (left) and Northrop–Coleman \(p\)-value path (right) for the Japanese centenarian dataset. Both suggest that a threshold as low as 100 may be suitable for peaks-over-threshold analysis. +
+Each plot in the package can be produced using base R or using ggplot2 (Wickham 2016), which implements the grammar of graphics. To keep the list of package dependencies lean and adhere to the tinyverse principle, the latter can be obtained by using the argument plot.type with the generic S3 method plot, or via autoplot, provided the ggplot2 package is already installed.
Determining whether a parametric model fits well to survival data is no easy task due to the difficulty in specifying the null distribution of many goodness of fit statistics, such as the Cramér–von Mises statistic, which differ for survival data. As such, the longevity package relies mostly on visual diagnostic tools. Waller and Turnbull (1992) discusses how classical visual graphics can be adapted in the presence of censoring. Most notably, only observed failure times are displayed on the \(y\)-axis against their empirical plotting position on the \(x\)-axis. Contrary to the independent and identically distributed case, the uniform plotting positions \(F_n(y_i)\) are based on the nonparametric maximum likelihood estimator discussed in Section 3.
+The situation is more complicated with truncated data (Belzile et al. 2022), since the data are not identically distributed: indeed, the distribution function of observation \(Y_i\) truncated on the interval \([a_i, b_i]\) is \(F_i(y_i) = \{F(y_i)-F(a_i)\}/\{F(b_i) - F(a_i)\}\), so the data arise from different distributions even if these share common parameters. One way out of this conundrum is using the probability integral transform and the quantile transform to map observations to the uniform scale and back onto the data scale. Taking \(\widetilde{F}(y_i) = F_n(y_i)=\mathrm{rank}(y_i)/(n+1)\) to denote the empirical distribution function estimator, a probability-probability plot would show \(x_i = \widetilde{F}_{i}(y_i)\) against \(y_i = F_i(y_i)\), leading to approximately uniform samples if the parametric distribution \(F\) is suitable. Another option is to standardize the observation, taking the collection \(\widetilde{y}_i=F^{-1}\{F_i(y_i)\}\) of rescaled exceedances and comparing them to the usual plotting positions \(x_{(i)} =\{i/(n+1)\}\). The drawback of the latter approach is that the quantities displayed on the \(y\)-axis are not raw observations and the ranking of the empirical quantiles may change, a somewhat counter-intuitive feature. However, this means that the sample \(\{F_i(y_i)\}\) should be uniform under the null hypothesis, and this allows one to use methods from Säilynoja et al. (2022) to obtain point-wise and simultaneous confidence intervals.
+longevity offers users the choice between three types of quantile-quantile plots: regular (Q-Q, "qq"), Tukey’s mean detrended Q-Q plots ("tmd") and exponential Q-Q plots ("exp"). Other options on uniform scale are probability-probability (P-P, "pp") plots and empirically rescaled plots (ERP, "erp") (Waller and Turnbull 1992), designed to ease interpretation with censored observations by rescaling axes. We illustrate the graphical tools with the dutch data above age 105 in Figure 3. The fit is correct above 110, but there is a notable dip due to excess mortality around 109.
fit_dutch <- fit_elife(
+ arguments = dutch_data,
+ event = 3,
+ type = "interval2",
+ family = "gp",
+ thresh = 105,
+ export = TRUE)
+par(mfrow = c(1, 2))
+plot(fit_dutch,
+ which.plot = c("pp","qq"))
+
++Figure 3: Probability-probability and quantile-quantile plots for generalized Pareto model fitted above age 105 years to Dutch data. The plots indicate broadly good agreement with the observation, except for some individuals who died age 109 for which too many have deaths close to their birthdates. +
+Censored observations are used to compute the plotting positions, but are not displayed. As such, we cannot use graphical goodness of fit diagnostics for the Japanese interval censored data. An alternative, given that data are tabulated in a contingency table, is to use a chi-squared test for independence, conditioning on the number of individuals per birth cohort. The expected number in each cell (birth cohort and age band) can be obtained by computing the conditional probability of falling in that age band. The asymptotic null distribution should be \(\chi^2\) with \((k-1)(p-1)\) degrees of freedom, where \(k\) is the number of age bands and \(p\) the number of birth cohorts. In finite samples, the expected count for large excess lifetimes are very low so one can expect the \(\chi^2\) approximation to be poor. To mitigate this, we can pool observations and resort to simulation to approximate the null distribution of the test statistic. The bootstrap \(p\)-value for the exponential model above 108 years, pooling observations with excess lifetime of 5 years and above, is 0.872, indicating no evidence that the model is inadequate, but the test here may have low power.
+Demographers may suspect differences between individuals of different sex, from different countries or geographic areas, or by birth cohort. All of these are instances of categorical covariates. One possibility is to incorporate these covariates with suitable link function through parameters, but we consider instead stratification. We can split the data by levels of covariate (with factors) into sub-data and compare the goodness of fit of the \(K\) models relative to that which pools all observations. The test_elife function performs likelihood ratio tests for the comparisons. The test statistic is \(-2\{\ell(\widehat{\boldsymbol{\theta}}_0) - \ell(\widehat{\boldsymbol{\theta}})\}\), where \(\widehat{\boldsymbol{\theta}}_0\) is the maximum likelihood estimator under the null model with common parameters, and \(\widehat{\boldsymbol{\theta}}\) is the unrestricted maximum likelihood estimator for the alternative model with the same distribution, but which allows for stratum-specific parameters. We illustrate this with a generalized Pareto model for the excess lifetime. The null hypothesis is \(\mathrm{H}_0: \sigma_{\texttt{f}} = \sigma_{\texttt{m}}, \xi_{\texttt{f}}=\xi_{\texttt{m}}\) against the alternative that at least one equality doesn’t hold and so the hazard and endpoints are different.
print(
+ test_elife(
+ arguments = args_japan,
+ thresh = 110,
+ family = "gp",
+ covariate = japanese2$gender)
+)
+#> Model: generalized Pareto distribution.
+#> Threshold: 110
+#> Number of exceedances per covariate level:
+#> female male
+#> 642 61
+#>
+#> Likelihood ratio statistic: 0.364
+#> Null distribution: chi-square (2)
+#> Asymptotic p-value: 0.833
+In the present example, there is no evidence against any difference in lifetime distribution between male and female; this is unsurprising given the large imbalance between counts for each covariate level, with far fewer males than females.
+If the maximum likelihood estimator of the shape \(\xi\) for the generalized Pareto model is negative, then the distribution has a finite upper endpoint; otherwise, the latter is infinite. With \(\xi < 0\), we can look at the profile log likelihood for the endpoint \(\eta = -\sigma/\xi\), using the function prof_gp_endpt, to draw the curve and obtain confidence intervals. The argument psi is used to give a grid of values over which to compute the profile log likelihood. The bounds of the \((1-\alpha)\) confidence intervals are obtained by fitting a cubic smoothing spline for \(y=\eta\) as a function of the shifted profile curve \(x = 2\{\ell_p(\eta)-\ell_p(\widehat{\eta})\}\) on both sides of the maximum likelihood estimator and predicting the value of \(y\) when \(x = -\chi^2_1(1-\alpha)\). This technique works well unless the profile is nearly flat or the bounds lie beyond the range of values of psi provided; the user may wish to change them if they are too far. If \(\widehat{\xi} \approx 0\), then the upper bound of the confidence interval may be infinite and the profile log likelihood may never reach the cutoff value of the asymptotic \(\chi^2_1\) distribution.
The profile log likelihood curve for the endpoint, shifted vertically so that its value is zero at the maximum likelihood estimator, highlights the marked asymmetry of the distribution of \(\eta\), shown in 4, with the horizontal dashed lines showing the limits for the 95% profile likelihood confidence intervals. These suggest that the endpoint, or a potential finite lifespan, could lie very much beyond observed records. The routine used to calculate the upper bound computes the cutoff value by fitting a smoothing spline with the role of the \(y\) and \(x\) axes reversed and by predicting the value of \(\eta\) at \(y=0\). In this example, the upper confidence limit is extrapolated from the model: more accurate measures can be obtained by specifying a longer and finer sequence of values of psi such that the profile log likelihood drops below the \(\chi^2_1\) quantile cutoff.
# Create grid of threshold values
+thresholds <- 105:110
+# Grid of values at which to evaluate profile
+psi <- seq(120, 200, length.out = 101)
+# Calculate the profile for the endpoint
+# of the generalized Pareto at each threshold
+endpt_tstab <- do.call(
+ endpoint.tstab,
+ args = c(
+ args_japan,
+ list(psi = psi,
+ thresh = thresholds,
+ plot = FALSE)))
+# Compute corresponding confidence intervals
+profile <- endpoint.profile(
+ arguments = c(args_japan, list(thresh = 110, psi = psi)))
+# Plot point estimates and confidence intervals
+g1 <- autoplot(endpt_tstab, plot = FALSE, ylab = "lifespan (in years)")
+# Plot the profile curve with cutoffs for conf. int. for 110
+g2 <- autoplot(profile, plot = FALSE)
+patchwork::wrap_plots(g1, g2)
+
++Figure 4: Maximum likelihood estimates with 95% confidence intervals as a function of threshold (left) and profile likelihood for exceedances above 110 years (right) for Japanese centenarian data. As the threshold increases, the number of exceedances decreases and the intervals for the upper bound become wider. At 110, the right endpoint of the interval would go until infinity. +
+Depending on the model, the conclusions about the risk of mortality change drastically: the Gompertz model implies an ever increasing hazard, but no finite endpoint for the distribution of exceedances. The exponential model implies a constant hazard and no endpoint. By contrast, the generalized Pareto can accommodate both finite and infinite endpoints. The marked asymmetry of the distribution of lifespan defined by the generalized Pareto shows that inference obtained using symmetric confidence intervals (i.e., Wald-based) is likely very misleading: the drop in fit from having a zero or positive shape parameter \(\xi\) is seemingly smaller than the cutoffs for a 95% confidence interval, suggesting that while the best point estimate is around 128 years, the upper bound is so large (and extrapolated) that everything is possible. The model however also suggests a very high probability of dying in any given year, regardless of whether the hazard is constant, decreasing or increasing.
+The parameters of the models are seldom of interest in themselves: rather, we may be interested in a summary such as the hazard function. At present, longevity does not allow general linear modelling of model parameters or time-varying covariates, but other software implementations can tackle this task. For example, casebase (Bhatnagar et al. 2022) fits flexible hazard models using logistic or multinomial regression with potential inclusion of penalties for the parameters associated to covariates and splines effects. Another alternative is flexsurv (Jackson 2016), which offers 10 parametric models and allows for user-specified models. +The bshazard package (Rebora et al. 2014) provides nonparametric smoothing via \(B\)-splines, whereas muhaz handles kernel-based hazard for right censored data; both could be used for validation of the parametric models in the case of right censoring. The rstpm2 package (Liu et al. 2018) handles generalized modelling for censoring with the Royston–Parmar model built from natural cubic splines (Royston and Parmar 2002). Contrasting with all of the aforementioned approaches, we focus on parametric models: this is partly because there are few observations for the user case we consider and covariates, except perhaps for gender and birth year, are not available.
+The hazard changes over time; the only notable exception is the exponential hazard, which is constant. longevity includes utilities for computing the hazard function from a fitted model object and computing point-wise confidence intervals using symmetric Wald intervals or the profile likelihood.
+Specifically, the hazard_elife function calculates the hazard \(h(t; \boldsymbol{\theta})\) point-wise at times \(t=\)x; Wald-based confidence intervals are obtained using the delta-method, whereas profile likelihood intervals are obtained by reparametrizing the model in terms of \(h(t)\) for each time \(t\). More naturally perhaps, we can consider a Bayesian analysis of the Japanese excess lifetime above 108 years. Using the likelihood and encoded log posterior provided in logpost_elife, we obtained independent samples from the posterior of the generalized Pareto parameters \((\sigma, \xi)\) with maximal data information prior using the rust package. Each parameter combination was then fed into helife and the hazard evaluated over a range of values. Figure 5 shows the posterior samples and functional boxplots (Sun and Genton 2011) of the hazard curves, obtained using the fda package. The risk of dying increases with age, but comes with substantial uncertainty, as evidenced by the increasing width of the boxes and interquartile range.
++Figure 5: Left: scatterplot of 1000 independent posterior samples from generalized Pareto model with maximum data information prior; the contour curves give the percentiles of credible intervals, and show approximate normality of the posterior. Right: functional boxplots for the corresponding hazard curves, with increasing width at higher ages. +
+The nonparametric maximum likelihood estimator is unique only up to equivalence classes. The data for individual \(i\) consists of the tuple \(\{L_i, R_i, V_i, U_i\}\), where the censoring interval is \([L_i, R_i]\) and the truncation interval is \([V_i, U_i\}\), with \(0 \leq V_i \leq L_i \leq R_i \leq U_i \leq \infty\). Turnbull (1976) shows how one can build disjoint intervals \(C = \bigsqcup_{j=1}^m [a_j, b_j]\) where \(a_j \in \mathcal{L} = \{L_1, \ldots, L_n\}\) and \(b_j \in \mathcal{R} = \{R_1, \ldots, R_n\}\) satisfy \(a_1 \leq b_1 < \cdots < a_m \leq b_m\) and the intervals \([a_j, b_j]\) contain no other members of \(L\) or \(R\) except at the endpoints. This last condition notably ensures that the intervals created include all observed failure times as singleton sets in the absence of truncation. Other authors (Lindsey and Ryan 1998) have taken interval censored data as semi-open intervals \((L_i, R_i]\), a convention we adopt here for numerical reasons. For interval censored and truncated data, Frydman (1994) shows that this construction must be amended by taking instead \(a_j \in \mathcal{L} \cup \{U_1, \ldots, U_n\}\) and \(b_j \in \mathcal{R} \cup \{V_1, \ldots, V_n\}\).
+We assign probability \(p_j = F(b_j^{+}) - F(a_j^{-}) \ge 0\) to each of the resulting \(m\) intervals under the constraint \(\sum_{j=1}^m p_j = 1\) and \(p_j \ge 0\) \((j=1, \ldots, m)\). The nonparametric maximum likelihood estimator of the distribution function \(F\) is then +\[\begin{align*} +\widehat{F}(t) = \begin{cases} 0, & t < a_1;\\ +\widehat{p}_1 + \cdots + \widehat{p}_j, & b_j < t < a_{j+1} \quad (1 \leq j \leq m-1);\\ +1, & t > b_m; +\end{cases} +\end{align*}\] +and is undefined for \(t \in [a_j, b_j]\) \((j=1, \ldots, m)\).
+
++Figure 6: Illustration of the truncation (pale grey) and censoring intervals (dark grey) equivalence classes based on Turnbull’s algorithm. Observations must fall within equivalence classes defined by the former. +
+The procedure of Turnbull can be encoded using \(m \times n\) matrices. For censoring, we build \(\mathbf{A}\) whose \((i,j)\)th entry \(\alpha_{ij}=1\) if \([a_j, b_j] \subseteq A_i\) and zero otherwise. Since the intervals forming the set \(C\) are disjoint and in increasing order, a more storage efficient manner of keeping track of the intervals is to find the smallest integer \(j\) such that \(L_i \leq a_j\) and the largest \(R_i \ge b_j\) \((j=1, \ldots, m)\) for each observation. The same idea applies for the truncation sets \(B_i = (V_i, U_i)\) and matrix \(\mathbf{B}\) with \((i,j)\) element \(\beta_{ij}\).
+The log likelihood function is +\[\begin{align*} +\ell(\boldsymbol{p}) = \sum_{i=1}^n w_i \log \left( \sum_{j=1}^m \alpha_{ij}p_j\right) - w_i\log \left( \sum_{j=1}^m \beta_{ij}p_j\right) +\end{align*}\]
+The numerical implementation of the EM is in principle forward: first identify the equivalence classes \(C\), next calculate the entries of \(A_i\) and \(B_i\) (or the vectors of ranges) and finally run the EM algorithm. In the second step, we need to account for potential ties in the presence of (interval) censoring and treat the intervals as open on the left for censored data. For concreteness, consider the toy example \(\boldsymbol{T} =(1,1,2)\) and \(\boldsymbol{\delta} = (0,1,1)\), where \(\delta_i = 1\) if the observation is a failure time and \(\delta_i=0\) in case of right censoring. +The left and right censoring bounds are \(\mathcal{L} = \{1, 2\}\) and \(\mathcal{R} = \{1, 2, \infty\}\) with \(A_1 = (1, \infty)\), \(A_2 = \{1\}\) and \(A_3 = \{2\}\) and \(C_1=\{1\}, C_2=\{2\}\). If we were to treat instead \(A_1\) as a semi-closed interval \([1, \infty)\), direct maximization of the log likelihood in eq. 2.2 of Turnbull (1976) would give probability half to each observed failure time. By contrast, the Kaplan–Meier estimator, under the convention that right censored observations at time \(t\) were at risk up to and until \(t\), assigns probability 1/3 to the first failure. To retrieve this solution with Turnbull’s EM estimator, we need the convention that \(C_1 \notin A_1\), but this requires comparing the bound with itself. The numerical tolerance in the implementation is taken to be the square root of the machine epsilon for doubles.
+The maximum likelihood estimator (MLE) need not be unique, and the EM algorithm is only guaranteed a local maximum. For interval censored data, Gentleman and Geyer (1994) consider using the Karush–Kuhn–Tucker conditions to determine whether the probability in some intervals is exactly zero and whether the returned value is indeed the MLE.
+Due to data scarcity, statistical inference for human lifespan is best conducted using parametric models supported by asymptotic theory, reserving nonparametric estimators to assess goodness of fit. The empirical cumulative hazard for Japanese is very close to linear from early ages, suggesting that the hazard may not be very far from exponential even if more complex models are likely to be favored given the large sample size.
+The function np_elife returns a list with Turnbull’s intervals \([a_j, b_j]\) and the probability weights assigned to each, provided the latter are positive. It also contains an object of class stepfun with a weighting argument that defines a cumulative distribution function.
+
We can use the nonparametric maximum likelihood estimator of the distribution function to assess a fitted parametric model by comparing the density with a binned distribution, or the cumulative distribution function. The distribution function is defined over equivalence classes, which may be isolated observations or intervals. The data interval over which there is non-zero probability of events is broken down into equispaced bins and the probability of failing in the latter estimated nonparametrically based on the distribution function. Alternatively, users can also provide a set of breaks, or the number of bins.
ecdf <- np_elife(arguments = args_japan, thresh = 108)
+# Summary statistics, accounting for censoring
+round(summary(ecdf), digits = 2)
+#> Min. 1st Qu. Median Mean 3rd Qu. Max.
+#> 109.00 110.00 111.00 110.01 111.00 118.00
+# Plots of fitted parametric model and nonparametric CDF
+model_gp <- fit_elife(
+ arguments = args_japan,
+ thresh = 108,
+ family = "gp",
+ export = TRUE)
+# ggplot2 plots, wrapped to display side by side
+patchwork::wrap_plots(
+ autoplot(
+ model_gp, # fitted model
+ plot = FALSE, # return list of ggplots
+ which.plot = c("dens", "cdf"),
+ breaks = seq(0L, 8L, by = 1L) # set bins for histogram
+ )
+)
+
++Figure 7: Nonparametric maximum likelihood estimate of the density (bar plot, left) and distribution function (staircase function, right), with superimposed generalized Pareto fit for excess lifetimes above 108 years. Except for the discreteness inherent to the nonparametric estimates, the two representations broadly agree at year marks. +
+This paper describes the salient features of longevity, explaining the theoretical underpinning of the methods and the design considerations followed when writing the package. While longevity was conceived for modelling lifetimes, the package could be used for applications outside of demography. Survival data in extreme value theory is infrequent yet hardly absent. For example, rainfall observations can be viewed as rounded due to the limited instrumental precision of rain gauges and treated as interval-censored. Some historical records, which are often lower bounds on the real magnitude of natural catastrophes, can be added as right-censored observations in a peaks-over-threshold analysis. In insurance, losses incurred by the company due to liability claims may be right-censored if they exceed the policy cap and are covered by a reinsurance company, or rounded (Belzile and Nešlehová 2025). In climate science, attribution studies often focus on data just after a record-breaking event, and the stopping rule leads to truncation (Miralles and Davison 2023), which biases results if ignored.
+The package has features that are interesting in their own right, including adapted quantile-quantile and other visual goodness of fit diagnostics for model validation. The testing procedures correctly handle tests for restrictions lying on the boundary of the parameter space. Parametric bootstrap procedures for such tests are not straightforward to implement given the heavy reliance on the data generating mechanism and the diversity of possible scenarios: this paper however shows how the utilities of the package can be coupled to ease such estimation and the code in the supplementary material illustrates how this can be extended for goodness of fit testing.
+The author thanks four anonymous reviewers for valuable feedback. This research was supported financially by the Natural Sciences and Engineering Research Council of Canada (NSERC) via Discovery Grant RGPIN-2022-05001.
+Supplementary materials are available in addition to this article. It can be downloaded at +RJ-2025-034.zip
+longevity, evd, mev, extRemes, evgam, survival, MortCast, demography, MortalityLaws, vitality, fitdistrplus, VGAM, prodlim, icenReg, dblcens, interval, tranSurv, DTDA, Rsolnp, ggplot2, casebase, flexsurv, bshazard, muhaz, rstpm2, rust, fda
+ActuarialScience, ChemPhys, ClinicalTrials, Distributions, Econometrics, Environmetrics, ExtremeValue, FunctionalData, MissingData, NetworkAnalysis, Optimization, Phylogenetics, Psychometrics, Spatial, Survival, TeachingStatistics
+Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
+For attribution, please cite this work as
+Belzile, "longevity: An R Package for Modelling Excess Lifetimes", The R Journal, 2026+
BibTeX citation
+@article{RJ-2025-034,
+ author = {Belzile, Léo R.},
+ title = {longevity: An R Package for Modelling Excess Lifetimes},
+ journal = {The R Journal},
+ year = {2026},
+ note = {https://doi.org/10.32614/RJ-2025-034},
+ doi = {10.32614/RJ-2025-034},
+ volume = {17},
+ issue = {4},
+ issn = {2073-4859},
+ pages = {37-58}
+}
+The receiver operating characteristic (ROC) curve is a graphical tool +commonly used to depict the binary classification accuracy of a +continuous marker in terms of its sensitivity and specificity. The +standard ROC curve assumes a monotone relationship between the marker +and the response, inducing classification subsets of the form +\((c,\infty)\) with \(c \in \mathbb{R}\). However, in non-standard cases, +the involved classification regions are not so clear, highlighting the +importance of tracking the decision rules. This paper introduces the R +package movieROC, +which provides visualization tools for understanding the ability of +markers to identify a characteristic of interest, complementing the +ROC curve representation. This tool accommodates multivariate +scenarios and generalizations involving different decision rules. The +main contribution of this package is the visualization of the +underlying classification regions, with the associated gain in +interpretability. Adding the time (videos) as a third dimension, this +package facilitates the visualization of binary classification in +multivariate problems. It constitutes a good tool to generate +graphical material for presentations.
+The use of data to detect a characteristic of interest is a cornerstone +of many disciplines such as medicine (to diagnose a pathology or to +predict a patient outcome), finance (to detect fraud) or machine +learning (to evaluate a classification algorithm), among others. +Continuous markers are surrogate measures for the characteristic under +study, or predictors of a potential subsequent event. They are measured +in subjects, some of whom have the characteristic (positive), and some +without it (negative). In addition to reliability and feasibility, a +good marker must have two relevant properties: interpretability and +accuracy (Mayeux 2004). High binary classification accuracy can be +achieved if there exists a strong relationship between the marker and +the response. The latter is assessed by a gold standard for the presence +or absence of the characteristic of interest. Interpretability refers to +the decision rules or subsets considered in the classification process. +This piece of research seeks to elucidate both desirable properties for +a marker by the implementation of a graphical tool in R language. We +propose a novel approach involving the generation of videos as a +solution to effectively capture the classification procedure for +univariate and multivariate markers. Graphical analysis plays a pivotal +role in data exploration, interpretation, and communication. Its +burgeoning potential is underscored by the fast pace of technological +advances, which empower the creation of insightful graphical +representations.
+A usual practice when the binary classification accuracy of a marker is +of interest involves the representation of the Receiver Operating +Characteristic (ROC) curve, summarized by the Area Under the Curve (AUC) +(Hanley and McNeil 1982). The resulting plot reflects the trade-off between the +sensitivity and the complement of the specificity. Sensitivity and +specificity are probabilities of correctly classifying subjects, either +positive or negative, respectively. Mathematically, let \(\xi\) and \(\chi\) +be the random variables modeling the marker values in the positive and +the negative population, respectively, with \(F_\xi(\cdot)\) and +\(F_\chi(\cdot)\) their associated cumulative distribution functions. +Assuming that the expected value of the marker is larger in the positive +than in the negative population, the standard ROC curve is based on +classification subsets of the form \(s = (c, \infty)\), where \(c\) is the +so-called cut-off value or threshold in the support of the marker \(X\), +\(\mathcal{S}(X)\). One subject is classified as a positive if its marker +value is within this region, and as a negative otherwise. This type of +subsets has two important advantages: first, their interpretability is +clear; second, for each specificity \(1-t \in [0,1]\), the corresponding +\(s_t = (c_t, \infty)\) is univocally defined by \(c_t = F_\chi^{-1}(1-t)\) +for absolutely continuous markers.
+When differences in marker distribution between the negative and the +positive population are only in location but not in shape, then +\(F_\chi(\cdot) < F_\xi(\cdot)\), and the classification is direct by +using these decision rules. However, when this is not the case, the +standard ROC curve may cross the main diagonal, resulting in an +improper curve (Dorfman et al. 1997). This may be due to three different +scenarios:
+the behavior of the marker in the two studied populations is +different but it is not possible to determine the decision rules. +Notice that the binary classification problem goes further than +distinction between the two populations: the classification subsets +should be highly likely in one population and highly unlikely in the +other one (Martı́nez-Camblor 2018);
there exists a relationship between the marker and the response with +a potential classification use, but this is not monotone;
there is no relationship between the marker and the response at all +(main diagonal ROC curve).
In the second case, we have to define classification subsets different +from standard \(s_t=(c_t,\infty)\). Therefore, the use of the marker +becomes more complex. With the aim of accommodating scenarios where both +higher and lower values of the marker are associated with a higher risk +of having the characteristic, Martı́nez-Camblor et al. (2017) proposed the so-called +generalized ROC (gROC) curve. This curve tracks the highest sensitivity +for every specificity in the unit interval resulting from subsets of the +form \(s_t=(-\infty, x_t^L] \cup (x_t^U, \infty)\) with +\(x_t^L \leq x_t^U \in \mathcal{S}(X)\).
+Although final decisions are based on the underlying classification +subsets, they are typically not depicted. This omission is not a +shortcoming in standard cases, as for each specificity \(1-t \in [0,1]\), +there is only one rule of the form \(s_t = (c_t, \infty)\) with such +specificity. Particularly, \(s_t\) is univocally defined by +\(c_t = 1 - F_\chi^{-1}(1-t)\); and the same applies if we fix a +sensitivity. Nevertheless, if the gROC curve is taken, there are +infinite subsets of the form \(s_t=(-\infty, x_t^L] \cup (x_t^U, \infty)\) +resulting in \(\mathbb{P}(\chi \in s_t) = t\). This loss of univocity underlines +the importance of reporting (numerically and/or graphically) the +decision rules actually proposed for classifying. This gap is covered in +the presented package.
+An alternative approach to assess the classification performance of a +marker involves considering a transformation of it. This transformation +\(h(\cdot)\) aims to capture differences in distribution between the two +populations in the ROC sense. Once \(h(\cdot)\) is identified, the +standard ROC curve for \(h(X)\) is represented, resulting in the efficient +ROC (eROC) curve (Kauppi 2016). Arguing as before, for a fixed +specificity, the classification subsets \(s_t=(c_t, \infty)\) in the +transformed space are univocally defined, where a subject is classified +as positive if \(h(x) \in s_t\) and negative otherwise (with \(x\) +representing its marker value). However, they may have any shape in the +original space, depending on the monotonicity of the functional +transformation \(h(\cdot)\) (Martı́nez-Camblor et al. 2019). Emphasizing the importance of +tracking the decision rules underlying the eROC curve, this monitoring +process enables an assessment of whether the improved accuracy of the +marker justifies the potential loss in interpretability.
+The ROC curve is defined for classification accuracy evaluation of +univariate markers. To deal with multivariate markers, the usual +practice is to consider a transformation \(\boldsymbol{h}(\cdot)\) to +reduce it to a univariate one, and then to construct the standard ROC +curve. Same considerations as before apply when a functional +transformation is taken. In the proposed R library, we consider methods +from the literature to define and estimate \(\boldsymbol{h}(\cdot)\) in +the multivariate case (Kang et al. 2016; Meisner et al. 2021).
+Focusing on the classification subsets underlying the decision rules, +the movieROC package +incorporates methods to visualize the construction process of ROC curves +by presenting the classification accuracy of these subsets. For +univariate markers, the library includes both the classical (standard +ROC curve) and the generalized (gROC curve) approach. Besides, it enables +the display of decision rules for various transformations of the marker, +seeking to maximize performance and allowing for flexibility in the final +shape of the subsets (eROC curve). For multidimensional markers, the +proposed tool visualizes the evolution of decision subsets when +different objective functions are employed for optimization, even +imposing restrictions on the underlying regions. In this case, +displaying the decision rules associated with every specificity in a +single static image is no longer feasible. Therefore, dynamic +representations (videos) are implemented, drawing on time as an extra +dimension to capture the variation in specificity.
+Much software available in R could be discussed here covering diverse +topics related to ROC curves: the +pROC package is a main +reference including tools to visualize, estimate and compare ROC curves +(Robin et al. 2011); ROCnReg +explicitly considers covariate information to estimate the +covariate-specific and the covariate-adjusted ROC curves (Rodrı́guez-Álvarez and Inácio 2021); +smoothROCtime +implements smooth estimation of time-dependent ROC curves based on the +bivariate kernel density estimator for \((X, \textit{time-to-event})\) +(Dı́az-Coto 2020); +OptimalCutpoints +includes point and interval estimation methods for optimal thresholds +(López-Ratón et al. 2014); and +nsROC performs +non-standard analysis such as gROC estimation (Pérez-Fernández et al. 2018); among +others.
+This paper introduces and elucidates the diverse functionalities of the +newly developed +movieROC package, +aimed at facilitating the visualization and comprehension of the +decision rules underlying the binary classification process, +encompassing various generalizations. Despite the availability of +numerous R packages implementing related analyses, we have identified +the main gaps covered in this library: tracking the decision rules +underlying the ROC curve, including multivariate markers and +non-standard scenarios (i.e. non-monotonic). The rest of the paper is +structured as follows. In +Section 2, we introduce the main R functions +and objects implemented, and briefly explain the dataset employed +throughout this manuscript to demonstrate the utility of the R library. +Section 3 is devoted to reconsidering the +definition of the standard ROC curve from the perspective of +classification subsets, including an extension to multivariate +scenarios. Sections 4 and +5 revisit the gROC curve and the eROC +curve, respectively, covering various methods to capture the potential +classification accuracy of the marker under study. Each of these +sections begins with a state-of-the-art overview, followed by the main +syntax of the corresponding R functions. In addition, examples of +implementation using the dataset presented in Section +2.3 are provided. Finally, the paper +concludes with a concise summary and computational details regarding the +implemented tool.
+Sections 2.1 and +2.2 provide a detailed description of the main +objectives of the implemented R functions. To reflect the practical +usage of the developed R package, we employ a real dataset throughout +this manuscript, which is introduced in +Section 2.3.
+A graphical tool was developed to showcase static and dynamic graphics
+displaying the classification subsets derived from maximizing diagnostic
+accuracy under certain assumptions, ensuring the preservation of the
+interpretability. The R package facilitates the construction of the ROC
+curve across various specificities, providing visualizations of the
+resulting classification regions. The proposed tool comprises multiple R
+functions that generate objects with distinct class attributes (see
+function names where red arrows depart from and red nodes in
+Figure 1, respectively). Once the object of
+interest is created, different methods may be used, in order to plot the
+underlying regions (plot_regions(), plot_funregions()), to track the
+resulting ROC curve (plot_buildROC(), plot()), to predict decision
+rules for a particular specificity, and to print relevant information,
+among others. The main function of the package, movieROC(), produces
+videos to exhibit the classification procedure.

++Figure 1: R functions of the movieROC package. The blue nodes include the names of the R functions and the red nodes indicate the different R objects that can be created and worked with. The red arrows depart from those R functions engaged in creating R objects and the black arrows indicate which R functions can be applied to which R objects. The grey dashed arrows show internal dependencies. +
+It includes algorithms to visualize the regions that underlie the binary +classification problem, considering different approaches:
+make the classification subsets flexible in order to cover
+non-standard scenarios, by considering two cut-off values (gROC()
+function); explained in
+Section 4;
transform the marker by a proper function \(h(\cdot)\) (hROC()
+function); introduced in
+Section 5;
when dealing with multivariate markers, consider a functional
+transformation with some fixed or dynamic parameters resulting from
+different methods available in the literature (multiROC()
+function); covered in
+Section 3.1.
By using the gROC(), the multiROC() or the hROC() function, the
+user obtains an R object of class ‘groc’, ‘multiroc’ or ‘hroc’,
+respectively. These will be called
+movieROC objects.
+Once the object of interest is created, the implemented package includes
+many functions (methods) to pass to it. Some of them are generic methods
+(print(), plot() and predict()), commonly used in R language over
+different objects according to their class attributes. The rest of the
+functions are specific for this library and therefore only applicable to
+movieROC objects.
+The following outline summarizes all these functions and
+provides their target and main syntax (with default input parameters).
print(): Print some relevant information.plot(): Plot the ROC curve estimate.predict(): Print the classification subsets corresponding to a particular false-positive rate (FPR) introduced by the user. For a ‘groc’ object, the user may specify a cut-off value C (for the standard ROC curve) or two cut-off values XL and XU (for the gROC curve).plot_regions()Applicable to a ‘groc’ or a ‘hroc’ object. Plot two graphics in the same figure: left, classification subsets for each false-positive rate (grey color by default); right, \(90^\circ\) rotated ROC curve.
plot_regions(obj, plot.roc = TRUE, plot.auc = FALSE, FPR = 0.15, ...)
+If the input parameter FPR is specified, the corresponding classification region reporting such false-positive rate and the point in the ROC curve are highlighted in blue color.
plot_funregions()Applicable to a ‘groc’ or a ‘hroc’ object.
+Plot the transforming function and the classification subsets reporting the false-positive rate(s) indicated in the input parameter(s) FPR and FPR2.
plot_funregions(obj, FPR = 0.15, FPR2 = NULL, plot.subsets = TRUE, ...)
+plot_buildROC()Applicable to a ‘groc’ or a ‘multiroc’ object.
groc’ object: Plot four (if input reduce is FALSE) or two (if reduce is TRUE, only those on the top) graphics in the same figure: top-left, density function estimates for the marker in both populations with the areas corresponding to FPR and TPR colored (blue and red, respectively) for the optional input parameter FPR, C or XL, XU; top-right, the empirical ROC curve estimate; bottom-left, boxplots in both groups; bottom-right, classification subsets for every FPR (grey color).plot_buildROC(obj, FPR = 0.15, C, XL, XU, h = c(1,1),
+ histogram = FALSE, breaks = 15, reduce = TRUE,
+ build.process = FALSE, completeROC = FALSE, ...)
+If build.process is FALSE, the whole ROC curve is displayed; otherwise, if completeROC is TRUE, the portion of the ROC curve until the fixed FPR is highlighted in black and the rest is shown in gray, while if completeROC is FALSE, only the first portion of the curve is displayed.
multiroc’ object: Plot two graphics in the same figure: right, the ROC curve highlighting the point and the threshold for the resulting univariate marker; left, scatterplot with the marker values in both positive (red color) and negative (blue color) subjects. About the left graphic: for \(p=2\), over the original/feature bivariate space; for \(p>2\), projected over two selected components of the marker (if display.method = "OV" with components selection in displayOV, c(1,2) by default) or the first two principal components from PCA (if display.method = "PCA", default). The classification subset reporting the FPR selected by the user (FPR \(\neq\) NULL) is displayed in gold color.Main syntax:
+for \(p=2\): +plot_buildROC(obj, FPR = 0.15,
+ build.process = FALSE, completeROC = TRUE, ...)
+If build.process is FALSE, the whole ROC curve is displayed; otherwise, if completeROC is TRUE, the portion of the ROC curve until the fixed FPR is highlighted in black and the rest is shown in gray, while if completeROC is FALSE, only the first portion of the curve is shown.
movieROC()Applicable to a ‘groc’ or a ‘multiroc’ object. Save a video as a GIF illustrating the construction of the ROC curve.
groc’ object:movieROC(obj, fpr = NULL,
+ h = c(1,1), histogram = FALSE, breaks = 15,
+ reduce = TRUE, completeROC = FALSE, videobar = TRUE,
+ file = "animation1.gif", ...)
+For each element in vector fpr (optional input parameter), the function executed is plot_buildROC(obj, FPR = fpr[i], build.process = TRUE, ...). The vector of false-positive rates illustrated in the video is NULL by default: if length of output parameter t for gROC() function is lower than 150, such vector is taken as fpr; otherwise, an equally-spaced vector of length 100 covering the range of the marker values is considered.
multiroc’ object:Main syntax:
+for \(p=2\): +movieROC(obj, fpr = NULL,
+ file = "animation1.gif", save = TRUE,
+ border = TRUE, completeROC = FALSE, ...)
+The video is saved by default as a GIF with the name indicated in argument file (extension .gif should be added). A border for the classification subsets is drawn by default.
For each element in vector fpr (optional input parameter), the function executed is
plot_buildROC(obj, FPR = fpr[i], build.process = TRUE, completeROC, ...)
+plot_buildROC(obj, FPR = fpr[i], build.process = TRUE, completeROC,
+ display.method, displayOV, ...)
+Same considerations about the input fpr as those for movieROC() over a ‘groc’ object.
In order to illustrate the functionality of our R package, we consider
+the HCC data. This dataset is derived from gene expression arrays of
+tumor and adjacent non-tumor tissues of 62 Taiwanese cases of
+hepatocellular carcinoma (HCC). The goal of the original study
+(Shen et al. 2012) was to identify, with a genome-wide approach, additional
+genes hypermethylated in HCC that could be used for more accurate
+analysis of plasma DNA for early diagnosis, by using Illumina
+methylation arrays (Illumina, Inc., San Diego, CA) that screen 27,578
+autosomal CpG sites. The complete dataset was deposited in NCBI’s Gene
+Expression Omnibus (GEO) and it is available through series accession
+number GSE37988
+(www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37988).
+It is included in the presented package (HCC dataset), selecting 948
+genes with complete information.
The following code loads the R package and the HCC dataset (see the
+vignette
+for main structure).
R> library(movieROC)
+R> data(HCC)We selected the genes 20202438, 18384097, and 03515901. On the one hand, +we chose the gene 03515901 as an example of a monotone relationship +between the marker and the response, reporting a good ROC curve. On the +other hand, relative gene expression intensities of the genes 20202438 +and 18384097 tend to be more extreme in tissues with tumor than in those +without it. These are non-standard cases, so if we limit ourselves to +detect “appropriate” genes on the basis of the standard ROC curve, they +would not be chosen. However, extending the decision rules by means of +the gROC curve, those genes may be considered as potential biomarkers +(locations) to differ between the two groups. The R code estimating and +displaying the density probability function for gene expression +intensities of the selected genes in each group +(Figure 2) is included in the +vignette.
+
++Figure 2: Density histograms and kernel density estimations (lighter) for gene expression intensities of the genes 20202438, 18384097 and 03515901 in negative (non-tumor) and positive (tumor) tissues. +
+Assuming that there exists a monotone relationship between the marker +and the response, the regular, right-sided or standard ROC curve +associated with the marker \(X\) considers classification subsets of the +form \(s_t=(c_t,\infty)\). For each specificity +\(1-t=\mathbb{P}(\chi \notin s_t) \in [0,1]\), also called true-negative rate, +there exists only one subset \(s_t\) reporting such specificity and thus a +particular sensitivity, also called true-positive rate, +\(\mathbb{P}(\xi \in s_t)\). This results in a simple correspondence between each +point of the ROC curve +\(\mathcal{R}_r(t) = 1-F_\xi \big(F_\chi^{-1}(1-t)\big)\) and its +associated classification region \(s_t \in \mathcal{I}_r(t)\), where +\[\mathcal{I}_r(t) = \Big\{ s_t = (c_t, \infty) : c_t \in \mathcal{S}(X) , \mathbb{P}(\chi \in s_t) = t \Big\}\] +is the right-sided family of eligible classification subsets. The +definition of this family captures the shape of the decision rules and +the target specificity.
+If higher values of the marker are associated with a higher probability +of not having the characteristic (see gene 03515901 in +Figure 2), the ROC curve would be defined by the +left-sided family of eligible classification subsets (Martı́nez-Camblor et al. 2017), +\(\mathcal{I}_l(t)\), similarly to \(\mathcal{I}_r(t)\) but with the form +\(s_t = (\infty, c_t]\). It results in +\(\mathcal{R}_l(t) = F_\xi \big(F_\chi^{-1}(t) \big)\), \(t\in[0,1]\), and +the decision rules are also univocally defined in this case.
+The ROC curve and related problems were widely studied in the +literature; interested readers are referred to the monographs of +Zhou et al. (2002), Pepe (2003), and Nakas et al. (2023), as well as the review by +Inácio et al. (2021). By definition, the ROC curve is confined within the unit +square, with optimal performance achieved when it approaches the +left-upper corner (AUC closer to 1). Conversely, proximity to the main +diagonal (AUC closer to 0.5) means diminished discriminatory ability, +resembling a random classifier.
+In practice, let \((\xi_1, \xi_2, \dots, \xi_n)\) and
+\((\chi_1, \chi_2, \dots, \chi_m)\) be two independent and identically
+distributed (i.i.d.) samples from the positive and the negative
+population, respectively. Different estimation procedures are
+implemented in the
+movieROC package,
+such as the empirical estimator (Hsieh and Turnbull 1996) (by default in the gROC()
+function), accompanied by its summary indices: the AUC and the Youden
+index (Youden 1950). Alternatively, semiparametric approaches based on
+kernel density estimation for the involved distributions may be
+considered (Zou et al. 1997). The plot_densityROC() function provides plots
+for both right- and left-sided ROC curves estimated by this method. On
+the other hand, assuming that the marker follows a gaussian distribution
+in both populations, that is,
+\(\xi \sim \mathcal{N}(\mu_\xi, \sigma_\xi)\) and
+\(\chi \sim \mathcal{N}(\mu_\chi, \sigma_\chi)\), parametric approaches
+propose plug-in estimators by estimating the unknown parameters while
+using the known distributions (Hanley 1988). This parametric estimation
+is included in the gROC_param() function, which works similarly to
+gROC().
Main syntax:
+gROC(X, D, side = "right", ...)
+gROC_param(X, D, side = "right", ...)
+Table 1 in the
+vignette
+provides the main input and output parameters of these R functions,
+which estimate the regular ROC curve (right-sided or left-sided with
+side = "right" or "left", respectively) and associated decision
+rules. Its output is an R object of class ‘groc’, to which the
+functions listed in Section 2.2 can be applied. Most of them are
+visualization tools, but the user may also print() summary information
+and predict() classification regions for a particular specificity.
Figure 3 graphically represents the empirical +estimation of the standard (gray line) and generalized (black line) ROC +curves for each gene in Figure 2. To construct the standard ROC curve for the +first two genes (20202438 and 18384097), the right-sided ROC curve is +considered; and the left-sided curve for the third one (03515901). As +expected following the discussion about +Figure 2, the standard and gROC curves are similar for +the third gene because there exists a monotone relationship between the +marker and the response. However, these curves differ for the first two +genes due to the lack of monotonicity in those scenarios. The empirical +gROC curve estimator is explained in detail in +Section 4.
+Next chunk of code generates the figure, providing an example of the use
+of gROC() function, plot() and how to get access to the AUC.
R> for(gene in c("20202438", "18384097", "03515901")){
++ roc <- gROC(X = HCC[,paste0("cg",gene)], D = HCC$tumor,
++ side = ifelse(gene == "03515901", "left", "right"))
++ plot(roc, col = "gray50", main = paste("Gene", gene), lwd = 3)
++ groc <- gROC(X = HCC[,paste0("cg",gene)], D = HCC$tumor, side = "both")
++ plot(groc, new = FALSE, lwd = 3)
++ legend("bottomright", paste(c("AUC =", "gAUC ="), format(c(roc$auc, groc$auc),
++ digits = 3)), col = c("gray50", "black"), lwd = 3, bty = "n", inset = .01)}
++Figure 3: Standard ROC curve (in gray) and gROC curve (in black) empirical estimation for the capacity of genes 20202438, 18384097 and 03515901 to differ between tumor and non-tumor group. +
+The following code snippet estimates the standard ROC curve for gene
+20202438, prints its basic information, and predicts the classification
+region and sensitivity resulting in a specificity of 0.9. It provides an
+illustrative example of utilizing the print() and predict()
+functions.
R> roc_selg1 <- gROC(X = HCC$cg20202438, D = HCC$tumor, side = "right")
+R> roc_selg1Data was encoded with nontumor (controls) and tumor (cases).
+It is assumed that larger values of the marker indicate larger confidence that a
+ given subject is a case.
+There are 62 controls and 62 cases.
+The specificity and sensitivity reported by the Youden index are 0.855 and 0.403,
+ respectively, corresponding to the following classification subset: (0.799, Inf).
+The area under the right-sided ROC curve (AUC) is 0.547.R> predict(roc_selg1, FPR = .1)$ClassSubsets $Specificity $Sensitivity
+[1] 0.8063487 Inf [1] 0.9032258 [1] 0.3064516The following line of code displays the whole construction of the +empirical standard ROC curve for gene 20202438. The video is saved by +default as a GIF with the name provided.
+R> movieROC(roc_selg1, reduce = FALSE, file = "StandardROC_gene20202438.gif")
In practice, many cases may benefit from combining information from +different markers to enhance classification accuracy. Rather than +assessing univariate markers separately, taking the multivariate marker +resulting from merging them can yield relevant gain. However, note that +the ROC curve and related indices are defined only for univariate +markers, as they require the existence of a total order. To address this +limitation, a common approach involves transforming the \(p\)-dimensional +multivariate marker \(\boldsymbol{X}\) into a univariate one through a +functional transformation +\(\boldsymbol{h}: \mathbb{R}^p \longrightarrow \mathbb{R}\). This +transformation \(\boldsymbol{h}(\cdot)\) seeks to optimize an objective +function related to the classification accuracy, usually the AUC +(Su and Liu 1993; McIntosh and Pepe 2002; Martı́nez-Camblor et al. 2019).
+We enumerate the methods included in the proposed R tool by the
+multiROC() function (with the input parameter method), listed
+according to the objective function to optimize. Recall that the output
+of multiROC() is an object of class ‘multiroc’, containing
+information about the estimation of the ROC curve and subsets for
+multivariate scenarios. Table 2 in the
+vignette
+includes the usage of this function.
Main syntax:
+multiROC(X, D, method = "lrm",
+ formula = 'D ~ X.1 + I(X.1^2) + X.2 + I(X.2^2) + I(X.1*X.2)', ...)
+multiROC() function, fixing input parameters
+method = "fixedLinear" and methodLinear to one from "SuLiu"
+(Su and Liu 1993), "PepeThompson" (Pepe and Thompson 2000), or "minmax" (Liu et al. 2011).
+The R function also admits quadratic combinations when \(p=2\), i.e.
+\(\mathcal{Q}_{\boldsymbol{\beta}}(\boldsymbol{X}) = \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \beta_4 X_1^2 + \beta_5 X_2^2\),
+by fixing method = "fixedQuadratic" for a particular
+coefQuadratic \(= \boldsymbol{\beta} = (\beta_1, \dots, \beta_5)\).The risk score function
+logit\(\left\{ \mathbb{P}(D = 1 \, | \, \boldsymbol{X}) \right\}\): Our
+package allows the user to fit a logistic regression model
+(method = "lrm") considering any family of functions (linear,
+quadratic, whether considering interactions or not...) by means of
+the input parameter formula.lrm. A stepwise regression model is
+fitted if stepModel = TRUE. Details are explained in
+Section 5.
The sensitivity for a particular specificity:
multiROC() function integrates the estimation
+procedures for the bandwidth matrix developed by Duong (2007), by
+fixing method = "kernelOptimal" and choosing a proper method
+to estimate the bandwidth ("kernelOptimal.H").method = "dynamicEmpirical", only implemented for \(p=2\)), and
+may result in overfitting. Instead, Meisner et al. (2021) method is
+recommended (method = "dynamicMeisner").Once the classification subsets for a multivariate marker are
+constructed by the multiROC() function, several R methods may be used
+for the output object (see Section 2.2). They include print relevant
+information or plot the resulting ROC curve. The main contribution of
+the package is to plot the construction of the ROC curve together with
+the classification subsets in a static figure for a particular FPR
+(plot_buildROC() function), or in a video for tracking the whole
+process (movieROC() function).
Figure 4
+illustrates the videos resulting from movieROC() function.
+Particularly, classification accuracy of the
+bivariate marker (cg20202438, cg18384097) was studied by using four
+different approaches indicated on the captions, considering linear
+combinations (top) and nonlinear transformations (bottom). This figure
+was implemented by the code below, integrating multiROC() and
+movieROC() functions. Four videos are saved as GIF files with names
+"PepeTh.gif" (a), "Meisner.gif" (b), "LRM.gif" (c), and
+"KernelDens.gif" (d).
R> X <- HCC[ ,c("cg20202438", "cg18384097")]; D <- HCC$tumor
+R> biroc_12_PT <- multiROC(X, D, method = "fixedLinear", methodLinear = "PepeThompson")
+R> biroc_12_Meis <- multiROC(X, D, method = "dynamicMeisner", verbose = TRUE)
+R> biroc_12_lrm <- multiROC(X, D)
+R> biroc_12_kernel <- multiROC(X, D, method = "kernelOptimal")
+R> list_biroc <- list(PepeTh = biroc_12_PT, Meisner = biroc_12_Meis,
++ LRM = biroc_12_lrm, KernelDens = biroc_12_kernel)
+R> lapply(names(list_biroc), function(x) movieROC(list_biroc[[x]],
++ display.method = "OV", xlab = "Gene 20202438", ylab = "Gene 18384097",
++ cex = 1.2, alpha.points = 1, lwd.curve = 4, file = paste0(x, ".gif")))![]() |
+![]() |
+
|---|---|
| (a) Linear combinations with fixed parameters by Pepe and Thompson (2000). | +(b) Linear combinations with dynamic parameters by Meisner et al. (2021). | +
![]() |
+![]() |
+
|---|---|
(c) Logistic regression model with quadratic formula by default (see formula.lrm in Table 2 of the vignette). |
+(d) Optimal transformation by multivariate kernel density estimation with "Hbcv" method by default (Martı́nez-Camblor et al. 2021a). |
+
+
+Figure 4: Videos (from movieROC() function) of the classification procedure and ROC curve for the bivariate marker cg20202438, cg18384097). Four different methods for classification are displayed.
+
When the marker has a dimension higher than two it is difficult to
+visualize the data and the classification regions. Therefore, the
+movieROC() function offers two options for showing the results, both
+on a bidimensional space. On the one hand, to choose two of the
+components of the multivariate marker and project the classification
+subsets on the plane defined by them
+(Figure 5, middle). On the other, to project the
+classification regions on the plane defined by the two first principal
+components (Figure 5, left). The R function prcomp() from stats
+is used to perform Principal Components Analysis (PCA) (Hotelling 1933).
Figure 5 shows the difficulty in displaying the
+decision rules when \(p>2\) (the 3 genes used along this manuscript), even
+with the two options implemented in our package. It was generated using
+multiROC() and plot_buildROC():
R> multiroc_PT <- multiROC(X = HCC[ ,c("cg20202438", "cg18384097", "cg03515901")],
++ D = HCC$tumor, method = "fixedLinear", methodLinear = "PepeThompson")
+R> multiroc_PTData was encoded with nontumor (controls) and tumor (cases).
+There are 62 controls and 62 cases.
+A total of 3 variables have been considered.
+A linear combination with fixed parameters estimated by PepeThompson approach has
+ been considered.
+The specificity and sensitivity reported by the Youden index are 0.855 and 0.742,
+ respectively, corresponding to the cut-off point -0.0755 for the transformation
+ h(X) = 0.81*cg20202438 - 0.1*cg18384097 - 1*cg03515901.
+The area under the ROC curve (AUC) is 0.811.R> plot_buildROC(multiroc_PT, cex = 1.2, lwd.curve = 4)
+R> plot_buildROC(multiroc_PT, display.method = "OV", displayOV = c(1,3), cex = 1.2,
++ xlab = "Gene 20202438", ylab = "Gene 03515901", lwd.curve = 4)
++Figure 5: Multivariate ROC curve estimation for the simultaneous diagnostic accuracy of genes 20202438, 18384097 and 03515901. Pepe and Thompson (2000) approach was used to estimate the linear coefficients and classification rules (yellow and gray border for positive and negative class, respectively) for a FPR 0.15 are displayed. Left, projected over the 2 principal components from PCA; middle, over the 1st and the 3rd selected genes. +
+There are scenarios whose standard ROC curves are not concave (first two +genes in Figure 3, gray solid line), reflecting that the +standard initial assumption of existence of a monotone relationship +between the marker and the response is misleading. In +Figure 2, we may see that difference in gene 20202438 +distribution between those tissues which have the characteristic and those +which do not is mainly in dispersion. To accommodate this common type of +scenarios, Martı́nez-Camblor et al. (2017) extended the ROC curve definition to the case +where both extremes for marker values are associated with a higher risk +of having the characteristic of interest, by considering the both-sided +family of eligible classification subsets: +\[\mathcal{I}_g(t) = \Big\{ s_t = (-\infty,x_t^L] \cup (x_t^U, \infty) : x_t^L \leq x_t^U \in \mathcal{S}(X) , \mathbb{P}(\chi \in s_t) = t \Big\}.\]
+It becomes crucial to consider the supremum in the definition of the +generalized ROC curve because the decision rule for each \(t \in [0,1]\) +is not univocally defined: there exist infinite pairs \(x_t^L \leq x_t^U\) +reporting a specificity \(1-t\) (i.e. \(\mathcal{I}_g(t)\) is uncountably +infinite). Computationally, this optimization process incurs a +time-consuming estimation, depending on the number of different marker +values in the sample.
+After the introduction of this extension, several studies followed up
+regarding estimation of the gROC curve (Martı́nez-Camblor et al. 2017; Martı́nez-Camblor and Pardo-Fernández 2019a)
+and related measures such as its area (gAUC) (Martı́nez-Camblor et al. 2021b) and the
+Youden index (Martı́nez-Camblor and Pardo-Fernández 2019b; Bantis et al. 2021). By considering this
+generalization, another property of the classification subsets may be
+lost: the regions may not be self-contained over the increase in
+false-positive rate. It may happen that a subject is classified as a
+positive for a particular FPR \(t_1\), but as a negative for a higher FPR
+\(t_2\). Therefore, it is natural to establish a restriction (C) on the
+classification subsets, ensuring that any subject classified as a
+positive for a fixed specificity (or sensitivity) will also be
+classified as a positive for any classification subset with lower
+specificity (higher sensitivity). Pérez-Fernández et al. (2021) proposed an
+algorithm to estimate the gROC curve under restriction (C), included
+in the gROC() function of the presented R package. See final Section
+for computational details about this algorithm implementation.
Main syntax:
+gROC(X, D, side = "both", ...)
+gROC_param(X, D, side = "both", ...)
+Table 1 in the
+vignette
+collects the input and output parameters of the gROC() function, which
+estimates the gROC curve, both in the mentioned direction
+(side = "both") and in the opposite, i.e. when classification subsets
+of the form \(s_t=(x_t^L, x_t^U]\) are considered (side = "both2"). In
+addition, all the particular methods for a ‘groc’ object collected in
+Section 2.2 may be used in this general
+scenario.
Following, the gene 20202438 expression intensity diagnostic accuracy is
+evaluated by the gROC curve without restrictions (groc_selg1 object)
+and under the restriction (C) (groc_selg1_C object). The
+classification subsets and sensitivity for a specificity of \(0.9\) are
+displayed with the predict() function.
R> groc_selg1 <- gROC(X = HCC$cg20202438, D = HCC$tumor, side = "both")
+R> predict(groc_selg1, FPR = .1)$ClassSubsets $Specificity $Sensitivity
+ [,1] [,2] [1] 0.9032258 [1] 0.4032258
+[1,] -Inf 0.7180623
+[2,] 0.8296072 InfR> groc_selg1_C <- gROC(X = HCC$cg20202438, D = HCC$tumor, side = "both",
++ restric = TRUE, optim = TRUE)All the classification regions underlying the standard and the
+generalized ROC curves without and with restrictions are represented in
+Figure 6. The following code was used to generate
+the figure, illustrating the usage and output of the plot_regions()
+function. Besides displaying all the classification regions underlying
+every specificity (in gray), the one chosen by the user (FPR = 0.15 by
+default) is highlighted in blue. Note that the ROC curves are rotated
+\(90^\circ\) to the right, in order to use the vertical axis for FPR in
+both plots.
R> plot_regions(roc_selg1, cex.legend = 1.5, plot.auc = TRUE,
++ main = "Standard right-sided assumption [Classification subsets]")
+R> plot_regions(groc_selg1, plot.auc = TRUE, legend = F,
++ main.plotroc = "gROC curve",
++ main = "General approach [Classification subsets]")
+R> plot_regions(groc_selg1_C, plot.auc = TRUE, legend = F,
++ main.plotroc = "gROC curve",
++ main = "General approach with restriction (C) [Classific. subsets]",
++ xlab = "Gene 20202438 expression intensity")

++Figure 6: Classification regions and the ROC curve (90º rotated) for evaluation of gene 20202438 expression intensity assuming i) standard scenario (top), ii) generalized scenario without restrictions (middle), iii) generalized scenario under restriction (C) over the subsets (bottom). +
+It is clear the gain achieved for considering the generalized scenario +for this marker, which fits better its distribution in each group. +Standard estimated AUC is 0.547, while the gAUC increases to 0.765. The +gAUC is not especially affected by imposing the restriction (C), +resulting in 0.762.
+By keeping classification subsets of the form \(s_t = (c_t, \infty)\), an +alternative approach can be explored: transforming the univariate marker +through a suitable function \(h: \mathbb{R} \longrightarrow \mathbb{R}\) +to enhance its accuracy. Henceforth, the transformation \(h^*(\cdot)\) +reporting the dominant ROC curve compared to the one from any other +function (i.e. \(\mathcal{R}_{h^*}(\cdot) \geq \mathcal{R}_h(\cdot)\)) +will be referred to as optimal transformation (in the ROC sense), and +the resulting ROC curve is called eROC (Kauppi 2016). Following the +well-known Neyman–Pearson lemma, McIntosh and Pepe (2002) proved that \(h^*(\cdot)\) +is the likelihood ratio.
+We enumerate the methods included in the proposed R tool by the hROC()
+function (with the input parameter type), listed according to the
+procedure considered to estimate \(h^*(\cdot)\). The output of this
+function is an object of class ‘hroc’. See Table 3 in the
+vignette
+for function usage and output details.
Main syntax:
+hROC(X, D, type = "lrm", formula.lrm = 'D ~ pol(X,3)', ...)
+type = "lrm" and defining the
+function \(h(\cdot)\) as formula.lrm.hROC() function, the user may fix
+type = "kernel" and choose a proper bandwidth for the kernel
+estimation by kernel.h in order to compute this method.type = "overfitting".The following code and figures study the capacity of improving the
+classification performance of the gene 18384097 expression intensity via
+the above functional transformations and its impact on the final
+decision rules. The first one considers an ordinary cubic polynomial
+formula (hroc_cubic_selg2), and a linear tail-restricted cubic spline
+(hroc_rcs_selg2) for the right-hand side of the logistic regression model.
+The second one uses two different bandwidths (\(h=1\) and \(h=3\)) for
+density function estimation. For a comparative purpose, the last one
+estimates the gROC curve under restriction (C).
R> X <- HCC$cg18384097; D <- HCC$tumorR> hroc_cubic_selg2 <- hROC(X, D); hroc_cubic_selg2Data was encoded with nontumor (controls) and tumor (cases).
+There are 62 controls and 62 cases.
+A logistic regression model of the form D ~ pol(X,3) has been performed.
+The estimated parameters of the model are the following:
+ Intercept X X^2 X^3
+ "1.551" "32.054" "-120.713" "100.449"
+The specificity and sensitivity reported by the Youden index are 0.935 and 0.532,
+ respectively, corresponding to the following classification subset:
+ (-Inf, 0.442) U (0.78, Inf).
+The area under the ROC curve (AUC) is 0.759.R> hroc_rcs_selg2 <- hROC(X, D, formula.lrm = "D ~ rcs(X,8)")
+R> hroc_lkr1_selg2 <- hROC(X, D, type = "kernel")
+R> hroc_lkr3_selg2 <- hROC(X, D, type = "kernel", kernel.h = 3)
+R> hroc_overfit_selg2 <- hROC(X, D, type = "overfitting")
+
+R> groc_selg2_C <- gROC(X, D, side = "both", restric = TRUE, optim = TRUE)The following code snippet compares the AUC achieved from each approach +considered above:
+R> list_hroc <- list(Cubic = hroc_cubic_selg2, Splines = hroc_rcs_selg2,
++ Overfit = hroc_overfit_selg2, LikRatioEst_h3 = hroc_lkr3_selg2,
++ LikRatioEst_h1 = hroc_lkr1_selg2, gAUC_restC = groc_selg2_C)R> AUCs <- sapply(list_hroc, function(x) x$auc)
+R> round(AUCs, 3)Cubic Splines Overfit LikRatioEst_h3 LikRatioEst_h1 gAUC_restC
+0.759 0.807 1.000 0.781 0.799 0.836The shape of the classification regions over the original space
+\(\mathcal{S}(X)\) depends on the monotonicity of \(h^*(\cdot)\), which may
+be graphically studied by the plot_funregions() function (see
+Figure 7). These regions can be visualized by
+the R function plot_regions() (see
+Figure 8). Both are explained in
+Section 2.2 and illustrated below. The next chunk of
+code produced Figure 7, representing the different functional
+transformations estimated previously:
R> lapply(list_hroc, function(x) plot_funregions(x, FPR = .15, FPR2 = .5))
++Figure 7: Different functional transformations and resulting classification subsets for gene 18384097. Rules for FPR 0.15 (blue) and 0.50 (red) are remarked. Top, from left to right: cubic polynomial function, restricted cubic splines (with 8 knots), and overfitted transformation. Bottom: likelihood ratio estimation with bandwidths 3 (left) and 1 (middle), and transformation resulting in the gROC curve under restriction (C). +
+Finally, using the plot_regions() function,
+Figure 8 shows the resulting classification subsets
+over the original space for the best two of the six methods above. The first
+method (fitting a logistic regression model with restricted cubic
+splines with 8 knots) reports an AUC of 0.804 (compared to 0.684 by the
+standard ROC curve), but the shape of some classification rules is
+complex, such as \(s_t=(-\infty,a_t] \cup (b_t,c_t] \cup (d_t,\infty)\).
+This area increases to 0.836 by considering subsets of the form
+\(s_t=(-\infty,x_t^L] \cup (x_t^U,\infty)\), even imposing the restriction
+(C) to get a functional transformation \(h(\cdot)\).

++Figure 8: Classification regions and the resulting ROC curve (90º rotated) for the gene 18384097. Top, ROC curve for restricted cubic splines transformation with 8 knots; bottom, gROC curve under restriction (C) for the original marker. +
+Conducting binary classification using continuous markers requires +establishment of decision rules. In the standard case, each specificity +\(t \in [0,1]\) entails a classification subset of the form +\(s_t = (c_t,\infty)\) univocally defined. However, in more complex +situations – such as where there is a non-monotone relationship between the +marker and the response or in multivariate scenarios – these become not +clear. Visualization of the decision rules becomes crucial in these +cases. To address this, the +movieROC package +incorporates novel visualization tools complementing the ROC curve +representation.
+This R package offers a user-friendly and easily comprehensible software +solution tailored for practical researchers. It implements statistical +techniques to estimate and compare, and finally to graphically represent +different classification procedures. While several R packages address +ROC curve estimation, the proposed one emphasizes the classification +process, tracking the decision rules underlying the studied binary +classification problem. This tool incorporates different considerations +and transformations which may be useful to capture the potential of the +marker to classify in non-standard scenarios. Nevertheless, this library +is also useful in standard cases, as well as when the marker itself +comes from a classification or regression method (such as support vector +machines), because it provides nice visuals and additional information +not usually reported with the ROC curve.
+The main function of the package, movieROC(), allows users to monitor how
+the resulting classification subsets change along different
+specificities, thereby building the corresponding ROC curve. Notably, it
+introduces time as a third dimension to keep those specificities,
+generating informative videos. For interested readers or potential users
+of movieROC, the
+manual
+available in CRAN provides complete information about the implemented
+functions and their parameters. In addition, a
+vignette
+is accessible, including mathematical formalism and details about the
+algorithms implemented.
Some functions of our package depend on other libraries available on +CRAN:
+gROC(X, D, side = "both", restric = TRUE, optim = TRUE, ...) uses
+the allShortestPaths() function in the
+e1071 package
+(Meyer et al. 2023).
hROC(X, D, type = "lrm", ...) and
+multiROC(X, D, method = "lrm", ...) use the lrm() function in
+the rms package
+(Harrell Jr 2023).
multiROC(X, D, method = "kernelOptimal", ...) uses the kde()
+function in the ks
+package (Duong 2023).
multiROC(X, D, method = "dynamicMeisner", ...) uses the maxTPR()
+function in the maxTPR package (Meisner et al. 2021). This package was
+removed from the CRAN repository, so we integrated the code of the
+maxTPR() function into our package. This function uses
+Rsolnp::solnp() and robustbase::BYlogreg().
multiROC(X, D, method = "fixedLinear", methodLinear, ...) uses the
+R functions included in Kang et al. (2016) (Appendix). We integrated this
+code into our package.
movieROC(obj, save = TRUE, ...) uses the saveGIF() function in
+the animation
+package (Xie et al. 2021).
Users should be aware of certain limitations while working with this +package:
+Some methods are potentially time-consuming, especially with medium +to large sample sizes:
+The estimation of the gROC curve under restriction (C) can be
+computationally intensive, especially when considering different FPR
+to locally optimize the search using
+gROC(X, D, side = "both", restric = TRUE, optim = TRUE, t0max = TRUE).
+Note that this method involves a quite exhaustive search of the
+self-contained classification subsets leading to the optimal gROC
+curve estimate. However, even selecting different false-positive
+rates \(t_0\) to start from, it may not result in the optimal
+achievable estimate under restriction (C). Input parameters
+restric, optim, t0 and t0max for gROC() function included
+in Table 1 of the
+vignette
+serve to control this search.
Similarly, it also occurs for multivariate markers when considering
+linear frontiers with dynamic parameters (by using
+multiROC(X, D, method = "dynamicMeisner" | "dynamicEmpirical")).
Most implemented R functions consider empirical estimation for the
+resulting ROC curve, even if the procedure to estimate the decision
+rules is semi-parametric. An exception is the gROC_param()
+function, which accommodates the binormal scenario.
When visualizing classification regions for multivariate markers
+with high dimension (plot_buildROC() and movieROC() functions
+for a ‘multiroc’ object), our package provides some alternatives,
+but additional improvements could provide further aid in
+interpretation.
The authors acknowledge support by the Grants PID2019-104486GB-I00 and +PID2020-118101GB-I00 from Ministerio de Ciencia e Innovación (Spanish +Government), and by a financial Grant for Excellence Mobility for +lecturers and researchers subsidized by the University of Oviedo in +collaboration with Banco Santander.
+movieROC, pROC, ROCnReg, OptimalCutpoints, nsROC, rms, ks, e1071, animation
+Cluster, Distributions, DynamicVisualizations, Econometrics, Environmetrics, MachineLearning, Psychometrics, ReproducibleResearch, Survival, TeachingStatistics
+This article is converted from a Legacy LaTeX article using the +texor package. +The pdf version is the official version. To report a problem with the html, +refer to CONTRIBUTE on the R Journal homepage.
+Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
+For attribution, please cite this work as
+Pérez-Fernández, et al., "movieROC: Visualizing the Decision Rules Underlying Binary Classification", The R Journal, 2026+
BibTeX citation
+@article{RJ-2025-035,
+ author = {Pérez-Fernández, Sonia and Martínez-Camblor, Pablo and Corral-Blanco, Norberto},
+ title = {movieROC: Visualizing the Decision Rules Underlying Binary Classification},
+ journal = {The R Journal},
+ year = {2026},
+ note = {https://doi.org/10.32614/RJ-2025-035},
+ doi = {10.32614/RJ-2025-035},
+ volume = {17},
+ issue = {4},
+ issn = {2073-4859},
+ pages = {59-79}
+}
+`,e.githubCompareUpdatesUrl&&(t+=`View all changes to this article since it was first published.`),t+=` + If you see mistakes or want to suggest changes, please create an issue on GitHub.
+ `);const n=e.journal;return'undefined'!=typeof n&&'Distill'===n.title&&(t+=` +Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
+ `),'undefined'!=typeof e.publishedDate&&(t+=` +For attribution in academic contexts, please cite this work as
+${e.concatenatedAuthors}, "${e.title}", Distill, ${e.publishedYear}.
+ BibTeX citation
+${m(e)}
+ `),t}var An=Math.sqrt,En=Math.atan2,Dn=Math.sin,Mn=Math.cos,On=Math.PI,Un=Math.abs,In=Math.pow,Nn=Math.LN10,jn=Math.log,Rn=Math.max,qn=Math.ceil,Fn=Math.floor,Pn=Math.round,Hn=Math.min;const zn=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],Bn=['Jan.','Feb.','March','April','May','June','July','Aug.','Sept.','Oct.','Nov.','Dec.'],Wn=(e)=>10>e?'0'+e:e,Vn=function(e){const t=zn[e.getDay()].substring(0,3),n=Wn(e.getDate()),i=Bn[e.getMonth()].substring(0,3),a=e.getFullYear().toString(),d=e.getUTCHours().toString(),r=e.getUTCMinutes().toString(),o=e.getUTCSeconds().toString();return`${t}, ${n} ${i} ${a} ${d}:${r}:${o} Z`},$n=function(e){const t=Array.from(e).reduce((e,[t,n])=>Object.assign(e,{[t]:n}),{});return t},Jn=function(e){const t=new Map;for(var n in e)e.hasOwnProperty(n)&&t.set(n,e[n]);return t};class Qn{constructor(e){this.name=e.author,this.personalURL=e.authorURL,this.affiliation=e.affiliation,this.affiliationURL=e.affiliationURL,this.affiliations=e.affiliations||[]}get firstName(){const e=this.name.split(' ');return e.slice(0,e.length-1).join(' ')}get lastName(){const e=this.name.split(' ');return e[e.length-1]}}class Gn{constructor(){this.title='unnamed article',this.description='',this.authors=[],this.bibliography=new Map,this.bibliographyParsed=!1,this.citations=[],this.citationsCollected=!1,this.journal={},this.katex={},this.publishedDate=void 0}set url(e){this._url=e}get url(){if(this._url)return this._url;return this.distillPath&&this.journal.url?this.journal.url+'/'+this.distillPath:this.journal.url?this.journal.url:void 0}get githubUrl(){return this.githubPath?'https://github.com/'+this.githubPath:void 0}set previewURL(e){this._previewURL=e}get previewURL(){return this._previewURL?this._previewURL:this.url+'/thumbnail.jpg'}get publishedDateRFC(){return Vn(this.publishedDate)}get updatedDateRFC(){return Vn(this.updatedDate)}get publishedYear(){return this.publishedDate.getFullYear()}get publishedMonth(){return Bn[this.publishedDate.getMonth()]}get publishedDay(){return this.publishedDate.getDate()}get publishedMonthPadded(){return Wn(this.publishedDate.getMonth()+1)}get publishedDayPadded(){return Wn(this.publishedDate.getDate())}get publishedISODateOnly(){return this.publishedDate.toISOString().split('T')[0]}get volume(){const e=this.publishedYear-2015;if(1>e)throw new Error('Invalid publish date detected during computing volume');return e}get issue(){return this.publishedDate.getMonth()+1}get concatenatedAuthors(){if(2 tag. We found the following text: '+t);const n=document.createElement('span');n.innerHTML=e.nodeValue,e.parentNode.insertBefore(n,e),e.parentNode.removeChild(e)}}}}).observe(this,{childList:!0})}}var Ti='undefined'==typeof window?'undefined'==typeof global?'undefined'==typeof self?{}:self:global:window,_i=f(function(e,t){(function(e){function t(){this.months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'],this.notKey=[',','{','}',' ','='],this.pos=0,this.input='',this.entries=[],this.currentEntry='',this.setInput=function(e){this.input=e},this.getEntries=function(){return this.entries},this.isWhitespace=function(e){return' '==e||'\r'==e||'\t'==e||'\n'==e},this.match=function(e,t){if((void 0==t||null==t)&&(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+e.length)==e)this.pos+=e.length;else throw'Token mismatch, expected '+e+', found '+this.input.substring(this.pos);this.skipWhitespace(t)},this.tryMatch=function(e,t){return(void 0==t||null==t)&&(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+e.length)==e},this.matchAt=function(){for(;this.input.length>this.pos&&'@'!=this.input[this.pos];)this.pos++;return!('@'!=this.input[this.pos])},this.skipWhitespace=function(e){for(;this.isWhitespace(this.input[this.pos]);)this.pos++;if('%'==this.input[this.pos]&&!0==e){for(;'\n'!=this.input[this.pos];)this.pos++;this.skipWhitespace(e)}},this.value_braces=function(){var e=0;this.match('{',!1);for(var t=this.pos,n=!1;;){if(!n)if('}'==this.input[this.pos]){if(0 =k&&(++x,i=k);if(d[x]instanceof n||d[T-1].greedy)continue;w=T-x,y=e.slice(i,k),v.index-=i}if(v){g&&(h=v[1].length);var S=v.index+h,v=v[0].slice(h),C=S+v.length,_=y.slice(0,S),L=y.slice(C),A=[x,w];_&&A.push(_);var E=new n(o,u?a.tokenize(v,u):v,b,v,f);A.push(E),L&&A.push(L),Array.prototype.splice.apply(d,A)}}}}}return d},hooks:{all:{},add:function(e,t){var n=a.hooks.all;n[e]=n[e]||[],n[e].push(t)},run:function(e,t){var n=a.hooks.all[e];if(n&&n.length)for(var d,r=0;d=n[r++];)d(t)}}},i=a.Token=function(e,t,n,i,a){this.type=e,this.content=t,this.alias=n,this.length=0|(i||'').length,this.greedy=!!a};if(i.stringify=function(e,t,n){if('string'==typeof e)return e;if('Array'===a.util.type(e))return e.map(function(n){return i.stringify(n,t,e)}).join('');var d={type:e.type,content:i.stringify(e.content,t,n),tag:'span',classes:['token',e.type],attributes:{},language:t,parent:n};if('comment'==d.type&&(d.attributes.spellcheck='true'),e.alias){var r='Array'===a.util.type(e.alias)?e.alias:[e.alias];Array.prototype.push.apply(d.classes,r)}a.hooks.run('wrap',d);var l=Object.keys(d.attributes).map(function(e){return e+'="'+(d.attributes[e]||'').replace(/"/g,'"')+'"'}).join(' ');return'<'+d.tag+' class="'+d.classes.join(' ')+'"'+(l?' '+l:'')+'>'+d.content+''+d.tag+'>'},!t.document)return t.addEventListener?(t.addEventListener('message',function(e){var n=JSON.parse(e.data),i=n.language,d=n.code,r=n.immediateClose;t.postMessage(a.highlight(d,a.languages[i],i)),r&&t.close()},!1),t.Prism):t.Prism;var d=document.currentScript||[].slice.call(document.getElementsByTagName('script')).pop();return d&&(a.filename=d.src,document.addEventListener&&!d.hasAttribute('data-manual')&&('loading'===document.readyState?document.addEventListener('DOMContentLoaded',a.highlightAll):window.requestAnimationFrame?window.requestAnimationFrame(a.highlightAll):window.setTimeout(a.highlightAll,16))),t.Prism}();e.exports&&(e.exports=n),'undefined'!=typeof Ti&&(Ti.Prism=n),n.languages.markup={comment://,prolog:/<\?[\w\W]+?\?>/,doctype://i,cdata://i,tag:{pattern:/<\/?(?!\d)[^\s>\/=$<]+(?:\s+[^\s>\/=]+(?:=(?:("|')(?:\\\1|\\?(?!\1)[\w\W])*\1|[^\s'">=]+))?)*\s*\/?>/i,inside:{tag:{pattern:/^<\/?[^\s>\/]+/i,inside:{punctuation:/^<\/?/,namespace:/^[^\s>\/:]+:/}},"attr-value":{pattern:/=(?:('|")[\w\W]*?(\1)|[^\s>]+)/i,inside:{punctuation:/[=>"']/}},punctuation:/\/?>/,"attr-name":{pattern:/[^\s>\/]+/,inside:{namespace:/^[^\s>\/:]+:/}}}},entity:/?[\da-z]{1,8};/i},n.hooks.add('wrap',function(e){'entity'===e.type&&(e.attributes.title=e.content.replace(/&/,'&'))}),n.languages.xml=n.languages.markup,n.languages.html=n.languages.markup,n.languages.mathml=n.languages.markup,n.languages.svg=n.languages.markup,n.languages.css={comment:/\/\*[\w\W]*?\*\//,atrule:{pattern:/@[\w-]+?.*?(;|(?=\s*\{))/i,inside:{rule:/@[\w-]+/}},url:/url\((?:(["'])(\\(?:\r\n|[\w\W])|(?!\1)[^\\\r\n])*\1|.*?)\)/i,selector:/[^\{\}\s][^\{\};]*?(?=\s*\{)/,string:{pattern:/("|')(\\(?:\r\n|[\w\W])|(?!\1)[^\\\r\n])*\1/,greedy:!0},property:/(\b|\B)[\w-]+(?=\s*:)/i,important:/\B!important\b/i,function:/[-a-z0-9]+(?=\()/i,punctuation:/[(){};:]/},n.languages.css.atrule.inside.rest=n.util.clone(n.languages.css),n.languages.markup&&(n.languages.insertBefore('markup','tag',{style:{pattern:/(
+
+
+ ${e.map(l).map((e)=>`
`)}}const Mi=`
+d-citation-list {
+ contain: layout style;
+}
+
+d-citation-list .references {
+ grid-column: text;
+}
+
+d-citation-list .references .title {
+ font-weight: 500;
+}
+`;class Oi extends HTMLElement{static get is(){return'd-citation-list'}connectedCallback(){this.hasAttribute('distill-prerendered')||(this.style.display='none')}set citations(e){x(this,e)}}var Ui=f(function(e){var t='undefined'==typeof window?'undefined'!=typeof WorkerGlobalScope&&self instanceof WorkerGlobalScope?self:{}:window,n=function(){var e=/\blang(?:uage)?-(\w+)\b/i,n=0,a=t.Prism={util:{encode:function(e){return e instanceof i?new i(e.type,a.util.encode(e.content),e.alias):'Array'===a.util.type(e)?e.map(a.util.encode):e.replace(/&/g,'&').replace(/e.length)break tokenloop;if(!(y instanceof n)){c.lastIndex=0;var v=c.exec(y),w=1;if(!v&&f&&x!=d.length-1){if(c.lastIndex=i,v=c.exec(e),!v)break;for(var S=v.index+(g?v[1].length:0),C=v.index+v[0].length,T=x,k=i,p=d.length;T
+
+`);class Ni extends ei(Ii(HTMLElement)){renderContent(){if(this.languageName=this.getAttribute('language'),!this.languageName)return void console.warn('You need to provide a language attribute to your Footnotes
+
+`,!1);class Fi extends qi(HTMLElement){connectedCallback(){super.connectedCallback(),this.list=this.root.querySelector('ol'),this.root.style.display='none'}set footnotes(e){if(this.list.innerHTML='',e.length){this.root.style.display='';for(const t of e){const e=document.createElement('li');e.id=t.id+'-listing',e.innerHTML=t.innerHTML;const n=document.createElement('a');n.setAttribute('class','footnote-backlink'),n.textContent='[\u21A9]',n.href='#'+t.id,e.appendChild(n),this.list.appendChild(e)}}else this.root.style.display='none'}}const Pi=ti('d-hover-box',`
+
+
+