From e4625a08a32f36733dfb3fe028452b4c1e0baaec Mon Sep 17 00:00:00 2001 From: jepusto Date: Thu, 19 Feb 2026 16:43:01 -0600 Subject: [PATCH 1/2] Reviewed LVM edits to Chapter 8. --- 020-Data-generating-models.Rmd | 11 +++---- 030-Estimation-procedures.Rmd | 14 +++++---- 035-running-simulation.Rmd | 52 ++++++++++++++++++---------------- 070-experimental-design.Rmd | 26 +++++++---------- 4 files changed, 51 insertions(+), 52 deletions(-) diff --git a/020-Data-generating-models.Rmd b/020-Data-generating-models.Rmd index 5d05d73..26ee31e 100644 --- a/020-Data-generating-models.Rmd +++ b/020-Data-generating-models.Rmd @@ -254,9 +254,9 @@ ggplot(bvp_dat) + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + scale_x_continuous(expand = expansion(0,c(0,1)), - breaks = c( 2, 4, 6, 8, 10, 12, 14, 16 )) + + breaks = seq(0,16,2)) + scale_y_continuous(expand = expansion(0,c(0,1)), - breaks = c( 2, 4, 6, 8, 10, 12 )) + + breaks = seq(2,12,2)) + coord_fixed() + geom_point(position = position_jitter(width = 0.075, height = 0.075)) + theme_minimal() @@ -540,9 +540,10 @@ Letting $\delta$ denote the standardized mean difference parameter, $$ \delta = \frac{E(Y | Z_j = 1) - E(Y | Z_j = 0)}{SD( Y | Z_j = 0 )} = \frac{\gamma_1}{\sqrt{ \sigma^2_u + \sigma^2_\epsilon } } $$ Because we have constrained the total variance to 1, $\gamma_1$ is equivalent to $\delta$. This equivalence holds for any value of $\gamma_0$, so we do not have to worry about manipulating $\gamma_0$ in the simulations---we can simply leave it at its default value. -The $\gamma_2$ parameter only impacts outcomes on the treatment side, with larger values induces more variation. Standardizing on something stable (here the control side) rather than something that changes in reaction to various parameters (which would happen if we standardized by overall variation or pooled variation) will lead to more interpretable quantities. - - +The $\gamma_2$ parameter only influences outcomes for those in the treatment condition, with larger values inducing more variation. +Standardizing based on the total variance also makes $\gamma_2$ somewhat easier to interpret because this parameter the same units as $\gamma_1$. +Specifically, $\gamma_2$ is the expected difference in the standardized treatment impact for schools that differ in size by $\bar{n}$ (the overall average school size). +More broadly, standardizing by a stable parameter (the variation among those in the control condition) rather than something that changes in reaction to other parameters (which would be the case if we standardized by overall variation or pooled variation) will lead to more interpretable quantities. ## Sometimes a DGP is all you need {#three-parameter-IRT} diff --git a/030-Estimation-procedures.Rmd b/030-Estimation-procedures.Rmd index b30acaf..e4fb78c 100644 --- a/030-Estimation-procedures.Rmd +++ b/030-Estimation-procedures.Rmd @@ -35,7 +35,7 @@ The easiest approach here is to implement each estimator in turn, and then bundl We describe how to do this in Section \@ref(multiple-estimation-procedures). In Section \@ref(validating-estimation-function), we describe strategies for validating the coded-up estimator before running a full simulation. -We offer a few strategies for validating ones code, including checking against existing implementations, checking theoretical properties, and using simulations to check for bugs. +We offer a few strategies for validating one's code, including checking against existing implementations, checking theoretical properties, and using simulations to check for bugs. For a full simulation, an estimator needs to be reliable, not crashing and giving answers across the wide range of datasets to which it is applied. An estimator will need to be able to handle odd edge cases or pathological datasets that might be generated as we explore a full range of simulation scenarios. @@ -120,8 +120,9 @@ Education researchers tend to be more comfortable using multi-level regression m We next develop estimation functions for each of these procedures, focusing on a simple model that does not include any covariates besides the treatment indicator. Each function needs to produce a point estimate, standard error, and $p$-value for the average treatment effect. -To have data to practice on, we generate a sample dataset using [a revised version of `gen_cluster_RCT()`](/case_study_code/gen_cluster_RCT.R), which corrects the bug discussed in Exercise \@ref(cluster-RCT-checks): +To write estimation functions, it is useful to work with an example dataset that has the same structure as the data that will be simulated. To that end, we generate a sample dataset using [a revised version of `gen_cluster_RCT()`](/case_study_code/gen_cluster_RCT.R), which corrects the bug discussed in Exercise \@ref(cluster-RCT-checks): + ```{r} dat <- gen_cluster_RCT( J=16, n_bar = 30, alpha = 0.8, p = 0.5, @@ -398,7 +399,7 @@ We consider both these problems here, and then revisit the conceptual considerat Some estimation functions will require complicated or stochastic calculations that can sometimes produce errors. Intermittent errors can really be annoying and time-consuming if not addressed. -To protect yourself, it is good practice to anticipate potential errors, preventing them from stopping code execution and allowing your simulations to keep running. +To protect yourself, it is good practice to anticipate potential errors, so that you can prevent them from stopping code execution and allow your simulations to keep running. We next demonstrate some techniques for error-handling using tools from the `purrr` package. For illustrative purposes, consider the following error-prone function that sometimes returns what we want, sometimes returns `NaN` due to taking the square root of a negative number, and sometimes crashes completely because `broken_code()` does not exist: @@ -471,12 +472,12 @@ mod <- analysis_MLM(dat) Wrapping `lmer()` with `quietly()` makes it possible to catch such output and store it along with other results, as in the following: ```{r} -quiet_safe_lmer <- quietly( possibly( lmerTest::lmer, otherwise=NULL ) ) +quiet_safe_lmer <- quietly( possibly( lmerTest::lmer, otherwise = NULL ) ) M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1|sid), data=dat ) M1 ``` -In our analysis method, we then pick apart the pieces and make a dataframe out of them (we also add in a generally preferred optimizer, bobyqa, to try and avoid the warnings and errors in the first place): +In our estimation function, we replace the original `lmer()` call with our quiet-and-safe version, `quiet_safe_lmer()`. We then pick apart the pieces of its output to return a tidy dataset of results. (Separately, we also add in a generally preferred optimizer, bobyqa, to try and avoid the warnings and errors in the first place.) The resulting function is: ```{r} analysis_MLM_safe <- function( dat ) { @@ -494,6 +495,7 @@ analysis_MLM_safe <- function( dat ) { warning = M1$warning, error = TRUE ) } else { + # no error! sum <- summary( M1$result ) tibble( @@ -592,7 +594,7 @@ We end up with a 20-entry list, with each element consisting of a pair of the re length( resu ) resu[[3]] ``` -We can massage our data to a more easy to parse format: +We can massage our data into a format that is easier to parse: ```{r} resu <- transpose( resu ) unlist(resu$result) diff --git a/035-running-simulation.Rmd b/035-running-simulation.Rmd index 0e97a7d..1be5a51 100644 --- a/035-running-simulation.Rmd +++ b/035-running-simulation.Rmd @@ -20,7 +20,7 @@ source("case_study_code/analyze_cluster_RCT.R") # Running the Simulation Process {#running-the-simulation-process} In the prior two chapters we saw how to write functions that generate data according to a particular model and functions that implement data-analysis procedures on simulated data. -The next step to build a simulation involves putting these two pieces together, running the DGP function and the data-analysis function repeatedly to obtain results (in the form of point estimates, standard errors, confidence intervals, $p$-values, or other quantities) from many replications of the whole process. +The next step in building a simulation involves putting these two pieces together, running the DGP function and the data-analysis function repeatedly to obtain results (in the form of point estimates, standard errors, confidence intervals, $p$-values, or other quantities) from many replications of the whole process. As with most things R-related, there are many different techniques that can be used to repeat a set of calculations over and over. In this chapter, we demonstrate two key techniques for doing so. @@ -28,7 +28,7 @@ We then explain how to ensure reproducibility of simulation results by setting t ## Repeating oneself {#repeating-oneself} -Suppose that we want to repeatedly generate bivariate Poisson data and then estimate the correlation for those data. +Suppose that we want to repeatedly generate bivariate Poisson data and then use the data to estimate the correlation between the variates. We saw a DGP function for the bivariate Poisson in Section \@ref(DGP-functions), and an estimation function in Section \@ref(estimation-functions). To produce a simulated correlation coefficient, we need to run these two functions in turn: ```{r} @@ -55,12 +55,12 @@ The first argument specifies the number of times to repeat the calculation. The second argument is an R expression that will be evaluated. The expression is wrapped in curly braces (`{}`) because it involves more than a single line of code. Including the option `id = "rep"` returns a dataset that includes a variable called `rep` to identify each replication of the process. -Setting the option `stack = TRUE` will stack up the output of each expression into a single tibble, which will facilitate later calculations on the results. +Setting the option `stack = TRUE` will stack up the output of each expression into a single tibble, which will facilitate further calculations on the results. Explicitly setting this option is not necessary because it is `TRUE` by default; setting `stack = FALSE` will return the results in a list rather than a tibble (try this for yourself to see!). There are many other functions that work very much like `repeat_and_stack()`, including the base-R function `replicate()` and the now-deprecated function `rerun()` from `{purrr}`. The functions in the `map()` family from `{purrr}` can also be used to do the same thing as `repeat_and_stack()`. -See Appendix \@ref(more-repeating-oneself) for more discussion of these alternatives. +See Appendix \@ref(more-repeating-oneself) for more discussion of these options. ## One run at a time @@ -128,8 +128,7 @@ Once we have a function to execute a single run, we can produce multiple results R <- 1000 ATE <- 0.30 runs <- repeat_and_stack( R, - one_run( n_bar = 30, J = 20, - gamma_1 = ATE ), + one_run( n_bar = 30, J = 20, gamma_1 = ATE ), id = "runID") ``` @@ -145,7 +144,7 @@ In Chapter \@ref(performance-measures), we will look systematically at ways to q In Section \@ref(DGP-standardization), we discussed how to index the DGP of the cluster-randomized experiment using an intra-class correlation (ICC) instead of using two separate variance components. This type of re-parameterization can be handled as part of writing a wrapper function for executing the DGP and estimation procedures. -Here is a revised version of `one_run()`, which also renames some of the more obscure model parameters using terms that are easier to interpret: +Here is a revised version of `one_run()`[^stopifnot], which also renames some of the more obscure model parameters using terms that are easier to interpret: ```{r revised_CRT, eval=FALSE} one_run <- function( @@ -166,12 +165,15 @@ one_run <- function( bind_rows( MLM = MLM, LR = LR, Agg = Agg, .id = "method" ) } ``` -Note the `stopifnot`: it is wise to ensure our parameter transforms are all reasonable, so we do not get unexplained errors or strange results later on. + +[^stopifnot]: Note the use of `stopifnot` in the body of `one_run()`. +It is wise to ensure our parameter transforms are all reasonable, so we do not get unexplained errors or strange results later on. It is best if your code fails as soon as possible! Otherwise debugging can be quite hard. + -Controlling how we use the foundational elements such as our data generating code is a key technique for making the higher level simulations sensible and more easily interpretable. +Controlling how we use the foundational elements such as our data generating code is a key technique for making the higher level simulations sensible and more readily interpretable. In the revised `one_run()` function, we transform the `ICC` input parameter into the parameters used by `gen_cluster_RCT()` so as to maintain the effect size interpretation of our simulation. We have not modified `gen_cluster_RCT()` at all: instead, we specify the parameters of the DGP function in terms of the parameters we want to directly control in the simulation. @@ -226,7 +228,7 @@ run_CRT_sim( reps = 1000, size_coef = 0.5, alpha = 0.5 ) saveRDS( runs, file = "results/cluster_RCT_simulation.rds" ) ``` -Notice our simulation takes easy to interpret parameters (`ATE`, `ICC`) rather than the raw variance components. +Notice how our simulation takes easy-to-interpret parameters (`ATE`, `ICC`) rather than the raw variance components. We have put our simulation into effect size units, and are now providing "knobs" to the simulation that are directly interpretable. @@ -236,10 +238,9 @@ The techniques that we have demonstrated for repeating a set of calculations eac The `simhelpers` package provides a function `bundle_sim()` that abstracts this common pattern and allows you to automatically stitch together (or "bundle") a DGP function and an estimation function, so that they can be run once or multiple times. Thus, `bundle_sim()` provides a convenient alternative to writing your own `one_run()` function for each simulation, thereby saving a bit of typing (and avoiding an opportunity for bugs to creep into your code). -`bundle_sim()` takes a DGP function and an estimation function as inputs and gives us back a new function that will run a simulation using whatever parameters we give it. +The `bundle_sim()` function takes a DGP function and an estimation function as inputs and gives us back a new function that will run a simulation using whatever parameters we give it. The `bundle_sim()` function is a "factory" in that it creates a new function for you on the fly. -Think of it as a (rather dumb) coding assistant who will take some direction and give you code back. - + Here is a basic example, where we create a function for running our simulation to evaluate how well our Pearson correlation coefficient confidence interval works on a bivariate Poisson distribution: ```{r} sim_r_Poisson <- bundle_sim( f_generate = r_bivariate_Poisson, @@ -263,7 +264,7 @@ Its first argument is `reps`, which controls the number of times that the simula Its last argument is `seed`, which we will discuss in Section \@ref(seeds-and-pseudo-RNGs). The `bundle_sim()` function requires specifying a DGP function and a _single_ estimation function, with the data as the first argument. -For our cluster-randomized experiment example, we would then need to use our `estimate_Tx_Fx()` function that organizes all of the estimators (see Chapter \@ref(multiple-estimation-procedures)). +To apply it in our cluster-randomized experiment example, we will need to use our `estimate_Tx_Fx()` function that organizes all of the estimators (see Chapter \@ref(multiple-estimation-procedures)). We then use `bundle_sim()` to create a function for running an entire simulation: ```{r, messages=FALSE} sim_cluster_RCT <- bundle_sim( gen_cluster_RCT, estimate_Tx_Fx, @@ -287,7 +288,9 @@ runs <- sim_cluster_RCT( reps = R, gamma_1 = ATE, gamma_2 = 0.5, sigma2_u = 0.2, sigma2_e = 0.8 ) ``` -The `bundle_sim()` function is just a convenient way to create a function that pieces together the steps in the simulation process, which is especially useful when the component functions include many input parameters. +The `bundle_sim()` function is just a convenient way to create a function that pieces together the steps in the simulation process. +This is especially useful when the component functions include many input parameters. +Think of it as a not overly smart coding assistant who will take some basic directions and give you some code back. The function has several further features, which we will demonstrate in subsequent chapters. @@ -313,11 +316,12 @@ sim_B <- sim_r_Poisson( 10, N = 30, rho = 0.4, mu1 = 8, mu2 = 14) identical(sim_A, sim_B) ``` Of course, this is the intended behavior of these functions, but it has an important consequence that needs some care and attention. -Using functions like `rchisq()`, `r_bivariate_Poisson()`, or `r_bivariate_Poisson()` in a simulation study means that the results will not be fully reproducible, meaning that if we run our simulation and write up our results, and then someone else tries to run our very same simulation using our very same code, they might not get the very same results. -This can undermine transparency, and make it hard to verify if a set of presented results really came from a given codebase. +Using functions like `rchisq()`, `r_bivariate_Poisson()`, or `r_bivariate_Poisson()` in a simulation study means that the results will not be fully reproducible. +If we run the simulation code and write up our results, and then someone else tries to run the same simulation using the very same code, they will not get the very same results. +This undermines transparency and makes it hard to verify if a set of presented results really came from a given codebase. When using DGP functions for simulations, it is useful to be able to exactly control the process of generating random numbers. -This is much more feasible than it sounds: Monte Carlo simulations are random, at least in theory, but computers are deterministic. +This is much more feasible than it sounds: although Monte Carlo simulations are random in theory, computers are deterministic. When we use R to generate what we have been referring to as "random numbers," the functions produce what are actually _pseudo-random_ numbers. Pseudo-random numbers are generated from chains of mathematical equations designed to produce sequences of numbers that appear random, but actually follow a deterministic sequence. Each subsequent random number is a calculated by starting from the previously generated value (i.e., the current state of the random number generator), applying a complicated function, and storing the result (i.e., updating the state). @@ -327,7 +331,7 @@ The state of the pseudo-random number generator is shared across different funct Each time we ask for a random number from the generator, its state is updated. Functions like `rnorm()` and `rchisq()` all call the low-level generator and then transform the result to be of the correct distribution. -Because the generator is actually deterministic, we can control its output by specify a starting value or initial state, +Because the generator is actually deterministic, we can control its output by specify a starting value or initial state. In R, the state of the random number generator can be controlled by setting what its known as the seed. The `set.seed()` function allows us to specify a seed value, so that we can exactly reproduce a calculation. For example, @@ -371,7 +375,7 @@ run_CRT_sim <- function(reps, A `seed` argument allows the user to set a seed value before running the full simulation. By default, the `seed` argument is `NULL` and so the current seed is not modified, but if you set the seed, you will get the same run each time. -The `bundle_sim()` function, demonstrated in Section \@ref(bundle-sim-demo), gives back a function with a `seed` seed argument like above. +The `bundle_sim()` function, demonstrated in Section \@ref(bundle-sim-demo), produces a function with a `seed` seed argument that works just as in the example above. Specifying an integer-valued seed will make the results exactly reproducible: ```{r} sim_A <- sim_r_Poisson( 10, N = 30, rho = 0.4, mu1 = 8, mu2 = 14, @@ -394,16 +398,14 @@ If we had not set the seed, we would not know if we were just getting (un)lucky, ## Conclusions -The act of running a simulation is not itself the goal of a simulation study. Rather, it is the mechanism by which we generate many realizations of a data analysis procedure under a specified data-generating process. +The act of running a simulation is not itself the goal of a simulation study. Rather, it is only the procedural mechanism by which we generate many realizations of a data analysis procedure under a specified data-generating process. As we have seen in this chapter, doing so requires carefully stitching together the one-two step of data generation and analysis, repeating that combined process many times, and then organizing the resulting output in a form that can be easily and systematically evaluated. Different coding strategies can be used to accomplish this, but they all serve the same purpose: to make repeated evaluation of a procedure reliable, transparent, and easy to extend. - -Finally, because simulation studies rely on stochastic data generation, careful attention must be paid to reproducibility. +Finally, because simulation studies rely on stochastic data generation, it is critical that we pay careful attention to reproducibility. By controlling the pseudo-random number generator through explicit use of seeds, we can ensure that simulation results can be exactly reproduced, checked, and debugged. This reproducibility is essential for both scientific transparency and practical development of simulation code. -With the tools of this chapter in place, we are now positioned to move from running simulations to evaluating what they tell us about estimator behavior, which is the focus of the next chapter. - +With the tools of this chapter in place, we are now positioned to move from running simulations to evaluating what they tell us about estimator behavior---the focus of Chapter \@ref(performance-measures). ## Exercises diff --git a/070-experimental-design.Rmd b/070-experimental-design.Rmd index a3d4669..1c4eea9 100644 --- a/070-experimental-design.Rmd +++ b/070-experimental-design.Rmd @@ -277,30 +277,24 @@ sres %>% mean_rmse = mean( rmse ) ) ``` -Even with this massive simplification, we see tantalizing hints of differences between the methods. +Even with this very crude summary of the results, we see tantalizing hints of differences between the methods. Linear Regression does seem biased, on average, across the scenarios considered. But so does multilevel modeling---we did not see bias before, but apparently for some scenarios it is biased as well. -Overall SE and RMSE seems roughly the same across scenarios, and it is clear that uncertainty, on average, trumps bias. -This could be due, however, to those scenarios with few clusters that are small in size. -In order to understand the more complete store, we need to dig further into the results. - - - - - - +Overall SE and RMSE seems roughly the same across scenarios, and it is clear that uncertainty trumps bias, at least on average. +However, this pattern could be driven by the scenarios with few clusters that are small in size. +In order to understand the full story of what is going on, we need to dig further into the results. ## Conclusions In this chapter, we have moved from isolated simulation scenarios to fully specified multifactor simulation designs. -By systematically varying multiple features of a data-generating process while holding others fixed, we can create a structured way to explore how estimator performance depends on the context in which the estimators are used. -By treating simulations as designed experiments, we gain a principled framework for choosing parameters and curating our set of scenarios to explore. +By systematically varying multiple features of a data-generating process while holding others fixed, we can explore how the performance of an estimator depends on the context in which it is used. +By thinking of simulations as designed experiments, we gain a principled framework for choosing parameters and curating a set of scenarios to explore. The result is a rich collection of simulation output that can capture bias, variability, uncertainty estimation, or testing behavior across a wide range of plausible conditions. -However, the volume and complexity of these results also create a new challenge: interpretation. -With many factors, levels, and performance measures, it is no longer possible to understand results by inspecting tables alone. -The next step is therefore to summarize, aggregate, and visualize these outcomes in ways that reveal patterns and trade-offs. -In the next few chapters, we therefore turn to the problem of analyzing and presenting such results, with an emphasis on graphical approaches that clarify how performance varies across conditions. +However, the volume and complexity of systematic simulation results also create new challenges for interpretation. +With many factors, levels, and performance measures, it is no longer possible to understand results simply by inspecting a table of results. +In the next few chapters, we therefore turn to the problem of analyzing and presenting simulation results, with an emphasis on graphical approaches that clarify how performance varies across conditions. +Through further analysis, we will seek to identify systematic trends in estimator performance, understand the generality of identified patterns, and identify trade-offs between different methods. From 66438e026970414decf4681c255371fcc8c02453 Mon Sep 17 00:00:00 2001 From: jepusto Date: Thu, 26 Feb 2026 16:37:46 -0600 Subject: [PATCH 2/2] Further edits to the chapter on performance measures. --- 040-Performance-criteria.Rmd | 378 ++++++++++++++++------------------- 1 file changed, 170 insertions(+), 208 deletions(-) diff --git a/040-Performance-criteria.Rmd b/040-Performance-criteria.Rmd index 7231415..d72da89 100644 --- a/040-Performance-criteria.Rmd +++ b/040-Performance-criteria.Rmd @@ -28,13 +28,9 @@ For example, Figure \@ref(fig:CRT-ATE-hist) depicts the distribution of average There are three different estimators, each with 1000 replications. Each histogram is an approximation of the _sampling distribution_ of the estimator, meaning its distribution across repetitions of the data-generating process. Visually, we see the three estimators look about the same---they sometimes estimate too low, and sometimes estimate too high. -We also see linear regression is shifted up a bit, and maybe aggregation and MLM have a few more outlines. -These observations are assessing how well each estimator _performs_, and we next dig into quantifying that performance more precisely with what we call _performance measures_. - - +We also see linear regression is shifted up a bit, and aggregation and MLM might have a few more outlying values. +These observations are initial impressions of how well each estimator performs---that is, how closely the estimator approximates the truth. +In this chapter, we look at how to more precisely quantify specific aspects of estimator performance and how to compare the performance of different estimators, using what we will call _performance measures_. ```{r CRT-ATE-hist} #| fig.width: 8 @@ -69,35 +65,36 @@ Bias measures the central tendency of the sampling distribution, capturing wheth In Figure \@ref(fig:CRT-ATE-hist), the black dashed lines mark the true average treatment effect of 0.3 and the solid lines with the dot mark the means of the estimators. The horizontal distance between the solid and dashed lines corresponds to the bias of the estimator. This distance is nearly zero for the aggregation estimator and the multilevel model estimator, but larger for the linear regression estimator. -Our performance measure of bias would be about zero for Agg and MLM, and around 0.05 to 0.1 for linear regression. +Our performance measure of bias would be near zero for aggregation and multilevel modeling, but around 0.05 to 0.1 for linear regression. Many performance measures exist. For point estimates, conventional performance measures include bias, variance, and root mean squared error. -Our definition of "success" could depend on which measure we find more important: is lower bias good? Is it better than lower variance? - + + If the point estimates come with corresponding standard errors, then we may also want to evaluate how accurately the standard errors represent the true uncertainty of the point estimators; conventional performance measures for capturing this include the relative bias and relative root mean squared error of the variance estimator. For procedures that produce confidence intervals or other types of interval estimates, conventional performance measures include the coverage rate and average interval width. Finally, for inferential procedures that involve hypothesis tests (or more generally, classification tasks), conventional performance measures include Type I error rates and power. We describe each of these measures in Sections \@ref(assessing-point-estimators) through \@ref(assessing-inferential-procedures). -One might have other performance measures than these, depending on what kind of estimator or data analysis procedure we are evaluating. -Perhaps it is some measure of classification accuracy such as the F1 rate. -Or perhaps it is something very specific, such as the chance of an error larger than some threshold. + + + + + -Regardless of measure chosen, the true performance of an estimator depends on the data generating process being used. -We might see good performance against some DGPs, and bad performance against other DGPs. -Performance measures are defined with respect to the sampling distribution of our estimator, which is the distribution of values we would see across different datasets that all came from the same DGP. +For any particular measure, the actual performance of an estimator depends on the data generating process under which it is evaluated. +Performance measures are defined with respect to the sampling distribution of an estimator, which is the distribution of values we would see across different datasets that all came from the same DGP. +We will therefore need to use some statistical notation to define properties of this sampling distribution. In the following, we define many of the most common measures in terms of standard statistical quantities such as the mean, variance, and other moments of the sampling distribution. Notation-wise, for a random variable $T$ (e.g., for some estimator $T$), we will use the expectation operator $\E(T)$ to denote the mean, $\M(T)$ to denote the median, $\Var(T)$ to denote the variance, and $\Prob()$ to denote probabilities of specific outcomes, all with respect to the sampling distribution of $T$. We also use $\Q_p(T)$ to denote the $p^{th}$ quantile of a distribution, which is the value $x$ such that $\Prob(T \leq x) = p$. With this notation, the median of a continuous distribution is equivalent to the 0.5 quantile: $\M(T) = \Q_{0.5}(T)$. Using this notation we can write, for example, that the bias of $T$ is $\E(T) - \theta$, where $\theta$ is the target parameter. -We develop all of this further, below. For some simple combinations of data-generating processes and data analysis procedures, it may be possible to derive exact mathematical formulas for some performance measures (such as exact mathematical expressions for the bias and variance of the linear regression estimator). But for many problems, the math is difficult or intractable---which is partly why we do simulations in the first place. -Simulations, however, do not produce the _exact_ sampling distribution or give us _exact_ values of performance measures. +However, simulations do not produce the _exact_ sampling distribution or give us _exact_ values of performance measures. Instead, simulations generate _samples_ (usually large samples) from the the sampling distribution, and we can use these to _estimate_ the performance measures of interest. In Figure \@ref(fig:CRT-ATE-hist), for example, we estimated the bias of each estimator by taking the mean of 1000 observations from its sampling distribution. If we were to repeat the whole simulation (with a different seed), then our bias results would shift slightly because they are imperfect estimates of the actual bias. @@ -118,7 +115,6 @@ Section \@ref(MCSE) provides details about how to compute MCSEs for conventional The most common performance measures used to assess a point estimator are **bias**, **variance**, and **root mean squared error**. **Bias** compares the mean of the sampling distribution to the target parameter. -It answers, "Are we systematically too high or too low, on average?" Positive bias implies that the estimator tends to systematically over-state the quantity of interest, while negative bias implies that it systematically under-shoots the quantity of interest. If bias is zero (or nearly zero), we say that the estimator is unbiased (or approximately unbiased). @@ -133,8 +129,6 @@ If two estimators have comparable RMSE, then the estimator with lower bias would To define these quantities more precisely, consider a generic estimator $T$ that is targeting a parameter $\theta$. We call the target parameter the _estimand_. -In most cases, in running our simulation we set the estimand $\theta$ and then generate a (typically large) series of $R$ datasets, with $\theta$ being the true target parameter for each. -We also analyze each dataset, obtaining a sample of estimates $T_1,...,T_R$. Formally, the bias, variance, and RMSE of $T$ are then defined as $$ \begin{aligned} @@ -155,7 +149,9 @@ $$ (\#eq:RMSE-decomposition) $$ -When conducting a simulation, we do not compute these performance measures directly but rather must estimate them using the replicates $T_1,...,T_R$ generated from the sampling distribution. +When conducting a simulation, we do not compute these performance measures directly but rather must estimate them. +In most cases, in running a simulation we generate a (typically large) series of $R$ datasets, with $\theta$ being the true target parameter in each of them. +We analyze each dataset, obtaining a sample of estimates $T_1,...,T_R$ drawn from the sampling distribution. There is nothing very surprising about how we construct estimates of the performance measures. It is just a matter of substituting sample quantities in place of the expectations and variances. Specifically, we estimate bias by taking @@ -183,17 +179,17 @@ Often, people talk about the MSE (Mean Squared Error), which is just the square Just as the true SE is usually easier to interpret than the sampling variance, units of RMSE are easier to interpret than the units of MSE. -The above performance measures depend on the scale of the parameter. -For example, if our estimators are measuring a treatment impact in dollars, then the bias, SE, and RMSE of the estimators are all in dollars. -(The variance and MSE would be in dollars squared, which is why we take their square roots to put them back on the more intepretable scale of dollars.) +All of these performance measures depend on the scale of the parameter. +For example, if an estimator is targeting a treatment impact in dollars, then the bias, SE, and RMSE of the estimator are all in dollars. +(The variance and MSE would be in dollars squared; taking square roots puts these measures back on the more intepretable scale of dollars.) -In many simulations, the scale of the outcome is an arbitrary feature of the data-generating process, making the absolute magnitude of performance measures less meaningful. +In many simulations, the scale of the outcome is an arbitrary feature of the data-generating process, so the absolute magnitude of the performance measure is not meaningful. To ease interpretation of performance measures, it is useful to consider their magnitude relative to some reference amount. We generally recommend using the baseline level of variation in the outcome. For example, we might report a bias of $0.2\sigma$, where $\sigma$ is the standard deviation of the outcome in the population. -One way to automatically get all your performance metrics on such a standardized scale is to generate data so the outcome has unit variance (i.e., we generate outcomes in _standardized units_). -This automatically puts the bias, empirical standard error, and root mean squared error on the scale of standard deviation units, which can facilitate interpretation about what constitutes a meaningfully large bias or a meaningful difference in RMSE. +One way to ensure performance metrics are on such a standardized scale is to generate data so the outcome has unit variance (i.e., we generate outcomes in _standardized units_). +This automatically puts the bias, true standard error, and root mean squared error on the scale of standard deviation units, which can facilitate interpretation about what constitutes a meaningfully large bias or a meaningful difference in RMSE. Finally, a non-linear transformation of a parameter will also generally change a performance measure. For instance, suppose that $\theta$ measures the proportion of time that something occurs. @@ -215,7 +211,7 @@ runs <- readRDS( file = "results/cluster_RCT_simulation.rds" ) ``` We now demonstrate the calculation of performance measures for the point estimators of average treatment effects in the cluster-RCT example. -Our three performance measures address the following questions: +We calculate performance measures to address the following questions: - Is the estimator systematically off? (bias) - Is it precise? (standard error) @@ -242,8 +238,8 @@ runs %>% There is no indication of major bias for aggregation or multi-level modeling. Linear regression, with a bias of about 0.08 SDs, appears much more biased than the other estimators. -This is because the linear regression is targeting the person-level average average treatment effect. -Our data-generating process makes larger sites have larger effects, so the person-level average effect is going to be higher because those larger sites will count more. +This is because the linear regression is targeting the person-level average treatment effect. +Our data-generating process makes larger sites have larger effects, so the person-level average effect is going to be higher because those larger sites count more in the overall average. In contrast, our estimand is the school-level average treatment effect, or the simple average of each school's true impact, which we have set to 0.30. The aggregation and multi-level modeling methods target this school-level average effect. If we had instead decided that the target estimand should be the person-level average effect, then we would find that linear regression is unbiased whereas aggregation and multi-level modeling are biased. @@ -267,9 +263,9 @@ true_SE <- true_SE ``` -In a real data analysis, these standard errors are what we would be trying to approximate with a standard error estimator. Aggregation and multi-level modeling have SEs about 4 to 5% smaller than linear regression, meaning these estimates tend to be a bit less variable across different datasets. For these data-generating conditions, aggregation and multi-level modeling are preferable to linear regression because they are more precise. +In a real data analysis, these standard errors are what we would be trying to approximate with a standard error estimator. #### Which method has the smallest Root Mean Squared Error? {-} @@ -295,22 +291,22 @@ RMSE takes into account both bias and variance. For aggregation and multi-level modeling, the RMSE is the same as the standard error, which makes sense because these estimators are not biased. For linear regression, the combination of bias plus increased variability yields a higher RMSE, with the standard error dominating the bias term (note how RMSE and SE are more similar than RMSE and bias). The difference between the estimators are pronounced because RMSE is the square root of the _squared_ bias and _squared_ standard error. -Overall, aggregation and multi-level modeling have RMSEs around 10% smaller than linear regression, meaning they tend to give estimates that are closer to the truth than linear regression. +Overall, aggregation and multi-level modeling have RMSEs around 10% smaller than linear regression, meaning they tend to produce estimates that are closer to the truth than linear regression. ### Robust Performance Measures {#less-conventional-measures} - -For point estimation, we introduced bias, variance and RMSE as three core measures of performance. +We have introduced bias, variance and RMSE as three core measures of performance for point estimators. However, all of these measures are sensitive to outliers in the sampling distribution. Consider an estimator that generally does well, except for an occasional large mistake. Because conventional measures are based on arithmetic averages, they could indicate that the estimator performs very poorly overall, when in fact it performs very well most of the time and terribly only rarely. -If we are more curious about typical performance, then we could use other measures, such as the __median bias__ and the __median absolute deviation__ of $T$, that are less sensitive to outliers in the sampling distribution. -These are called "robust" measures because they are not massively changed by only a few extreme values. -The overall approach for most robust measures is to simply chop off some percent of the most extreme values, and then look at the center or spread of the remaining values. +If we are more curious about typical performance, then we might prefer to use other performance measures, such as the __median bias__ and the __median absolute deviation__ of $T$, which are less sensitive to outliers in the sampling distribution. +These are called "robust" measures because they are not strongly affect by a few extreme values. +The overall approach for most robust measures is to simply chop off some percent of the most extreme values, and then look at the central tendency or spread of the remaining values. +Estimating these measures will involve calculating sample quantiles of $T_1,...,T_R$, which are functions of the sample ordered from smallest to largest. -For example, median bias is an alternative measure of the central tendency of a sampling distribution. -Here we "chop off" essentially all the values, focusing on the middle one only. +Median bias is an alternative measure of the central tendency of a sampling distribution. +Here we "chop off" essentially all the values, focusing on the middle one alone Positive median bias implies that more than 50% of the sampling distribution exceeds the quantity of interest, while negative median bias implies that more than 50% of the sampling distribution fall below the quantity of interest. Formally, $$ @@ -326,8 +322,6 @@ $$ $$ where, if we define $T_{(r)}$ as the $r^{th}$ order statistic for $r = 1,...,R$, we have $M_T = T_{((R+1)/2)}$ if $R$ is odd or $M_T = \frac{1}{2}\left(T_{(R/2)} + T_{(R/2+1)}\right)$ if $R$ is even. - - Another robust measure of central tendency uses the $p \times 100\%$-trimmed mean, which ignores the estimates in the lowest and highest $p$-quantiles of the sampling distribution. Formally, the trimmed-mean bias is $$ @@ -343,7 +337,7 @@ where (assuming $pR$ is a whole number) $$ \tilde{T}_{\{p\}} = \frac{1}{(1 - 2p)R} \sum_{r=pR + 1}^{(1-p)R} T_{(r)} $$ -For a symmetric sampling distribution, trimmed-mean bias will be the same as the conventional (mean) bias, but its estimator $\tilde{T}_{\{p\}}$ will be less affected by outlying values (i.e., values of $T$ very far from the center of the distribution) compared to $\bar{T}$. +For a symmetric sampling distribution, the trimmed-mean bias will be the same as the conventional (mean) bias, but its estimator $\tilde{T}_{\{p\}}$ will be less affected by outlying values (i.e., values of $T$ very far from the center of the distribution) compared to $\bar{T}$. However, if a sampling distribution is not symmetric, trimmed-mean bias will be a distinct performance measure, different from mean bias, that puts less emphasis on large errors compared to the conventional bias measure. A further robust measure of central tendency is based on winsorizing the sampling distribution, or truncating all errors larger than a certain maximum size. @@ -357,17 +351,17 @@ U_w &= \Q_{0.75}(T) + w \times (\Q_{0.75}(T) - \Q_{0.25}(T)), \end{aligned} $$ where $\Q_{0.25}(T)$ and $\Q_{0.75}(T)$ are the first and third quartiles of the distribution of $T$, respectively, and $w$ is the number of inter-quartile ranges below which an observation will be treated as an outlier.[^fences] -Once we have our thresholds, we let our truncated estimate be $X = \min\{\max\{T, L_w\}, U_w\}$. +Once we have calculated the thresholds, we compute a truncated estimate as $X = \min\{\max\{T, L_w\}, U_w\}$. The winsorized bias, variance, and RMSE are then defined using winsorized values in place of the raw values of $T$, as $$ \begin{aligned} \text{Bias}(X) &= \E\left(X\right) - \theta \\ -\Var\left(X\right) &= \E\left[\left(X - \E (T^{(w)})\right)^2 \right], \\ +\Var\left(X\right) &= \E\left[\left(X - \E (X)\right)^2 \right], \\ \RMSE\left(X\right) &= \sqrt{\E\left[\left(X - \theta\right)^2 \right]}. \end{aligned} (\#eq:winsorized-variance-RMSE) $$ -To compute estimates of the winsorized performance criteria, we substitute sample quantiles $T_{(R/4)}$ and $T_{(3R/4)}$ in place of $\Q_{0.25}(T)$ and $\Q_{0.25}(T)$, respectively, to get estimated thresholds, $\hat{L}_w$ and $\hat{U}_w$, find $\hat{X}_r = \min\{\max\{T_r, \hat{L}_w\}, \hat{U}_w\}$, and compute the sample performance measures using Equations \@ref(eq:bias-estimator), \@ref(eq:var-estimator), and \@ref(eq:rmse-estimator), but with $\hat{X}$ in place of $T_r$. +To compute estimates of the winsorized performance criteria, we substitute sample quantiles $T_{(R/4)}$ and $T_{(3R/4)}$ in place of $\Q_{0.25}(T)$ and $\Q_{0.25}(T)$, respectively, to get estimated thresholds, $\hat{L}_w$ and $\hat{U}_w$, find $\hat{X}_r = \min\{\max\{T_r, \hat{L}_w\}, \hat{U}_w\}$, and compute the sample performance measures using Equations \@ref(eq:bias-estimator), \@ref(eq:var-estimator), and \@ref(eq:rmse-estimator), but with $\hat{X}_r$ in place of $T_r$. [^fences]: For a normally distributed sampling distribution, the interquartile range is `r round(diff(qnorm(c(0.25, 0.75))),2)` SD; with $w = 2$, the lower and upper thresholds would then fall at $\pm `r round(qnorm(0.75) + 2 * diff(qnorm(c(0.25, 0.75))),2)`$ SD, or the $`r round(100 * pnorm(qnorm(0.25) - 2 * diff(qnorm(c(0.25, 0.75)))),2)`^{th}$ and $`r round(100 * pnorm(qnorm(0.75) + 2 * diff(qnorm(c(0.25, 0.75)))),2)`^{th}$ percentiles. Still assuming a normal sampling distribution, taking $w = 2.5$ will mean that the thresholds fall at the $`r round(100 * pnorm(qnorm(0.25) - 2.5 * diff(qnorm(c(0.25, 0.75)))),3)`^{th}$ and $`r round(100 * pnorm(qnorm(0.75) + 2.5 * diff(qnorm(c(0.25, 0.75)))),3)`^{th}$ percentiles. @@ -381,38 +375,41 @@ $$ Letting $E_r = |T_r - \theta|$, the MAE can be estimated by taking the sample median of $E_1,...,E_R$. Many other robust measures of the spread of the sampling distribution are also available, including the Rosseeuw-Croux scale estimator $Q_n$ [@Rousseeuw1993alternatives] and the biweight midvariance [@Wilcox2022introduction]. @Maronna2006robust provide a useful introduction to these measures and robust statistics more broadly. -The `robustbase` package [@robustbase] provides functions for calculating many of these robust statistics. +The `robustbase` package [@robustbase] provides R functions for calculating many of these robust statistics. -## Measures for Variance Estimators +## Measures for Variance Estimators {#measures-for-variance-estimators} Statistics is concerned not only with how to estimate things (e.g., point estimates), but also with understanding how well we have estimated them (e.g., standard errors). Monte Carlo simulation studies can help us evaluate both these concerns. The prior section focused on assessing how well our estimators work. -In this section we focus on assessing how well our uncertainty assessments work. -We focus on the standard error. - -The first order question is whether our estimated standard errors are __calibrated__, meaning they are the right size, on average. -The second order question is whether our estimated standard errors are __stable__, meaning they do not vary too much from one dataset to another. -For this question we can think of the standard errors in terms of effective degrees of freedom, or look at the standard error (or RMSE) of the standard error. +In this section we focus on assessing how well our uncertainty assessments work, looking in particular on estimation of the standard error. +A first order question is whether our estimated standard errors are __calibrated__, meaning whether they are the right size, on average. +For this question we will examine the bias of squared standard error estimators. +A second order question is whether our estimated standard errors are __stable__, meaning that they do not vary too much from one dataset to another. +For this question we evaluate estimated standard errors in terms of effective degrees of freedom or the standard error (or RMSE) of the standard error. - ### Calibration -Calibration is a relative measure, where we look at the ratio of the expected (average) value of our uncertainty estimator to the true uncertainty. +Calibration is a relative measure, where we look at the ratio of the expected (average) value of an uncertainty estimator to the true degree of uncertainty. The relative measure removes scale, which is important because the absolute magnitude of uncertainty will depend on many features of the data-generating process, such as sample size. -In particular, across simulation scenarios, even those with the same target parameter $\theta$, the true standard error of our estimator $T$ might change considerably. -With a relative measure, we will, even under such conditions, end up with statements such as "Our standard errors are 10% too large, on average," which are easy to interpret. +In particular, across different simulation scenarios (even those with the same target parameter $\theta$), the true standard error of an estimator $T$ may change considerably. +In other words, the ground truth changes in magnitude from one scenario to another. +With a relative measure, we can make statements such as "Our standard errors are 10% too large, on average," which are easy to interpret. +The relative measures lets us sidestep some of the complexity of dealing with wide variation in the magnitude of the ground truth. + + Typically, we assess performance for _variance_ estimators ($\widehat{SE^2}$) rather than standard error estimators. There are a few reasons it is more common to work with variance rather than standard error. -First, in practice, so-called unbiased standard errors usually are not actually unbiased, while variance estimators might be.[^Jokes-on-us] +First, in practice, so-called unbiased standard errors are not actually unbiased, whereas variance estimators might be.[^Jokes-on-us] For linear regression, for example, the classic standard error estimator is an unbiased _variance_ estimator, but the standard error estimator is not exactly unbiased because $$ \E[ \sqrt{ V } ] \neq \sqrt{ \E[ V ] }. @@ -422,7 +419,7 @@ Thus, if we are trying to determine whether an overall MSE is due to instability [^Jokes-on-us]: See the delightfully titled section 11.5, "The Joke Is on Us: The Standard Deviation Estimator is Biased after All," in @westfall2013understanding for further discussion. -To make assessing calibration concrete, consider a generic standard error estimator $\widehat{SE}$ to go along with our generic estimator $T$ of target parameter $\theta$, and let $V = \widehat{SE^2}$ be the corresponding variance estimator. +To make the calibration measure concrete, consider a generic standard error estimator $\widehat{SE}$ to go along with our generic estimator $T$ of target parameter $\theta$, and let $V = \widehat{SE^2}$ be the corresponding variance estimator. In our simulation we obtain a large sample of standard errors, $\widehat{SE}_1,...,\widehat{SE}_R$ and corresponding variance estimators $V_r = \widehat{SE}_r^2$ for $r = 1,...,R$. Formally, the __relative bias__ or __calibration__ of $V$ is then defined as @@ -434,28 +431,18 @@ $$ $$ Relative bias describes whether the central tendency of $V$ aligns with the actual degree of uncertainty in the point estimator $T$. We say our variance estimator $V$ is _calibrated_ if the relative bias is close to one. -When calibrated, $V$ accurately reflects how much uncertainty there actually is, on average. -If relative bias is less than one, $V$ is _anti-conservative_, meaning it tends to under-estimate the true variance of $T$. -This is usually considered quite bad. -If it is more than one, it is _conservative_, meaning it tends to over-estimate the true variance of $T$. -This is usually considered non-ideal, but livable. +When calibrated, $V$ reflects how much uncertainty there actually is, on average. +If relative bias is less than one, $V$ is _anti-conservative_, meaning it tends to under-estimate the true uncertainty of $T$, which would usually be considered quite bad. +If relative bias is more than one, $V$ is _conservative_, meaning it tends to over-estimate the true uncertainty of $T$. +This would be considered less than ideal, though perhaps not as severe an issue. Calibration also has implications for the performance of other uncertainty assessments such as hypothesis tests and confidence intervals. A relative bias of less than 1 implies that $V$ tends to under-state the amount of uncertainty, which will lead to confidence intervals that are overly narrow and do not cover the true parameter value at the desired rate. It also will lead to rejecting the null more often than desired, which elevates Type I error. -Relative bias greater than 1 implies that $V$ tends to over-state the amount of uncertainty in the point estimator, making confidence intervals too large and null hypotheses hard to reject. - +Relative bias greater than 1 implies that $V$ tends to over-state the amount of uncertainty in the point estimator, making confidence intervals too large and null hypotheses harder than necessary to reject. - - -To estimate relative bias, simply substitute sample quantities in place of the $\E(V)$ and $\Var(T)$. -Unlike with performance measures for $T$, where we know the target quantity $\theta$, here we do not know $\Var(T)$ exactly, +To estimate relative bias of $V$, we simply substitute sample quantities in place of the $\E(V)$ and $\Var(T)$. +Unlike with performance measures for $T$, where we know the target quantity $\theta$, here we do not know $\Var(T)$ exactly. Instead, we must estimate it using $S_T^2$, the sample variance of $T$ across replications. Denoting the arithmetic mean of the variance estimates across simulation trials as $$ @@ -470,12 +457,6 @@ $$ $$ If we then take the square root of our estimate, we can obtain the estimated relative bias of the standard error. - - ### Stability {#performance-stability} @@ -485,9 +466,11 @@ Are the estimates of uncertainty $V$ close to the true uncertainty $Var(T)$, fro We offer two ways of assessing this. The first is to look at the relative SE or RMSE of the variance estimator $V$. -The second is to assess the effective "degrees of freedom" of $V$, which takes into account how one would account for uncertainty estimator uncertainty when generating confidence intervals or conducting hypothesis tests (this is what a degrees of freedom correction does---for example, when $V$ is unstable, as represented by a low degree of freedom, we hedge against it being randomly too small by making our confidence intervals systematically larger). +The second is to assess the effective degrees of freedom of $V$, which takes into account how one would account for the uncertainty of a standard error estimator when generating confidence intervals or conducting hypothesis tests.[^df-correction] + +[^df-correction]: A degrees of freedom correction used in a confidence interval or hypothesis test functions in exactly this way. For instance, when $V$ is unstable, as represented by a low degree of freedom, we hedge against it being randomly too small by making our confidence intervals systematically larger. -The relative Variance and MSE of $V$ as an estimate of $Var(T)$ are the following: +The relative Variance and MSE of $V$ as an estimate of $Var(T)$ are defined as $$ \begin{aligned} \text{Relative Var}(V) &= \frac{\Var(V)}{\Var(T)} \\ @@ -513,17 +496,16 @@ S_V^2 = \frac{1}{R - 1}\sum_{r=1}^R \left(V_r - \bar{V}\right)^2, $$ -Another more abstract measure of the stability of a variance estimator is its Satterthwaite degrees of freedom. +Another more abstract measure of the stability of a variance estimator is its effective degrees of freedom. For some simple statistical models such as classical analysis of variance and linear regression with homoskedastic errors, the variance estimator is a sum of squares of normally distributed errors. In such cases, the sampling distribution of the variance estimator is a multiple of a $\chi^2$ distribution, with degrees of freedom corresponding to the number of independent observations used to compute the sum of squares. This is what leads to $t$ statistics and so forth. -In the context of analysis of variance problems, @Satterthwaite1946approximate described a method of extending this idea by approximating the variability of more complex statistics, involving linear combinations of sums of squares, by using a chi-squared distribution with a certain degrees of freedom. - +In the context of analysis of variance problems, @Satterthwaite1946approximate described a method of extending this idea by approximating the variability of more complex statistics (those that can be represented as a linear combination of a sum of squares), by using a chi-squared distribution with a certain degrees of freedom. When applied to an arbitrary variance estimator $V$, these degrees of freedom can be interpreted as the number of independent, normally distributed errors going into a sum of squares that would lead to a variance estimator that is as equally precise as $V$. -More succinctly, these degrees of freedom correspond to the amount of independent observations used to estimate $V$. +More succinctly, these effective degrees of freedom correspond to the amount of independent observations used to estimate $V$. -Following @Satterthwaite1946approximate, we define the degrees of freedom of $V$ as +Following @Satterthwaite1946approximate, we define the effective degrees of freedom of $V$ as $$ df = \frac{2 \left[\E(V)\right]^2}{\Var(V)}. (\#eq:Satterthwaite-df) @@ -533,7 +515,7 @@ $$ \widehat{df} = \frac{2 \bar{V}^2}{S_V^2}. (\#eq:Satterthwaite-df-estimator) $$ -For simple statistical methods in which $V$ is based on a sum-of-squares of normally distributed errors, then the Satterthwaite degrees of freedom will be constant and correspond exactly to the number of independent observations in the sum of squares. +For simple statistical methods in which $V$ is based on a sum-of-squares of normally distributed errors, the effective degrees of freedom will be constant and correspond exactly to the number of independent observations in the sum of squares. Even with more complex methods, the degrees of freedom are interpretable: higher degrees of freedom imply that $V$ is based on more observations, and thus will be a more precise estimate of the actual degree of uncertainty in $T$. ### Assessing SEs for the Cluster RCT Simulation {#assess-cluster-RCT-SEs} @@ -569,7 +551,7 @@ Overall, it is a bad day for linear regression. ## Measures for Confidence Intervals -Some estimation procedures provide confidence intervals (or confidence sets) which are ranges of values, or interval estimators, that should include the true parameter value with a specified confidence level. +Some estimation procedures provide confidence intervals (or confidence sets) which are ranges of values that should include the true parameter value with a specified confidence level. For a 95% confidence level, the interval should include the true parameter in 95% replications of the data-generating process. However, with the exception of some simple methods and models, methods for constructing confidence intervals usually involve approximations and simplifying assumptions, so their actual coverage rate might deviate from the intended confidence level. @@ -582,7 +564,9 @@ $$ \text{Coverage}(A,B) = \Prob(A \leq \theta \leq B). (\#eq:coverage) $$ -For a well-performing interval estimator, $\text{Coverage}$ will at least $\beta$ and, ideally will not exceed $\beta$ by too much. +For a well-performing interval estimator, $\text{Coverage}$ will be at least $\beta$ and ideally will not exceed $\beta$ by too much. +Some analysts prefer to look at lower and upper coverage separately, where lower coverage is $\Prob(A \leq \theta)$ and upper coverage is $\Prob(\theta \leq B)$, +and where it is desirable for each of these rates to be near $\beta / 2$. The expected width of a $\beta$-level interval estimator is the average difference between the upper and lower endpoints, formally defined as $$ @@ -603,7 +587,7 @@ $$ (\#eq:coverage-width) $$ Following a strict statistical interpretation, a confidence interval performs acceptably if it has actual coverage rate greater than or equal to $\beta$. -If multiple methods satisfy this criterion, then the method with the lowest expected width would be preferable. Some analysts prefer to look at lower and upper coverage separately, where lower coverage is $\Prob(A \leq \theta)$ and upper coverage is $\Prob(\theta \leq B)$. +If multiple methods satisfy this criterion, then the method with the lowest expected width would be preferable. In many instances, confidence intervals are constructed using point estimators and uncertainty estimators. @@ -660,7 +644,7 @@ Alternately, testing procedures might be formulated by comparing a test statisti Hypothesis testing procedures aim to control the level of false positives, corresponding to the probability that the null hypothesis is rejected when it holds in truth. The level of a testing procedure is often denoted as $\alpha$, and it has become conventional in many fields to conduct tests with a level of $\alpha = .05$.[^significance] Just as in the case of confidence intervals, hypothesis testing procedures can sometimes be developed that will have false positive rates exactly equal to the intended level $\alpha$. -However, in many other problems, hypothesis testing procedures involve approximations or assumption violations, so that the actual rate of false positives might deviate from the intended $\alpha$-level. +However, in many other problems, hypothesis testing procedures involve approximations or simplifying assumptions, so that the actual rate of false positives might deviate from the intended $\alpha$-level. Knowing the risk of this is important. When we evaluate a hypothesis testing procedure, we are concerned with two primary measures of performance: *validity* and *power*. @@ -668,7 +652,7 @@ When we evaluate a hypothesis testing procedure, we are concerned with two prima ### Validity -Validity pertains to whether we erroneously reject a true null more than we should. +Validity pertains to whether we erroneously reject a true null more often than we should. An $\alpha$-level testing procedure is valid if it has no more than an $\alpha$ chance of rejecting the null, when the null is true. If we were using the conventional $\alpha = .05$ level, then a valid testing procedure will reject the null in only about 50 of 1000 replications of a data-generating process where the null hypothesis actually holds true. @@ -684,29 +668,21 @@ Power is concerned with the chance that we notice when an effect or a difference Compared to validity, power is a more nuanced concept because larger effects will clearly be easier to notice than smaller ones, and more blatant violations of a null hypothesis will be easier to identify than subtle ones. Furthermore, the rate at which we can detect violations of a null will depend on the $\alpha$ level of the testing procedure. A lower $\alpha$ level will make for a less sensitive test, requiring stronger evidence to rule out a null hypothesis. Conversely, a higher $\alpha$ level will reject more readily, leading to higher power but at a cost of increased false positives. - In order to evaluate the power of a testing procedure by simulation, we will need to generate data where there is something to detect. -In other words, we will need to ensure that the null hypothesis is violated (and that some specific alternative hypothesis of interest holds). +In other words, we will need to ensure that the null hypothesis is violated and that some specific alternative hypothesis of interest holds. The process of evaluating the power of a testing procedure is otherwise identical to that for evaluating its validity: generate many datasets, carry out the testing procedure, and track the rate at which the null hypothesis is rejected. The only difference is the _conditions_ under which the data are generated. We often find it useful to think of power as a _function_ rather than as a single quantity because its absolute magnitude will generally depend on the sample size of a dataset and the magnitude of the effect of interest. Because of this, power evaluations will typically involve examining a _sequence_ of data-generating scenarios with varying sample size or varying effect size. -If we are actually planning a future analysis, we would then look to determine what sample size or effect size would give us 80% power (the traditional target power used for planning in many cases). +If we are actually planning a future analysis, we would then look to determine what sample size or effect size would give us 80% power (a traditional target power level used for planning in many cases). See Chapter \@ref(sec:power) for more on power analysis for study planning. -Separately, if our goal is to compare testing procedures, then absolute power at substantively large effects is often less informative than relative performance near the null. +Separately, if our goal is to compare testing procedures, then absolute power at substantively large effects is often less informative than the relative power of different testing procedures at substantively small effects, near to but not exactly at the null. In this setting, attention naturally shifts to infinitesimal power---that is, how quickly power increases as we move from a null effect to very small alternatives. Examining infinitesimal power helps identify which tests are most sensitive to small departures from the null in a more standardized way. - - - - - - - ### Rejection Rates When evaluating either validity or power, the main performance measure is the __rejection rate__ of the hypothesis test. @@ -736,7 +712,6 @@ Methodologists hold a variety of perspectives on how close the actual Type-I err Among a collection of level-$\alpha$ testing procedures, we would prefer the one with highest power. If looking only at null rejection rates, then the test with Type-I error closest to $\alpha$ would usually be preferred. - @@ -797,9 +772,10 @@ We also estimate the bias of the point estimator, since bias can affect the vali ```{r demo-calc-validity} runs_val %>% group_by( method ) %>% - summarise( bias = mean(ATE_hat), - power = mean( p_value <= 0.05 ) ) %>% - knitr::kable( digits=3 ) + summarise( + bias = mean(ATE_hat), + power = mean( p_value <= 0.05 ) + ) ``` The aggregation and multi-level modeling approaches both seem to reject a bit below 0.05, at 0.04, while LR is here at 0.10. @@ -808,7 +784,7 @@ The elevated rejection rate due to bias might then be part of the reason that th If so, it is not entirely fair to compare the power of the three testing procedures, because one of them has Type-I error in excess of the desired level. As discussed above, linear regression targets the person-level average treatment effect. -In the scenario we just simulated for evaluating validity, the person-level average effect is not zero even though our cluster-average effect is zero, because we have specified a non-zero impact heterogeneity parameter (`size_coef=0.75`) along with school size variation (`alpha = 0.75`), meaning that the school-specific treatment effects vary around 0. +In the scenario we just simulated for evaluating validity, the person-level average effect is not zero even though our cluster-average effect is zero, because we have specified a non-zero impact heterogeneity parameter (`size_coef=0.75`) along with positive school size variation (`alpha = 0.75`), so that the school-specific treatment effects vary around 0. To see if this is why the linear regression test has an inflated Type-I error rate, we could re-run the simulation using settings where both the school-level and person-level average effects are truly zero by setting `size_coef = 0`. @@ -818,15 +794,15 @@ To see if this is why the linear regression test has an inflated Type-I error ra A __relative measure__ is when you compare a performance measure to the target quantity by taking a ratio. For example, __relative bias__ is the expected value of an estimator divided by the true parameter value. As we saw when evaluating variance, above, relative measures can be very powerful, giving statements such as "this estimator's standard errors are 10% too big, on average." -In their best case, they are easy to interpret, since they are unitless. -That said, they also suffer from some notable drawbacks. +In the best case, they are easy to interpret because they are unitless. +However, they also suffer from some notable drawbacks. -In particular, ratios can be deceiving when the denominator is near zero. +Relative measures can be difficult to interpret under several circumstances. For one, ratios can be deceiving when the denominator is near zero. This can be a problem when, for example, examining bias relative to a benchmark estimator. They can also be problematic when the denominator, i.e. reference value, can take on either negative or positive values. Finally, if the reference value changes unexpectedly with changes in the simulation context, then relative measures can be misleading. -We next discuss these issues in more detail, and provide guidance on when to use relative versus absolute performance measures. +We now discuss these issues in more detail and offer guidance on how to decide whether to to use relative or absolute performance measures. ### Performance relative to the target parameter @@ -854,29 +830,24 @@ $$ $$ As justification for evaluating bias in relative terms, authors often appeal to @hoogland1998RobustnessStudiesCovariance, who suggested that relative bias of under 5% (i.e., relative bias falling between 0.95 and 1.05) could be considered acceptable for an estimation procedure. However, @hoogland1998RobustnessStudiesCovariance were writing about a very specific context---robustness studies of structural equation modeling techniques---that have parameters of a particular form. -In our view, their proposed rule-of-thumb is often generalized far beyond the circumstances where it might be defensible, including to problems where it is clearly arbitrary and inappropriate. +In our view, their proposed rule-of-thumb is often generalized far beyond the circumstances where it is defensible, including to problems where it is clearly arbitrary and inappropriate. -The problem with relative bias, in particular, is when the target estimand is near zero or zero. -Say you have a bias of 0.1 when estimating a parameter with true value of 0.2. -Then the relative bias would be 0.5. -But if the true value were 0.01, then the relative bias is 10. +One big problem with relative bias, in particular, arises when the target parameter is equal to or near zero. +Suppose you have a bias of 0.1 that is consistent across values of the true parameter. For a true parameter value of 0.2, the relative bias would be 0.5. +But if the true value were 0.01, then the relative bias would be 10! As the true value gets smaller and smaller, the relative bias will explode. -Another problem with relative bias is that when estimating a location parameter the location is often fairly arbitrary. -In our Cluster RCT case, for example, the relative bias of our estimate of the ATE would change if we moved the ATE from 0.2 to 0.5, or from 0.2 to 0 (where relative bias would now be entirely undefined). -The absolute bias, however, would stay the same regardless of the ATE. +Another problem with relative bias occurs when the target parameter is on an arbitrary scale, so that the absolute magnitude is not meaningful. +For example, in our Cluster RCT simulation, the relative bias of the ATE estimators would change if we moved the ATE from 0.2 to 0.5, or from 0.2 to 0 (where relative bias would now be entirely undefined). +In contrast, the absolute bias would stay the same regardless of the ATE. A more principled approach to choosing between whether to use absolute and relative measures is to first consider whether the magnitude of the measure changes across different values of the target parameter $\theta$. If the estimand of interest is a location parameter, then the bias, variance or RMSE of an estimator will likely not change as $\theta$ changes (see Figure \@ref(fig:absolute-relative), left). -Do not use relative measures in this case! - - - -If, however, change $\theta$ would lead to proportionate changes in the magnitude of bias, variance, or RMSE, then relative bias may be useful because it could lead to a simple story such as "the estimator is usually estimating around 12% too high," or similar. +Focusing on relative measures in this scenario would lead to a much more complicated story because different values of $\theta$ will produce drastically different values, ranging from nearly unbiased to nearly infinite bias (for $\theta$ very close to zero). +Better to use absolute bias, which is nearly constant and simpler to interpret and explain. +On the other hand, if changing $\theta$ would lead to proportionate changes in the magnitude of bias, variance, or RMSE, then relative bias may be useful because it could lead to a simple story such as "the estimator is usually estimating around 12% too high," or similar. This is what is shown at right of Figure \@ref(fig:absolute-relative). -We usually expect this type of pattern to occur for scale parameters, such as standard deviations or, as we saw earlier, standard errors. +We usually expect this type of pattern to occur for scale parameters, such as standard deviations or standard errors (as we considered in \@ref(measures-for-variance-estimators)). ```{r absolute-relative} #| fig.width: 8 @@ -910,7 +881,7 @@ However, many problems are too complex to be tractable. A generally more feasible route is to evaluate performance for _multiple_ values of the target parameter and then generate a graph such as shown in Figure \@ref(fig:absolute-relative) to understand how performance changes as a function of the target parameter. If the absolute bias is roughly the same for all values of $\theta$ (as in Scenario A), then use absolute bias. On the other hand, if the bias grows roughly in proportion to $\theta$ (as in Scenario B), then relative bias might be a better summary criterion. -Generally, unless you believe bias must be zero if the parameter $\theta$ is zero, absolute bias is usually the better choice. +Generally, unless you believe bias must be zero if the parameter $\theta$ is zero, absolute bias is usually the safer choice. ### Performance relative to a benchmark estimator @@ -931,7 +902,6 @@ Comparing performance relative to a benchmark method can be an effective tool, b Because these relative performance measures are inherently comparative, higher or lower ratios could either be due to the behavior of the method of interest (the numerator) or due to the behavior of the benchmark method (the denominator). Consider if the benchmark completely falls apart under some circumstance, but the comparison only worsens a slight bit. The relative measure, focused on the comparison, would appear to be a marked improvement, which could be deceptive. - Ratio comparisons are also less effective for performance measures, such as power, that are on a constrained scale. If we have a power of 0.05, and we improve it to 0.10, we have doubled our power, but if it is 0.50 and we increase to 0.55, we have only increased by 10%. Whether this is misleading or not will depend on context. @@ -939,21 +909,21 @@ Whether this is misleading or not will depend on context. ## Estimands Not Represented By a Parameter {#implicit-estimands} -In our Cluster RCT example, we focused on the estimand of the school-level ATE, represented by the model parameter $\gamma_1$. +In the Cluster RCT example, we focused on the estimand of the school-level ATE, represented by the model parameter $\gamma_1$. What if we were instead interested in the person-level average effect? This estimand does not correspond to any input parameter in our data generating process. Instead, it is defined _implicitly_ by a combination of other parameters. -In order to compute performance characteristics such as bias and RMSE, we would need to first calculate what the target estimand is based on the inputs of the data-generating processes. +In order to compute performance characteristics such as bias and RMSE, we would need to first calculate the target estimand based on the inputs to the data-generating processes. There are at least three possible ways to accomplish this: mathematical theory, generating a massive dataset, or tracking the estimand during simulation. -**Mathematical theory.** The first way is to use mathematical distribution theory to compute the implied parameter. +**Mathematical theory.** One approach is to use mathematical distribution theory to compute the implied parameter. Our target parameter will be some function of the parameters and random variables in the data-generating process, and it may be possible to evaluate that function algebraically or numerically (i.e., using numerical integration functions such as `integrate()`). This can be a very worthwhile exercise if it provides insights into the relationship between the target parameter and the inputs of the data-generating process. However, this approach requires knowledge of distribution theory, and it can get quite complicated and technical.[^cluster-RCT-estimand] [^cluster-RCT-estimand]: In the cluster-RCT example, the distribution theory is tractable. See Exercise \@ref(cluster-RCT-SPATE). -**Massive dataset.** Instead, one can simply generate a massive dataset---one so large that it can stand in for the entire data-generating model---and then simply calculate the target parameter of interest in this massive dataset. In the cluster-RCT example, we can apply this strategy by generating data from a very large number of clusters and then simply calculating the true person-average effect across all generated clusters. +**Massive dataset.** Another approach is to generate a massive dataset---one so large that it can stand in for the entire data-generating model---and then simply calculate the target parameter of interest in this massive dataset. In the cluster-RCT example, we can apply this strategy by generating data from a very large number of clusters and then simply calculating the true person-average effect across all generated clusters. If the dataset is big enough, then the uncertainty in this estimate will be negligible compared to the uncertainty in our simulation. We implement this approach as follows, generating a dataset with 100,000 clusters: @@ -969,8 +939,7 @@ ATE_person <- mean( dat$Yobs[dat$Z==1] ) - mean( dat$Yobs[dat$Z==0] ) ``` Our extremely precise estimate of the person-average effect is `r round( ATE_person, 3)`, which is consistent with what we would expect given the bias we saw earlier for the linear model. -If we recalculate performance measures for all of our estimators with respect to the `ATE_person` estimand, the bias and RMSE of our estimators will shift. -The standard errors will stay the same, however. +If we recalculate performance measures for all of our estimators with respect to the `ATE_person` estimand, the bias and RMSE of our estimators will shift, although the standard errors will stay the same. ```{r} performance_person_ATE <- @@ -996,8 +965,8 @@ RMSE_ratio <- For the person-weighted estimand, the aggregation estimator and multilevel model are biased but the linear regression estimator is unbiased. However, the aggregation estimator and multilevel model estimator still have smaller standard errors than the linear regression estimator. -RMSE now captures the trade-off between bias and reduced variance. -Overall, aggregation and multilevel modeling have an RMSE that is around `r round(100 * (RMSE_ratio - 1))`% larger than linear regression. +RMSE now captures the trade-off between increased bias and reduced variance. +Overall, aggregation and multilevel modeling have an RMSE that is around `r round(100 * abs(RMSE_ratio - 1))`% smaller than linear regression. The greater precision is worth the bias price. **Track during the simulation.** A further approach for calculating `ATE_person` would be to record the true person average effect of the dataset with each simulation iteration, and then average the sample-specific parameters at the end. @@ -1033,10 +1002,9 @@ Now when we run our simulation, we will have a column corresponding to the true We could then take the average of these value across replications to estimate the true person average treatment effect in the population, and then use this as the target parameter for performance calculations. An estimand not represented by any single input parameter is more difficult to work with than one that corresponds directly to an input parameter. -That said, it might be a more important estimand than ones represented directly by your DGP. +That said, it might be a more important estimand than one represented directly with an input parameter. As the above shows, it is quite feasible to examine such estimands with a bit of forethought and careful programming. -The key is to be clear about what you are trying to estimate because the performance of an estimator depends critically on the estimand against which it is compared. - +The key is to be clear about what you are trying to estimate because the performance of an estimator depends on the estimand against which it is compared. ## Uncertainty in Performance Estimates (the Monte Carlo Standard Error) {#MCSE} @@ -1045,19 +1013,19 @@ The performance measures we have described are all defined in terms of an infini Of course, simulations will only involve a finite set of replications, based on which we _estimate_ the measures. These estimates involve some Monte Carlo error because they are based on a limited number of replications. If we re-ran the simulation (with a different seed), our estimates would change. -We next discuss how to quantify how much they might change. +We can quantify how much they might change by assessing Monte Carlo error. To account for Monte Carlo error, we can think of our simulation results as a sample from a population. Each replication is independent drawn from the population of the sampling distribution. -Once we frame the problem in these terms, we can easily apply standard statistical techniques to calculate standard errors. +Once we frame the problem in these terms, we can apply standard statistical techniques to calculate standard errors. We call these standard errors Monte Carlo Simulation Errors, or MCSEs. -For most performance measures, we have closed-form expressions for their MCSEs. -We can estimate MCSEs for the rest via techniques such as the jackknife or bootstrap. +For most performance measures, closed-form expressions for their MCSEs are available. +If no closed-form expressions are available, we can still estimate MCSEs using techniques such as the jackknife or bootstrap. ### Conventional measures for point estimators -For the measures that we have described for evaluating point estimators, Monte Carlo standard errors can be calculated using conventional formulas.[^estimated-MCSEs] +For the performances measures that we have described for evaluating point estimators, Monte Carlo standard errors can be calculated using conventional formulas.[^estimated-MCSEs] Recall that we have a point estimator $T$ of a target parameter $\theta$, and we calculate the mean of the estimator $\bar{T}$ and its sample standard deviation $S_T$ across $R$ replications of the simulation process. In addition, we will need to calculate the standardized skewness and kurtosis of $T$ as @@ -1069,9 +1037,10 @@ $$ (\#eq:skewness-kurtosis) $$ -[^estimated-MCSEs]: To be precise, the formulas that we give are _estimators_ for the Monte Carlo standard errors of the performance measure estimators. Our presentation does not emphasize this point because the performance measures will usually be estimated using a large number of replications from an independent and identically distributed process, so the distinction between empirical and estimated Monte Carlo standard errors will not be consequential. One could write $SE$ and $\widehat{SE}$ to distinguish between the two, but we avoid this notation for clarity and simplicity. See, e.g., +[^estimated-MCSEs]: To be precise, the formulas that we give are _estimators_ for the Monte Carlo standard errors of the performance measure estimators. Our presentation does not emphasize this point because the performance measures will usually be estimated using a large number of replications from an independent and identically distributed process, so the distinction between empirical and estimated Monte Carlo standard errors will not be consequential. One could write $SE$ and $\widehat{SE}$ to distinguish between the two, but we avoid this notation for simplicity. -The bias of $T$ is estimated as $\bar{T} - \theta$, so the MCSE for bias, equal to the MCSE of $\bar{T}$ as $\theta$ is fixed, can be estimated as +The bias of $T$ is estimated as $\bar{T} - \theta$, so the MCSE for bias is equal to the MCSE of $\bar{T}$ because $\theta$ is fixed. +It can be estimated as $$ MCSE\left(\widehat{\Bias}(T)\right) = \sqrt{\frac{S_T^2}{R}}. (\#eq:MCSE-bias) @@ -1094,13 +1063,13 @@ where $g'(\phi)$ is the derivative of $g(\cdot)$ evaluated at $\phi$. Following this approximation, $$ SE( g(X) ) \approx \left| g'(\theta) \right| \times SE(X) .$$ For estimation, we plug in $\hat{\theta}$ and our estimate of $SE(X)$ into the above. -To find the MCSE for $S_T$, we can apply the delta method approximation to $X = S_T^2$ with $g(x) = \sqrt(x)$ and $g'(x) =\frac{1}{2\sqrt{x}}$. See [ths blog post](https://jepusto.com/posts/Multivariate-delta-method/index.html) for a friendly introduction. +To find the MCSE for $S_T$, we can apply the delta method approximation to $X = S_T^2$ with $g(x) = \sqrt(x)$ and $g'(x) =\frac{1}{2\sqrt{x}}$. [This blog post](https://jepusto.com/posts/Multivariate-delta-method/index.html) provides a friendly introduction. We estimate RMSE using Equation \@ref(eq:rmse-estimator), which can also be written as $$ \widehat{\RMSE}(T) = \sqrt{(\bar{T} - \theta)^2 + \frac{R - 1}{R} S_T^2}. $$ -An MCSE for the estimated mean squared error (the square of RMSE) is +The MCSE of the estimated mean squared error (the square of RMSE) is $$ MCSE( \widehat{MSE} ) = \sqrt{\frac{1}{R}\left[S_T^4 (k_T - 1) + 4 S_T^3 g_T\left(\bar{T} - \theta\right) + 4 S_T^2 \left(\bar{T} - \theta\right)^2\right]}. (\#eq:MCSE-MSE) @@ -1150,23 +1119,21 @@ Performance measures based on winsorization include winsorized bias, winsorized MCSEs for these measures can be computed using the same formuals as for the conventional measures of bias, empirical standard error, and RMSE, but using sample moments of $\hat{X}_r$ in place of the sample moments of $T_r$. - - ### MCSE for Confidence Intervals and Hypothesis Tests Performance measures for confidence intervals and hypothesis tests are simple compared to those we have described for point and variance estimators. For evaluating hypothesis tests, the main measure is the rejection rate of the test, which is a proportion estimated as $r_\alpha$ (Equation \@ref(eq:rejection-rate-estimate)). -An estimated MCSE for the estimated rejection rate is +The MCSE for the estimated rejection rate is $$ MCSE(r_\alpha) = \sqrt{\frac{r_\alpha ( 1 - r_\alpha)}{R}}. (\#eq:MCSE-rejection-rate) $$ This MCSE uses the estimated rejection rate to approximate its Monte Carlo error. -When evaluating the validity of a test, we may expect the rejection rate to be fairly close to the nominal $\alpha$ level, in which case we could compute a MCSE using $\alpha$ in place of $r_\alpha$, taking $\sqrt{\alpha(1 - \alpha) / R}$. +When evaluating the validity of a test, we might expect the rejection rate to be fairly close to the nominal $\alpha$ level, in which case we could compute a MCSE using $\alpha$ in place of $r_\alpha$, taking $\sqrt{\alpha(1 - \alpha) / R}$. When evaluating power, we will not usually know the neighborhood of the rejection rate in advance of the simulation. -However, a conservative upper bound on the MCSE can be derived by observing that MCSE is maximized when $\rho_\alpha = \frac{1}{2}$, and so +However, a conservative upper bound on the MCSE can be derived by observing that MCSE is maximized when $r_\alpha = \frac{1}{2}$, and so $$ -MCSE(r_\alpha) \leq \sqrt{\frac{1}{4 R}}. +MCSE(r_\alpha) \leq \frac{1}{2 \sqrt{R}}. $$ When evaluating confidence interval performance, we focus on coverage rates and expected widths. @@ -1178,7 +1145,7 @@ MCSE(\widehat{\text{Coverage}}(A,B)) = \sqrt{\frac{\beta(1 - \beta)}{R}}. $$ Alternately, Equation \@ref(eq:MCSE-coverage) could be computed using the estimated coverage rate $\widehat{\text{Coverage}}(A,B)$ in place of $\beta$. -Finally, the expected confidence interval width can be estimated as $\bar{W}$, with MCSE +Finally, the expected confidence interval width is estimated as $\bar{W}$, with MCSE $$ MCSE(\bar{W}) = \sqrt{\frac{S_W^2}{R}}, (\#eq:MCSE-width) @@ -1193,10 +1160,10 @@ For instance, the relative bias of $V$ is estimated as the ratio $\bar{V} / S_T^ To properly account for the Monte Carlo uncertainty of the ratio, one possibility is to use formulas for the standard errors of ratio estimators. Alternately, we can use general uncertainty approximation techniques such as the jackknife or bootstrap [@boos2015Assessing]. The jackknife involves calculating a statistic of interest repeatedly, each time excluding one observation from the calculation. -The variance of this set of one-left-out statistics then serves as a reasonable approximation to the actual sampling variance of the statistic calculated from the full sample. +The variance of this set of one-left-out statistics usually serves as a reasonable approximation to the actual sampling variance of the statistic calculated from the full sample. To apply the jackknife to assess MCSEs of relative bias or relative RMSE of a variance estimator, we will need to compute several statistics repeatedly. -Let $\bar{V}_{(j)}$ and $S_{T(j)}^2$ be the average variance estimate and the empirical variance estimate calculated from the set of replicates __*that excludes replicate $j$*__, for $j = 1,...,R$. +Let $\bar{V}_{(j)}$ and $S_{T(j)}^2$ be the average variance estimate and the empirical variance estimate calculated from the set of replicates *that excludes replicate $j$*, for $j = 1,...,R$. The relative bias estimate, excluding replicate $j$, would then be $\bar{V}_{(j)} / S_{T(j)}^2$. Calculating all $R$ versions of this relative bias estimate and taking the variance of these $R$ versions yields a jackknife MCSE: $$ @@ -1237,14 +1204,14 @@ Instead, all we need to do is calculate the overall mean and overall variance, a Jackknife methods are useful for approximating MCSEs of other performance measures beyond just those for variance estimators. For instance, the jackknife is a convenient alternative for computing the MCSE of the empirical standard error or (raw) RMSE of a point estimator, which avoids the need to compute skewness or kurtosis. -However, @boos2015Assessing notes that the jackknife does not work for performance measures involving percentiles, including medians, although bootstrapping (discussed next) usually remains valid. +However, @boos2015Assessing notes that the jackknife does not work for performance measures involving percentiles, including medians; for such statistics, bootstrapping techniques (discussed next) usually remains valid. -### Using bootstrap for MCSEs for relative performance measures +### Bootstrapping MCSEs for relative performance measures {#bootstrapping-MCSEs} An alternative to the jackknife for estimating MCSEs is the bootstrap [@efron1994introduction]. The bootstrap involves repeatedly resampling the set of simulation replications with replacement, calculating the performance measure of interest for each resample, and then calculating the standard deviation of the performance measures across the resamples. -To illustrate, we next calculate a bootstrap MCSE for two quantities: the relative improvement in variance for Aggregation vs. Linear Regression, and the relative bias of Aggregation's variance estimator (how well calibrated it is). +To illustrate, we next calculate a bootstrap MCSE for two quantities: the relative improvement in variance for Aggregation vs. Linear Regression, and the relative bias of Aggregation's variance estimator (i.e., how well calibrated it is). We first calculate these performance measures for quick reference and for making confidence intervals after we bootstrap: @@ -1259,7 +1226,7 @@ ests <- runs %>% imp_var = SE2 / SE2[method=="LR"] ) ests ``` -As before, we see Aggregation has a variance around 9% smaller than LR, and that its variance estimator is inflated by around 13%. +As before, we see that Aggregation has variance around 9% smaller than linear regression and that its variance estimator is inflated by around 13%. Let's make bootstrap confidence intervals on these two quantities to account for Monte Carlo error in our simulation. We first make our data wide format to ease bootstrapping: @@ -1287,7 +1254,7 @@ rps <- repeat_and_stack( 1000, { ``` For simplicity, we are only looking at the aggregated estimator---we could extend our code to bootstrap for everything. -Regardless, once we have our 1000 bootstrap iterations, we see how much our performance measures change across iteration: +Regardless, once we have 1000 bootstrap iterations, we can evaluate how much our performance measures change across iterations: ```{r} rps %>% @@ -1295,28 +1262,28 @@ rps %>% group_by( measure ) %>% summarise( avg = mean( value ), - MCSE = sd( value ) ) %>% - mutate( theta = c( ests$calib[[1]], ests$imp_var[[1]] ), - CI_l = theta - 2*MCSE, - CI_u = theta + 2*MCSE + MCSE = sd( value ) + ) %>% + mutate( + theta = c( ests$calib[[1]], ests$imp_var[[1]] ), + CI_l = theta - 2 * MCSE, + CI_u = theta + 2 * MCSE ) %>% relocate( measure, theta ) %>% knitr::kable( digits = 3 ) ``` -We see two results: -First, Aggregation has a lower variance than linear regression, with the upper end of the confidence interval being 97%. + +Interpreting the confidence interval in the second row, Aggregation has a lower variance than linear regression, with the upper end of the confidence interval being 97%. Second, we have evidence that the Aggregation variance estimator is systematically too high, with the lower end of the confidence interval being above 1 (but only just). If we had run fewer bootstrap iterations, we may well not have been able to make these claims with certainty. - The bootstrap is a very general tool. -You can calculate MCSE similarly for nearly any performance measures you can come up with. - +It can be applied to assess MCSE for nearly any performance measure you might encounter. ## Performance Calculations with the `simhelpers` Package -The `simhelpers` package provides several functions for calculating most of the performance measures that we have reviewed, along with MCSEs for each performance measures. +The `simhelpers` package provides several functions for calculating most of the performance measures that we have reviewed, along with MCSEs for each of the performance measures. The functions are easy to use. Consider this set of simulation runs on the Welch dataset: @@ -1384,7 +1351,7 @@ all_rejection_rates %>% In Section \@ref(clusterRCTperformance), we computed performance measures for three point estimators of the school-level average treatment effect in a cluster RCT. We can carry out the same calculations using the `calc_absolute()` function from `simhelpers`, which also provides MCSEs. Examining the MCSEs is useful to ensure that 1000 replications of the simulation is suffiicent to provide reasonably precise estimates of the performance measures. -In particular, we have: +We have: ```{r cluster-MCSE-calculation} library( simhelpers ) @@ -1399,11 +1366,11 @@ runs %>% ) ``` -We see the MCSEs are quite small relative to the linear regression bias term and all the SEs (`stddev`) and RMSEs. +The MCSEs are quite small relative to the linear regression bias term and all of the SEs (`stddev`) and RMSEs. Results based on 1000 replications seems adequate to support our conclusions about the gross trends identified. We have _not_ simulated enough to rule out the possibility that the aggregation estimator and multilevel modeling estimator could be slightly biased. Given our MCSEs, they could have true bias of as much as 0.01 (two MCSEs). -For our variance estimators, we can use the `calc_variance()` function to compute performance measures and MCSEs using the jackknife: +For our variance estimators, we can use the `calc_relative_var()` function to compute performance measures and MCSEs using the jackknife: ```{r cluster-variance-MCSE-calculation} perf <- runs %>% @@ -1419,7 +1386,7 @@ perf %>% dplyr::select( -K_relvar ) %>% knitr::kable( digits = 2 ) ``` -Comparing our calibration results to those we got from the bootstrap, above, we can use our performance table to make confidence intervals as so: +We can use these calculations to create confidence intervals for the performance measures: ```{r} perf %>% dplyr:: select( method, rel_bias_var, rel_bias_var_mcse ) %>% @@ -1427,36 +1394,32 @@ perf %>% CI_u = rel_bias_var + 2 * rel_bias_var_mcse ) %>% knitr::kable( digits = 2 ) ``` -We get essentially the same results! +We get essentially the same results as what we found in Section \@ref(bootstrapping-MCSEs) using bootstrapping. ## Concluding thoughts -The performance of an estimator is not usually any single number. -As we saw above, there are many different qualities a given estimator can have, bias and precision being just two of them. -We list the main ones we discussed on the table below. +The performance of an estimator cannot usually be captured with any single number. +As we have seen, there are many different qualities a given estimator can have---bias and precision being just two of them. +We summarize the main measures for quantifying the performance of point estimators in Table \@ref(tab:performance-measure-summary). Furthermore, many data analysis procedures produce multiple pieces of information---not just point estimates, but also standard errors and confidence intervals and $p$-values from null hypothesis tests---and those pieces are inter-related. -For instance, a confidence interval is usually computed from a point estimate and its standard error. -Consequently, the performance of that confidence interval will be strongly affected by whether the point estimator is biased and whether the standard error tends to understates or over-states the true uncertainty. +For instance, a confidence interval is usually computed from a point estimate and its standard error. +Consequently, the performance of that confidence interval will be strongly affected by whether the point estimator is biased and whether the standard error tends to understate or over-state the true uncertainty. Likewise, the performance of a hypothesis testing procedure will often strongly depend on the properties of the point estimator and standard error used to compute the test. Thus, to understand "performance," most simulations will involve evaluating a data analysis procedure on several measures to arrive at a holistic understanding of its performance. -Moreover, the main aim of many simulations is to compare the performance of several different estimators or to determine which of several data analysis procedures is preferable. +The main aim of many simulations is to compare the performance of several different estimators or to determine which of several data analysis procedures is preferable. For such aims, we will need to use a suite of performance measures to understand how different procedures work differently, when and how one is superior to the other, and what factors influence differences in performance. +In many simulation studies, interest lies in assessing broader patterns of performance, including how performance _varies_ across different conditions of the DGP. +In the next chapters we will examine how to expand a simulation from a single scenario to multiple scenarios. +Once we can do this, we will be in position to evaluate how performance varies as a function of the DGP conditions, which will help us understand when and why different data analysis procedures work well or poorly. -Finally, we will actually need to assess how performance _changes_ in response to different conditions as given by the DGP. -In the next chapters we talk about how to expand a simulation from a single scenario to multiple scenarios. -Once we do this, we will then begin to graph how performance changes as a function of the DGP conditions, which will help us truly understand when and why different data analysis procedures work well or poorly. + - -### Summary of Peformance Measures - - - +Table: (\#tab:performance-measure-summary) Performance measures for evaluating point estimators | Criterion | Definition | Estimator | Monte Carlo Standard Error | |-------------------------|---------------------------------------------------------|----------------------|----------------------------| -| __Measures for point estimators__ ||| | Bias | $\E(T) - \theta$ | $\bar{T} - \theta$ | \@ref(eq:MCSE-bias) | | Median bias | $\M(T) - \theta$ | $m_T - \theta$ | \@ref(eq:MCSE-median) | | Trimmed bias | @@ -1469,11 +1432,10 @@ Once we do this, we will then begin to graph how performance changes as a functi | Relative median bias | $\M(T) / \theta$ | $m_T / \theta$ | \@ref(eq:MCSE-relative-bias) | | Relative RMSE | $\sqrt{\E\left[\left(T - \theta\right)^2\right]} / \theta$ | $\frac{\sqrt{\left(\bar{T} - \theta\right)^2 + \frac{R - 1}{R}S_T^2}}{\theta}$ | \@ref(eq:MCSE-relative-bias) | - -* Bias and median bias are measures of whether the estimator is systematically higher or lower than the target parameter. -* Variance is a measure of the __precision__ of the estimator---that is, how far it deviates _from its average_. We might look at the square root of this, to assess the precision in the units of the original measure. This is the true SE of the estimator. -* Mean-squared error is a measure of __overall accuracy__, i.e. is a measure how far we typically are from the truth. We more frequently use the root mean squared error, or RMSE, which is just the square root of the MSE. -* The median absolute deviation (MAD) is another measure of overall accuracy that is less sensitive to outlier estimates. The RMSE can be driven up by a single bad egg. The MAD is less sensitive to this. + + + +