diff --git a/DESCRIPTION b/DESCRIPTION index 672156f..54bf8c3 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -18,6 +18,7 @@ Encoding: UTF-8 Roxygen: list(markdown = TRUE) RoxygenNote: 7.2.3 Suggests: + applicable, baguette, beans, bestNormalize, @@ -40,6 +41,7 @@ Suggests: mixOmics, multilevelmod, nlme, + probably, ranger, roxygen2, rsconnect, @@ -47,6 +49,7 @@ Suggests: rules, stringr, testthat (>= 3.0.0), + textrecipes, tidymodels, tidyposterior, tidyverse, diff --git a/inst/tutorials/02-a-tidyverse-primer/tutorial.Rmd b/inst/tutorials/02-a-tidyverse-primer/tutorial.Rmd index eab3c50..24fe839 100644 --- a/inst/tutorials/02-a-tidyverse-primer/tutorial.Rmd +++ b/inst/tutorials/02-a-tidyverse-primer/tutorial.Rmd @@ -1,6 +1,6 @@ --- title: A Tidyverse Primer -author: Pratham Kancherla and David Kane +author: Pratham Kancherla tutorial: id: a-tidyverse-primer output: diff --git a/inst/tutorials/04-the-ames-housing-data/tutorial.Rmd b/inst/tutorials/04-the-ames-housing-data/tutorial.Rmd index d3d2ee0..71f80f9 100644 --- a/inst/tutorials/04-the-ames-housing-data/tutorial.Rmd +++ b/inst/tutorials/04-the-ames-housing-data/tutorial.Rmd @@ -1,6 +1,6 @@ --- title: The Ames Housing Data -author: Pratham Kancherla and David Kane +author: Pratham Kancherla tutorial: id: the-ames-housing-data output: diff --git a/inst/tutorials/07-a-model-workflow/tutorial.Rmd b/inst/tutorials/07-a-model-workflow/tutorial.Rmd index 3c668dd..6bec576 100644 --- a/inst/tutorials/07-a-model-workflow/tutorial.Rmd +++ b/inst/tutorials/07-a-model-workflow/tutorial.Rmd @@ -1,12 +1,12 @@ --- title: A Model Workflow -author: Pratham Kancherla and David Kane +author: Pratham Kancherla tutorial: id: a-model-workflow output: learnr::tutorial: - progressive: true - allow_skip: true + progressive: yes + allow_skip: yes runtime: shiny_prerendered description: 'Tutorial for Chapter 7: A Model Workflow' --- @@ -73,6 +73,16 @@ multilevel_workflow <- multilevel_fit <- fit(multilevel_workflow, data = Orthodont) +parametric_spec <- survival_reg() + +parametric_workflow <- + workflow() |> + add_variables(outcome = c(fustat, futime), predictors = c(age, rx)) |> + add_model(parametric_spec, + formula = Surv(futime, fustat) ~ age + strata(rx)) + +parametric_fit <- fit(parametric_workflow, data = ovarian) + location <- list( longitude = Sale_Price ~ Longitude, latitude = Sale_Price ~ Latitude, @@ -83,7 +93,7 @@ location <- list( location_models <- workflow_set(preproc = location, models = list(lm = lm_model)) location_models <- - location_models %>% + location_models |> mutate(fit = map(info, ~ fit(.x$workflow[[1]], ames_train))) final_lm_res <- last_fit(lm_wflow, ames_split) @@ -840,27 +850,81 @@ fit(multilevel_workflow, data = Orthodont) ### Exercise 15 - - -We can even use the previously mentioned `strata()` function from the survival package for survival analysis. Run the following code. +Type `survival_reg()` and set it to `parametric_sepc()`. Then, pipe `workflow()` to `add_variables`. Add the parameter `outcome`, setting it equal to `c(fustat, futime)`, and `predictors`, setting it equal to `c(age, rx)`. ```{r how-does-a-workflow--15, exercise = TRUE} -library(censored) +``` + + + +```{r how-does-a-workflow--15-hint-1, eval = FALSE} +parametric_spec <- survival_reg() + +workflow() |> + add_variables(outcome = ..., predictors = ...) +``` + +```{r include = FALSE} parametric_spec <- survival_reg() +workflow() |> + add_variables(outcome = c(fustat, futime), predictors = c(age, rx)) +``` + +### + +Outliers can significantly impact analysis; preprocessing involves identifying and handling outliers using techniques like Z-score, IQR, or clustering-based methods. + +### Exercise 16 + +Copy the previous code (delete the parametric_spec line) and pipe it to `add_model()`. Add the parameters `parametric_spec` and `forumla`, setting that equal to `Surv(futime, fustat) ~ age + strata(rx)`. Then, set the entire expression to `parametric_workflow` using `<-`. + +```{r how-does-a-workflow--16, exercise = TRUE} + +``` + + + +```{r how-does-a-workflow--16-hint-1, eval = FALSE} +parametric_workflow <- + ... |> + add_model(parametric_spec, + formula = ...) +``` + +```{r include = FALSE} parametric_workflow <- - workflow() %>% - add_variables(outcome = c(fustat, futime), predictors = c(age, rx)) %>% + workflow() |> + add_variables(outcome = c(fustat, futime), predictors = c(age, rx)) |> add_model(parametric_spec, formula = Surv(futime, fustat) ~ age + strata(rx)) +``` + +### + +Transformation techniques like log-transformations, scaling, and standardization are used to adjust the data distribution or make it suitable for certain algorithms. + +### Exercise 17 + +Type `fit()`. Add the parameters `parametric_workflow()` and `data`, setting it equal to `ovarian`. Then, set the entire expression to `parametric_fit` using `<-` and run it on the next line. +```{r how-does-a-workflow--17, exercise = TRUE} + +``` + +```{r how-does-a-workflow--17-hint-1, eval = FALSE} +parametric_fit <- fit(..., data = ..) +``` + +```{r include = FALSE} parametric_fit <- fit(parametric_workflow, data = ovarian) -parametric_fit ``` ### + + Great Job! You now know how a workflow uses different sorts of formulas from a data set. ## Creating Multiple Workflows at Once @@ -1024,12 +1088,12 @@ location_models$fit[[1]] We use a **purrr** function here to map through our models, but there is an easier, better approach to fit workflow sets that will be introduced in later tutorials. -### +### Great Job! You now know how to create multiple workflows and put them in a workflow set. You also know how to extract these sets and analyze them based on the model of the chosen workflow set. ## Evaluatin the Test Set -### +### Let’s say that we’ve concluded our model development and have settled on a final model. There is a convenience function called `last_fit()` that will fit the model to the entire training set and evaluate it with the testing set. @@ -1041,15 +1105,15 @@ Enter `last_fit()` and add the parameter `lm_wflow`. Hit "Run Code." (Note: This ``` -```{r evaluatin-the-test-s-1-hint, eval = FALSE} +```{r evaluatin-the-test-s-1-hint-1, eval = FALSE} last_fit(...) ``` -```{r, include = FALSE} +```{r include = FALSE} #last_fit(lm_wflow) ``` -### +### The `last_fit()` function is used to fit a model on the last split of a resampled data set, typically obtained through cross-validation or bootstrapping. It is useful when you want to use the final model trained on the entire training dataset for making predictions on new, unseen data. @@ -1063,15 +1127,15 @@ We always need to a have split for `last_fit()`. Add the parameter `ames_split` -```{r evaluatin-the-test-s-2-hint, eval = FALSE} +```{r evaluatin-the-test-s-2-hint-1, eval = FALSE} final_lm_res <- last_fit(lm_wflow, ...) ``` -```{r, include = FALSE} +```{r include = FALSE} final_lm_res <- last_fit(lm_wflow, ames_split) ``` -### +### The .workflow column contains the fitted workflow and can be pulled out of the results using `extract_workflow()`. @@ -1083,15 +1147,15 @@ Use `extract_workflow()` and add the parameter `final_lm_res`. Hit "Run Code". ``` -```{r evaluatin-the-test-s-3-hint, eval = FALSE} +```{r evaluatin-the-test-s-3-hint-1, eval = FALSE} extract_workflow(...) ``` -```{r, include = FALSE} +```{r include = FALSE} extract_workflow(final_lm_res) ``` -### +### `collect_metrics()` and `collect_predictions()` provide access to the performance metrics and predictions, respectively. The `collect_metrics()` function is a lovely way to extract model performance metrics with resampling. `collect_predictions()` can summarize the various results over replicate out-of-sample predictions. @@ -1105,17 +1169,17 @@ Run `collect_metrics()` and `collect_predictions()`, on separate lines, with the -```{r evaluatin-the-test-s-4-hint, eval = FALSE} +```{r evaluatin-the-test-s-4-hint-1, eval = FALSE} c_mtrcs <- collect_metrics(...) c_predic <- collect_predictions(...) ``` -```{r, include = FALSE} +```{r include = FALSE} c_mtrcs <- collect_metrics(final_lm_res) c_predic <- collect_predictions(final_lm_res) ``` -### +### Statistical metrics are used to describe the distribution of data, compare groups, assess relationships between variables, and draw conclusions from data.The model takes the predictor variables from the test data and generates predictions for the outcome variable. For example, in linear regression, the model estimates the response variable based on the values of the predictor variables. @@ -1129,19 +1193,19 @@ Finally, lets `slice()` the predictions output, as it is too many unnecessary ro -```{r evaluatin-the-test-s-5-hint, eval = FALSE} +```{r evaluatin-the-test-s-5-hint-1, eval = FALSE} c_predic <- collect_predictions(final_lm_res) |> slice(...) ``` -```{r, include = FALSE} +```{r include = FALSE} c_predic <- collect_predictions(final_lm_res) |> slice(1:5) ``` -### +### Great Job! You now know how to evaluate a testing set by using `last_fit()` and statistical metrics and predictions using the `collect_metrics()` and `collect_predictions()`. diff --git a/inst/tutorials/09-judging-model-effectiveness/tutorial.Rmd b/inst/tutorials/09-judging-model-effectiveness/tutorial.Rmd index c32aeb6..c45a596 100644 --- a/inst/tutorials/09-judging-model-effectiveness/tutorial.Rmd +++ b/inst/tutorials/09-judging-model-effectiveness/tutorial.Rmd @@ -1,6 +1,6 @@ --- title: Judging Model Effectiveness -author: Pratham Kancherla and David Kane +author: Pratham Kancherla tutorial: id: judging-model-effectiveness output: diff --git a/inst/tutorials/11-comparing-models/tutorial.Rmd b/inst/tutorials/11-comparing-models/tutorial.Rmd index 7e0b1f8..38564a8 100644 --- a/inst/tutorials/11-comparing-models/tutorial.Rmd +++ b/inst/tutorials/11-comparing-models/tutorial.Rmd @@ -1,6 +1,6 @@ --- title: Comparing Models with Resampling -author: Pratham Kancherla and David Kane +author: Pratham Kancherla tutorial: id: comparing-models-with-resampling output: @@ -1735,7 +1735,6 @@ How does the number of resamples affect these types of formal Bayesian compariso Great Job! You now know have basic understanding of Bayesian Methods and how to analyze these methods using models and functions to make these models. - ## Summary ### diff --git a/inst/tutorials/15-screening-many-models/tutorial.Rmd b/inst/tutorials/15-screening-many-models/tutorial.Rmd index 54ce6b1..3a13b6d 100644 --- a/inst/tutorials/15-screening-many-models/tutorial.Rmd +++ b/inst/tutorials/15-screening-many-models/tutorial.Rmd @@ -166,6 +166,70 @@ with_features <- all_workflows <- bind_rows(no_pre_proc, normalized, with_features) |> mutate(wflow_id = gsub("(simple_)|(normalized_)", "", wflow_id)) + +grid_results <- + all_workflows %>% + workflow_map( + seed = 1503, + resamples = concrete_folds, + grid = 25, + control = grid_ctrl + ) + +grid_ctrl <- + control_grid( + save_pred = TRUE, + parallel_over = "everything", + save_workflow = TRUE + ) + +grid_results <- + all_workflows |> + workflow_map( + seed = 1503, + resamples = concrete_folds, + grid = 25, + control = grid_ctrl + ) + +race_ctrl <- + control_race( + save_pred = TRUE, + parallel_over = "everything", + save_workflow = TRUE + ) + +race_results <- + all_workflows |> + workflow_map( + "tune_race_anova", + seed = 1503, + resamples = concrete_folds, + grid = 25, + control = race_ctrl + ) + +matched_results <- + rank_results(race_results, select_best = TRUE) |> + select(wflow_id, .metric, race = mean, config_race = .config) |> + inner_join( + rank_results(grid_results, select_best = TRUE) |> + select(wflow_id, .metric, complete = mean, + config_complete = .config, model), + by = c("wflow_id", ".metric") + ) |> + filter(.metric == "rmse") + +best_results <- + race_results |> + extract_workflow_set_result("boosting") |> + select_best(metric = "rmse") + +boosting_test_results <- + race_results |> + extract_workflow("boosting") |> + finalize_workflow(best_results)|> + last_fit(split = concrete_split) ``` ```{r copy-code-chunk, child = system.file("child_documents/copy_button.Rmd", package = "tutorial.helpers")} @@ -174,13 +238,12 @@ all_workflows <- ```{r info-section, child = system.file("child_documents/info_section.Rmd", package = "tutorial.helpers")} ``` - ## Introduction ### - +This tutorial covers [Chapter 15: Grid Search](https://www.tmwr.org/workflow-sets) from [*Tidy Modeling with R*](https://www.tmwr.org/) by Max Kuhn and Julia Silge. This tutorial will illustrate how to use workflow sets to investigate multiple models or feature engineering strategies in such a situation. Racing methods can more efficiently rank models than fitting every candidate model being considered. ## Modeling Concrete Mixture Strength ### @@ -442,7 +505,7 @@ normalized_rec <- ### - +The `all_predictors()` function is used to select all predictor (feature) variables from a data frame or tibble. It's often used within the context of defining a data preprocessing recipe, where you specify the operations you want to perform on your data. ### Exercise 12 @@ -549,7 +612,7 @@ linear_reg(penalty = tune(), mixture = tune()) ### - +Techniques like Recursive Feature Elimination (RFE) help iteratively eliminate less important features from the model based on their impact on performance. ### Exercise 17 @@ -575,7 +638,7 @@ linear_reg_spec <- ### - +**glmnet** is an R package used for fitting and analyzing regularized linear regression models, including the popular LASSO (Least Absolute Shrinkage and Selection Operator) and elastic net regularization methods. ### Exercise 18 @@ -595,7 +658,7 @@ mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) ### - +`mlp()`, for multilayer perceptron, is a way to generate a specification of a model before fitting and allows the model to be created using different packages in R or via keras. ### Exercise 19 @@ -646,7 +709,7 @@ nnet_spec <- ### - +The **randomForest** package provides a way to assess variable importance using the Gini index or Mean Decrease in Accuracy in a random forest model. ### Exercise 21 @@ -696,7 +759,7 @@ mars_spec <- ### - +`earth` engine: Google Earth Engine (GEE) is a cloud-based platform that provides access to a massive amount of geospatial and satellite imagery data, along with computational resources to perform analysis on this data. ### Exercise 23 @@ -744,7 +807,7 @@ svm_r_spec <- ### - +**kernlab** is an R package that provides tools for kernel-based machine learning, including support vector machines (SVMs) and kernel methods. ### Exercise 25 @@ -792,7 +855,7 @@ svm_p_spec <- ### - +Kernel methods are powerful techniques used for non-linear classification, regression, and other types of machine learning tasks. ### Exercise 27 @@ -1106,8 +1169,8 @@ nnet_spec |> ### - - +Support Vector Machines (SVM) use kernel functions to map data into higher-dimensional spaces, allowing them to capture complex relationships. + ### Exercise 40 Copy the previous code and pipe it to `update()`, setting the parameter `hidden_units` to `hidden_units(c(1, 27))`. Then, set the entire expression to `nnet_param` using `<-`. @@ -1171,7 +1234,7 @@ workflow_set( ### - +**kernlab** also provides support for other machine learning algorithms, including Least Squares Support Vector Machines (LSSVM) and Kernel Ridge Regression (KRR). ### Exercise 2 @@ -1254,7 +1317,7 @@ normalized1 <- ### - +Techniques like Recursive Feature Elimination (RFE) help iteratively eliminate less important features from the model based on their impact on performance. ### Exercise 5 @@ -1273,12 +1336,12 @@ model_vars <- ```{r include = FALSE} model_vars <- workflow_variables(outcomes = compressive_length, - predictors = everything()) + predictors = everything()) ``` ### - +LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization technique often used for feature selection by shrinking less important features to zero. ### Exercise 6 @@ -1300,7 +1363,7 @@ workflow_set(preproc = list(...)) ### - +Screening models are often categorized into filter methods (based on statistical metrics) and wrapper methods (based on model performance). ### Exercise 7 @@ -1341,7 +1404,7 @@ no_pre_proc <- ### - +Forward selection starts with an empty set of features and adds one feature at a time based on its impact on model performance. ### Exercise 8 @@ -1369,7 +1432,7 @@ workflow_set( ### - +Information criteria like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) can be used for model selection and feature inclusion. ### Exercise 9 @@ -1420,8 +1483,7 @@ bind_rows(no_pre_proc, normalized, with_features) ### - - +Some machine learning algorithms, like LASSO and Elastic Net, have built-in feature selection mechanisms that perform selection during the model fitting process. ### Exercise 11 @@ -1504,7 +1566,7 @@ all_workflows |> ### - +Some screening models provide a "regularization path" showing how the importance of features changes as the strength of regularization varies. ### Exercise 3 @@ -1527,8 +1589,6 @@ grid_results <- ) ``` - - ```{r include = FALSE} # grid_results <- # all_workflows %>% @@ -1546,26 +1606,936 @@ The `option` column now contains all of the options that we used in the `workflo ### Exercise 4 +Pipe `grid_results` into `rank_results()`. Then, pipe `rank_results()` to `filter()`. Add the parameter `.metric == "rmse"`. + ```{r tuning-and-evaluatin-4, exercise = TRUE} ``` +```{r tuning-and-evaluatin-4-hint-1, eval = FALSE} +grid_results |> + ...() |> + filter(.metric = "...") +``` + +```{r include = FALSE} +grid_results |> + rank_results() |> + filter(.metric = "rmse") +``` + +### + +There are a few convenience functions for examining results such as `grid_results`. The `rank_results()` function will order the models by some performance metric. + +### Exercise 5 + +Copy the previous code and filter it to `select()`. Add the parameter `model`, `.config`, `rank`, and `rmse`, setting it equal to `mean`. + +```{r tuning-and-evaluatin-5, exercise = TRUE} + +``` + -```{r tuning-and-evaluatin-4-hint-1, eval = FALSE} +```{r tuning-and-evaluatin-5-hint-1, eval = FALSE} +... |> + select(..., .config, ... = mean, rank) +``` + +```{r include = FALSE} +grid_results |> + rank_results() |> + filter(.metric == "rmse") |> + select(model, .config, rmse = mean, rank) +``` + +### + +Also by default, the function ranks all of the candidate sets; that’s why the same model can show up multiple times in the output. An option, called `select_best`, can be used to rank the models using their best tuning parameter combination. + + +### Exercise 6 + +Type `autoplot()`. Add the parameters `grid_results`, `rank_metric = "rmse"`, `metric = "rmse"`, and `select_best = TRUE`. + +```{r tuning-and-evaluatin-6, exercise = TRUE} + +``` + +```{r tuning-and-evaluatin-6-hint-1, eval = FALSE} +autoplot( + grid_results, + rank_metric = "...", + metric = "rmse", + ... = TRUE) +``` + +```{r include = FALSE} +autoplot( + grid_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE) +``` + +### + +Screening models often incorporate dimension reduction techniques like Principal Component Analysis (PCA) or Partial Least Squares (PLS) to reduce the number of features while retaining essential information. + +### Exercise 7 + +Copy the previous code and add `geom_text()`. Within `aes()` of `geom_text()`, add the parameters `y`, setting it equal to `mean - 1/2`, and `label`, setting it equal to `wflow_id`. + +```{r tuning-and-evaluatin-7, exercise = TRUE} + +``` + + + +```{r tuning-and-evaluatin-7-hint-1, eval = FALSE} +... + + geom_text(aes(y = ..., label = ...)) +``` + +```{r include = FALSE} +autoplot( + grid_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE + ) + + geom_text(aes(y = mean - 1/2, label = wflow_id)) +``` + +### + +Model screening aims to strike a balance between reducing model complexity (variance reduction) and maintaining useful information (avoiding bias). + +### Exercise 8 + +Copy the previous code and add the parametes `angle = 90` and `hjust = 1` outside of the `aes()` in `geom_text()`. + +```{r tuning-and-evaluatin-8, exercise = TRUE} + +``` + + + +```{r tuning-and-evaluatin-8-hint-1, eval = FALSE} +... + + geom_text(aes(y = mean - 1/2, lable = wflow_id), angle = ..., hjust = ...) +``` + +```{r include = FALSE} +autoplot( + grid_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE + ) + + geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust = 1) +``` + +### + +Some screening techniques assess the stability of feature selection by comparing results across different subsets of data or resampling iterations. + +### Exercise 9 + +Copy the previous code and add `lims()`. Add the parameter `y` and set it equal to a vector consistent of `3.5` and `9.5`. + +```{r tuning-and-evaluatin-9, exercise = TRUE} + +``` + + + +```{r tuning-and-evaluatin-9-hint-1, eval = FALSE} +... + + lims(y = c(..., ...)) +``` + +```{r include = FALSE} +autoplot( + grid_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE + ) + + geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust = 1) + + lims(y = c(3.5, 9.5)) +``` + +### + +Univariate screening methods evaluate features one at a time, while multivariate methods consider interactions between features during selection. + +### Exercise 10 + +Copy the previous code and add `theme()`. Add the parameter `legend.position` and set it equal to `"none"`. + +```{r tuning-and-evaluatin-10, exercise = TRUE} + +``` + + + +```{r tuning-and-evaluatin-10-hint-1, eval = FALSE} +... + + theme(legend.position = "...") +``` + +```{r include = FALSE} +autoplot( + grid_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE + ) + + geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust = 1) + + lims(y = c(3.5, 9.5)) + + theme(legend.position = "none") +``` + +### + +In case you want to see the tuning parameter results for a specific model, the `id` argument can take a single value from the `wflow_id` column for which model to plot. + +### Exercise 11 + +Type `autoplot()` and add the parametes `grid_results`, `id`, setting it equal to `"Cubist"`, and `metric`, setting equal `"rmse"`. + +```{r tuning-and-evaluatin-11, exercise = TRUE} + +``` + +```{r tuning-and-evaluatin-11-hint-1, eval = FALSE} +autoplot(grid_results, id = "...", metric = "...") +``` + +```{r include = FALSE} +autoplot(grid_results, id = "Cubist", metric = "rmse") +``` + +### + +There are also methods for `collect_predictions()` and `collect_metrics()`. The example model screening with our concrete mixture data fits a total of 12,600 models. Using 2 workers in parallel, the estimation process took 2 hours to complete. + +Great Job! You have learned to tune and evaluate the models that we have created in the previous section successfully. + +## Efficiently Screening Models +### + +With a workflow set, we can use the `workflow_map()` function for this racing approach. Recall that after we pipe in our workflow set, the argument we use is the function to apply to the workflows; in this case, we can use a value of `"tune_race_anova"`. + +### Exercise 1 + +Type `control_race()`. Add the parameters `save_pred`, setting it equal to `TRUE`, `parallel_over`, setting it equal to `"everything"`, and `save_workflow`, setting it equal to `TRUE`. Set the entire expression to `race_ctrl`. + +```{r efficiently-screenin-1, exercise = TRUE} + +``` + +```{r efficiently-screenin-1-hint-1, eval = FALSE} +race_ctrl <- + control_race( + save_pred = ..., + parallel_over = "...", + save_workflow = ... + ) +``` + +```{r include = FALSE} +race_ctrl <- + control_race( + save_pred = TRUE, + parallel_over = "everything", + save_workflow = TRUE + ) +``` + +### + +`control_race()` control aspects of the grid search racing process. + +### Exercise 2 + +Type `all_workflows` and pipe it to `workflow_map()`. Add the parameter `"tune_race_anova"`. + +```{r efficiently-screenin-2, exercise = TRUE} + +``` + +```{r efficiently-screenin-2-hint-1, eval = FALSE} +all_workflows |> + workflow_map( + "..." + ) +``` + +```{r include = FALSE} +all_workflows |> + workflow_map( + "tune_race_anova" + ) +``` + +### + +Apart from selecting features, model screening might also involve creating new features through transformations, interactions, and other engineering techniques. + +### Exercise 3 + +Copy the previous code and add the parameters `seed = 1503`, `resamples = concrete_folds`, `grid = 25`, and `control = race_ctrl`. Set the entire expression to `race_results`. Type `race_results` to see the outcome. + +```{r efficiently-screenin-3, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-3-hint-1, eval = FALSE} +race_results <- + ... |> + workflow_map( + "tune_race_anova", + seed = ..., + resamples = ..., + grid = ..., + control = race_ctrl + ) +``` + +```{r include = FALSE} +race_results <- + all_workflows |> + workflow_map( + "tune_race_anova", + seed = 1503, + resamples = concrete_folds, + grid = 25, + control = race_ctrl + ) +``` + +### + +The new object looks very similar, although the elements of the result column show a value of "race[+]", indicating a different type of object. + +### Exercise 4 + +Type `autoplot()` and add the parameters `race_results`, `rank_metrics`, setting it equal to `"rmse"`, `metric`, setting it equal to `"rmse"`, and `select_best`, setting it equal to `TRUE`. + +```{r efficiently-screenin-4, exercise = TRUE} + +``` + +```{r efficiently-screenin-4-hint-1, eval = FALSE} +autoplot( + race_results, + rank_metric = "...", + metric = "...", + select_best = ... +) +``` + +```{r include = FALSE} +autoplot( + race_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE +) +``` + +### + +Effective model screening often involves thorough data preprocessing steps, including handling missing values, encoding categorical variables, and scaling features. + +### Exercise 5 + +Copy the previous code and add `geom_text()`. Within `aes()` of `geom_text()`, add the parameters `y = mean - 1/2` and `label = wflow_id`. + +```{r efficiently-screenin-5, exercise = TRUE} + +``` + + +```{r efficiently-screenin-5-hint-1, eval = FALSE} +... + + geom_text(aes(y = ..., label = ...)) ``` ```{r include = FALSE} +autoplot( + race_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE +) + + geom_text(aes(y = mean - 1/2, label = wflow_id)) +``` + +### + +Visualization tools like heatmaps, correlation plots, and scatter plots can help identify patterns and relationships among features before and after screening. + +### Exercise 6 + +Copy the previous code and add `lims()`. Add the parameter `y` and set it equal to a vector consistent of `3.0` and `9.5`. + +```{r efficiently-screenin-6, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-6-hint-1, eval = FALSE} +... + + lims(y = c(..., ...)) +``` + +```{r efficiently-screenin-6-hint-2, eval = FALSE} + +``` +```{r include = FALSE} +autoplot( + race_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE +) + + geom_text(aes(y = mean - 1/2, label = wflow_id)) + + lims(y = c(3.0, 9.5)) ``` ### +Choose appropriate evaluation metrics that align with your problem and goals. For classification, accuracy, precision, recall, and F1-score can be informative; for regression, RMSE or MAE might be suitable. + +### Exercise 7 + +Copy the previous code and add `theme()`. Add the parameter `legned.position` and set it equal to `"none"`. + +```{r efficiently-screenin-7, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-7-hint-1, eval = FALSE} +... + + theme(... = "none") +``` + +```{r include = FALSE} +autoplot( + race_results, + rank_metric = "rmse", + metric = "rmse", + select_best = TRUE +) + + geom_text(aes(y = mean - 1/2, label = wflow_id)) + + lims(y = c(3.0, 9.5)) + + theme(legend.position = "none") +``` + +### + +Overall, the racing approach estimated a total of 1,050 models, 8.33% of the full set of 12,600 models in the full grid. As a result, the racing approach was 4.7-fold faster. + +### Exercise 8 + +Did we get similar results? For both objects, we rank the results, merge them, and plot them against one another. + +Type `rank_results()` and add the parameters `race_results` and `select_best`, setting it equal to `TRUE`. + +```{r efficiently-screenin-8, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-8-hint-1, eval = FALSE} +rank_results(..., select_best = ...) +``` + +```{r include = FALSE} +rank_results(race_results, select_best = TRUE) +``` + +### + +Choose algorithms that are computationally efficient for screening, especially when dealing with large datasets. Linear models or tree-based methods are often faster than complex models like deep neural networks. + +### Exercise 9 + +Copy the previous code and pipe it to `select()`. Add the parameters `wflow_id`, `.metric`, `race = mean`, and `config_race = .config`. + +```{r efficiently-screenin-9, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-9-hint-1, eval = FALSE} +... |> + select(wflow_id, .metric, race = mean, config_race = .config) +``` + +```{r include = FALSE} +rank_results(race_results, select_best = TRUE) |> + select(wflow_id, .metric, race = mean, config_race = .config) +``` + +### + +If the dataset contains distinct subgroups, consider conducting separate screening processes for each subgroup to capture subgroup-specific features. + +### Exercise 10 + +Copy the previous code and pipe it to `inner_join()`. Add the parameter `rank_results()`, add the parameters `grid_results` and `select_best = TRUE`. + +```{r efficiently-screenin-10, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-10-hint-1, eval = FALSE} +... |> + inner_join( + rank_results(..., select_best = ...) + ) +``` + +```{r include = FALSE} +rank_results(race_results, select_best = TRUE) |> + select(wflow_id, .metric, race = mean, config_race = .config) |> + inner_join( + rank_results(grid_results, select_best = TRUE) + ) +``` + +### + +Create automated pipelines that integrate data preprocessing, feature selection, and model training to streamline the screening process. + +### Exercise 11 + +Copy the previous code and pipe `race_results()` inside `inner_join()` to `select()`. Add the `wflow_id`, `.metric`, `complete = mean`, `config_complete = .config`, and `model`. + +```{r efficiently-screenin-11, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-11-hint-1, eval = FALSE} +... |> + inner_join( + ... |> + select(..., .metric, ... = mean, ... = .config, model.) + ) +``` + +```{r include = FALSE} +rank_results(race_results, select_best = TRUE) |> + select(wflow_id, .metric, race = mean, config_race = .config) |> + inner_join( + rank_results(grid_results, select_best = TRUE) |> + select(wflow_id, .metric, complete = mean, config_complete = .config, model) + ) +``` + +### + +Before screening, handle outliers using methods like winsorization or transformation to ensure they don't disproportionately influence feature selection. + +### Exercise 12 + +Copy the previous code and add the parameter`by` inside `inner_join()`, setting it equal to a vector consistent `"wflow_id"` and `".metric"`. + +```{r efficiently-screenin-12, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-12-hint-1, eval = FALSE} +... |> + inner_join( + ..., + by = c("...", "...") + ) +``` + +```{r include = FALSE} +rank_results(race_results, select_best = TRUE) |> + select(wflow_id, .metric, race = mean, config_race = .config) |> + inner_join( + rank_results(grid_results, select_best = TRUE) |> + select(wflow_id, .metric, complete = mean, config_complete = .config, model), + by = c("wflow_id", ".metric") + ) +``` + +### + +When working with imbalanced datasets, employ techniques like oversampling, undersampling, or Synthetic Minority Over-sampling Technique (SMOTE) to balance class distributions during screening. + +### Exercise 13 + +Copy the previous code and, outside of `inner_join()`, pipe it to `filter()`. Add the parameter `.metric == "rmse"`. Then set the entire expression to `matched_results`. + +```{r efficiently-screenin-13, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-13-hint-1, eval = FALSE} +matched_results <- + ... |> + filter(.metric == "...") +``` + +```{r include = FALSE} +matched_results <- + rank_results(race_results, select_best = TRUE) |> + select(wflow_id, .metric, race = mean, config_race = .config) |> + inner_join( + rank_results(grid_results, select_best = TRUE) |> + select(wflow_id, .metric, complete = mean, + config_complete = .config, model), + by = c("wflow_id", ".metric") + ) |> + filter(.metric == "rmse") +``` + +### + +When training complex models, like neural networks, implement early stopping criteria to halt the training process if the model's performance stops improving, saving time and resources. + +### Exercise 14 + +Type `matched_results` and pipe it to `ggplot()`. Within `aes()` of `ggplot()`, add the parametes `x`, setting to `complete`, and `y`, setting it equal to `race`. + +```{r efficiently-screenin-14, exercise = TRUE} + +``` + +```{r efficiently-screenin-14-hint-1, eval = FALSE} +matched_results |> + ggplot(aes(x = ..., y = ...)) +``` + +```{r include = FALSE} +matched_results |> + ggplot(aes(x = complete, y = race)) +``` + +### + +Utilize dimensionality reduction techniques like Principal Component Analysis (PCA) or t-SNE to transform high-dimensional data into a lower-dimensional space while preserving essential information. + +### Exercise 15 + +Copy the previous code and add it to `geom_abline()`. Add the parameter `lty`, setting it equal to `3`. + +```{r efficiently-screenin-15, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-15-hint-1, eval = FALSE} +... |> + geom_abline(lty = ...) +``` + +```{r include = FALSE} +matched_results |> + ggplot(aes(x = complete, y = race)) + + geom_abline(lty = 3) +``` + +### + +Implement model agnostic methods like Recursive Feature Elimination (RFE) that can be used with various algorithms without requiring modifications. + +### Exercise 16 + +Copy the previous code and add `geom_point()`. Then, add `geom_text_repel()`. Within `aes()` of `geom_text_repel()`, set `label` equal to `model`. + +```{r efficiently-screenin-16, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-16-hint-1, eval = FALSE} +... + + ...() + + geom_text_repel(aes(label = ...)) +``` + +```{r include = FALSE} +matched_results |> + ggplot(aes(x = complete, y = race)) + + geom_abline(lty = 3) + + geom_point() + + geom_text_repel(aes(label = model)) +``` + +### + +Consider creating meta-features that aggregate or transform multiple original features before screening. These can capture higher-level patterns or relationships. + +### Exercise 17 + +Copy the previous code and add `coord_obs_pred()`. Then, add `labs()`. Add the parameters `x`, setting `"Complete Grid RMSE`, and `y`, setting it equal to `"Racing RMSE"`. + +```{r efficiently-screenin-17, exercise = TRUE} + +``` + + + +```{r efficiently-screenin-17-hint-1, eval = FALSE} +... + + coord_obs_pred() + + labs(x = "...", y = "...") +``` + +```{r include = FALSE} +matched_results |> + ggplot(aes(x = complete, y = race)) + + geom_abline(lty = 3) + + geom_point() + + geom_text_repel(aes(label = model)) + + coord_obs_pred() + + labs(x = "Complete Grid RMSE", y = "Racing RMSE") +``` + +### + +Great Job! You now know how to efficiently screen models by using methods such as `control_race()`, `workflow_map()`, `rank_results()`, etc. + +## Finalizing a Model +### + +Similar to what we have shown in previous chapters, the process of choosing the final model and fitting it on the training set is straightforward. The first step is to pick a workflow to finalize. Since the boosted tree model worked well, we’ll extract that from the set, update the parameters with the numerically best settings, and fit to the training set. + +### Exercise 1 + +Type in `race_results` and pipe it to `extract_workflow_set_result()`. Add the parameter `"boosting"`. + +```{r finalizing-a-model-1, exercise = TRUE} + +``` + +```{r finalizing-a-model-1-hint-1, eval = FALSE} +race_results |> + extract_workflow_set_result("...") +``` + +```{r include = FALSE} +race_results |> + extract_workflow_set_result("boosting") +``` + +### + +Perform sensitivity analysis by re-evaluating model performance with slight variations in the selected features to ensure their stability. + +### Exercise 2 + +Copy the previous code and pipe it to `select_best()`. Add the parameter `metric` and set it equal to `"rmse"`. Then, set the entire expression to `best_results`. + +```{r finalizing-a-model-2, exercise = TRUE} + +``` + + + +```{r finalizing-a-model-2-hint-1, eval = FALSE} +best_results <- + ... |> + select_best(metric = "rmse") +``` + +```{r include = FALSE} +best_results <- + race_results |> + extract_workflow_set_result("boosting") |> + select_best(metric = "rmse") +``` + +### + +Experiment with multiple screening algorithms and compare their results to choose the one that aligns best with your data and objectives. + +### Exercise 3 + +Type `race_results` and pipe it to `extract_workflow("boosting")`. Then pipe it to `finalize_workflow()`, adding the parameter `best_results`. + +```{r finalizing-a-model-3, exercise = TRUE} + +``` + +```{r finalizing-a-model-3-hint-1, eval = FALSE} +race_results |> + extract_workflow("...") |> + finalize_workflow(...) +``` + +```{r include = FALSE} +race_results |> + extract_workflow("boosting") |> + finalize_workflow(best_results) +``` + +### + +Choose algorithms that are specifically designed for speed and efficiency in high-dimensional data, such as Fast Correlation-Based Filter (FCBF). + +### Exercise 4 + +Copy the previous code and pipe it to `last_fit()`. Add the parameter `split`, setting it equal to `concrete_split`. Then, set the entire expression to `boosting_test_results` using `<-`. + +```{r finalizing-a-model-4, exercise = TRUE} + +``` + + + +```{r finalizing-a-model-4-hint-1, eval = FALSE} +boosting_test_results <- + ... |> + last_fit(split = ...) +``` + +```{r include = FALSE} +boosting_test_results <- + race_results |> + extract_workflow("boosting") |> + finalize_workflow(best_results)|> + last_fit(split = concrete_split) +``` + +### + +We can see the test set metrics results, and visualize the predictions in the next exercise. + +### Exercise 5 + +Type `collect_metrics()` and add the parameter `boosting_test_results`. + +```{r finalizing-a-model-5, exercise = TRUE} + +``` + +```{r finalizing-a-model-5-hint-1, eval = FALSE} +collect_metrics(...) +``` + +```{r include = FALSE} +collect_metrics(boosting_test_results) +``` + +### + +In scenarios with continuous data influx, implement online feature screening techniques to adaptively select features over time. + +### Exercise 6 + +Pipe `boosting_test_results` to `collect_predictions()`. Then, pipe it to `ggplot()` and within `aes()`, set `x` to `compressive_strength` and `y` to `.pred`. + +```{r finalizing-a-model-6, exercise = TRUE} + +``` + +```{r finalizing-a-model-6-hint-1, eval = FALSE} +boosting_test_results |> + ...() |> + ggplot(aes(x = ..., y = .pred)) +``` + +```{r include = FALSE} +boosting_test_results |> + collect_predictions() |> + ggplot(aes(x = compressive_strength, y = .pred)) +``` + +### + +When dealing with complex models like deep learning, use regularization techniques like dropout or L1 regularization to implicitly perform feature screening. + +### Exercise 7 + +Copy the previous code and add `geom_abline()`, setting the parameter `color` equal to `"gray50"` and `lty` equal to `2`. + +```{r finalizing-a-model-7, exercise = TRUE} + +``` + + + +```{r finalizing-a-model-7-hint-1, eval = FALSE} +... + + geom_abline(color = "...", lty = ...) +``` + +```{r include = FALSE} +boosting_test_results |> + collect_predictions() |> + ggplot(aes(x = compressive_strength, y = .pred)) + + geom_abline(color = "gray50", lty = 2) +``` + +### + +Utilize Automated Machine Learning (AutoML) tools that incorporate efficient screening as part of their feature engineering process. + +### Exercise 8 + +Copy the previous code and add `geom_point()`. Add the parameter `alpha = 0.5`. Then add `coord_orbs_pred()`. Finally, add `labs()`, and set `x` equal to `"observed"` and `y` equal to `"predicted"`. + +```{r finalizing-a-model-8, exercise = TRUE} + +``` + + + +```{r finalizing-a-model-8-hint-1, eval = FALSE} +... + + geom_point(alpha = ...) + + coord_orbs_pred() + + labs(x = "...", y = "...") +``` + +```{r include = FALSE} +boosting_test_results |> + collect_predictions() |> + ggplot(aes(x = compressive_strength, y = .pred)) + + geom_abline(color = "gray50", lty = 2) + + geom_point(alpha = 0.5) + + coord_orbs_pred() + + labs(x = "observed", y = "predicted") +``` + +### + +Great Job! You have now learned how to finalize a model and finalize the model that we have been working on this tutorial. + ## Summary ### - +This tutorial covers [Chapter 15: Grid Search](https://www.tmwr.org/workflow-sets) from [*Tidy Modeling with R*](https://www.tmwr.org/) by Max Kuhn and Julia Silge. This chapter illustrated how to use workflow sets to investigate multiple models or feature engineering strategies in such a situation. Racing methods can more efficiently rank models than fitting every candidate model being considered. ```{r download-answers, child = system.file("child_documents/download_answers.Rmd", package = "tutorial.helpers")} ``` diff --git a/inst/tutorials/17-encoding-categorical-data/tutorial.Rmd b/inst/tutorials/17-encoding-categorical-data/tutorial.Rmd new file mode 100644 index 0000000..8d8f204 --- /dev/null +++ b/inst/tutorials/17-encoding-categorical-data/tutorial.Rmd @@ -0,0 +1,995 @@ +--- +title: Encoding Categorical Data +author: Pratham Kancherla +tutorial: + id: encoding-categorical-data +output: + learnr::tutorial: + progressive: yes + allow_skip: yes +runtime: shiny_prerendered +description: 'Tutorial for Chapter 17: Encoding Categorical Data' +--- + +```{r setup, include = FALSE} +library(learnr) +library(tutorial.helpers) +library(finetune) +library(baguette) +library(tidyverse) +library(tidymodels) +library(rlang) +library(embed) +library(tune) +library(ggrepel) +library(ggforce) +library(rstanarm) +library(rules) +library(textrecipes) +library(tidyposterior) +library(lme4) +library(multilevelmod) +library(nlme) +library(usemodels) +library(workflowsets) + +knitr::opts_chunk$set(echo = FALSE) +options(tutorial.exercise.timelimit = 60, + tutorial.storage = "local") + +ames <- mutate(ames, Sale_Price = log10(Sale_Price)) + +ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price) +ames_train <- training(ames_split) +ames_test <- testing(ames_split) + +ames_glm <- + recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) |> + step_lencode_glm(Neighborhood, outcome = vars(Sale_Price)) |> + step_dummy(all_nominal_predictors()) |> + step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) |> + step_ns(Latitude, Longitude, deg_free = 20) + +glm_estimates <- + prep(ames_glm) |> + tidy(number = 2) + +ames_mixed <- + recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) |> + step_lencode_mixed(Neighborhood, outcome = vars(Sale_Price)) |> + step_dummy(all_nominal_predictors()) |> + step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) |> + step_ns(Latitude, Longitude, deg_free = 20) + +mixed_estimates <- + prep(ames_mixed) |> + tidy(number = 2) + +ames_hashed <- + ames_train |> + mutate(Hash = map_chr(Neighborhood, hash)) + +ames_hash <- + recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) |> + step_dummy_hash(Neighborhood, signed = FALSE, num_terms = 16L) |> + step_dummy(all_nominal_predictors()) |> + step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) |> + step_ns(Latitude, Longitude, deg_free = 20) +``` + +```{r copy-code-chunk, child = system.file("child_documents/copy_button.Rmd", package = "tutorial.helpers")} +``` + +```{r info-section, child = system.file("child_documents/info_section.Rmd", package = "tutorial.helpers")} +``` + +## Introduction +### + +This tutorial covers [Chapter 17: Encoding Categorical Data](https://www.tmwr.org/categorical) from [*Tidy Modeling with R*](https://www.tmwr.org/) by Max Kuhn and Julia Silge. In this tutorial, you will learn about using preprocessing recipes for encoding categorical predictors. We will discuss more sophisticated options for encoding categorical predictors that address these issues. These options are available as tidymodels recipe steps in the **embed** and **textrecipes** packages. + +## Using the Outcome for Encoding Predictors +### + +There are multiple options for encodings more complex than dummy or indicator variables. One method called effect or likelihood encodings replaces the original categorical variables with a single numeric column that measures the effect of those data. + +### Exercise 1 + +Load the library **tidymodels** using `library()`. + +```{r using-the-outcome-fo-1, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-1-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +library(tidymodels) +``` + +### Exercise 2 + +Next, load the library **rsample** using `library()`. + +```{r using-the-outcome-fo-2, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-2-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +library(rsample) +``` + +### + +For statistical modeling in R, the preferred representation for categorical or nominal data is a factor, which is a variable that can take on a limited number of different values; internally, factors are stored as a vector of integer values together with a set of text labels. + +### Exercise 3 + +Type `ames_train` and pipe it to `group_by()`, adding the parameter `Neighborhood`. + +```{r using-the-outcome-fo-3, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-3-hint-1, eval = FALSE} +ames_train |> + group_by(...) +``` + +```{r include = FALSE} +ames_train |> + group_by(Neighborhood) +``` + +### + +For some realistic data sets, straightforward dummy variables are not a good fit. This often happens because there are too many categories or there are new categories at prediction time. + +### Exercise 4 + +Copy the previous code and pipe it to `summarize()`. Add the parameters `mean`, setting it equal to `mean(Sale_Price)`, and `std_err`, setting it equal to `sd(Sale_Price) / sqrt(length(Sale_Price))`. + +```{r using-the-outcome-fo-4, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-4-hint-1, eval = FALSE} +... |> + summarize(mean = mean(...), + std_err = sd(...) / sqrt(length(...))) +``` + +```{r include = FALSE} +ames_train |> + group_by(Neighborhood) |> + summarize(mean = mean(Sale_Price), + std_err = sd(Sale_Price) / sqrt(length(Sale_Price))) +``` + +### + +A minority of models, such as those based on trees or rules, can handle categorical data natively and do not require encoding or transformation of these kinds of features. + +### Exercise 5 + +Copy the previous code and pipe it to `ggplot()`. Within `aes()` of `ggplot()`, set `y` equal to `reorder(Neighborhood, mean)`, and `x` equal to `mean`. + +```{r using-the-outcome-fo-5, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-5-hint-1, eval = FALSE} +... |> + ggplot(aes(y = ..., x = ...)) +``` + +```{r include = FALSE} +ames_train |> + group_by(Neighborhood) |> + summarize(mean = mean(Sale_Price), + std_err = sd(Sale_Price) / sqrt(length(Sale_Price))) |> + ggplot(aes(y = reorder(Neighborhood, mean), x = mean)) +``` + +### + +A tree-based model can natively partition a variable like Bldg_Type into groups of factor levels when looking at the encodings for the building type predictor in the Ames training set. + +### Exercise 6 + +Copy the previous code and add it to `geom_point()`. Then, add the code to `geom_errorbar()`. Then, within `aes()`, add the parameters `xmin`, setting it equal to `mean - 1.64 * std_err`, and `xmax`, setting it equal to `mean + 1.64 * std_err`. + +```{r using-the-outcome-fo-6, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-6-hint-1, eval = FALSE} +... + + geom_point() + + geom_errorbar(aes(xmin = ..., xmax = ...)) +``` + +```{r include = FALSE} +ames_train |> + group_by(Neighborhood) |> + summarize(mean = mean(Sale_Price), + std_err = sd(Sale_Price) / sqrt(length(Sale_Price))) |> + ggplot(aes(y = reorder(Neighborhood, mean), x = mean)) + + geom_point() + + geom_errorbar(aes(xmin = mean - 1.64 * std_err, xmax = mean + 1.64 * std_err)) +``` + +### + +`geom_bar()` includes various ways of representing a vertical interval defined by x, ymin and ymax. Each case draws a single graphical object. + +### Exercise 7 + +Copy the previous code and add `labs()`. Add the parameters `y`, setting it to `NULL`, and `x`, setting it to `"Price (mean, log scale)"`. + +```{r using-the-outcome-fo-7, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-7-hint-1, eval = FALSE} + +``` + +```{r include = FALSE} +ames_train |> + group_by(Neighborhood) |> + summarize(mean = mean(Sale_Price), + std_err = sd(Sale_Price) / sqrt(length(Sale_Price))) |> + ggplot(aes(y = reorder(Neighborhood, mean), x = mean)) + + geom_point() + + geom_errorbar(aes(xmin = mean - 1.64 * std_err, xmax = mean + 1.64 * std_err)) + + labs(y = NULL, x = "Price (mean, log_scale)" ) +``` + +### + +In tidymodels, the **embed** package includes several recipe step functions for different kinds of effect encodings, such as `step_lencode_glm()`, `step_lencode_mixed()`, and `step_lencode_bayes()`. These steps use a generalized linear model to estimate the effect of each level in a categorical predictor on the outcome. + +### Exercise 8 + +Load the library **embed** using `library()`. + +```{r using-the-outcome-fo-8, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-8-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +library(embed) +``` + +### + +Predictors can be converted to one or more numeric representations using a variety of methods. Effect encodings using simple generalized linear models or nonlinear models can be used. + +### Exercise 9 + +Load the recipe that we have created earlier. We will created an ames `glm` model. + +```{r using-the-outcome-fo-9, exercise = TRUE} +recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + + Latitude + Longitude, data = ames_train) +``` + +### + +`step_lencode_glm()` creates a specification of a recipe step that will convert a nominal (i.e. factor) predictor into a single set of scores derived from a generalized linear model. + +### Exercise 10 + +Copy the previous code and pipe it to `step_log()`. Add the parameters `Gr_Liv_Area` and `base`, setting it equal to `10`. + +```{r using-the-outcome-fo-10, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-10-hint-1, eval = FALSE} +... |> + step_log(..., base = ...) +``` + +```{r include = FALSE} +recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) +``` + +### + +`step_log()` creates a specification of a recipe step that will log transform data. + +### Exercise 11 + +Copy the previous code and pipe it to `step_lencode_glm()`. Add the parameters `Neighborhood`, and `outcome`, setting it equal to `vars(Sale_Price)`. + +```{r using-the-outcome-fo-11, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-11-hint-1, eval = FALSE} +... |> + step_lencode_glm(..., outcome = vars(...)) +``` + +```{r include = FALSE} +recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) |> + step_lencode_glm(Neighborhood, outcome = vars(Sale_Price)) +``` + +### + +`step_lencode_glm()` creates a specification of a recipe step that will convert a nominal (i.e. factor) predictor into a single set of scores derived from a generalized linear model. + +### Exercise 12 + +Copy the previous code and pipe it to `step_dummy()`. Add the parameter `all_nominal_predictors()`. + +```{r using-the-outcome-fo-12, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-12-hint-1, eval = FALSE} +... |> + step_dummy(...()) +``` + +```{r include = FALSE} +recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) |> + step_lencode_glm(Neighborhood, outcome = vars(Sale_Price)) |> + step_dummy(all_nominal_predictors()) +``` + +### + +`step_dummy()` creates a specification of a recipe step that will convert nominal data (e.g. character or factors) into one or more numeric binary model terms for the levels of the original data. + +### Exercise 13 + +Copy the previous code and pipe it to `step_interact()`. Add the parameter ` ~ Gr_Live_Area:starts_with("Bldg_Type_")`. + +```{r using-the-outcome-fo-13, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-13-hint-1, eval = FALSE} +... |> + step_interact(~ ...) +``` + +```{r include = FALSE} +recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) |> + step_lencode_glm(Neighborhood, outcome = vars(Sale_Price)) |> + step_dummy(all_nominal_predictors()) |> + step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) +``` + +### + +`step_interact()` creates a specification of a recipe step that will create new columns that are interaction terms between two or more variables. + +### Exercise 14 + +Copy the previous code and pipe it to `step_ns()`. Add the parameters `Latitude`, `Longitude` and `deg_free`, setting it equal to `20`. Then, set the entire expression to `ames_glm`. Type `ames_glm()` on the next line to see the output. + +```{r using-the-outcome-fo-14, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-14-hint-1, eval = FALSE} +ames_glm <- + ... |> + step_ns(..., ..., deg_free = ...) +``` + +```{r include = FALSE} +ames_glm <- + recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) |> + step_lencode_glm(Neighborhood, outcome = vars(Sale_Price)) |> + step_dummy(all_nominal_predictors()) |> + step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) |> + step_ns(Latitude, Longitude, deg_free = 20) + +ames_glm +``` + +### + +`step_ns()` can create new features from a single variable that enable fitting routines to model this variable in a nonlinear manner. The original variables are removed from the data and new columns are added. + +### Exercise 15 + +Type in `prep()` and add the parameter `ames_glm`. + +```{r using-the-outcome-fo-15, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-15-hint-1, eval = FALSE} +prep(...) +``` + +```{r include = FALSE} +prep(ames_glm) +``` + +### + +For a recipe with at least one preprocessing operation, estimate the required parameters from a training set that can be later applied to other data sets. + +### Exercise 16 + +Copy the previous code and pipe it to `tidy()`. Add the parameter `number = 2`. Then, set the entire expression to `glm_estimates` and run it on the next line. + +```{r using-the-outcome-fo-16, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-16-hint-1, eval = FALSE} +glm_estimates <- + ... |> + tidy(number = ...) +``` + +```{r include = FALSE} +glm_estimates <- + prep(ames_glm) |> + tidy(number = 2) +``` + +### + +When we use the newly encoded `Neighborhood` numeric variable created via this method, we substitute the original level (such as `"North_Ames"`) with the estimate for `Sale_Price` from the GLM. + +### Exercise 17 + +Type `glm_estimates` and pipe it to `filter()`. Add the parameter `level == "..new"`. + +```{r using-the-outcome-fo-17, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-17-hint-1, eval = FALSE} +glm_estimates |> + filter(... == "...") +``` + +```{r include = FALSE} +glm_estimates |> + filter(level == "..new") +``` + +### + +Effect encodings can be powerful but should be used with care. The effects should be computed from the training set, after data splitting. + +### Exercise 18 + +We can use partial pooling to adjust these estimates so that levels with small sample sizes are shrunken toward the overall mean. The effects for each level are modeled all at once using a mixed or hierarchical generalized linear model: + +Copy the code from exercise 12. Switch `step_lencode_glm` to `step_lencode_mixed`. Set the entire expression to `ames_mixed` and run it on the next line. + +```{r using-the-outcome-fo-18, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-18-hint-1, eval = FALSE} +ames_mixed <- + ... |> + step_lencode_mixed(...) |> + ... +``` + +```{r include = FALSE} +ames_mixed <- + recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude + Longitude, data = ames_train) |> + step_log(Gr_Liv_Area, base = 10) |> + step_lencode_mixed(Neighborhood, outcome = vars(Sale_Price)) |> + step_dummy(all_nominal_predictors()) |> + step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) |> + step_ns(Latitude, Longitude, deg_free = 20) + +ames_mixed +``` + +### + +Creating an effect encoding with step_lencode_glm() estimates the effect separately for each factor level (in this example, neighborhood). + +### Exercise 19 + +Type `prep()` and add the parameter `ames_mixed`. Then, pipe the code to `tidy()`, adding the parameter `number = 2`. Then, set the entire expression to `mixed_estimates` and run it on the next line. + +```{r using-the-outcome-fo-19, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-19-hint-1, eval = FALSE} +mixed_estimates <- + prep(...) |> + tidy(number = ...) +``` + +```{r include = FALSE} +mixed_estimates <- + prep(ames_mixed) |> + tidy(number = 2) + +mixed_estimates +``` + +### + +Naive Bayes models are another example where the structure of the model can deal with categorical variables natively; distributions are computed within each level, for example, for all the different kinds of `Bldg_Type` in the data set. + +### Exercise 20 + +Pipe `mixed_estimates` to `filter()`. Add the parameter `level = "..new"` and hit "Run Code". + +```{r using-the-outcome-fo-20, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-20-hint-1, eval = FALSE} +mixed_estimates |> + filter(...) +``` + +```{r include = FALSE} +mixed_estimates |> + filter(level == "..new") +``` + +### + +These models we have talked about previously in the tutorial that can handle categorical features natively can also deal with numeric, continuous features, making the transformation or encoding of such variables optional. + +### Exercise 21 + +Let’s visually compare the effects using partial pooling vs. no pooling. Pipe `glm_estimates` to `rename()`, adding the parameter ``no pooling``, setting it equal to value. + +```{r using-the-outcome-fo-21, exercise = TRUE} + +``` + +```{r using-the-outcome-fo-21-hint-1, eval = FALSE} +glm_estimates |> + rename(`...` = ...) +``` + +```{r include = FALSE} +glm_estimates |> + rename(`no pooling` = value) +``` + +### + +In short, using dummy encodings did not typically result in better model performance but often required more time to train the models. + +### Exercise 22 + +Copy the previous code and pipe it to `left_join()`. Within the function, pipe `mixed_estimates` to `rename()`, adding the parameters ``partial pooling``, setting it equal to `value`, and `by`, setting it equal to `"level"`. + +```{r using-the-outcome-fo-22, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-22-hint-1, eval = FALSE} +... |> + left_join( + mixed_estimates |> + rename(`...` = ..., by = "...") + ) +``` + +```{r include = FALSE} +glm_estimates |> + rename(`no pooling` = value) |> + left_join( + mixed_estimates |> + rename(`partial pooling` = value), by = "level" + ) +``` + +### + +It is advised to start with un-transformed categorical variables when a model allows it; note that more complex encodings often do not result in better performance for such models. + +### Exercise 23 + +Copy the previous code and pipe it to `left_join()` again. Within the function, pipe `ames_train` to `count()`, adding the parameter `Neighborhood`. Then, pipe `count()` to `mutate()`, adding the parameter `level = as.character(Neighborhood)`. + +```{r using-the-outcome-fo-23, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-23-hint-1, eval = FALSE} +... |> + left_join( + ames_train |> + count(...) |> + mutate(level = as.character(...)) + ) +``` + +```{r include = FALSE} +glm_estimates |> + rename(`no pooling` = value) |> + left_join( + mixed_estimates |> + rename(`partial pooling` = value), by = "level" + ) |> + left_join( + ames_train |> + count(Neighborhood) |> + mutate(level = as.character(Neighborhood)) + ) +``` + +### + +Sometimes qualitative columns can be ordered, such as “low,” “medium,” and “high”. In base R, the default encoding strategy is to make new numeric columns that are polynomial expansions of the data. + +### Exercise 24 + +Copy the previous code and pipe it to `ggplot()`. Within `aes()` of `ggplot()`, add the parameters ``no pooling``, ``partial pooling``, and `size`, setting `size` equal to `sqrt(n)`. + +```{r using-the-outcome-fo-24, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-24-hint-1, eval = FALSE} +... |> + ggplot(aes(`...`, `...`, size = sqrt(...))) +``` + +```{r include = FALSE} +glm_estimates |> + rename(`no pooling` = value) |> + left_join( + mixed_estimates |> + rename(`partial pooling` = value), by = "level" + ) |> + left_join( + ames_train |> + count(Neighborhood) |> + mutate(level = as.character(Neighborhood)) + ) |> + ggplot(aes(`no pooling`, `partial pooling`, size = sqrt(n))) +``` + +### + +An 11-degree polynomial is probably not the most effective way of encoding an ordinal factor for the months of the year. Instead, consider trying recipe steps related to ordered factors, such as `step_unorder()`, to convert to regular factors, and `step_ordinalscore()`, which maps specific numeric values to each factor level. + + +### Exercise 25 + +Copy the previous code and add `geom_abline()`. Add the parameters `color`, setting it equal to `"gray50"`, and `lty`, setting it equal to `2`. + +```{r using-the-outcome-fo-25, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-25-hint-1, eval = FALSE} +... + + geom_abline(color = "...", lty = ...) +``` + +```{r include = FALSE} +glm_estimates |> + rename(`no pooling` = value) |> + left_join( + mixed_estimates |> + rename(`partial pooling` = value), by = "level" + ) |> + left_join( + ames_train |> + count(Neighborhood) |> + mutate(level = as.character(Neighborhood)) + ) |> + ggplot(aes(`no pooling`, `partial pooling`, size = sqrt(n))) + + geom_abline(color = "gray50", lty = 2) +``` + +### + +These geoms add reference lines (sometimes called rules) to a plot, either horizontal, vertical, or diagonal (specified by slope and intercept). These are useful for annotating plots. + +### Exercise 26 + +Copy the previous code and add `geom_point()`. Add the parameter `alpha`, setting it equal to `0.7`. Then, add `coord_fixed()`. + +```{r using-the-outcome-fo-26, exercise = TRUE} + +``` + + + +```{r using-the-outcome-fo-26-hint-1, eval = FALSE} +... + + geom_point(alpha = ...) + + ...() +``` + +```{r include = FALSE} +glm_estimates |> + rename(`no pooling` = value) |> + left_join( + mixed_estimates |> + rename(`partial pooling` = value), by = "level" + ) |> + left_join( + ames_train |> + count(Neighborhood) |> + mutate(level = as.character(Neighborhood)) + ) |> + ggplot(aes(`no pooling`, `partial pooling`, size = sqrt(n))) + + geom_abline(color = "gray50", lty = 2) + + geom_point(alpha = 0.7) + + coord_fixed() +``` + +### + +Excellent Work! You now know how to use the outcome of predictors for encoding predictors by using partial pooling, mixed estimates, and various Bayesian models. + +## Feature Hashing +### + +Feature hashing methods also create dummy variables, but only consider the value of the category to assign it to a predefined pool of dummy variables. Let’s look at the `Neighborhood` values in Ames again and use the `rlang::hash()` function to understand more. + +### Exercise 1 + +Load the library **rlang** using `library()`. + +```{r feature-hashing-1, exercise = TRUE} + +``` + +```{r feature-hashing-1-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +library(rlang) +``` + +### + +The **rlang** package in R facilitates non-standard evaluation, aiding in the creation of functions that programmatically generate, modify, or interact with expressions and symbols. It's particularly useful for constructing dynamic code and working seamlessly with packages like **dplyr** and **tidyr**. + +### Exercise 2 + +Pipe `ames_train` to `mutate()`. Add the parameter `Hash`, setting it equal to `map_chr()`. Add the parameter `Neighborhood` and `hood` in `map_chr()`. Set the entire expression to `ames_hashed`. + +```{r feature-hashing-2, exercise = TRUE} + +``` + +```{r feature-hashing-2-hint-1, eval = FALSE} +ames_hashed <- + ames_train |> + mutate(Hash = ...) +``` + +```{r include = FALSE} +ames_hashed <- + ames_train |> + mutate(Hash = map_chr(Neighborhood, hash)) +``` + +### + +Effect encoding methods like this one can also seamlessly handle situations where a novel factor level is encountered in the data. + +### Exercise 3 + +Pipe `ames_hashed` to `select()`. Add the parameters `Neighborhood` and `Hash`. + +```{r feature-hashing-3, exercise = TRUE} + +``` + +```{r feature-hashing-3-hint-1, eval = FALSE} +ames_hashed |> + select(..., ...) +``` + +```{r include = FALSE} +ames_hashed |> + select(Neighborhood, Hash) +``` + +### + +If we input `Briardale` to this hashing function, we will always get the same output. The neighborhoods in this case are called the “keys,” while the outputs are the “hashes.” + +### Exercise 4 + +We can get sixteen possible hash values by using `Hash %% 16`. + +Pipe `ames_hased` to `mutate()`. Add the parameters `Hash`, setting it equal to `strtoi()`. Inside `strtoi()`, add the parameter `substr(Hash, 26, 32)` and `base = 16L`. + +```{r feature-hashing-4, exercise = TRUE} + +``` + +```{r feature-hashing-4-hint-1, eval = FALSE} +ames_hashed |> + mutate(Hash = ..., base = ...) +``` + +```{r include = FALSE} +ames_hashed |> + mutate(Hash = strtoi(substr(Hash, 26, 32), base = 16L)) +``` + +### + +**strtoi()** is a function in the R programming language that is used to convert a character vector representing integers (in base 10) into an integer vector. It stands for "string to integer" and is commonly used for data transformation tasks involving character data that needs to be treated as numeric values. + +### Exercise 5 + +Copy the previous code and add the parameter `Hash`, setting it equal to `Hash %% 16`. + +```{r feature-hashing-5, exercise = TRUE} + +``` + + + +```{r feature-hashing-5-hint-1, eval = FALSE} +... |> + mutate(..., + Hash = ... %% ...) +``` + +```{r include = FALSE} +ames_hashed |> + mutate(Hash = strtoi(substr(Hash, 26, 32), base = 16L), + Hash = Hash %% 16) +``` + +### + +When we use pooling, we shrink the effect estimates toward the mean because we don’t have as much evidence about the price in those neighborhoods. + +### Exercise 6 + +Copy the previous code and pipe it to `select()`. Add the parameters `Neighborhood` and `Hash`. + +```{r feature-hashing-6, exercise = TRUE} + +``` + + + +```{r feature-hashing-6-hint-1, eval = FALSE} +... |> + select(..., Hash) +``` + +```{r include = FALSE} +ames_hashed |> + mutate(Hash = strtoi(substr(Hash, 26, 32), base = 16L), + Hash = Hash %% 16) |> + select(Neighborhood, Hash) +``` + +### + +Now instead of the 28 neighborhoods in our original data or an incredibly huge number of the original hashes, we have sixteen hash values. This method is very fast and memory efficient, and it can be a good strategy when there are a large number of possible categories. + +### Exercise 7 + +Load the library `textrecipes` using `library()`. + +```{r feature-hashing-7, exercise = TRUE} + +``` + +```{r feature-hashing-7-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +#library(textrecipes) +``` + +### + +**textrecipes** is a package in R that is part of the broader **recipes** ecosystem. It provides tools for preprocessing and feature engineering specifically tailored for text data, helping to prepare text data for machine learning tasks. With functions like `step_tokenize()`, `step_tf()`, and `step_tfidf()`, **textrecipes** assists in converting raw text into structured and numeric features that can be used as input for various machine learning algorithms. + +### Exercise 8 + +Copy the code from exercise 12. Change the function `step_lencode_glm()` to `step_dummy_hash()`. Delete the parameters and then add `Neighborhood`, `signed = FALSE`, and `num_terms = 16L`. Set this expression to `ames_hash`. + +```{r feature-hashing-8, exercise = TRUE} + +``` + +```{r feature-hashing-8-hint-1, eval = FALSE} +ames_hash <- + ... |> + step_dummy_hash(Neighborhood, signed = ..., num_terms = ...) +``` + +```{r include = FALSE} +# ames_hash <- +# recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + +# Latitude + Longitude, data = ames_train) |> +# step_log(Gr_Liv_Area, base = 10) |> +# step_dummy_hash(Neighborhood, signed = FALSE, num_terms = 16L) |> +# step_dummy(all_nominal_predictors()) |> +# step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) |> +# step_ns(Latitude, Longitude, deg_free = 20) +``` + +### + +The number of hash values is a tuning parameter of this preprocessing technique, and you should try several values to determine what is best for your particular modeling approach. A lower number of hash values results in more collisions, but a high number may not be an improvement over your original high cardinality variable. + +### + +Great Job! You learned how to feature hash to create dummy variables that optimizes data retrievel. + +## Summary +### + +This tutorial covered [Chapter 17: Encoding Categorical Data](https://www.tmwr.org/categorical) from [*Tidy Modeling with R*](https://www.tmwr.org/) by Max Kuhn and Julia Silge. In this tutorial, you learned about using preprocessing recipes for encoding categorical predictors. The most straightforward option for transforming a categorical variable to a numeric representation is to create dummy variables from the levels, but this option does not work well when you have a variable with high cardinality (too many levels) or when you may see novel values at prediction time (new levels). One option in such a situation is to create effect encodings, a supervised encoding method that uses the outcome. + +```{r download-answers, child = system.file("child_documents/download_answers.Rmd", package = "tutorial.helpers")} +``` diff --git a/inst/tutorials/19-trust-your-predictions/tutorial.Rmd b/inst/tutorials/19-trust-your-predictions/tutorial.Rmd new file mode 100644 index 0000000..a3cdb05 --- /dev/null +++ b/inst/tutorials/19-trust-your-predictions/tutorial.Rmd @@ -0,0 +1,1727 @@ +--- +title: When Should you Trust Your Predictions? +author: Pratham Kancherla +tutorial: + id: when-should-you-trust-your-predictions-? +output: + learnr::tutorial: + progressive: yes + allow_skip: yes +runtime: shiny_prerendered +description: 'Tutorial for Chapter 19: When Should you Trust Your Predictions?' +--- + +```{r setup, include = FALSE} +library(learnr) +library(tutorial.helpers) +library(applicable) +library(finetune) +library(baguette) +library(tidyverse) +library(tidymodels) +library(rlang) +library(embed) +library(tune) +library(ggrepel) +library(ggforce) +library(rstanarm) +library(rules) +library(tidyposterior) +library(lme4) +library(multilevelmod) +library(nlme) +library(probably) +library(usemodels) +library(workflowsets) +knitr::opts_chunk$set(echo = FALSE) +options(tutorial.exercise.timelimit = 60, + tutorial.storage = "local") + +simulate_two_classes <- + function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) + dat <- + as_tibble(dat) %>% + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error), + prob = binomial()$linkinv(linear_pred), + class = ifelse(prob > runif(n), cls[1], cls[2]), + class = factor(class, levels = cls) + ) + dplyr::select(dat, x, y, class) + } + +training_set <- simulate_two_classes(200) +testing_set <- simulate_two_classes(50) + +two_class_mod <- + logistic_reg() |> + set_engine("stan") |> + fit(class ~ . + I(x^2)+ I(y^2), data = training_set) + +test_pred <- augment(two_class_mod, testing_set) + +lvls <- levels(training_set$class) + +test_pred1 <- + test_pred |> + mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = 0.15)) + +eq_zone_results <- + function(buffer){ + test_pred <- + test_pred |> + mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = buffer)) + acc <- test_pred |> accuracy(class, .pred_with_eqz) + rep_rate <- reportable_rate(test_pred$.pred_with_eqz) + tibble(accuracy = acc$.estimate, reportable = rep_rate, buffer = buffer) + } + +test_pred <- + test_pred |> + bind_cols( + predict(two_class_mod, testing_set, type = "pred_int", std_error = TRUE) + ) +Chicago <- Chicago |> select(ridership, date, one_of(stations)) + +n <- nrow(Chicago) + +Chicago_train <- Chicago |> slice(1:(n - 14)) + +Chicago_test <- Chicago |> slice((n - 13):n) + +base_recipe <- + recipe(ridership ~ ., data = Chicago_train) |> + step_date(date) |> + step_holiday(date, keep_original_cols = FALSE) |> + step_dummy(all_nominal()) |> + step_zv(all_predictors()) |> + step_normalize(!!!stations) |> + step_pls(!!!stations, num_comp = 10, outcome = vars(ridership)) + +lm_spec <- + linear_reg() |> + set_engine("lm") + +lm_wflow <- + workflow() |> + add_recipe(base_recipe) |> + add_model(lm_spec) + +lm_fit <- fit(lm_wflow, data = Chicago_train) + +res_test <- + predict(lm_fit, Chicago_test) |> + bind_cols( + predict(lm_fit, Chicago_test, type = "pred_int"), + Chicago_test + ) + +res_2020 <- + predict(lm_fit, Chicago_2020) |> + bind_cols( + predict(lm_fit, Chicago_2020, type = "pred_int"), + Chicago_2020 + ) + +pca_stat <- + apd_pca(~ ., data = Chicago_train |> select(one_of(stations)), threshold = 0.99) +``` + +```{r copy-code-chunk, child = system.file("child_documents/copy_button.Rmd", package = "tutorial.helpers")} +``` + +```{r info-section, child = system.file("child_documents/info_section.Rmd", package = "tutorial.helpers")} +``` + +## Introduction +### + +This tutorial covered [Chapter 19: When Should you Trust Your Predictions?](https://www.tmwr.org/trust) from [*Tidy Modeling with R*](https://www.tmwr.org/) by Max Kuhn and Julia Silge. This tutorial discusses two methods for quantifying the potential quality of a prediction: equivocal zones use the predicted values to alert the user that results may be suspect and applicability uses the predictors to measure the amount of extrapolation (if any) for new samples. + +## Equicoval Results +### + +In some cases, the amount of uncertainty associated with a prediction is too high to be trusted. + +### Exercise 1 + +Load the library **tidymodels** using `library()`. + +```{r equicoval-results-1, exercise = TRUE} + +``` + +```{r equicoval-results-1-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +library(tidymodels) +``` + +### + +When a new data point is well outside of the range of data used to create the model, making a prediction may be an inappropriate *extrapolation*. + +### Exercise 2 + +Now we will create a function of a training set of 200 samples and a test set of 50. Type `function()` and add the parameters `n`, `error`, setting it equal to `0.1`, and `eqn`, setting it equal to `quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)`. + +```{r equicoval-results-2, exercise = TRUE} + +``` + +```{r equicoval-results-2-hint-1, eval = FALSE} +function (..., error = ..., eqn = quote(...)) +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) +``` + +### + +In some cases, the amount of uncertainty associated with a prediction is too high to be trusted. + +### Exercise 3 + +Copy the previous code and add two `{}` brackets. Within the brackets, type `matrix()` and add the parameters `c(1, 0.7, 0.7, 1)`, `nrow`, setting it equal to `2`, and `ncol`, setting it equal to `2`. Then, set the `matrix()` expression to `sigma`. (Check Hint for Formatting). + +```{r equicoval-results-3, exercise = TRUE} + +``` + + + +```{r equicoval-results-3-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(...), nrow = ..., ncol = ...) +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) +} +``` + +### + +An equivocal zone is a range of results in which the prediction should not be reported to patients, for example, some range of COVID-19 test results that are too uncertain to be reported to a patient. + +### Exercise 4 + +Copy the previous code. On the next line, type `MASS::mvrnorm()`. Add the parameters `n`, setting it equal to `n`, `mu`, setting it equal to `c(0, 0)`, and `Sigma`, setting it equal to `sigma`. Set the `mvrnorm()` expression to `dat` using `<-`. + +```{r equicoval-results-4, exercise = TRUE} + +``` + + + +```{r equicoval-results-4-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = ..., mu = c(..., 0), Sigma = ...) +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) +} +``` + +### + +An example of an inappropriate prediction when a model is used in a completely different context, such as applying a cell segmentation model regarding breast cancer cells to stomach cells and making a prediction. + +### Exercise 5 + +Copy the previous code and type `c()`. Add the parameters `"x"` and `"y"`. Set the `c()` expression to `colnames(dat)`. + +```{r equicoval-results-5, exercise = TRUE} + +``` + + + +```{r equicoval-results-5-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + ... + colnames(dat) <- c("...", "...") +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) +} +``` + +### + +In R, the **mvrnorm** function is part of the **MASS** package (Modern Applied Statistics with S) and is used to generate random samples from a multivariate normal distribution. + +### Exercise 6 + +Copy the previous code and, on the next line, type `paste0()`. Add the parameters `"class_"` and `1:2`. Set the `paste0` expression `cls` using `<-`. + +```{r equicoval-results-6, exercise = TRUE} + +``` + + + +```{r equicoval-results-6-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + ... + cls <- paste0("class_", 1:2) +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) +} +``` + +### + +**mvrnorm** parameters explained: +`n`: This is the number of random samples (observations) you want to generate from the multivariate normal distribution. + +### Exercise 7 + +Copy the previous code and, on the next line, type `as_tibble()`, add the parameter `dat`. Then, pipe the function to `mutate()`. First, add the parameter `linear_pred`, setting it equal to `!!eqn`. + +```{r equicoval-results-7, exercise = TRUE} + +``` + + + +```{r equicoval-results-7-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + ... + as_tibble(...) |> + mutate( + linear_pred = !!... + ) +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) + as_tibble(dat) |> + mutate( + linear_pred = !!eqn + ) +} +``` + +### + +**mvrnorm** parameters explained: +`mu`: This is a numeric vector representing the mean of the multivariate normal distribution. It should have the same length as the number of variables you want to generate. + +### Exercise 8 + +Copy the previous code and update `linear_pred` to `linear_pred + rnorm(n, sd = error)`. + +```{r equicoval-results-8, exercise = TRUE} + +``` + + + +```{r equicoval-results-8-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + ... + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = ... + rnorm(n, sd = ...) + ) +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error) + ) +} +``` + +### + +`linkinv()` returns the inverse link function of a ``parameters'' object. If the model's developer did not specify one (but did specify a link function) this function returns a numerical approximation of the link function. + +### Exercise 9 + +Copy the previous code and add the parameter `prob`, setting it equal to `binomial()$linkinv(linear_pred)`. + +```{r equicoval-results-9, exercise = TRUE} + +``` + + + +```{r equicoval-results-9-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + ... + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error), + prob = ...()$linkinv(...) + ) +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error), + prob = binomial()$linkinv(linear_pred) + ) +} +``` + +### + +**mvrnorm** parameters explained: +`Sigma`: This is a positive definite covariance matrix that defines the relationships and variabilities between the variables. It should have dimensions compatible with the length of mu (p x p, where p is the number of variables). + +### Exercise 10 + +Copy the previous code and, on the next line, set `class` to an `ifelse()` statement. Set the first parameter in `ifelse()` to check `prob > runif(n)`. Add the parameter for if the check is true, `cls[1]`, and add for if the check is false, `cls[2]`. + +```{r equicoval-results-10, exercise = TRUE} + +``` + + + +```{r equicoval-results-10-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + ... + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error), + prob = binomial()$linkinv(linear_pred), + class = ifelse(... > runif(n), cls[...], cls[...]) + ) +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error), + prob = binomial()$linkinv(linear_pred), + class = ifelse(prob > runif(n), cls[1], cls[2]) + ) +} +``` + +### + +**MASS** package provides a collection of functions for linear/non-linear regression, cluster analysis, etc., and data sets that cover a wide range of statistical techniques. + +### Exercise 11 + +Copy the previous code and add the final parameter to `mutate()`. Set `class` equal to `factor()`. Add the parameters `class` and `levels`, setting it equal to `cls`. Then, set the entire expression, from `as_tibble()`, to `dat` using `<-`. + +```{r equicoval-results-11, exercise = TRUE} + +``` + + + +```{r equicoval-results-11-hint-1, eval = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + ... + dat <- + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error), + prob = binomial()$linkinv(linear_pred), + class = ifelse(prob > runif(n), cls[1], cls[2]), + class = factor(..., levels = ...) + ) +} +``` + +```{r include = FALSE} +function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) + dat <- + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error), + prob = binomial()$linkinv(linear_pred), + class = ifelse(prob > runif(n), cls[1], cls[2]), + class = factor(class, levels = cls) + ) +} +``` + +### + +The `paste0()` function in R is used to concatenate (combine) multiple character strings into a single string without any separator. It is a convenient way to join strings together. + +### Exercise 12 + +Copy the previous code and, outside `mutate()` but within the function, type `dplyr::select()`. Add the parameters `dat`, `x`, `y`, and `class`. Then, set the entire function to `simulate_two_classes` using `<-`. + +```{r equicoval-results-12, exercise = TRUE} + +``` + + + +```{r equicoval-results-12-hint-1, eval = FALSE} +simulate_two_classes <- + function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + ... + dplyr::select(..., x, y, ...) +} +``` + +```{r include = FALSE} +simulate_two_classes <- + function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) { + sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2) + dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma) + colnames(dat) <- c("x", "y") + cls <- paste0("class_", 1:2) + dat <- + as_tibble(dat) |> + mutate( + linear_pred = !!eqn, + linear_pred = linear_pred + rnorm(n, sd = error), + prob = binomial()$linkinv(linear_pred), + class = ifelse(prob > runif(n), cls[1], cls[2]), + class = factor(class, levels = cls) + ) + dplyr::select(dat, x, y, class) + } +``` + +### + +`paste0()` code creates two strings: "class_1" and "class_2". The 1:2 generates a numeric vector containing the numbers 1 and 2. `paste0()` then concatenates each element of the numeric vector with the string `"class_"`, resulting in the vector of character strings shown in the output. + +### Exercise 13 + +Outside of the function, type `simulate_two_classes(200)` and set it to `training_set`. Then on the next line, type `simulate_two_classes(50)` and set it to `testing_set`. + +```{r equicoval-results-13, exercise = TRUE} + +``` + +```{r equicoval-results-13-hint-1, eval = FALSE} +training_set <- simulate_two_classes(...) +testing_set <- simulate_two_classes(...) +``` + +```{r include = FALSE} +training_set <- simulate_two_classes(200) +testing_set <- simulate_two_classes(50) +``` + +### + +`logistic_reg()` defines a generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event. This function can fit classification models. + +### Exercise 14 + +Pipe `logistic_reg()` to `set_engine()`. Add the parameter `"stan"`. + +```{r equicoval-results-14, exercise = TRUE} + +``` + +```{r equicoval-results-14-hint-1, eval = FALSE} +logistic_reg() |> + set_engine("...") +``` + +```{r include = FALSE} +logistic_reg() |> + set_engine("stan") +``` + +### + +Examples of different engines that can be used with `set_engine()`: `glm`, `ranger`, `xgboost`, `keras`, `spark`, `h20`, etc. + +### Exercise 15 + +Copy the previous code and pipe it to `fit()`. Add the parameters `class ~ . + I(x^2)+ I(y^2)` and `data`, setting it equal to `training_set`. Then, set the entire expression to `two_class_mod` using `<-`. + +```{r equicoval-results-15, exercise = TRUE} + +``` + + + +```{r equicoval-results-15-hint-1, eval = FALSE} +two_class_mod <- + ... |> + fit(..., data = ...) +``` + +```{r include = FALSE} +two_class_mod <- + logistic_reg() |> + set_engine("stan") |> + fit(class ~ . + I(x^2)+ I(y^2), data = training_set) +``` + +### + +`quote()` function is used to create a "quotation" of an expression without evaluating it immediately. Useful when working with non-standard evaluation (NSE) or when you want to capture an expression to manipulate it programmatically. + +### Exercise 16 + +Print `two_class_mod` using `print()`. Add the parameters `two_class_mod` and `digits = 3`. + +```{r equicoval-results-16, exercise = TRUE} + +``` + + + +```{r equicoval-results-16-hint-1, eval = FALSE} +print(..., digits = 3) +``` + +```{r include = FALSE} +print(two_class_mod, digits = 3) +``` + +### + +We estimated a logistic regression model using Bayesian methods (using the default Gaussian prior distributions for the parameters). + +### Exercise 17 + +Type in `augment()` and add the parameters `two_class_mod` and `testing_set`. Then, set the entire expression to `test_pred`. + +```{r equicoval-results-17, exercise = TRUE} + +``` + +```{r equicoval-results-17-hint-1, eval = FALSE} +test_pred <- augment(..., testing_set) +``` + +```{r include = FALSE} +test_pred <- augment(two_class_mod, testing_set) +``` + +### + +The `augment()` function is commonly associated with the broom package. It is used to augment statistical model objects with additional information or predictions, making them more suitable for further analysis, visualization, or reporting. + +### Exercise 18 + +Type `test_pred`. Pipe it to `head()` and hit "Run Code". + +```{r equicoval-results-18, exercise = TRUE} + +``` + +```{r equicoval-results-18-hint-1, eval = FALSE} +test_pred |> + ...() +``` + +```{r include = FALSE} +test_pred |> + head() +``` + +### + +With **tidymodels**, the **probably** package contains functions for equivocal zones. For cases with two classes, the `make_two_class_pred()` function creates a factor-like column that has the predicted classes with an equivocal zone. + +### Exercise 19 + +Load the library `probably` using `library()`. + +```{r equicoval-results-19, exercise = TRUE} + +``` + +```{r equicoval-results-19-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +#library(probably) +``` + +### + +**probably**: Models can be improved by post-processing class probabilities, by: recalibration, conversion to hard probabilities, assessment of equivocal zones, and other activities. 'probably' contains tools for conducting these operations as well as calibration tools and conformal inference techniques for regression models. + +### Exercise 20 + +Type `levels()` and the parameter `training_set$class`. Set the entire expression to `lvls` using `<-`. + +```{r equicoval-results-20, exercise = TRUE} + +``` + +```{r equicoval-results-20-hint-1, eval = FALSE} +lvls <- levels(...$class) +``` + +```{r include = FALSE} +lvls <- levels(training_set$class) +``` + +### + +One simple method for disqualifying results is to call them “equivocal” if the values are within some range around 50% (or the appropriate probability cutoff for a certain situation). + +### Exercise 21 + +Type `test_pred` and pipe it to `mutate()`. Add the parameter `.pred_with_eqz`, setting it equal to `make_two_class_pred(.pred_class_1, lvls, buffer = 0.15)`. Set the entire expression to `test_pred1`. + +```{r equicoval-results-21, exercise = TRUE} + +``` + +```{r equicoval-results-21-hint-1, eval = FALSE} +test_pred1 <- + test_pred |> + mutate(...) +``` + +```{r include = FALSE} +test_pred1 <- + test_pred |> + mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = 0.15)) +``` + +### + +`make_two_class_pred()` can be used to convert class probability estimates to `class_pred` objects with an optional equivocal zone. + +### Exercise 22 + +Type `test_pred1` and pipe it to `count()`. Add the parameter `.pred_with_eqz`. + +```{r equicoval-results-22, exercise = TRUE} + +``` + +```{r equicoval-results-22-hint-1, eval = FALSE} +test_pred1 |> + count(...) +``` + +```{r include = FALSE} +test_pred1 |> + count(.pred_with_eqz) +``` + +### + +Since the factor levels are the same as the original data, confusion matrices and other statistics can be computed without error. When using standard functions from the **yardstick** package, the equivocal results are converted to `NA` and are not used in the calculations that use the hard class predictions. + +### Exercise 23 + +Type `test_pred1` and pipe it to `conf_mat()`. Add the parameter `class` and `.pred_with_eqz`, giving the outcome of only reportable results. + +```{r equicoval-results-23, exercise = TRUE} + +``` + +```{r equicoval-results-23-hint-1, eval = FALSE} +test_pred |> + conf_mat(class, ...) +``` + +```{r include = FALSE} +test_pred |> + conf_mat(class, .pred_with_eqz) +``` + +### + +***_equivocal() functions provide multiple methods of checking for equivocal values, and finding their locations. + +### Exercise 24 + +Create a function by typing `function()`. Add the parameter `buffer` and add brackets ({}) after `function()`. + +```{r equicoval-results-24, exercise = TRUE} + +``` + +```{r equicoval-results-24-hint-1, eval = FALSE} +function(...) { + +} +``` + +```{r include = FALSE} +function(buffer) { + +} +``` + +### + +The term "reportable rate" in statistics generally refers to a proportion or percentage that is calculated from data and is considered significant enough to be reported, especially in research papers, reports, or presentations. + +### Exercise 25 + +Copy the previous code and, inside the brackets, type `test_pred` and pipe it to `mutate()`. Within `mutate()`, add the parameter `.pred_with_eqz`, setting it equal to `make_two_class_pred(.pred_class_1, lvls, buffer = buffer)`. Then, set this expression to `test_pred`. + +```{r equicoval-results-25, exercise = TRUE} + +``` + + + +```{r equicoval-results-25-hint-1, eval = FALSE} +function(buffer){ + test_pred <- + test_pred |> + mutate(.pred_with_eqz = ...) +} +``` + +```{r include = FALSE} +function(buffer){ + test_pred <- + test_pred |> + mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = buffer)) + +} +``` + +### + +If a `buffer` parameter is provided, predictions falling within the buffer range around the threshold might be considered as an "ambiguous" or "uncertain" class. + +### Exercise 26 + +Copy the previous code and, on the next line, pipe `test_pred` to `accuracy()`. Add the parameters `class` and `.pred_with_eqz`. Then, set this expression to `acc` using `<-`. + +```{r equicoval-results-26, exercise = TRUE} + +``` + + + +```{r equicoval-results-26-hint-1, eval = FALSE} +function(buffer){ + ... + acc <- test_pred |> accuracy(..., ...) +} +``` + +```{r include = FALSE} +function(buffer){ + test_pred <- + test_pred |> + mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = buffer)) + acc <- test_pred |> accuracy(class, .pred_with_eqz) +} +``` + +### + +The `reportable_rate()` is defined as the percentage of class predictions that are not equivocal. + +### Exercise 27 + +Copy the previous code and, on the next line, type `reportable_rate()`. Add the parameter `test_pred$.pred_with_eqz`. Then, set this expression to + +```{r equicoval-results-27, exercise = TRUE} + +``` + + + +```{r equicoval-results-27-hint-1, eval = FALSE} +function(buffer){ + ... + rep_rate <- reportable_rate(test_pred$...) +} +``` + +```{r include = FALSE} +function(buffer){ + test_pred <- + test_pred |> + mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = buffer)) + acc <- test_pred |> accuracy(class, .pred_with_eqz) + rep_rate <- reportable_rate(test_pred$.pred_with_eqz) +} +``` + +### + +Since we used a Bayesian model, the probability estimates we found are actually the mean of the posterior predictive distribution. In other words, the Bayesian model gives us a distribution for the class probability. + +### Exercise 28 + +Copy the previous code and, on the next line, type `tibble()`. Add the parameters `accuracy`, setting it equal to `acc$.estimate`, reportable, setting it equal to `rep_rate`, and `buffer`, setting it equal to `buffer`. +Then, set the entire function to `eq_zone_results` using `<-`. + +```{r equicoval-results-28, exercise = TRUE} + +``` + + + +```{r equicoval-results-28-hint-1, eval = FALSE} +eq_zone_results <- + function(buffer){ + ... + tibble(accuracy = ..., reportable = ..., buffer = ...) +} +``` + +```{r include = FALSE} +eq_zone_results <- + function(buffer){ + test_pred <- + test_pred |> + mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = buffer)) + acc <- test_pred |> accuracy(class, .pred_with_eqz) + rep_rate <- reportable_rate(test_pred$.pred_with_eqz) + tibble(accuracy = acc$.estimate, reportable = rep_rate, buffer = buffer) +} +``` + +### + +In R, the `map()` function is part of the **purrr** package, a **tidymodels** package, which is a component of the tidyverse ecosystem. map() is used to apply a function to each element of a list or a vector and returns a new list with the results. + +### Exercise 29 + +Outside of the function, type `map()`. Within `seq()` of `map()`, add the parameters `0`, `0.1`, and `length.out`, setting it equal to `40`. Outside of `seq()`, add the parameter `eq_zone_results`. + +```{r equicoval-results-29, exercise = TRUE} + +``` + +```{r equicoval-results-29-hint-1, eval = FALSE} +map(seq(..., ..., length.out = ...), eq_zone_results) +``` + +```{r include = FALSE} +map(seq(0, .1, length.out = 40), eq_zone_results) +``` + +### + +`list_rbind()` combines elements into a data frame by row-binding them together with `vctrs::vec_rbind()`. + +### Exercise 30 + +Copy the previous code and pipe it to `list_rbind()`. Then, pipe the code to `piviot_longer()`. Within `c()` of the function, add the parameters `c(-buffer)`, `names_to`, setting it equal to `"statistic"`, and `values_to`, setting it equal to `"value"`. + +```{r equicoval-results-30, exercise = TRUE} + +``` + + + +```{r equicoval-results-30-hint-1, eval = FALSE} +... |> + list_rbind() |> + pivot_longer(c(...), names_to = ..., values_to = ...) +``` + +```{r include = FALSE} +map(seq(0, .1, length.out = 40), eq_zone_results) |> + list_rbind() |> + pivot_longer(c(-buffer), names_to = "statistic", values_to = "value") +``` + +### + +Measuring the standard deviation of this distribution gives us a standard error of prediction of the probability. In most cases, this value is directly related to the mean class probability. + +### Exercise 31 + +Copy the previous code and pipe it to `ggplot()`. Within `aes()` of `ggplot()`, add the parameters `x`, setting it equal to `buffer`, `y`, setting it equal to `value`, and `lty`, setting it equal to `statistic`. + +```{r equicoval-results-31, exercise = TRUE} + +``` + + + +```{r equicoval-results-31-hint-1, eval = FALSE} +... |> + ggplot(aes(x = ..., y = ..., lty = ...)) +``` + +```{r include = FALSE} +map(seq(0, .1, length.out = 40), eq_zone_results) |> + list_rbind() |> + pivot_longer(c(-buffer), names_to = "statistic", values_to = "value") |> + ggplot(aes(x = buffer, y = value, lty = statistic)) +``` + +### + +You might recall that, for a Bernoulli random variable with probability $p$, the variance is $p(1-p)$. Because of this relationship, the standard error is largest when the probability is 50%. + +### Exercise 32 + +Copy the previous code and add `geom_step()`. Add the parameters `linewidth`, setting it equal to `1.2`, and `alpha`, setting it equal to `0.8`. + +```{r equicoval-results-32, exercise = TRUE} + +``` + + + +```{r equicoval-results-32-hint-1, eval = FALSE} +... + + geom_step(linewidth = ..., alpha = ...) +``` + +```{r include = FALSE} +map(seq(0, .1, length.out = 40), eq_zone_results) |> + list_rbind() |> + pivot_longer(c(-buffer), names_to = "statistic", values_to = "value") |> + ggplot(aes(x = buffer, y = value, lty = statistic)) + + geom_step(linewidth = 1.2, alpha = 0.8) +``` + +### + +Instead of assigning an equivocal result using the class probability, we could instead use a cutoff on the standard error of prediction. + +### Exercise 33 + +Copy the previous code and `labs()`. Add the parameters `y`, setting it equal to `NULL`, and `lty`, setting it equal to `NULL`. + +```{r equicoval-results-33, exercise = TRUE} + +``` + + + +```{r equicoval-results-33-hint-1, eval = FALSE} +... + + labs(y = ..., lty = ...) +``` + +```{r include = FALSE} +map(seq(0, .1, length.out = 40), eq_zone_results) |> + list_rbind() |> + pivot_longer(c(-buffer), names_to = "statistic", values_to = "value") |> + ggplot(aes(x = buffer, y = value, lty = statistic)) + + geom_step(linewidth = 1.2, alpha = 0.8) + + labs(y = NULL, lty = NULL) +``` + +### + +One important aspect of the standard error of prediction is that it takes into account more than just the class probability. In cases where there is significant extrapolation or aberrant predictor values, the standard error might increase. + +### Exercise 34 + +Type `test_pred` and pipe it to `bind_cols()`. Add the parameter `predict()`. Within `predict()`, add the parameters `two_class_mod`, `testing_set`, `type`, setting it equal to `"pred_int"`, and `std_error`, setting it equal to `TRUE`. Set this entire expression to `test_pred`. + +```{r equicoval-results-34, exercise = TRUE} + +``` + +```{r equicoval-results-34-hint-1, eval = FALSE} +test_pred <- + test_pred |> + bind_cols( + predict(..., ..., type = "...", std_error = ...) + ) +``` + +```{r include = FALSE} +test_pred <- + test_pred |> + bind_cols( + predict(two_class_mod, testing_set, type = "pred_int", std_error = TRUE) + ) +``` + +### + +The benefit of using the standard error of prediction is that it might also flag predictions that are problematic (as opposed to simply uncertain). + +### + +Great Job! You now know how to produce and evaluate equivocal results by implementing certain methodologies with the help of functions such as `reportable_rate()`, `levels()`, `augment()`, `logistic_reg()`, etc. + +## Determining Model Applicability +### + +Equivocal zones try to measure the reliability of a prediction based on the model outputs. It may be that model statistics, such as the standard error of prediction, cannot measure the impact of extrapolation, and so we need another way to assess whether to trust a prediction and answer, “Is our model applicable for predicting a specific data point?” + +### Exercise 1 + +Load the data `chicago` by using `data()`. + +```{r determining-model-ap-1, exercise = TRUE} + +``` + +```{r determining-model-ap-1-hint-1, eval = FALSE} +data(...) +``` + +```{r include = FALSE} +data(chicago) +``` + +### + +Let’s take the Chicago train data used extensively in Kuhn and Johnson [2019](https://bookdown.org/max/FES/chicago-intro.html) and first shown in a previous tutorial. + +### Exercise 2 + +Pipe `Chicago` to `select()`. Add the parameters `ridership`, `date`, and `one_of(stations)`. Set this expression to `Chicago` using `<-`. + +```{r determining-model-ap-2, exercise = TRUE} + +``` + +```{r determining-model-ap-2-hint-1, eval = FALSE} +Chicago |> select(..., ..., one_of(...)) +``` + +```{r include = FALSE} +Chicago <- Chicago %>% select(ridership, date, one_of(stations)) +``` + +### + +The goal is to predict the number of customers entering the Clark and Lake train station each day. + +### Exercise 3 + +Type `nrow()` and add the parameter `Chicago`. Set this expression to `n` using `<-`. + +```{r determining-model-ap-3, exercise = TRUE} + +``` + +```{r determining-model-ap-3-hint-1, eval = FALSE} +n <- nrow(...) +``` + +```{r include = FALSE} +n <- nrow(Chicago) +``` + +### + +The data set in the **modeldata** package (a tidymodels package with example data sets) has daily values between January 22, 2001 and August 28, 2016. + +### Exercise 4 + +Pipe `Chicago` to `slice()`. Add the parameter `1:(n-14)` and set the entire expression to `Chicago_train`. + +```{r determining-model-ap-4, exercise = TRUE} + +``` + +```{r determining-model-ap-4-hint-1, eval = FALSE} +Chicago_train <- Chicago |> slice(...:(...)) +``` + +```{r include = FALSE} +Chicago_train <- Chicago |> slice(1:(n - 14)) +``` + +### + +Using the standard error as a measure to preclude samples from being predicted can also be applied to models with numeric outcomes. + +### Exercise 5 + +Pipe `Chicago` to `slice()`. Add the parameter `(n - 13):n`. Set this expression to `Chicago_test` using `<-`. + +```{r determining-model-ap-5, exercise = TRUE} + +``` + + + +```{r determining-model-ap-5-hint-1, eval = FALSE} +Chicago_test <- Chicago |> slice((...):...) +``` + +```{r include = FALSE} +Chicago_test <- Chicago |> slice((n - 13):n) +``` + +### + +The main predictors are lagged ridership data at different train stations, including Clark and Lake, as well as the date. The ridership predictors are highly correlated with one another. + +### Exercise 6 + +Type `recipe()`. Add the parameter `ridership ~ .`, and `data`, setting it equal to `Chicago_train`. + +```{r determining-model-ap-6, exercise = TRUE} + +``` + +```{r determining-model-ap-6-hint-1, eval = FALSE} +recipe(..., data = ...) +``` + +```{r include = FALSE} +recipe(ridership ~ ., data = Chicago_train) +``` + +### + +In the following recipe, the date column is expanded into several new features, and the ridership predictors are represented using partial least squares (PLS) components. + +### Exercise 7 + +Copy the previous code and pipe it to `step_date()`. Add the parameter `date`. Then, pipe the code to `step_holiday()`. Add the parameters `date` and `keep_original_cols`, setting it equal to `FALSE`. + +```{r determining-model-ap-7, exercise = TRUE} + +``` + +```{r determining-model-ap-7-hint-1, eval = FALSE} +... |> + step_date(...) |> + step_holiday(..., keep_original_cols = ...) +``` + +```{r include = FALSE} +recipe(ridership ~ ., data = Chicago_train) |> + step_date(date) |> + step_holiday(date, keep_original_cols = FALSE) +``` + +### + +PLS [Geladi and Kowalski 1986](https://www.tmwr.org/trust#ref-Geladi:1986) is a supervised version of principal component analysis where the new features have been decorrelated but are predictive of the outcome data. + +### Exercise 8 + +Copy the previous code and pipe it to `step_dummy()`. Add the parameter `all_nominal()`. Then, pipe the code to `step_zv()`, adding the parameter `all_predictors()`. + +```{r determining-model-ap-8, exercise = TRUE} + +``` + + + +```{r determining-model-ap-8-hint-1, eval = FALSE} +... |> + step_dummy(...()) |> + step_zv(...()) +``` + +```{r include = FALSE} +recipe(ridership ~ ., data = Chicago_train) |> + step_date(date) |> + step_holiday(date, keep_original_cols = FALSE) |> + step_dummy(all_nominal()) |> + step_zv(all_predictors()) +``` + +### + +`step_holiday()` creates a specification of a recipe step that will convert date data into one or more binary indicator variables for common holidays. + +### Exercise 9 + +Copy the previous code and pipe it to `step_normalize()`. Add the parameter `!!!stations`. + +```{r determining-model-ap-9, exercise = TRUE} + +``` + + + +```{r determining-model-ap-9-hint-1, eval = FALSE} +... |> + step_normalize(!!!...) +``` + +```{r include = FALSE} +recipe(ridership ~ ., data = Chicago_train) |> + step_date(date) |> + step_holiday(date, keep_original_cols = FALSE) |> + step_dummy(all_nominal()) |> + step_zv(all_predictors()) |> + step_normalize(!!!stations) +``` + +### + +`step_pls()` creates a specification of a recipe step that will convert numeric data into one or more new dimensions. + +### Exercise 10 + +Copy the previous code and pipe it to `step_pls()`. Add the parameters `!!!stations`, `num_comp`, setting it equal to `10`, and `outcome`, setting it equal to `vars(ridership)`. Finally, set the entire expression to `base_recipe` using `<-`. + +```{r determining-model-ap-10, exercise = TRUE} + +``` + + + +```{r determining-model-ap-10-hint-1, eval = FALSE} +base_recipe <- + ... |> + step_pls(!!!..., num_comp = ..., outcome = vars(...)) +``` + +```{r include = FALSE} +base_recipe <- + recipe(ridership ~ ., data = Chicago_train) |> + step_date(date) |> + step_holiday(date, keep_original_cols = FALSE) |> + step_dummy(all_nominal()) |> + step_zv(all_predictors()) |> + step_normalize(!!!stations) |> + step_pls(!!!stations, num_comp = 10, outcome = vars(ridership)) +``` + +### + +Now, lets create a specification for the model fit. + +### Exercise 11 + +Pipe `linear_reg()` to `set_engine()`, adding the parameter `"lm"` to `set_engine()`. Then, set the entire expression to `lm_spec` using `<-`. + +```{r determining-model-ap-11, exercise = TRUE} + +``` + +```{r determining-model-ap-11-hint-1, eval = FALSE} +lm_spec <- + linear_reg() |> + set_engine("...") +``` + +```{r include = FALSE} +lm_spec <- + linear_reg() |> + set_engine("lm") +``` + +### + +In cases where there is significant extrapolation or aberrant predictor values, the standard error might increase. The benefit of using the standard error of prediction is that it might also flag predictions that are problematic (as opposed to simply uncertain). + +### Exercise 12 + +Pipe `workflow()` to `add_recipe()`, adding the parameter `base_recipe` to `add_recipe()`. Then, pipe the code to `add_model(lm_spec)` and set the whole expression to `lm_wflow()` using `<-`. + +```{r determining-model-ap-12, exercise = TRUE} + +``` + + + +```{r determining-model-ap-12-hint-1, eval = FALSE} +lm_wflow <- + workflow() |> + ...(base_recipe) |> + add_model(...) +``` + +```{r include = FALSE} +lm_wflow <- + workflow() |> + add_recipe(base_recipe) |> + add_model(lm_spec) +``` + +### + +One reason we used the Bayesian model is that it naturally estimates the standard error of prediction; not many models can calculate this. + +### Exercise 13 + +Now let's fit the model using `fit()`. Type `fit()` and add the parameters `lm_wflow` and `data`, setting it equal to `Chicago_train`. Set this entire expression to `lm_fit` using `<-`. + +```{r determining-model-ap-13, exercise = TRUE} + +``` + +```{r determining-model-ap-13-hint-1, eval = FALSE} +lm_fit <- fit(..., data = ...) +``` + +```{r include = FALSE} +lm_fit <- fit(lm_wflow, data = Chicago_train) +``` + +### + +How well do the data fit on the test set? We can `predict()` for the test set to find both predictions and prediction intervals. + +### Exercise 14 + +Type `predict()` and add the parameters `lm_fit` and `Chicago_test`. + +```{r determining-model-ap-14, exercise = TRUE} + +``` + + + +```{r determining-model-ap-14-hint-1, eval = FALSE} +predict(..., ...) +``` + +```{r include = FALSE} +predict(lm_fit, Chicago_test) + +``` + +### + +For our test set, using `type = "pred_int"` will produce upper and lower limits and the `std_error` adds a column for that quantity. + +### Exercise 15 + +Copy the previous code and pipe it to `bind_cols()`. Add the parameters `predict()` and `Chicago_test`. Within `predict()`, add the parameters `lm_fit`, `Chicago_test`, and `type`, setting it equal to `"pred_int"`. Then set the entire expression to `res_test` using `<-`. + +```{r determining-model-ap-15, exercise = TRUE} + +``` + + + +```{r determining-model-ap-15-hint-1, eval = FALSE} +res_test <- + ... |> + bind_cols( + predict(..., ..., type= "..."), + Chicago_test + ) +``` + +```{r include = FALSE} +res_test <- + predict(lm_fit, Chicago_test) |> + bind_cols( + predict(lm_fit, Chicago_test, type = "pred_int"), + Chicago_test + ) +``` + +### + +Confidence and prediction intervals for linear regression expand as the data become more and more removed from the center of the training set. However, that effect is not dramatic enough to flag these predictions as being poor. + +### Exercise 16 + +Pipe `res_test` to `select()` and add the parameters `date`, `ridership`, and `starts_with(".pred")`. + +```{r determining-model-ap-16, exercise = TRUE} + +``` + +```{r determining-model-ap-16-hint-1, eval = FALSE} +res_test |> select(date, ..., starts_with("...")) +``` + +```{r include = FALSE} +res_test |> select(date, ridership, starts_with(".pred")) +``` + +### + +There are a variety of methods to compute an applicability domain model, such as Jaworska, Nikolova-Jeliazkova, and Aldenberg [2005](https://www.tmwr.org/trust#ref-Jaworska) or Netzeva et al. [2005](https://www.tmwr.org/trust#ref-Netzeva). + +### Exercise 17 + +Pipe `rest_test` to `rmse()`. Add the parameters `ridership` and `.pred`. + +```{r determining-model-ap-17, exercise = TRUE} + +``` + +```{r determining-model-ap-17-hint-1, eval = FALSE} +rest_test |> rmse(..., ...) +``` + +```{r include = FALSE} +rest_test |> rmse(ridership, .pred) +``` + +### + +One method that computes an applicability domain model well uses principal component analysis (PCA) on the numeric predictor values. We’ll illustrate the process by using only two of the predictors that correspond to ridership at different stations (California and Austin stations). + +### Exercise 18 + +Type `predict()` and add the parameters `lm_fit` and `Chicago_2020`. + +```{r determining-model-ap-18, exercise = TRUE} + +``` + +```{r determining-model-ap-18-hint-1, eval = FALSE} +predict(..., ...) +``` + +```{r include = FALSE} +predict(lm_fit, Chicago_2020) +``` + +### + +The `bind_cols()` function in R is part of the **dplyr** package, which is a core component of the tidyverse ecosystem. It is used to combine data frames or tibbles by column-wise binding, effectively concatenating them horizontally. + +### Exercise 19 + +Copy the previous code and pipe it to `bind_cols()`. Add the parameter `predict()` and `Chicago_2020`. Within `predict()`, add the parameters `lm_fit`, `Chicago_2020`, and `type` setting it equal to `"pred_int"`. Then, set the entire expression to `res_2020`. + +```{r determining-model-ap-19, exercise = TRUE} + +``` + + + +```{r determining-model-ap-19-hint-1, eval = FALSE} +res_2020 <- + ... |> + bind_cols( + predict(lm_fit, Chicago_2020, type = "..."), + ... + ) +``` + +```{r include = FALSE} +res_2020 <- + predict(lm_fit, Chicago_2020) |> + bind_cols( + predict(lm_fit, Chicago_2020, type = "pred_int"), + Chicago_2020 + ) +``` + +### + +If this model were deployed, how well would it have done a few years later in June 2020? The model successfully makes a prediction, as a predictive model almost always will when given input data. + +### Exercise 20 + +Pipe `res_2020` to `select()`. Add the parameters `date` and `contains(".pred")`. + +```{r determining-model-ap-20, exercise = TRUE} + +``` + +```{r determining-model-ap-20-hint-1, eval = FALSE} +res_2020 |> select(..., contain("...")) +``` + +```{r include = FALSE} +res_2020 |> select(..., contain(".pred")) +``` + +### + +The prediction intervals are about the same width, even though these data are well beyond the time period of the original training set. However, given the global pandemic in 2020, the performance on these data are abysmal. + +### Exercise 21 + +Pipe `res_2020` to `select()`. Add the parameters `date`, `ridership`, and `starts_with(".pred")`. + +```{r determining-model-ap-21, exercise = TRUE} + +``` + +```{r determining-model-ap-21-hint-1, eval = FALSE} +res_2020 |> select(date, ..., starts_with("...")) +``` + +```{r include = FALSE} +res_2020 %>% select(date, ridership, starts_with(".pred")) +``` + +### + +We’ll use the 20 lagged station ridership predictors as inputs into the PCA analysis. For our example, we’ll use a large value that indicates we should use enough components to account for 99% of the variation in the ridership predictors. + +### Exercise 22 + +Load the library **applicable** using `library()`. + +```{r determining-model-ap-22, exercise = TRUE} + +``` + +```{r determining-model-ap-22-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +library(applicable) +``` + +### + +The applicable package can develop an applicability domain model using PCA. There is an additional argument called threshold that determines how many components are used in the distance calculation. + +### Exercise 23 + +Type `apd_pca()`. Add the parameters `~.`, `data`, setting that equal to `Chicago_train`, and `threshold`, setting that equal to `0.99`. + +```{r determining-model-ap-23, exercise = TRUE} + +``` + +```{r determining-model-ap-23-hint-1, eval = FALSE} +apd_pca(..., data = ..., threshold = ...) +``` + +```{r include = FALSE} +#apd_pca(~ ., data = Chicago_train, threshold = 0.99) +``` + +### + +The 2020 sample is farther from the center than any of the training set samples (with a percentile of 100%). This indicates the sample is very extreme and that its corresponding prediction would be a severe extrapolation (and probably should not be reported). + +### Exercise 24 + +Copy the previous code and in the `data` argument, pipe `Chicago_train` to `select()`. Within `select()`, add the parameter `one_of(stations)`. Then, set the entire expression to `pca_stat` and run it on the next line. + +```{r determining-model-ap-24, exercise = TRUE} + +``` + + + +```{r determining-model-ap-24-hint-1, eval = FALSE} +pca_stat <- + apd_pca(~ ., data = Chicago_train |> ...(one_of(...)), threshold = 0.99) +``` + +```{r include = FALSE} +# pca_stat <- +# apd_pca(~ ., data = Chicago_train |> select(one_of(stations)), threshold = 0.99) +# +# pca_stat +``` + +### + +For `score()` with `by.node = TRUE`, a vector of numeric values, the individual node contributions to the score of the Bayesian network. Otherwise, a single numeric value, the score of the Bayesian network. + +### Exercise 25 + +Type `score()` and add the parameters `pca_stat` and `Chicago_test`. Pipe this code to `select()` and add the parameters `starts_with("distance")`. + +```{r determining-model-ap-25, exercise = TRUE} + +``` + +```{r determining-model-ap-25-hint-1, eval = FALSE} +score(..., ...) |> select(starts_with("...")) +``` + +```{r include = FALSE} +score(pca_stat, Chicago_test) |> select(starts_with("distance")) +``` + +### + +Now let's look at the 2020 data. + +### Exercise 26 + +Copy the previous code and change `Chicago_test` to `Chicago_2020`. + +```{r determining-model-ap-26, exercise = TRUE} + +``` + + + +```{r determining-model-ap-26-hint-1, eval = FALSE} +score(pca_stat, ...) |> select(starts_with("distance")) + +``` + +```{r include = FALSE} +score(pca_stat, Chicago_test) |> select(starts_with("distance")) + +``` + +### + +The **applicable** package also contains specialized methods for data sets where all of the predictors are binary. This method computes similarity scores between training set data points to define the reference distribution. + +### + +Great Job! You now know how to determine model applicability by trying to predict the number of customers entering the Clark and Lake train station each day. + +## Summary +### + +This tutorial covered [Chapter 19: When Should you Trust Your Predictions?](https://www.tmwr.org/trust) from [*Tidy Modeling with R*](https://www.tmwr.org/) by Max Kuhn and Julia Silge. This tutorial showed two methods for evaluating whether predictions should be reported to the consumers of models. Equivocal zones deal with outcomes/predictions and can be helpful when the amount of uncertainty in a prediction is too large. + +```{r download-answers, child = system.file("child_documents/download_answers.Rmd", package = "tutorial.helpers")} +``` diff --git a/inst/tutorials/21-inferential-analysis/tutorial.Rmd b/inst/tutorials/21-inferential-analysis/tutorial.Rmd new file mode 100644 index 0000000..ee3eb93 --- /dev/null +++ b/inst/tutorials/21-inferential-analysis/tutorial.Rmd @@ -0,0 +1,1237 @@ +--- +title: Inferential Analysis +author: Pratham Kancherla +tutorial: + id: inferential-analysis +output: + learnr::tutorial: + progressive: yes + allow_skip: yes +runtime: shiny_prerendered +description: 'Tutorial for Chapter 21: Inferential Analysis' +--- + +```{r setup, include = FALSE} +library(learnr) +library(tutorial.helpers) +library(finetune) +library(baguette) +library(tidyverse) +library(tidymodels) +library(rlang) +library(embed) +library(tune) +library(ggrepel) +library(ggforce) +library(rstanarm) +library(rules) +library(tidyposterior) +library(lme4) +library(multilevelmod) +library(nlme) +library(probably) +library(usemodels) +library(workflowsets) + +knitr::opts_chunk$set(echo = FALSE) +options(tutorial.exercise.timelimit = 60, + tutorial.storage = "local") + +observed <- + bioChemists |> + specfiy(art ~ fem) |> + calculate(stat = "diff in means", order = c("Men", "Women")) + +bootstrapped <- + bioChemists |> + specify(art ~ fem) |> + generate(reps = 2000, type = "bootstrap") |> + calculate(stat = "diff in means", order = c("Men", "Women")) + +percentile_ci <- get_ci(bootstrapped) + +permuted <- + bioChemists |> + specify(art ~ fem) |> + hypothesise(null = "independence") |> + generate(reps = 2000, type = "permute") |> + calculate(stat = "diff in means", order = c("Men", "Women")) + +log_lin_spec <- poisson_reg() + +log_lin_fit <- + log_lin_spec |> + fit(art ~ ., data = bioChemists) + +glm_boot <- + reg_intervals(art ~ ., data = bioChemists, model_fn = "glm", family = poisson) + +log_lin_reduced <- + log_lin_spec |> + fit(art ~ ment + kid5 + fem + mar, data = bioChemists) + +zero_inflated_spec <- poisson_reg() |> set_engine("zeroinfl") + +zero_inflated_fit <- + zero_inflated_spec |> + fit(art ~ fem + mar + kid5 + ment | fem + mar + kid5 + phd + ment, + data = bioChemists) + +bootstrap_models <- + bootstraps(bioChemists, times = 2000, apparent = TRUE) |> + mutate( + glm = map(splits, ~ fit(log_lin_spec, glm_form, data = analysis(.x))), + zip = map(splits, ~ fit(zero_inflated_spec, zip_form, data = analysis(.x))) + ) + +bootstrap_models <- + bootstrap_models |> + mutate( + glm_aic = map_dbl(glm, ~ extract_fit_engine(.x) |> AIC()), + zip_aic = map_dbl(zip, ~ extract_fit_engine(.x) |> AIC()) + ) + +bootstrap_models <- + bootstrap_models |> + mutate(zero_coefs = map(zip, ~ tidy(.x, type = "zero"))) + +``` + +```{r copy-code-chunk, child = system.file("child_documents/copy_button.Rmd", package = "tutorial.helpers")} +``` + +```{r info-section, child = system.file("child_documents/info_section.Rmd", package = "tutorial.helpers")} +``` + +## Introduction +### + +This tutorial covers [Chapter 21: Inferential Analysis](https://www.tmwr.org/inferential) from [*Tidy Modeling with R*](https://www.tmwr.org/) by Max Kuhn and Julia Silge. In this tutorial, we describe how to use tidymodels for fitting and assessing inferential models. In some cases, the tidymodels framework can help users work with the objects produced by their models. In others, it can help assess the quality of a given model. + +## Inference for Count Data +### + +To understand how tidymodels packages can be used for inferential modeling, let’s focus on an example with count data. We’ll use biochemistry publication data from the **pscl** package. These data consist of information on 915 Ph.D. biochemistry graduates and tries to explain factors that impact their academic productivity (measured via number or count of articles published within three years). The predictors include the gender of the graduate, their marital status, the number of children of the graduate that are at least five years old, the prestige of their department, and the number of articles produced by their mentor in the same time period. The data reflect biochemistry doctorates who finished their education between 1956 and 1963. The data are a somewhat biased sample of all of the biochemistry doctorates given during this period (based on completeness of information). + +### Exercise 1 + +Load the library **tidymodels** using `library()`. + +```{r inference-for-count--1, exercise = TRUE} + +``` + +```{r inference-for-count--1-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +library(tidymodels) +``` + +### + +The common assumption is that the residual values are independent and follow a Gaussian distribution with a constant variance. The methods for determining if the model’s assumptions have been met are not as simple as looking at holdout predictions, although that can be very useful as well. + +### Exercise 2 + +Load the data `bioChemists` from the **"pscl"** package. Type `data()` and add the parameters `"bioChemists"` and `package`, setting it equal to `"pscl"`. + +```{r inference-for-count--2, exercise = TRUE} + +``` + +```{r inference-for-count--2-hint-1, eval = FALSE} +data("...", package = "...") +``` + +```{r include = FALSE} +data("bioChemists", package = "pscl") +``` + +### + +The **pscl** (Political Science Computational Laboratory) package is an R package specifically designed for political science research and other social science applications. It provides a range of tools and functions for analyzing data commonly encountered in political science and related fields. + +### Exercise 3 + +Pipe `bioChmeists` to `ggplot()`. Within `aes()` of `ggplot()`, set `x` equal to `art`. + +```{r inference-for-count--3, exercise = TRUE} + +``` + +```{r inference-for-count--3-hint-1, eval = FALSE} +bioChemists |> + ggplot(aes(x = ...)) +``` + +```{r include = FALSE} +bioChemists |> + ggplot(aes(x = art)) +``` + +### + +Inferential models are usually created not only for their predictions, but also to make inferences or judgments about some component of the model, such as a coefficient value or other parameter. These results are often used to answer some (hopefully) pre-defined questions or hypotheses. + +### Exercise 4 + +Copy the previous code and add `geom_histogram()`. Add the parameters `binwidth`, setting that equal to `1`, and `color`, setting that equal to `"white"`. + +```{r inference-for-count--4, exercise = TRUE} + +``` + + + +```{r inference-for-count--4-hint-1, eval = FALSE} +... + + geom_histogram(binwidth = ..., color = "...") +``` + +```{r include = FALSE} +bioChemists |> + ggplot(aes(x = art)) + + geom_histogram(bindwidth = 1, color = "white") +``` + +### + +In predictive models, predictions on hold-out data are used to validate or characterize the quality of the model. Inferential methods focus on validating the probabilistic or structural assumptions that are made prior to fitting the model. + +### Exercise 5 + +Copy the previous code and add `labs()`. Add the parameter `x`, setting it equal to the title `"Number of articles within 3y of graduation"`. + +```{r inference-for-count--5, exercise = TRUE} + +``` + + + +```{r inference-for-count--5-hint-1, eval = FALSE} +... + + labs(x = "...") +``` + +```{r include = FALSE} +bioChemists |> + ggplot(aes(x = art)) + + geom_histogram(bindwidth = 1, color = "white") + + labs(x = "Number of articles within 3y of graduation") +``` + +### + +Since the outcome data are counts, the most common distribution assumption to make is that the outcome has a Poisson distribution. This tutorial will use these data for several types of analyses. + +## Comparision with Two-Sample Tests +### + +We can start with hypothesis testing. The original author’s goal with this data set on biochemistry publication data was to determine if there is a difference in publications between men and women [Long 1992](https://www.tmwr.org/inferential#ref-Long1992). + +### Exercise 1 + +Pipe `bioChemists` to `group_by()`. Add the parameter `fem`. Then, pipe the code to `summarize()`. Add the parameters `counts`, setting it equal to `sum(art)`, and `n`, setting it equal to `length(art)`. + +```{r comparision-with-two-1, exercise = TRUE} + +``` + + + +```{r comparision-with-two-1-hint-1, eval = FALSE} +bioChemists |> + group_by(...) |> + summarize(counts = sum(...), n = length(...)) +``` + +```{r include = FALSE} +bioChemists |> + group_by(fem,) |> + summarize(counts = sum(art), n = length(art)) +``` + +### + +There were many more publications by men, although there were also more men in the data. The simplest approach to analyzing these data would be to do a two-sample comparison using the `poisson.test()` function in the **stats** package. + +### Exercise 2 + +Let's create basic application of a poisson test. Type `poisson.test()`. Add the vector parameter `c(930, 619)` and add `T`, setting it equal to 3. + +```{r comparision-with-two-2, exercise = TRUE} + +``` + + + +```{r comparision-with-two-2-hint-1, eval = FALSE} +poisson.test(c(..., 619), T = ...)) +``` + +```{r include = FALSE} +poisson.test(c(930, 619), T = 3) +``` + +### + +The function reports a p-value as well as a confidence interval for the ratio of the publication rates. The results indicate that the observed difference is greater than the experiential noise and favors $H_a$. + +### Exercise 3 + +Copy the previous code. Delete the parameter `T = 3` and pipe the entire expression to `tidy()`. + +```{r comparision-with-two-3, exercise = TRUE} + +``` + + + +```{r comparision-with-two-3-hint-1, eval = FALSE} +poisson.test(c(930, 619)) |> + ...() +``` + +```{r include = FALSE} +poisson.test(c(930, 619)) |> + tidy() +``` + +### + +While the Poisson distribution is reasonable, we might also want to assess using fewer distributional assumptions. Two methods that might be helpful are the bootstrap and permutation tests [Davison and Hinkley 1997](https://www.tmwr.org/inferential#ref-davison1997bootstrap). + +### Exercise 4 + +Load the library **infer** using `library()`. + +```{r comparision-with-two-4, exercise = TRUE} + +``` + +```{r comparision-with-two-4-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +library(infer) +``` + +### + +The **infer** package is part of the tidyverse and is designed to help with statistical inference, hypothesis testing, and confidence interval estimation in R. + +### Exercise 5 + +Pipe `bioChemists` to `specify()`. Add the parameter `art ~ fem`. + +```{r comparision-with-two-5, exercise = TRUE} + +``` + +```{r comparision-with-two-5-hint-1, eval = FALSE} +bioChemists |> + specfiy(art ~ ...) +``` + +```{r include = FALSE} +bioChemists |> + specfiy(art ~ fem) +``` + +### + +The **infer** package, part of the tidymodels framework, is a powerful and intuitive tool for hypothesis testing [(Ismay and Kim 2021)](https://www.tmwr.org/inferential#ref-ModernDive). Its syntax is concise and designed for nonstatisticians. + +### Exercise 6 + +Copy the previous code and pipe it to `calculate()`. Add the parameters `stat`, setting it equal to `"diff in means"`, and `order`, setting it equal to a vector (`c()`) consistent of `"Men"` and `"Women"`. + +```{r comparision-with-two-6, exercise = TRUE} + +``` + + + +```{r comparision-with-two-6-hint-1, eval = FALSE} +... |> + calculate(stat = "...", order = c("...", "...")) +``` + +```{r include = FALSE} +bioChemists |> + specfiy(art ~ fem) |> + calculate(stat = "diff in means", order = c("Men", "Women")) +``` + +### + +Recall that the maximum likelihood estimator for the Poisson mean is the sample mean. The hypotheses tested here are the same as the previous test (but are conducted using a different testing procedure). + +### Exercise 7 + +Finally, copy the previous code and set it to `observed` using `<-`. + +```{r comparision-with-two-7, exercise = TRUE} + +``` + + + +```{r comparision-with-two-7-hint-1, eval = FALSE} +observed <- + ... +``` + +```{r include = FALSE} +observed <- + bioChemists |> + specfiy(art ~ fem) |> + calculate(stat = "diff in means", order = c("Men", "Women")) +``` + +### + +The `specify()` function is part of the **infer** package in R, which is designed for statistical inference and hypothesis testing. . It specifies that you're interested in comparing the `value` variable based on the `group` variable. + +### Exercise 8 + +Pipe `bioChemists` to `specify(art ~ fem)`. Then, pipe the code to `generate()`. Add the parameters `reps`, setting it equal to `2000`, and `type`, setting that equal to `"bootstrap"`. + +```{r comparision-with-two-8, exercise = TRUE} + +``` + + + +```{r comparision-with-two-8-hint-1, eval = FALSE} +bioChemists |> + specify(...) |> + generate(reps = ..., type = "...") +``` + +```{r include = FALSE} + +``` + +### + +The `calculate()` function is particularly useful when you want to compute statistics on resampled data to estimate sampling variability and construct confidence intervals. It works in combination with other functions in the **infer** package to create a comprehensive framework for statistical analysis and hypothesis testing. + +### Exercise 9 + +Copy the previous code and pipe it to `calculate()`. Add thep parameter `stat`, setting it equal to `"diff in mean"`, and `order`, setting it equal to `c("Men", "Women")`. Then, set the entire expression to `bootstrapped` and run it on the next line. + +```{r comparision-with-two-9, exercise = TRUE} + +``` + + + +```{r comparision-with-two-9-hint-1, eval = FALSE} +... |> + calculate(stat = "...", order = c("...", "...")) +``` + +```{r include = FALSE} +bootstrapped <- + bioChemists |> + specify(art ~ fem) |> + generate(reps = 2000, type = "bootstrap") |> + calculate(stat = "diff in means", order = c("Men", "Women")) + +bootstrapped +``` + +### + +The **tidymodels** framework tends to promote confidence intervals over p-values as a method for quantifying the evidence for an alternative hypothesis. + +### Exercise 10 + +Type `get_ci()` and add the parameter `bootstrapped`. Set the expression to `percentile_ci` and run it on the next line. + +```{r comparision-with-two-10, exercise = TRUE} + +``` + +```{r comparision-with-two-10-hint-1, eval = FALSE} +... <- get_ci(...) +``` + +```{r include = FALSE} +percentile_ci <- get_ci(bootstrapped) + +percentile_ci +``` + +### + +`get_ci()` function takes a matrix containing the bootstrapped coefficients from a parametric ADRF estimator and returns upper and lower 95 percent confidence lines. + +### Exercise 11 + +Type `visualize()` and add the parameter `bootstrapped`. Then, add `shade_confidence_interval()` and add the parameter `endpoints`, setting it equal to `percentile_ci`. + +```{r comparision-with-two-11, exercise = TRUE} + +``` + +```{r comparision-with-two-11-hint-1, eval = FALSE} +visualize(...) + + shade_confidence_interval(endpoints = ...) +``` + +```{r include = FALSE} +visualize(bootstrapped) + + shade_confidence_interval(endpoints = percentile_ci) +``` + +### + +The `visualize()` function is used to create visualizations of the results obtained from an analysis performed using the **infer** package. It's commonly used to visualize the sampling distribution of a statistic, confidence intervals, and other aspects of the inferential process. + +### Exercise 12 + +Copy the code from exercise 9. After `specfiy()`, pipe it to `hypothesize()`, adding the parameter `null = "independence"`. Then, change the `type` parameter in `generate` to `permute`. Finally, set the expression to `permuted` using `<-` and run it on the next line. + +```{r comparision-with-two-12, exercise = TRUE} + +``` + +```{r comparision-with-two-12-hint-1, eval = FALSE} +permuted <- + ... |> + hypothesize(null = "...") |> + ... +``` + +```{r include = FALSE} +permuted <- + bioChemists |> + specify(art ~ fem) |> + hypothesise(null = "independence") |> + generate(reps = 2000, type = "permute") |> + calculate(stat = "diff in means", order = c("Men", "Women")) + +permuted +``` + +### + +`shade_confidence_interval()` plots a confidence interval region on top of `visualize()` output. The output is a **ggplot2** layer that can be added with `+`. The function has a shorter alias, `shade_ci()`. + +### Exercise 13 + +Type `visualize()` and add the parameter `permuted`. Then, add `shade_p_value()`, adding the parameters `obs_stat`, setting it equal to `observed`, and `direction`, setting it equal to `"two-sided"`. + +```{r comparision-with-two-13, exercise = TRUE} + +``` + +```{r comparision-with-two-13-hint-1, eval = FALSE} +visualize(...) + + shade_p_value(obs_stat = ..., direction = "...") +``` + +```{r include = FALSE} +visualize(permuted) + + shade_p_value(obs_stat = observed, direction = "two-sided") +``` + +### + +The vertical line representing the null hypothesis in the graph is far away from the permutation distribution. This means, if in fact the null hypothesis were true, the likelihood is exceedingly small of observing data at least as extreme as what is at hand. + +### Exercise 14 + +Pipe `permuted` to `get_p_value()`. Add the parameter `obs_stat`, setting it equal to `observed`, and `direction`, setting it equal to `"two-sided"`. + +```{r comparision-with-two-14, exercise = TRUE} + +``` + +```{r comparision-with-two-14-hint-1, eval = FALSE} +permuted |> + get_p_value(obs_stat = ..., direction = "...") +``` + +```{r include = FALSE} +permuted |> + get_p_value(obs_stat = observed, direction = "two-sided") +``` + +### + +The two-sample tests shown in this section are probably suboptimal because they do not account for other factors that might explain the observed relationship between publication rate and sex. Let’s move to a more complex model that can consider additional covariates. + +## Log-Linear Models +### + +The focus of the rest of this chapter will be on a generalized linear model [(Dobson 1999)](https://www.tmwr.org/inferential#ref-Dobson99) where we assume the counts follow a Poisson distribution. For this model, the covariates/predictors enter the model in a log-linear fashion. + +### Exercise 1 + +Load the library **poissonreg** using `library()`. + +```{r loglinear-models-1, exercise = TRUE} + +``` + +```{r loglinear-models-1-hint-1, eval = FALSE} +library(...) +``` + +```{r include = FALSE} +#library(poissonreg) +``` + +### + +poissonreg : bindings for Poisson regression models for use with the 'parsnip' package. Models include simple generalized linear models, Bayesian models, and zero-inflated Poisson models. + +### Exercise 2 + +Type in `poissonreg()` and set the expression to `log_lin_spec` using `<-`. + +```{r loglinear-models-2, exercise = TRUE} + +``` + +```{r loglinear-models-2-hint-1, eval = FALSE} +log_lin_spec <- ...() +``` + +```{r include = FALSE} +log_lin_spec <- poisson_reg() +``` + +### + +Let’s fit a simple model that contains all of the predictor columns. The **poissonreg** package, a **parsnip** extension package in **tidymodels**, will fit this model specification + +### Exercise 3 + +Pipe `log_lin_spec` to `fit()`. Add the parameters `art ~ .` and `data`, setting that equal to `bioChemists`. Set this entire expression to `log_lin_fit` and run it on the next line. + +```{r loglinear-models-3, exercise = TRUE} + +``` + +```{r loglinear-models-3-hint-1, eval = FALSE} +log_lin_spec |> + fit(..., data = ...) +``` + +```{r include = FALSE} +log_lin_fit <- + log_lin_spec |> + fit(art ~ ., data = bioChemists) + +log_lin_fit +``` + +### + +The `tidy()` method succinctly summarizes the coefficients for the model (along with 90% confidence intervals). + +### Exercise 4 + +Type `tidy()` and add the parameters `log_lin_fit`, `conf.fit`, setting it equal to `TRUE`, and `conf.level`, setting it equal to `0.90`. + +```{r loglinear-models-4, exercise = TRUE} + +``` + +```{r loglinear-models-4-hint-1, eval = FALSE} +tidy(..., conf.fit = ..., conf.level = ...) +``` + +```{r include = FALSE} +tidy(log_lin_fit, conf.int = TRUE, conf.level = 0.90) +``` + +### + +Looking at these results, `phd` (the prestige of their department) may not have any relationship with the outcome. + +### Exercise 5 + +Type in `reg_intervals()`. Add the parameters `art ~ .`, `data`, setting it equal to `bioChemists`, `model_fn`, setting it equal to `"glm"`, and `family`, setting it equal to `poisson`. Set this entire expression to `glm_boost` using `<-` and run it on the next line. + +```{r loglinear-models-5, exercise = TRUE} + +``` + + + +```{r loglinear-models-5-hint-1, eval = FALSE} +glm_boot <- + reg_intervals(..., data = ..., model_fn = "...", family = ...) +``` + +```{r include = FALSE} +glm_boot <- + reg_intervals(art ~ ., data = bioChemists, model_fn = "glm", family = poisson) + +glm_boot +``` + +### + +`reg_intervals()`: A convenience function for confidence intervals with linear-ish parametric models. + +### Exercise 6 + +Pipe `log_lin_spec` to `fit()`. Add the parameters `art ~ ment + kid5 + fem + mar` and `data`, setting it equal to `bioChemists`. Set this entire expression to `log_lin_reduced` using `<-`. + +```{r loglinear-models-6, exercise = TRUE} + +``` + +```{r loglinear-models-6-hint-1, eval = FALSE} +log_lin_reduced <- + log_lin_spec |> + fit(art ~ ..., data = ...) +``` + +```{r include = FALSE} +log_lin_reduced <- + log_lin_spec |> + fit(art ~ ment + kid5 + fem + mar, data = bioChemists) +``` + +### + +This hypothesis was previously tested when we showed the tidied results for `log_lin_fit`. That particular approach used results from a single model fit via a Wald statistic (i.e., the parameter divided by its standard error). For that approach, the p-value was 0.63. + +### Exercise 7 + +Type `anova()`. Add the parameters `extract_fit_engine()`, adding the parameter `log_lin_reduced`, `extract_fit_engine()`, add the parameter `log_lin_fit`, and `test`, setting it equal to `"LRT"`. + +```{r loglinear-models-7, exercise = TRUE} + +``` + +```{r loglinear-models-7-hint-1, eval = FALSE} +anova( + extract_fit_engine(...), + extract_fit_engine(...), + test = "..." +) +``` + +```{r include = FALSE} +anova( + extract_fit_engine(log_lin_reduced), + extract_fit_engine(log_lin_fit), + test = "LRT" +) +``` + +### + +The most impactful tool that tidymodels offers for inferential models is the `tidy()` functions in the **broom** package. As previously seen, this function makes a well-formed, predictably named tibble from the object. + +### Exercise 8 + +Copy the previous code and pipe it to `tidy()`. + +```{r loglinear-models-8, exercise = TRUE} + +``` + + + +```{r loglinear-models-8-hint-1, eval = FALSE} +... |> + ...() +``` + +```{r include = FALSE} +anova( + extract_fit_engine(log_lin_reduced), + extract_fit_engine(log_lin_fit), + test = "LRT" +) |> + tidy() +``` + +### + +The results are the same and, based on these and the confidence interval for this parameter, we’ll exclude `phd` from further analyses since it does not appear to be associated with the outcome. + +## A More Complex Model +### + +We can move into even more complex models within our tidymodels approach. For count data, there are occasions where the number of zero counts is larger than what a simple Poisson distribution would prescribe. A more complex model appropriate for this situation is the zero-inflated Poisson (ZIP) model; see Mullahy (1986), Lambert (1992), and Zeileis, Kleiber, and Jackman (2008). Here, there are two sets of covariates: one for the count data and others that affect the probability (denoted as π) of zeros. + +### Exercise 1 + +Type `poisson_reg()` and pipe it to `set_engine()`. Add the parameter `"zeroinfl"`. Then, set the entire expression to `zero_inflated_spec` using `<-`. + +```{r a-more-complex-model-1, exercise = TRUE} + +``` + +```{r a-more-complex-model-1-hint-1, eval = FALSE} +zero_inflated_spec <- poisson_reg() |> set_engine("...") +``` + +```{r include = FALSE} +zero_inflated_spec <- poisson_reg() |> set_engine("zeroinfl") +``` + +### + +**zeroinfl** is part of the **pscl package** in R. It's used to fit zero-inflated models, which are a type of statistical model used to analyze data with an excess of zeros compared to what would be expected in a standard distribution. + +### Exercise 2 + +Pipe `zero_inflated_spec` to `fit()`. Add the parameter `art ~ fem + mar + kid5 + ment | fem + mar + kid5 + phd + ment` and `data`, setting it equal to `bioChemists`. Set the entire expression to `zero_inflated_fit` using `<-` and run it on the next line. + +```{r a-more-complex-model-2, exercise = TRUE} + +``` + +```{r a-more-complex-model-2-hint-1, eval = FALSE} +zero_inflated_fit <- + zero_inflated_spec |> + fit(art ~ ... | ..., data = ...) +``` + +```{r include = FALSE} +zero_inflated_fit <- + zero_inflated_spec |> + fit(art ~ fem + mar + kid5 + ment | fem + mar + kid5 + phd + ment, + data = bioChemists) + +zero_inflated_fit +``` + +### + +Since the coefficients for this model are also estimated using maximum likelihood, let’s try to use another likelihood ratio test to understand if the new model terms are helpful. We will *simultaneously* test that. + +### Exercise 3 + +Type in `anova()`. Add the parameter `extract_fit_engine(zero_inflated_fit)`,`extract_fit_engine(log_lin_reduced)`, and `test = "LRT"`. Then, pipe it to `tidy()`. An error might occur. + +```{r a-more-complex-model-3, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-3-hint-1, eval = FALSE} +anova( + extract_fit_engine(zero_inflated_fit), + extract_fit_engine(...), + test = "..." +) |> + ...() +``` + +```{r include = FALSE} +# anova( +# extract_fit_engine(zero_inflated_fit), +# extract_fit_engine(log_lin_reduced), +# test = "LRT" +# ) |> +# tidy() +``` + +### + +An anova() method isn’t implemented for **zeroinfl** objects! An alternative is to use an information criterion statistic, such as the Akaike information criterion (AIC) [(Claeskens 2016)](https://www.tmwr.org/inferential#ref-claeskens2016statistical). + +### Exercise 4 + +Pipe `zero_inflated_fit` to `extract_fit_engine()`. Then, pipe the code to `AIC()`. + +```{r a-more-complex-model-4, exercise = TRUE} + +``` + +```{r a-more-complex-model-4-hint-1, eval = FALSE} +zero_inflated_fit |> ...() |> AIC() +``` + +```{r include = FALSE} +zero_inflated_fit |> extracted_fit_engine() |> AIC() +``` + +### + +In R’s parameterization, smaller AIC values are better. In this case, we are not conducting a formal statistical test but *estimating* the ability of the data to fit the data. + +### Exercise 5 + +Copy the previous code and change `zero_inflated_fit` to `log_lin_reduced`. + +```{r a-more-complex-model-5, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-5-hint-1, eval = FALSE} +log_lin_reduced |> ... +``` + +```{r include = FALSE} +log_lin_reduced |> extracted_fit_engine() |> AIC() +``` + +### + +We will be characterizing the uncertainty of the AIC statistics to gauge their difference relative to the noise in the data. We’ll also compute more bootstrap confidence intervals for the parameters in a bit so we specify the `apparent = TRUE` option when creating the bootstrap samples. + +### Exercise 6 + +Set `zip_form` to `art ~ fem + mar + kid5 + ment | fem + mar + kid5 + phd + ment` using `<-`. Then, set `glm_form` to `art ~ fem + mar + kid5 + ment`. + +```{r a-more-complex-model-6, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-6-hint-1, eval = FALSE} +zip_form <- ... +glm_form <- ... +``` + +```{r include = FALSE} +zip_form <- art ~ fem + mar + kid5 + ment | fem + mar + kid5 + phd + ment +glm_form <- art ~ fem + mar + kid5 + ment +``` + +### + +However, it’s hard to contextualize the AIC pair of single values and assess how different they actually are. To solve this problem, we’ll resample a large number of each of these two models. + +### Exercise 7 + +Type `bootstraps()` and add the parameters `bioChemists`, `times`, setting it equal to `2000`, and `apparent`, setting it equal to `TRUE`. + +```{r a-more-complex-model-7, exercise = TRUE} + +``` + +```{r a-more-complex-model-7-hint-1, eval = FALSE} +bootstraps(bioChemists, times = ..., apparent = ...) +``` + +```{r include = FALSE} +bootstraps(bioChemists, times = 2000, apparent = TRUE) +``` + +### + +Basically, we will be characterizing the uncertainty of the AIC statistics to gauge their difference relative to the noise in the data. + +### Exercise 8 + +Copy the previous code and pipe it to `mutate()`. Add the parameters `glm`, setting it equal to `map(splits, ~ fit(log_lin_spec, glm_form, data = analysis(.x)))` and `zip`, setting it equal to `map(splits, ~ fit(zero_inflated_spec, zip_form, data = analysis(.x)))`. Then, set the entire expression to `bootstrap_models` using `<-`. + +```{r a-more-complex-model-8, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-8-hint-1, eval = FALSE} +bootstraps_models <- + ... |> + mutate( + glm = ..., + zip = ..., +) +``` + +```{r include = FALSE} +bootstrap_models <- + bootstraps(bioChemists, times = 2000, apparent = TRUE) |> + mutate( + glm = map(splits, ~ fit(log_lin_spec, glm_form, data = analysis(.x))), + zip = map(splits, ~ fit(zero_inflated_spec, zip_form, data = analysis(.x))) + ) +``` + +### + +Now we can extract the model fits and their corresponding AIC values. + +### Exercise 9 + +Pipe `bootstrap_models` to `mutate()`. Add the parametes `glm_aic`, setting it equal to `map_dbl(glm, ~ extract_fit_engine(.x) %>% AIC())`, and `zip_aic`, setting it equal to `map_dbl(zip, ~ extract_fit_engine(.x) %>% AIC())`. Finally, set this expression to `bootstrap_models`. + +```{r a-more-complex-model-9, exercise = TRUE} + +``` + +```{r a-more-complex-model-9-hint-1, eval = FALSE} +boostrap_models <- + bootstrap_models |> + mutate( + glm_aic = ..., + zip_aic = ... + ) +``` + +```{r include = FALSE} +bootstrap_models <- + bootstrap_models |> + mutate( + glm_aic = map_dbl(glm, ~ extract_fit_engine(.x) %>% AIC()), + zip_aic = map_dbl(zip, ~ extract_fit_engine(.x) %>% AIC()) + ) +``` + +### + +We could have used `fit_resamples()` or a workflow set to conduct these computations. In this section, we used `mutate()` and `map()` to compute the models to demonstrate how one might use tidymodels tools for models that are not supported by one of the **parsnip** packages. + +### Exercise 10 + +Type `mean()` and add the parameter `bootstrap_models$zip_aic < bootstrap_models$glm_aic`. + +```{r a-more-complex-model-10, exercise = TRUE} + +``` + +```{r a-more-complex-model-10-hint-1, eval = FALSE} +mean(... < ...) +``` + +```{r include = FALSE} +mean(bootstrap_models$zip_aic < bootstrap_models$glm_aic) +``` + +### + +Since we have computed the resampled model fits, let’s create bootstrap intervals for the zero probability model coefficients (i.e., the γj). + +### Exercise 11 + +Pipe `bootstrap_models` to `mutate()`. Add the parameter `zero_coefs`, setting it equal to `map(zip, ~ tidy(.x, type = "zero"))`. + +```{r a-more-complex-model-11, exercise = TRUE} + +``` + +```{r a-more-complex-model-11-hint-1, eval = FALSE} +bootstrap_models <- + bootstrap_models |> + mutate(zero_coefs = ...) +``` + +```{r include = FALSE} +bootstrap_models <- + bootstrap_models |> + mutate(zero_coefs = map(zip, ~ tidy(.x, type = "zero"))) +``` + +### + +Lets see and example of the one of the bootstrap models. + +### Exercise 12 + +Type `bootstrap_models` and extract `zero_coefs[[1]]` using `$`. + +```{r a-more-complex-model-12, exercise = TRUE} + +``` + +```{r a-more-complex-model-12-hint-1, eval = FALSE} +bootstrap_models$... +``` + +```{r include = FALSE} +bootstrap_models$zero_coefs[[1]] +``` + +### + +It’s a good idea to visualize the bootstrap distributions of the coefficients, as it will be done using `ggplot()`. + +### Exercise 13 + +Pipe `bootstrap_models` to `unnest()`. Add the parameter `zero_coefs`. + +```{r a-more-complex-model-13, exercise = TRUE} + +``` + +```{r a-more-complex-model-13-hint-1, eval = FALSE} +bootstrap_models |> + unnest(...) +``` + +```{r include = FALSE} +bootstrap_models |> + unnest(zero_coefs) +``` + +### + +The `unnest()` function is part of the **tidyr** package in R. It's used to "unnest" or expand nested data structures in a column of a data frame or tibble, effectively breaking down a column containing lists or vectors into separate rows. + +### Exercise 14 + +Copy the previous code and pipe it to `ggplot()`. Within `aes()` of `ggplot()`, set `x` to `estimate`. + +```{r a-more-complex-model-14, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-14-hint-1, eval = FALSE} +... |> + ggplot(aes(x = ...)) +``` + +```{r include = FALSE} +bootstrap_models |> + unnest(zero_coefs) |> + ggplot(aes(x = estimate)) +``` + +### + +Arguably, Bayesian analysis is a very effective and often superior approach for inference. A variety of Bayesian models are available via **parsnip**. + +### Exercise 15 + +Copy the previous code and add `geom_histogram()`. Add the parameters `bins`, setting it equal to `25`, and `color`, setting it equal to `"white"`. + +```{r a-more-complex-model-15, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-15-hint-1, eval = FALSE} + +``` + +```{r include = FALSE} +bootstrap_models |> + unnest(zero_coefs) |> + ggplot(aes(x = estimate)) + + geom_histogram(bins = 25, color = "white") +``` + +### + +The `facet_wrap()` function is part of the **ggplot2** package in R, which is used for creating data visualizations. It's specifically used to create a grid of smaller plots (facets) based on one or more categorical variables, allowing you to compare multiple subsets of your data in a single plot. + +### Exercise 16 + +Copy the previous code and add the function `facet_wrap()`. Add the parametes `~ term` and `scales`, setting it equal to `"free_x"`. + +```{r a-more-complex-model-16, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-16-hint-1, eval = FALSE} +... + + facet_wrap(~ terms, scales = "...") +``` + +```{r include = FALSE} +bootstrap_models |> + unnest(zero_coefs) |> + ggplot(aes(x = estimate)) + + geom_histogram(bins = 25, color = "white") + + facet_wrap(~ term, scales = "free_x") +``` + +### + +The `geom_vline()` function is part of the **ggplot2** package in R, which is used for creating data visualizations. It's used to add vertical lines to a plot, helping to highlight specific values or regions on the x-axis. + +### Exercise 17 + +Copy the previous code and add `geom_vline()`. Add the parameters `xintercept`, setting it equal to `0`, `lty`, setting it equal to `2`, and `color`, setting it equal to `"gray70"`. + +```{r a-more-complex-model-17, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-17-hint-1, eval = FALSE} +... + + geom_vline(xintercept = ..., lty = ..., color = "...") +``` + +```{r include = FALSE} +bootstrap_models |> + unnest(zero_coefs) |> + ggplot(aes(x = estimate)) + + geom_histogram(bins = 25, color = "white") + + facet_wrap(~ term, scales = "free_x") + + geom_vline(xintercept = 0, lty = 2, color = "gray70") +``` + +### + +One of the covariates (`ment`) that appears to be important has a very skewed distribution. This might occur when models did not converge; those results probably should be excluded from the resamples. + +### Exercise 18 + +Pipe `bootstrap_models` to `int_pctl()`. Add the parameter `zero_coefs` and hit "Run Code". + +```{r a-more-complex-model-18, exercise = TRUE} + +``` + +```{r a-more-complex-model-18-hint-1, eval = FALSE} +bootstrap_models |> int_pctl(...) +``` + +```{r include = FALSE} +bootstrap_models |> int_pctl(zero_coefs) +``` + +### + +Percentile intervals are the standard method of obtaining confidence intervals but require thousands of resamples to be accurate. T-intervals may need fewer resamples but require a corresponding variance estimate. + +### Exercise 19 + +Copy the previous code and switch the method from `int_pctl()` to `int_t()`. + +```{r a-more-complex-model-19, exercise = TRUE} + +``` + + + +```{r a-more-complex-model-19-hint-1, eval = FALSE} +bootstrap_models |> int_t(...) +``` + +```{r include = FALSE} +bootstrap_models |> int_t(zero_coefs) +``` + +### + +From these results, we can get a good idea of which predictor(s) to include in the zero count probability model. It may be sensible to refit a smaller model to assess if the bootstrap distribution for `ment` is still skewed. + +## Summary +### + +This tutorial covered [Chapter 21: Inferential Analysis](https://www.tmwr.org/inferential) from [*Tidy Modeling with R*](https://www.tmwr.org/) by Max Kuhn and Julia Silge. This chapter demonstrated just a small subset of what is available for inferential analysis in tidymodels and has focused on resampling and frequentist methods. + +```{r download-answers, child = system.file("child_documents/download_answers.Rmd", package = "tutorial.helpers")} +```