Skip to content

LazyModeler is a statistical package for the programming language R that allows for easy performance of regression modeling

License

Notifications You must be signed in to change notification settings

LMKoesters/LazyModeler

Repository files navigation

LazyModeler

LazyModeler is a statistical package for the programming language R that enables users to easily perform regression modeling. It includes removal of autocorrelated variables, choice between several types of (non)linear regression models, standard stepwise model simplification, various model quality checks, plotting of coefficient estimates and relationships, and output generation.

Statement of need

Statistical modeling describes the process of finding a mathematical function with specific statistical assumptions that best fits the observed data (Crawley, 2007, 2015; Henley et al., 2020). This process attempts, in practice, to find a (causal) relationship between a dependent response variable y and an independent predictor variable x for any postulated hypothesis. For statistical inference and graphics in science, the programming environment R (R Core Team, 2024) has become highly popular.

Our R package LazyModeler enables users to automatically remove autocorrelated variables, choose between several types of (non)linear regression models (e.g., LM, GLM, LMER, GLMER, GAM, or NLMER), perform stepwise model simplification, check model quality, plot coefficient estimates and relationships, and generate the output of the final model.

Overview and major functions

LazyModeler automatizes all necessary steps needed for use of (non)linear regression models. It comprises three major functions that are included within the main function optimize_model.

The first major function remove_autocorrelations checks for any autocorrelations (|r| > 0.7) (Dormann et al. 2013) given a list of variables sorted by relevance. Automatic removal of these autocorrelations is possible through the use of a function parameter. Removal will follow the order of the list of variables, ensuring that the user's expertise on the importance of features is respected. A named list is returned with a) a vector containing all removed predictors, and b) a dataframe listing autocorrelations and information on deleted variables.

The main function provides the model formula to the second major function simplify_model. If autocorrelations were detected, the formula is updated accordingly. The regression model is then calculated. Options for the models are: lm, glm, lmer, glmer, gam, or nlmer, with all possible distributions of the response variable being allowed. Stepwise backward simplification or forward model selection takes place using an iterative process where each time the metric(s) specified by the user are applied on the model to check whether further simplification/selection is needed. Main variables are kept when they are involved in interactions. Options for the metrics are: aov, aic, aicc, or bic. The final model is returned to the main function alongside its metadata as well as simplification history if requested by the user.

Using the third major function fancy_plotting, the final model then undergoes multiple visualization steps. Plots to assess model quality are created using the standard plot function available through base R, or model check included in the performance R package (Lüdecke et al. 2021). Furthermore, the script produces regression, box, or violin plots for each numerical or categorical coefficient as well as plots depicting effects sizes and estimates. All generated plots are returned to the user within a named list. The main function additionally returns the output of both the model simplification/selection and autocorrelation functions as well as the summary of the final model.

LazyModeler makes use of the R package corrplot (Wei and Simko 2021) to calculate correlations between variables, lme4 (Bates et al. 2024) and lmerTest (Kuznetsova, Brockhoff, and Christensen 2017) for regression modeling, tidyverse (Wickham et al. 2019) for data handling, and MuMIn (Bartoń 2024) for calculation of AICc scores. For generation of plots visualizing regression, effect size, and estimates, the script further leverages tidyverse and color palettes included in the colorspace (Zeileis et al. 2020) and viridis (Garnier et al. 2024) R packages.

How to install

You have two options to install LazyModeler. You can either install through GitHub using the remotes package.

remotes::install_github("LMKoesters/LazyModeler")

Alternatively, you need to download the tarball from GitHub and then install using install.packages.

install.packages("PATH_TO_TARBALL/LazyModeler-v.0.2.0.tar.gz", repos = NULL, type="source")

Example

To get to know the package and its main function optimize_model(), we provide a testing dataset. This dataset contains information on European plant specimens, including geolocation, habitat, and ploidy (Karbstein et al. 2021). The corresponding data types are numeric (floats and integers) as well as categorical, offering a range of combinations for the user to test.

A common usage requires an input dataframe as well as a starting term for the model. The term needs to be provided as a quote() and can encompass transformed variables and interactions (note: we currently only allow for 2-way interactions). If the user wants to check for autocorrelations, they can provide a list of coefficients (in the form of dataframe columns) that must be sorted by their relevance in descending order. If an autocorrelation is detected and the parameter automatic_removal=TRUE is set, the coefficient further down the list is removed first.

To calculate the model, the user can provide the type of linear model to calculate (default: "glm"), and the family (default: "gaussian"). The user can also decide on the simplification direction ("forward" for forward selection, "backward" for backward simplification, or "both"). If no simplification is desired, setting simplification_direction = "backward" plus backward_simplify_model = FALSE yields the original model without simplification.

The following example generates and optimizes a generalized linear model using the provided plant dataset and a term that includes geographic, ecological and genetic information. The pipeline checks for correlations between the values of the provided dataframe columns and removes autocorrelated variables. The autocorrelation-cleaned term is then used for backward simplification of the model. Finally, the information on the coefficients of the simplified model is plotted. To access the final model, the user can navigate to models_with_info within the result, then either toforward or backward depending on the chosen selection/simplification procedure, and then to final_model. The plots are stored alongside the model within plots - these plots cover the result of performance::check_model(), as well as the regression, estimate, and effect size plots.

# import example data
data(plants)

# check data structure
str(plants)
summary(plants)

results_example <- optimize_model(
	df = plants,
    term = quote(sexual_seed_prop ~
			     altitude +
			     latitude_gps_n +
			     longitude_gps_e +
			     (solar_radiation +
				     annual_mean_temperature +
				     isothermality)^2 +
				 I(isothermality^2) +
				 habitat +
				 ploidy),
    autocorrelation_cols = c(
	    "solar_radiation",
	    "annual_mean_temperature",
	    "isothermality",
	    "altitude",
	    "latitude_gps_n",
	    "longitude_gps_e"
	),
    automatic_removal=TRUE,
    autocorrelation_threshold = 0.8,
    correlation_method="spearman",
    cor_use="complete.obs",
    model_type = "glm",
    model_family = "quasibinomial",
    evaluation_methods=c("anova"),
    simplification_direction="backward",
    backward_simplify_model=TRUE,
    omit_na="overall",
    scale_predictor=TRUE,
    plot_quality_assessment="performance",
    round_p=3,
    plot_relationships=TRUE,
    jitter_plots=TRUE,
    plot_type="violinplot",
    stat_test="wilcox",
    trace=TRUE)

Navigating through the output. For example, (a) simply click on dataframe button highlighted with a red arrow to (b) illustrate the final model output. Navigating through the output. For example, (a) simply click on dataframe button highlighted with a red arrow to (b) illustrate the final model output.

(a) Model quality check and (b,c) exemplary output plots of significant relationships. (a) Model quality check and (b,c) exemplary output plots of significant relationships.

Community guidelines

We welcome contributions, feedback, and suggestions to improve this project.

  • Encountered a bug or unexpected behavior? Open an issue on GitHub. Just make sure that your issue hasn't been reported yet by checking existing issues before opening a new one.
  • Contributions (e.g., code improvements, new features, documentation) are welcome via pull requests. When contributing, describe your changes clearly and provide sufficient context to help us understand your work.

Important note

The model selection procedures implemented in LazyModeler are provided for convenience and exploratory analysis, and reflect practices recommended in widely used applied statistics literature [@Crawley2007; @Crawley2015]. Users should be aware, however, that statistical inference reported from a model chosen in a data-driven way may be anti-conservative (e.g., p-values may appear smaller than they truly are, confidence intervals narrower). This issue is known as post-selection inference (PSI). Specialized methods have been developed to address it, for instance [@Lee2016], but they are not yet broadly applicable across the full range of model classes supported by LazyModeler. We have implemented PSI for (generalized) linear regression models based on the 'selcorr' R package [@Cattaneo2021], but users are free to use the retained model from LazyModeler for more sophisticated PSI analyses.

References

Bartoń, K. (2024). MuMIn: Multi-Model Inference. https://CRAN.R-project.org/package=MuMIn.

Bates, D., Maechler, M., Bolker, B., and Walker,S. (2024). “lme4 - Linear mixed-effects models using ’Eigen’ and S4.” https://github.com/lme4/lme4/.

Bauer, M., & Albrecht, H. (2020). Vegetation monitoring in a 100-year-old calcareous grassland reserve in Germany. Basic and Applied Ecology, 42, 15–26. https://doi.org/10.1016/j.baae.2019.11.003.

Cai, L., Kreft, H., Taylor, A., Denelle, P., Schrader, J., Essl, F., Kleunen, M. van, Pergl, J., Pyšek, P., Stein, A., Winter, M., Barcelona, J. F., Fuentes, N., Inderjit, Karger, D. N., Kartesz, J., Kuprijanov, A., Nishino, M., Nickrent, D., … Weigelt, P. (2023). Global models and predictions of plant diversity based on advanced machine learning techniques. New Phytologist, 237(4), 1432–1445. https://doi.org/10.1111/nph.18533.

Cattaneo, M. (2021). selcorr: Post-Selection Inference for Generalized Linear Models. R package version 1.0. https://CRAN.R-project.org/package=selcorr

Crawley, M. J. (2007). The R Book (p. 942). John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118448908

Crawley, M. J. (2015). Statistics: an introduction using R (sec. ed., p. 339). John Wiley & Sons. ISBN: 1118448960

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., García Marquéz, J. R, et al. (2013). “Collinearity: a review of methods to deal with it and a simulation study evaluating their performance.” Ecography 36 (1): 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x.

Forstmeier, W., & Schielzeth, H. (2011). Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner’s curse. Behavioral Ecology and Sociobiology, 65(1), 47–55. https://doi.org/10.1007/s00265-010-1038-5.

Garnier, S., Ross, N., Rudis, B., Filipovic-Pierucci, A., et al. (2024). viridis(Lite) - Colorblind-Friendly Color Maps for r. https://doi.org/10.5281/zenodo.4679423.

Hastie, T. (2023). gam: Generalized Additive Models. https://cran.r-project.org/web/.

Karbstein, K., Tomasello, S., and Hoda{\v c}, L., Lorberg, E., Daubert, M., H{"o}randl, E. (2021). Moving beyond Assumptions: Polyploidy and Environmental Effects Explain a Geographical Parthenogenesis Scenario in European Plants. Molecular Ecology, 30 (11): 2659-2675. https://doi.org/10.1111/mec.15919.

Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). “lmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical Software 82 (13): 1–26. https://doi.org/10.18637/jss.v082.i13.

Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Annals of Statistics 44: 907–27. https://doi.org/10.1214/15-AOS1371.

Li, J. (2023). Overview of high dimensional linear regression models. Theoretical and Natural Science, 5(1), 656–661. https://doi.org/10.54254/2753-8818/5/20230427.

Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). “Performance: An r Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software 6 (60): 3139. https://doi.org/10.21105/joss.03139.

R Core Team. (2024). R: a language and environment for statistical computing. R Foundation for Statistical Computing. https://www.r-project.org/

Römermann, C., Bucher, S. F., Hahn, M., & Bernhardt-Römermann, M. (2016). Plant functional traits – fixed facts or variable depending on the season? Folia Geobotanica, 51(2), 143–159. https://doi.org/10.1007/s12224-016-9250-3.

Schielzeth, H. (2010). Simple means to improve the interpretability of regression coefficients. Methods in Ecology and Evolution, 1(2), 103–113. https://doi.org/10.1111/j.2041-210X.2010.00012.x

Schielzeth, H., Dingemanse, N. J., Nakagawa, S., Westneat, D. F., Allegue, H., Teplitsky, C., Réale, D., Dochtermann, N. A., Garamszegi, L. Z., & Araya‐Ajoy, Y. G. (2020). Robustness of linear mixed‐effects models to violations of distributional assumptions. Methods in Ecology and Evolution, 11(9), 1141–1152. https://doi.org/10.1111/2041-210X.13434.

Wei, T., & Viliam, S. (2021). R Package ’Corrplot’: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.

Wicke, S., Müller, K. F., DePamphilis, C. W., Quandt, D., Bellot, S., & Schneeweiss, G. M. (2016). Mechanistic model of evolutionary rate variation en route to a nonphotosynthetic lifestyle in plants. Proceedings of the National Academy of Sciences of the United States of America, 113(32), 9045-9050. https://doi.org/10.1073/pnas.1607576113.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Zeileis, A., Fisher, J. C., Hornik, K., Ihaka,R., McWhite, C. D., Murrell, P., Stauffer, R., & Wilke, C. O. 2020. “colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes.” Journal of Statistical Software 96 (1): 1–49. https://doi.org/10.18637/jss.v096.i01.

About

LazyModeler is a statistical package for the programming language R that allows for easy performance of regression modeling

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •