Generalized Linear Models (GLMs) with Python examples, covering Binomial, Gamma, and Gaussian families, model diagnostics, formula interface, and alternative estimation approaches using statsmodels.
- Overview
- Features
- Datasets
- Installation
- Notebook Structure
- Key Techniques
- Results Highlights
- Applications
- References
- License
Generalized Linear Models (GLMs) extend ordinary linear regression to handle non-normal response distributions and non-constant variance structures. This notebook provides a comprehensive guide to GLMs, covering:
- Theoretical foundations of GLMs (link functions, variance functions, exponential families)
- Practical implementations with real datasets
- Model diagnostics and validation techniques
- Advanced extensions including regularization and alternative estimation methods
- Multiple GLM families: Binomial, Gamma, and Gaussian distributions
- Real datasets: Star98 educational data and Scottish Parliament voting data
- Comprehensive diagnostics: Residual plots, Q-Q plots, influence measures
- Advanced techniques: Robust standard errors, regularization, cross-validation
- Formula interface: R-style formulas with custom transformations
- Model comparison: AIC, BIC, and cross-validation metrics
- Practical insights: First differences, marginal effects, interpretation guides
- Source: Jeff Gill (2000) "Generalized linear models: A unified approach"
- Observations: 303 California counties
- Variables: 13 predictors + 8 interaction terms
- Response: Binomial (students above/below national math median)
- Key predictors: Income levels, ethnic percentages, teacher experience, spending
- Observations: 32 council districts
- Response: Proportion voting "Yes" for taxation powers
- Predictors: Council tax, unemployment, mortality rates, economic indicators
- Family: Gamma with log link
# Clone repository
git clone https://github.com/esosetrov/generalized_linear_models
cd generalized_linear_modelsstatsmodels>=0.14.0
numpy>=1.21.0
scipy>=1.7.0
pandas>=1.3.0
matplotlib>=3.5.0
scikit-learn>=1.0.0
- Library imports and configuration
- Plotting settings
- Theoretical background
- Three components: Random, Systematic, Link Function
- Exponential family distributions
- Star98 dataset loading and preparation
- Model fitting with logit link
- Parameter interpretation
- First differences and marginal effects
- Observed vs. fitted plots
- Residual dependence analysis
- Q-Q plots for deviance residuals
- Standardized residual histograms
- Scottish voting data analysis
- Log link function implementation
- Interpretation of rate parameters
- Artificial data generation
- Log link with Gaussian family
- Parameter recovery demonstration
- R-style formula specification
- Interaction term implementation
- Custom transformation functions
- Comparison with direct specification
- Robust standard errors (HC0, HC1, HC2, HC3)
- Bayesian GLM concepts
- Penalized regression (Ridge, Lasso)
- Quasi-likelihood methods
- Information criteria (AIC, BIC)
- Cross-validation implementation
- Model performance metrics
- Overfitting detection
- Generalized Additive Models (GAMs)
- Mixed Effects GLMs (GLMMs)
- Zero-inflated models
- Temporal and spatial extensions
- Model selection guidelines
- Interpretation caveats
- Reporting standards
- Future directions
- Logit for binary/proportion data
- Log for positive continuous data
- Identity for normal data
- Probit and complementary log-log alternatives
# Residual analysis
resid_pearson = res.resid_pearson
resid_deviance = res.resid_deviance
# Influence measures
influence = res.get_influence()
cooks_distance = influence.cooks_distance# Heteroskedasticity-consistent standard errors
res_robust = model.fit(cov_type='HC0')
print(res_robust.bse) # Robust standard errors# Ridge regression for GLMs
res_ridge = model.fit_regularized(alpha=0.5, L1_wt=0)# K-fold cross-validation
kf = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_log_loss')- Strong negative effect: Low-income percentage reduces academic success probability
- Demographic patterns: Asian percentage positively correlates with success
- Teacher quality: Experience and minority representation show positive impacts
- Resource paradox: Higher spending associated with lower success (after controlling for other factors)
- Economic factors: Unemployment and taxes reduce support for taxation powers
- Demographic effects: Older populations more supportive
- Excellent model fit: Pseudo R² = 0.98 for Gamma GLM
- Robust SEs matter: Several "significant" predictors become non-significant with robust errors
- Regularization helps: Ridge regression stabilizes coefficient estimates
- CV reveals overfitting: High in-sample fit doesn't guarantee out-of-sample performance
- Identifying key determinants of student success
- Resource allocation optimization
- Equity and inclusion analysis
- Voting behavior modeling
- Policy preference analysis
- Regional variation studies
- Customer conversion modeling (logistic regression)
- Claim severity modeling (Gamma regression)
- Demand forecasting
- Disease prevalence modeling
- Treatment effect estimation
- Healthcare utilization analysis
- Nelder, J. A., & Wedderburn, R. W. M. (1972). "Generalized Linear Models"
- McCullagh, P., & Nelder, J. A. (1989). "Generalized Linear Models, 2nd Edition"
- Dobson, A. J., & Barnett, A. G. (2018). "An Introduction to Generalized Linear Models, 4th Edition"
- Gill, J. (2000). "Generalized linear models: A unified approach"
- Hardin, J. W., & Hilbe, J. M. (2018). "Generalized Linear Models and Extensions"
- Agresti, A. (2015). "Foundations of Linear and Generalized Linear Models"
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
This notebook is designed for educational and research purposes. Real-world applications may require additional considerations and validation.