Skip to content

A comprehensive explanation of Generalized Linear Models (GLMs) with Python examples, covering Binomial, Gamma, and Gaussian families, model diagnostics, formula interface, and alternative estimation approaches using statsmodels.

License

Notifications You must be signed in to change notification settings

esosetrov/generalized_linear_models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

generalized_linear_models

Generalized Linear Models (GLMs) with Python examples, covering Binomial, Gamma, and Gaussian families, model diagnostics, formula interface, and alternative estimation approaches using statsmodels.

Table of Contents

  1. Overview
  2. Features
  3. Datasets
  4. Installation
  5. Notebook Structure
  6. Key Techniques
  7. Results Highlights
  8. Applications
  9. References
  10. License

Overview

Generalized Linear Models (GLMs) extend ordinary linear regression to handle non-normal response distributions and non-constant variance structures. This notebook provides a comprehensive guide to GLMs, covering:

  • Theoretical foundations of GLMs (link functions, variance functions, exponential families)
  • Practical implementations with real datasets
  • Model diagnostics and validation techniques
  • Advanced extensions including regularization and alternative estimation methods

Features

  • Multiple GLM families: Binomial, Gamma, and Gaussian distributions
  • Real datasets: Star98 educational data and Scottish Parliament voting data
  • Comprehensive diagnostics: Residual plots, Q-Q plots, influence measures
  • Advanced techniques: Robust standard errors, regularization, cross-validation
  • Formula interface: R-style formulas with custom transformations
  • Model comparison: AIC, BIC, and cross-validation metrics
  • Practical insights: First differences, marginal effects, interpretation guides

Datasets

1. Star98 Dataset

  • Source: Jeff Gill (2000) "Generalized linear models: A unified approach"
  • Observations: 303 California counties
  • Variables: 13 predictors + 8 interaction terms
  • Response: Binomial (students above/below national math median)
  • Key predictors: Income levels, ethnic percentages, teacher experience, spending

2. Scottish Parliament Voting Data

  • Observations: 32 council districts
  • Response: Proportion voting "Yes" for taxation powers
  • Predictors: Council tax, unemployment, mortality rates, economic indicators
  • Family: Gamma with log link

Installation

# Clone repository
git clone https://github.com/esosetrov/generalized_linear_models
cd generalized_linear_models

Requirements

statsmodels>=0.14.0
numpy>=1.21.0
scipy>=1.7.0
pandas>=1.3.0
matplotlib>=3.5.0
scikit-learn>=1.0.0

Notebook Structure

1. Environment Setup and Imports

  • Library imports and configuration
  • Plotting settings

2. Introduction to GLMs

  • Theoretical background
  • Three components: Random, Systematic, Link Function
  • Exponential family distributions

3. Binomial GLM Example

  • Star98 dataset loading and preparation
  • Model fitting with logit link
  • Parameter interpretation
  • First differences and marginal effects

4. Model Diagnostics

  • Observed vs. fitted plots
  • Residual dependence analysis
  • Q-Q plots for deviance residuals
  • Standardized residual histograms

5. Gamma GLM Example

  • Scottish voting data analysis
  • Log link function implementation
  • Interpretation of rate parameters

6. Gaussian GLM with Non-canonical Link

  • Artificial data generation
  • Log link with Gaussian family
  • Parameter recovery demonstration

7. Formula Interface

  • R-style formula specification
  • Interaction term implementation
  • Custom transformation functions
  • Comparison with direct specification

8. Alternative Estimation Approaches

  • Robust standard errors (HC0, HC1, HC2, HC3)
  • Bayesian GLM concepts
  • Penalized regression (Ridge, Lasso)
  • Quasi-likelihood methods

9. Model Comparison and Selection

  • Information criteria (AIC, BIC)
  • Cross-validation implementation
  • Model performance metrics
  • Overfitting detection

10. Extensions and Advanced Topics

  • Generalized Additive Models (GAMs)
  • Mixed Effects GLMs (GLMMs)
  • Zero-inflated models
  • Temporal and spatial extensions

11. Conclusion and Best Practices

  • Model selection guidelines
  • Interpretation caveats
  • Reporting standards
  • Future directions

Key Techniques Demonstrated

1. Link Function Selection

  • Logit for binary/proportion data
  • Log for positive continuous data
  • Identity for normal data
  • Probit and complementary log-log alternatives

2. Diagnostic Procedures

# Residual analysis
resid_pearson = res.resid_pearson
resid_deviance = res.resid_deviance

# Influence measures
influence = res.get_influence()
cooks_distance = influence.cooks_distance

3. Robust Inference

# Heteroskedasticity-consistent standard errors
res_robust = model.fit(cov_type='HC0')
print(res_robust.bse)  # Robust standard errors

4. Regularization

# Ridge regression for GLMs
res_ridge = model.fit_regularized(alpha=0.5, L1_wt=0)

5. Cross-Validation

# K-fold cross-validation
kf = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_log_loss')

Results Highlights

Star98 Dataset Findings

  • Strong negative effect: Low-income percentage reduces academic success probability
  • Demographic patterns: Asian percentage positively correlates with success
  • Teacher quality: Experience and minority representation show positive impacts
  • Resource paradox: Higher spending associated with lower success (after controlling for other factors)

Scottish Voting Analysis

  • Economic factors: Unemployment and taxes reduce support for taxation powers
  • Demographic effects: Older populations more supportive
  • Excellent model fit: Pseudo R² = 0.98 for Gamma GLM

Methodological Insights

  • Robust SEs matter: Several "significant" predictors become non-significant with robust errors
  • Regularization helps: Ridge regression stabilizes coefficient estimates
  • CV reveals overfitting: High in-sample fit doesn't guarantee out-of-sample performance

Applications

Education Policy

  • Identifying key determinants of student success
  • Resource allocation optimization
  • Equity and inclusion analysis

Political Science

  • Voting behavior modeling
  • Policy preference analysis
  • Regional variation studies

Business Analytics

  • Customer conversion modeling (logistic regression)
  • Claim severity modeling (Gamma regression)
  • Demand forecasting

Healthcare

  • Disease prevalence modeling
  • Treatment effect estimation
  • Healthcare utilization analysis

References

Primary Sources

  1. Nelder, J. A., & Wedderburn, R. W. M. (1972). "Generalized Linear Models"
  2. McCullagh, P., & Nelder, J. A. (1989). "Generalized Linear Models, 2nd Edition"
  3. Dobson, A. J., & Barnett, A. G. (2018). "An Introduction to Generalized Linear Models, 4th Edition"

Software Documentation

Resources

  • Gill, J. (2000). "Generalized linear models: A unified approach"
  • Hardin, J. W., & Hilbe, J. M. (2018). "Generalized Linear Models and Extensions"
  • Agresti, A. (2015). "Foundations of Linear and Generalized Linear Models"

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions, issues, and feature requests are welcome! Feel free to check the issues page.

Note

This notebook is designed for educational and research purposes. Real-world applications may require additional considerations and validation.

About

A comprehensive explanation of Generalized Linear Models (GLMs) with Python examples, covering Binomial, Gamma, and Gaussian families, model diagnostics, formula interface, and alternative estimation approaches using statsmodels.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published