Live dashboard: https://twhit.shinyapps.io/phasm/
PHASM is a Bayesian projection system for MLB hitters and pitchers. It combines multivariate outcome
modeling, hierarchical player/position effects, and AR(1) year trends to produce
probabilistic forecasts of per-PA/per-IP rates and rate stats. The system also supports total-count
projections when paired with external PA/IP forecasts.
Category projection outputs include posterior mean and quantiles (p05, p50, p95). Count outcomes are also
stored as totals using ATC PA/IP (e.g., H_mean_t, Ks_mean, W_mean_t, SVHLD_mean_t).
Projection and composite outputs also carry 2026 Team from ATC projection files.
- Fits a joint multivariate Bayesian model (rstan) for H, R, RBI, HR, SB (per-PA rates) plus AVG, OBP, SLG.
- Fits a joint multivariate Bayesian model (rstan) for SP: SO, BB, H, ER, W, QS (per-IP rates).
- Fits a joint multivariate Bayesian model (rstan) for RP: SO, BB, H, ER, W, SVHLD (per-IP rates).
- Uses age/aging curve for hitters and pitchers; position effects for hitters; role leverage for RP SVHLD.
- Player random intercepts and age slopes; position random intercepts and age/age^2 slopes.
- Year random intercepts with AR(1) evolution.
- Hitter Stan model:
models/hitter_model.stan - Hitter R driver:
models/fit_hitter_model.R - SP Stan model:
models/sp_model.stan - SP R driver:
models/fit_sp_model.R - RP Stan model:
models/rp_model.stan - RP R driver:
models/fit_rp_model.R - Hitter inputs:
data/fangraphs_batters_2018_2025.csv - Pitcher inputs:
data/fangraphs_pitchers_2018_2025.csv - Hitter outputs (after fitting):
models/hitter_model_fit.rdsmodels/hitter_model_inputs.rdsresults/projections/batters/category_projections_2026.csv
- Hitter projection refresh (no refit):
results/scripts/build_batter_category_projections_from_fit.R
- SP outputs (after fitting):
models/sp_model_fit.rdsmodels/sp_model_inputs.rdsresults/projections/pitchers/sp_category_projections_2026.csv
- SP projection refresh (no refit):
results/scripts/build_sp_category_projections_from_fit.R
- RP outputs (after fitting):
models/rp_model_fit.rdsmodels/rp_model_inputs.rdsresults/projections/pitchers/rp_category_projections_2026.csv
- RP projection refresh (no refit):
results/scripts/build_rp_category_projections_from_fit.R
- Trends:
- Hitters: all modeled outcomes.
- Starters: W and QS (per-IP), plus derived ERA/K/9/BB/9/WHIP.
- Relievers: W and SVHLD (per-IP), plus derived ERA/K/9/BB/9/WHIP.
- Interval projections:
- Hitters by position; starters/relievers by role.
- Age (standardized) and age^2
- Hitter position indicators
- RP
role_leverageindicator for SVHLD (fixed effect)
- Current SP model is SP-only (2018–2025) and models SO, BB, H, ER, W, and QS as per-IP rates.
- SP outcomes are modeled as Poisson with a log(IP) offset.
- Role effects and SV/HLD are omitted in the current SP-only run.
- RP model uses 2018–2025 relievers only and models SO, BB, H, ER, W, and SVHLD as per-IP rates.
- RP outcomes are modeled as Poisson with a log(IP) offset.
- RP fit input excludes any pitcher with ATC-projected GS >= 1 (from
data/atc_ip_projections_2026.csv). - RP uses a binary
role_leveragecovariate for SVHLD only:- Training years: top 3 in SVHLD on each team-season are flagged. For
Team == "2 Tms", top 5 are flagged;3+ Tmsare excluded. Anyone with SVHLD >= 10 or SVHLD/IP >= 0.3 is also flagged. - 2026: players listed as Closer/Co-Closer/Closer Committee/Setup Man on the Fangraphs closer depth chart are flagged and cached to
data/closer_depth_chart_2026.csv.
- Training years: top 3 in SVHLD on each team-season are flagged. For
- RP priors default to empirical-Bayes summaries loaded from
results/prior_predictive/rp_prior_summary.csv. - If that summary file is missing or invalid, RP fitting falls back to legacy priors (including the tighter SVHLD-specific priors).
- EB summaries are fit on 2013-2017 data and used as defaults for all main model fits (hitters, SP, RP).
- EB summary generation scripts:
- Hitters:
results/scripts/fit_batter_eb_2013_2017.R->results/prior_predictive/batter_prior_summary.csv - SP:
results/scripts/fit_sp_eb_2013_2017.R->results/prior_predictive/sp_prior_summary.csv - RP:
results/scripts/fit_rp_eb_2013_2017.R->results/prior_predictive/rp_prior_summary.csv
- Hitters:
- EB training data:
- Hitters:
data/fangraphs_batters_2013_2017.csv - Pitchers:
data/fangraphs_pitchers_2013_2017.csvwith model-specific SP/RP filters matching each main fit.
- Hitters:
- EB summaries store posterior mean/sd/quantiles for prior hyperparameters used by the corresponding main model.
- Main fit scripts (
models/fit_hitter_model.R,models/fit_sp_model.R,models/fit_rp_model.R) use EB posterior means as prior centers and EB posterior sds as prior scales (with small lower bounds for numerical stability), then pass those values into Stan via data. - For half-normal scale priors (for example
sigma_player,sigma_year), fit scripts convert EB posterior mean sigma to half-normal scale parameters. - If an EB summary is missing or invalid, the corresponding fit script logs a fallback message and reverts to legacy priors.
The sections below describe the hitter model. The pitcher models use the same backbone but use IP instead of PA for the offset, and model:
- SP: SO, BB, H, ER, W, QS
- RP: SO, BB, H, ER, W, SVHLD (plus
role_leveragefor SVHLD)
- Shiny app lives in
dashboard/app.R. - Run from repo root:
Rscript -e "shiny::runApp('dashboard', port = 3838, launch.browser = TRUE)"
- Composite ranking tabs support both position/role filters and
Teamfilters. - Projection tabs support
Teamfilters (and hitter position filter).
- Players
$i = 1..I$ , positions$p = 1..P$ , years$y = 1..Y$ - Outcomes
$k = 1..8$ , ordered:$(H, R, RBI, HR, SB, AVG, OBP, SLG)$ - Count outcomes:
$k = 1..5$ ; continuous outcomes:$k = 6..8$ - Observations indexed by
$n = 1..N$ , each with player$i[n]$ , position$p[n]$ , year$y[n]$
-
$PA_n$ : plate appearances for observation$n$ - Count outcomes:
$y_{n,k}$ for$k=1..5$ - Continuous outcomes:
-
$AVG_n, OBP_n \in (0,1)$ with logit transform -
$SLG_n > 0$ with log transform
-
- Transforms:
$a_n = \text{logit}(AVG_n)$ $o_n = \text{logit}(OBP_n)$ $s_n = \log(SLG_n + \varepsilon)$
-
$X_n$ : fixed effects row (intercept, age, age$^2$) -
$Z^{\text{pos}}_n$ : position random effect predictors (intercept, age, age$^2$) -
$Z^{\text{player}}_n$ : player random effect predictors (intercept, age)
- Count outcomes (per-PA rates via log offset):
equivalently:
- Continuous outcomes:
- Player random effects use
${\text{intercept}, \text{age}}$ :
- Position random effects use
${\text{intercept}, \text{age}, \text{age}^2}$ :
- Each
$\Sigma^{\text{group}}_r$ is constructed from scale vector$\sigma^{\text{group}}_r$ and correlation matrix$\Omega^{\text{group}}_r$ :
- For each outcome
$k$ :
- Draw
$\gamma_{k,Y+1} \sim \mathcal{N}(\rho_k\gamma_{k,Y}, \sigma_{\text{year},k})$ - Predict
$\eta_{n,k}$ for 2026 using age and age$^2$ (with age incremented by +1 from the most recent season), plus the drawn 2026 year effect
- Fixed effects (standardized predictors):
$\beta_k \sim \mathcal{N}(0, 2.5^2)$ - Random effect scales (half-normal):
$\sigma^{\text{player}}_r, \sigma^{\text{pos}}_r \sim \mathcal{N}^+(0, 1)$ - Non-centered random effects:
$z^{\text{player}}_r, z^{\text{pos}}_r \sim \mathcal{N}(0, 2.5^2)$ - Correlations:
$\Omega^{\text{group}}_r \sim \text{LKJ}(2)$ - Year AR(1) parameters:
$\rho_k \sim \mathcal{N}(0, 0.5)$ ,$\sigma_{\text{year},k} \sim \mathcal{N}^+(0, 1)$ - Continuous outcome noise:
$\sigma_k \sim \mathcal{N}^+(0, 1)$
In production fits, each model (hitters/SP/RP) defaults to EB-informed priors from its 2013-2017 summary file. The distributions above are the legacy fallback priors used if EB summaries are unavailable or invalid.