Skip to content

adding weights to regressions with up to 2 FE dimensions#12

Open
jamesbrandecon wants to merge 5 commits intomainfrom
weighted-regression
Open

adding weights to regressions with up to 2 FE dimensions#12
jamesbrandecon wants to merge 5 commits intomainfrom
weighted-regression

Conversation

@jamesbrandecon
Copy link
Collaborator

First step of #11

I am running local tests of every combination of no FEs, 1 FE, weights, and no weights, and the output matches feols exactly.

@jamesbrandecon
Copy link
Collaborator Author

jamesbrandecon commented Aug 30, 2025

@grantmcdermott if you can, it'd help to confirm that weighted regs match feols on your end. Next task is to figure out if I can refactor the dbplyr stuff to avoid writing temp tables, which I think will be hard but probably necessary.

This is also the second step of #11, at least for TWFE. More work for HDFE.

@grantmcdermott
Copy link
Owner

Awesome. I think the next step is to set up a test suite and CI so that we can iterate more systemically and securely. I've got a couple of things on today, but will try to do that when I get sec. Then we can merge into this branch and add weighted tests too before diving into the code.

@jamesbrandecon
Copy link
Collaborator Author

I agree! Already it was unwieldy to make sure I haven't broken anything. Happy to defer to you on test structure/content.

@grantmcdermott
Copy link
Owner

Tests and CI added. If you merge from main locally and then push, they should auto propagate.

(You can also add additional tests as part of this PR. They need to be saved as inst/tinytest/test_<suffix>.R. See inst/tinytest/test_trade.R for some examples.)

@jamesbrandecon
Copy link
Collaborator Author

Thanks! Any attachment to the current test structure? Locally, I'm simulating one small-ish dataset and then running regs with many different options and comparing to feols as source of truth. So expect_equal(feols_coefs == dbreg_coefs) for weighted, unweighted, FEs, data vs. table, and each strategy.

@grantmcdermott
Copy link
Owner

grantmcdermott commented Aug 31, 2025

No attachment, but it does need to follow the inst/tinytest location and test_*.R naming convention for the test suite to run. You can add your new test file, including and the simulated dataset etc. all in a single script. (It doesn't need to be added to mine.)

@jamesbrandecon jamesbrandecon changed the title adding weights to regression with zero or one FE adding weights to regressions with up to 2 FE dimensions Aug 31, 2025
@grantmcdermott
Copy link
Owner

Just a quick note: the CI will fail because you've got new undeclared dependencies data.table and tidyr (these would need to be added to the DESCRIPTION file). But I'd prefer to only introduce the former as a dependency if possible, since it has zero recursive deps.

@grantmcdermott
Copy link
Owner

grantmcdermott commented Aug 31, 2025

To clarify: Just add data.table to the list of packages in Suggests.

(It doesn't need to be in Imports since it's only required for the test suite).

@jamesbrandecon
Copy link
Collaborator Author

I'll try harder on testing later today -- sorry for the CI breaking emails. Need to learn local testing in R!

@grantmcdermott
Copy link
Owner

grantmcdermott commented Aug 31, 2025

No worries! I actually don't get the breaking emails (since it's your PR). The workflow is quite nice once you get used to it. I might add a Makefile when I get a chance, but two tips for local testing in the meantime:

  • To run just the tests locally: tinytest::test_all()

  • To run the full R CMD check locally, including tests (what the CI is doing under the hood): devtools::check()

@jamesbrandecon
Copy link
Collaborator Author

Thanks! Now I get the flow.

I hadn't realized we weren't importing dplyr yet -- is that something you want to avoid? I was relying heavily on dplyr/dbplyr and pipes to get alternating projections to work.

@grantmcdermott
Copy link
Owner

grantmcdermott commented Aug 31, 2025

I hadn't realized we weren't importing dplyr yet -- is that something you want to avoid? I was relying heavily on dplyr/dbplyr and pipes to get alternating projections to work.

I'd prefer not to add it as a dependency if we can avoid it. At the same time, I obviously don't want to make our lives harder than it needs be. Regardless, maybe it's a good idea to touch base over a call at this point? The weights functionality sounds very useful regardless of where we land on HDFE. I don't want to lose that through too much integration with alternating projects functionality yet....

Pure SQL alternating projections

Speaking of which, here is my largely vibe-coded "pure SQL" implementation of alternating projections. Right now, this is just a standalone function that does the AP (using temp tables) and then calls dbreg::dbreg() on the finalized demeaned table, which is still in our connection memory. So the good news is that everything stays on the database... but we'd obviously have to do more work to integrate it as a direct dbreg(..., strategy = "altproj") option. (I also haven't tested it thoroughly).

dbreg_altproj.R
#' High Dimensional Fixed Effects via Alternating Projections
#'
#' @description
#' Implements the alternating projections method (Gaure, 2013) for handling
#' high dimensional fixed effects (3+) by performing iterative demeaning in SQL,
#' then delegating the final regression to dbreg on the residualized data.
#'
#' @param fml Formula with 3+ fixed effects
#' @param conn Database connection
#' @param table,data,path Data specification (same as dbreg)
#' @param vcov Variance-covariance type
#' @param tol Convergence tolerance for alternating projections
#' @param maxit Maximum iterations
#' @param verbose Print progress messages
#'
#' @return A dbreg result object with alternating projections metadata
#'
#' @references
#' Gaure, S. (2013). OLS with multiple high dimensional category variables.
#' Computational Statistics & Data Analysis, 66, 8-18.
#'
#' @export
dbreg_altproj = function(
  fml,
  conn = NULL,
  table = NULL,
  data = NULL,
  path = NULL,
  vcov = c("iid", "hc1"),
  tol = 1e-10,
  maxit = 1000,
  verbose = TRUE
) {
  vcov = match.arg(vcov)

  # Parse formula to extract variables
  fml_parsed = Formula::Formula(fml)
  yvar = all.vars(formula(fml_parsed, lhs = 1, rhs = 0))
  xvars = all.vars(formula(fml_parsed, lhs = 0, rhs = 1))
  fes = if (length(fml_parsed)[2] > 1) {
    all.vars(formula(fml_parsed, lhs = 0, rhs = 2))
  } else {
    NULL
  }

  # if (length(fes) < 3) {
  #   stop(
  #     "dbreg_altproj requires 3+ fixed effects. Use regular dbreg() for fewer FE."
  #   )
  # }

  # Set up connection if needed
  own_conn = FALSE
  if (is.null(conn)) {
    conn = DBI::dbConnect(duckdb::duckdb(), shutdown = TRUE)
    own_conn = TRUE
  }

  # Determine data source
  if (!is.null(table)) {
    from_statement = glue::glue("FROM {table}")
  } else if (!is.null(data)) {
    duckdb::duckdb_register(conn, "tmp_altproj_data", data)
    from_statement = "FROM tmp_altproj_data"
  } else if (!is.null(path)) {
    if (!(grepl("^read|^scan", path) && grepl("'", path))) {
      path = gsub('"', "'", path)
      from_statement = glue::glue("FROM '{path}'")
    } else {
      from_statement = glue::glue("FROM {path}")
    }
  } else {
    stop("Provide one of `table`, `data`, or `path`.")
  }

  if (verbose) {
    message(
      "[altproj] Starting alternating projections for ",
      length(fes),
      " fixed effects"
    )
  }

  # Step 1: Create dense FE identifiers
  fe_dense_cols = paste0("fe_", seq_along(fes))
  dense_rank_exprs = character(length(fes))
  for (i in seq_along(fes)) {
    dense_rank_exprs[i] = glue::glue(
      "DENSE_RANK() OVER (ORDER BY {fes[i]}) AS {fe_dense_cols[i]}"
    )
  }

  all_vars = c(yvar, xvars)
  var_list = paste(all_vars, collapse = ", ")
  dense_fe_list = paste(fe_dense_cols, collapse = ", ")

  sql_dense_ids = glue::glue(
    "CREATE TEMP TABLE altproj_base AS
     SELECT
       {var_list},
       {paste(dense_rank_exprs, collapse = ',\n       ')}
     {from_statement}"
  )

  DBI::dbExecute(conn, sql_dense_ids)

  if (verbose) {
    message("[altproj] Created dense FE identifiers")
  }

  # Step 2: Initial multiway demeaning
  window_exprs = character()
  demean_exprs = character()

  for (var in all_vars) {
    # Create window functions for each FE
    for (i in seq_along(fe_dense_cols)) {
      fe_col = fe_dense_cols[i]
      window_exprs = c(
        window_exprs,
        glue::glue("AVG({var}) OVER (PARTITION BY {fe_col}) AS {var}_{i}")
      )
    }

    # Create demeaning expression (subtract all FE means)
    fe_terms = paste0(var, "_", seq_along(fe_dense_cols))
    demean_expr = var
    for (term in fe_terms) {
      demean_expr = paste0("(", demean_expr, " - ", term, ")")
    }
    demean_exprs = c(demean_exprs, glue::glue("{demean_expr} AS {var}_res"))
  }

  sql_initial_demean = glue::glue(
    "CREATE TEMP TABLE ap_work AS
     WITH windowed AS (
       SELECT
         {dense_fe_list},
         {var_list},
         {paste(window_exprs, collapse = ',\n         ')}
       FROM altproj_base
     )
     SELECT
       {dense_fe_list},
       {paste(demean_exprs, collapse = ',\n       ')}
     FROM windowed"
  )

  DBI::dbExecute(conn, sql_initial_demean)

  if (verbose) {
    message("[altproj] Completed initial multiway demeaning")
  }

  # Step 3: Alternating projections loop
  res_vars = paste0(all_vars, "_res")

  # Helper function for one demeaning step
  do_half_step = function(part_col) {
    demean_exprs = character()
    for (var in res_vars) {
      demean_exprs = c(
        demean_exprs,
        glue::glue("{var} - AVG({var}) OVER (PARTITION BY {part_col}) AS {var}")
      )
    }

    sql_step = glue::glue(
      "CREATE TEMP TABLE ap_work_new AS
       SELECT {dense_fe_list},
              {paste(demean_exprs, collapse = ',\n              ')}
       FROM ap_work"
    )

    DBI::dbExecute(conn, "DROP TABLE IF EXISTS ap_work_new")
    DBI::dbExecute(conn, sql_step)
    DBI::dbExecute(conn, "DROP TABLE ap_work")
    DBI::dbExecute(conn, "ALTER TABLE ap_work_new RENAME TO ap_work")
  }

  # Function to compute squared norm for convergence
  get_norm2 = function() {
    norm_terms = paste0(res_vars, "*", res_vars)
    sql_norm = glue::glue(
      "SELECT COALESCE(SUM({paste(norm_terms, collapse = ' + ')}), 0) AS n2 FROM ap_work"
    )
    as.numeric(DBI::dbGetQuery(conn, sql_norm)$n2)
  }

  # Main AP loop
  last_norm = Inf
  iteration = 0L

  if (verbose) {
    message("[altproj] Running alternating projections iterations...")
  }

  repeat {
    iteration = iteration + 1L

    # One full sweep through all fixed effects
    for (fe_col in fe_dense_cols) {
      do_half_step(fe_col)
    }

    current_norm = get_norm2()
    if (abs(current_norm - last_norm) <= tol * max(1, last_norm)) {
      if (verbose) {
        message("[altproj] Converged after ", iteration, " iterations")
      }
      break
    }

    if (iteration >= maxit) {
      if (verbose) {
        message("[altproj] Reached maximum iterations (", maxit, ")")
      }
      break
    }

    last_norm = current_norm
  }

  # Step 4: Run dbreg on the demeaned data
  if (verbose) {
    message("[altproj] Running final regression on demeaned data...")
  }

  # Create formula for demeaned variables
  y_res = paste0(yvar, "_res")
  x_res = paste0(xvars, "_res")
  final_fml = reformulate(x_res, y_res)

  # Call dbreg on the ap_work table (no fixed effects needed)
  result = dbreg::dbreg(
    fml = final_fml,
    conn = conn,
    table = "ap_work",
    vcov = vcov,
    verbose = FALSE # We handle our own messaging
  )

  # Step 5: Adjust result metadata
  # Get original sample size
  nobs_orig = as.numeric(
    DBI::dbGetQuery(conn, "SELECT COUNT(*) AS n FROM altproj_base")$n
  )

  # Count FE degrees of freedom
  fe_df = 0
  for (i in seq_along(fe_dense_cols)) {
    fe_col = fe_dense_cols[i]
    n_levels = as.numeric(
      DBI::dbGetQuery(
        conn,
        glue::glue("SELECT COUNT(DISTINCT {fe_col}) AS n FROM altproj_base")
      )$n
    )
    fe_df = fe_df + n_levels
  }
  fe_df = fe_df - length(fes) # Subtract for identifiability

  # Adjust degrees of freedom
  adjusted_df_res = max(nobs_orig - length(xvars) - fe_df, 1)

  # Update result object
  result$strategy = "altproj"
  result$fes = fes # Restore original FE names
  result$nobs_orig = nobs_orig
  result$df_residual = adjusted_df_res
  result$n_iterations = iteration
  result$n_fe_levels = fe_df + length(fes)

  # Recompute coefficient table with correct degrees of freedom
  if (result$df_residual != adjusted_df_res) {
    betahat = result$coeftable[, "estimate", drop = FALSE]
    # Scale variance for different df
    scale_factor = result$df_residual / adjusted_df_res
    vcov_scaled = result$vcov * scale_factor
    result$vcov = vcov_scaled
    result$coeftable = gen_coeftable(betahat, vcov_scaled, adjusted_df_res)
  }

  # Clean up
  DBI::dbExecute(conn, "DROP TABLE IF EXISTS altproj_base")
  DBI::dbExecute(conn, "DROP TABLE IF EXISTS ap_work")
  if (!is.null(data)) {
    DBI::dbExecute(conn, "DROP VIEW IF EXISTS tmp_altproj_data")
  }

  if (own_conn) {
    DBI::dbDisconnect(conn)
  }

  if (verbose) {
    message("[altproj] Alternating projections completed")
  }

  return(result)
}

And a proof-of-concept example:

pkgload::load_all("~/Documents/Projects/dbreg/")
#> ℹ Loading dbreg
library(fixest)

# Load the trade data (4 FE: Origin, Destination, Product, Year)
data("trade", package = "fixest")

# Create log variables
trade$log_euros = log(trade$Euros)
trade$log_dist = log(trade$dist_km)

# example: gravity equation with four FEs

feols(
  log_euros ~ log_dist | Origin + Destination + Product + Year,
  data = trade,
  vcov = "hc1"
)
#> OLS estimation, Dep. Var.: log_euros
#> Observations: 38,325
#> Fixed-effects: Origin: 15,  Destination: 15,  Product: 20,  Year: 10
#> Standard-errors: Heteroskedasticity-robust 
#>          Estimate Std. Error  t value  Pr(>|t|)    
#> log_dist -2.16988   0.018212 -119.147 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 1.74337     Adj. R2: 0.705139
#>                 Within R2: 0.219322
dbreg_altproj(
  log_euros ~ log_dist | Origin + Destination + Product + Year,
  data = trade,
  vcov = "hc1",
  verbose = FALSE
)
#> Standard Errors: Heteroskedasticity-robust 
#>              Estimate Std. Error  t value  Pr(>|t|)    
#> log_dist_res -2.17176   0.020928 -103.772 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Created on 2025-08-31 with reprex v2.1.1

The results are close to fixest, albeit not perfect. Again, I haven't tested thoroughly, but my assumption is that this can be fixed by tweaking the tolerance steps and stopping rules.

@jamesbrandecon
Copy link
Collaborator Author

Ok yeah let's talk live! We can email to schedule. Fwiw, three quick points:

  1. your code implements HDFE, whereas mine (also vibe-coded) is specifically for 2 dims and I haven't thought about expanding further.
  2. For a dataset with N=5e5, T=30, it seems like yours takes ~8 seconds and mine is more like 2, same as feols.
  3. I agree weights are useful alone (and for bootstrapping) but it's important to note that the weighted mundlak with two dims can only (as far as I can tell) be done via AP, so the two are connected.

@jamesbrandecon
Copy link
Collaborator Author

@grantmcdermott for now, can you unblock me on adding dplyr as a dependency and let me build this stuff out (weights and AP)? Always able to go back and take other approaches if I make it modular

@jamesbrandecon
Copy link
Collaborator Author

@grantmcdermott I am still making sure the speed is ok, but would be interested in any comments you have here. I needed weights for a couple of the things I want to add next, so I went back and had codex undo all the dplyr stuff. Hopefully that lets us merge this finally

@grantmcdermott
Copy link
Owner

Thanks @jamesbrandecon.

Realistically, I won't be able to review under the weekend (at least). QQ, though: what do you mean "alternating projections for demean strategy" in commits above? Is this full blown, FWL style iterative AP? Or something else?

@jamesbrandecon
Copy link
Collaborator Author

@grantmcdermott If I am thinking correctly (I independently came to this conclusion in the previous version above, which I mostly discarded here), the demeaning with num_FEs>2 is best done via alternating projections. I guess I've never looked deeply into AP algos so maybe I am missing something but I just loop over each FE dimension and take out the mean, repeating until convergence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants