Add PiecewiseITS experiment for known interruption dates#614
Add PiecewiseITS experiment for known interruption dates#614drbenvincent wants to merge 17 commits intomainfrom
PiecewiseITS experiment for known interruption dates#614Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
bugbot run |
PR SummaryAdds a new segmented-regression ITS workflow with explicit level/slope changes at known interruptions.
Written by Cursor Bugbot for commit c14f15c. This will update automatically on new commits. Configure here. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #614 +/- ##
========================================
Coverage 93.74% 93.74%
========================================
Files 41 44 +3
Lines 6827 7676 +849
Branches 458 517 +59
========================================
+ Hits 6400 7196 +796
- Misses 267 300 +33
- Partials 160 180 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Added detailed explanations comparing Piecewise ITS to Regression Discontinuity and Regression Kink designs. Introduced new real-world scenarios for level and slope changes, multiple interventions, and level-only models. Enhanced example code and output to illustrate these cases, improving clarity and practical guidance for users.
Improved clarity and conciseness throughout the Piecewise Interrupted Time Series (ITS) notebook. Rewrote several sections for better readability, combined and streamlined example scenarios, and clarified distinctions between level and slope changes, as well as the relationship to regression discontinuity and regression kink designs.
Refactors the PiecewiseITS experiment to use flexible patsy formulas with new stateful step() and ramp() transforms for specifying level and slope changes at interventions. Adds the causalpy.transforms module with robust, datetime-aware step/ramp transforms, updates tests to cover new formula interface and transform behavior, and improves documentation and error handling. This enables more flexible modeling of multiple interventions and supports both numeric and datetime time columns.
|
bca3699 adds the most amazing patsy-based API for segmented/piecewise regression! |
Added a new section describing the formula-based API for PiecewiseITS, including explanations of the custom step() and ramp() transforms, usage examples, and clarification on how the counterfactual is computed. This improves documentation clarity and helps users understand flexible model specification.
Implemented creation of post_impact, datapost, and post_pred attributes in PiecewiseITS for compatibility with effect_summary() from BaseExperiment. Added tests to verify effect_summary works for both OLS and PyMC models and that the new attributes are correctly created.
Added mathematical definitions for step and ramp functions using LaTeX for clarity, and moved import/setup code to the top of the notebook for better organization. Improved explanations of function arguments and removed duplicate import cell.
The introductory markdown in the piecewise_its_pymc.ipynb notebook has been significantly expanded and reorganized. The new content provides clearer explanations of when to use Piecewise ITS, the distinction between level and slope changes, the mathematical model, and its relationship to regression discontinuity and regression kink designs. Redundant sections were removed and a more structured, didactic flow was introduced.
Expanded explanations of level and slope changes in piecewise ITS, referencing a new illustrative figure. Added a code cell to display the figure, and clarified the description of multiple interventions for improved instructional clarity.
Inserted a markdown cell with a table summarizing model formulas for single and two intervention cases, covering level, slope, and combined effects. This provides clearer guidance on specifying models for each panel in the notebook.
Condenses and reorganizes introductory explanations for piecewise interrupted time series (ITS), splitting out key concepts, model details, and comparisons to related methods into clearer, more focused sections. Adds collapsible dropdowns and card formatting for scenario examples, and improves clarity and flow for users learning the model and its API.
Adds a comprehensive suite of tests for the PiecewiseITS class, including class and instance attribute checks, formula parsing, plotting, PyMC integration, counterfactuals, data generation, and error handling. Also updates the interrogate badge to reflect increased coverage.
Added detailed references and in-text citations to the piecewise_its_pymc.ipynb notebook to support methodological explanations. Updated the references.bib file with key literature on segmented regression and interrupted time series analysis. Improved clarity on model parameterization and corrected the references section to use the Sphinx bibliography directive.
|
Tagging @tomicapretto in case you are interested in the stateful transforms ( |
|
TODO: does this api work when we have datetime rather than integer time index? |
JeanVanDyk
left a comment
There was a problem hiding this comment.
Hi! I’ve done a brief review of the changes, and I must say the notebook is extensive and really interesting—the variety of examples makes the functionality very clear. I’ve also run the notebook and the tests locally: everything works and passes as expected.
However, I noticed a few points that might need addressing before we merge:
Data Structure & Types: There are a few places where date/threshold handling is quite defensive (using multiple try/except and isinstance checks). I suspect we could simplify the entire class by standardizing these to pd.Timestamp or numeric types at the initial extraction point. This would also allow us to remove the redundant _convert_threshold_for_plotting helper.
Missing Method: It looks like the effect_summary method is currently missing. Since the global refactor, this seems to have been dropped or overlooked, but it's quite central to the experiment's output.
Overall, great work on the documentation and examples! Let me know what you think about streamlining the type handling and re-adding the summary method.
| if matches: | ||
| return matches[0] | ||
| # Fallback: try to find a time-like column | ||
| return "t" |
There was a problem hiding this comment.
I noticed that the current logic merges all thresholds into a single list, regardless of the variable name (for example, step(t, 10) and step(month, 5) would result in thresholds = [10, 5]). This loses the context of which limit applies to which variable.
Is it intended to support multiple tracking variables within a single formula?
If yes: We should consider storing these in a dictionary (e.g., {"t": [10], "month": [5]}) to ensure the thresholds are applied to the correct variables later in the execution.
If no: It might be safer to check the number of unique variables found and raise a ValueError if more than one is detected. This would prevent unexpected behavior if a user provides a complex formula.
| else: | ||
| # Numeric threshold | ||
| post_mask = self.data[time_col] >= first_interruption | ||
|
|
There was a problem hiding this comment.
I see we handle str to Timestamp conversion with a fallback to direct comparison. Is there a specific case where we need to compare raw strings that aren't timestamps?
If we are primarily dealing with dates or numbers, I wonder if it wouldn't be safer to convert everything to the proper type (using pd.to_datetime) right at the beginning of the pipeline?
My thinking is that it might allow us to "fail fast" if a user provides an invalid date, and it would simplify the final comparison to a single line: self.data[time_col] >= first_interruption.
I might be overlooking a specific scenario where this late-stage conversion is necessary, so I'd love to hear your thoughts on the intent here!
| return pd.Timestamp(threshold) | ||
| except Exception: | ||
| return threshold # type: ignore[return-value] | ||
| return threshold |
There was a problem hiding this comment.
If we standardize the threshold types to pd.Timestamp or numeric values immediately upon extraction, this method likely becomes redundant as the data would already be in its final, usable form. I might be overlooking a specific edge case, but it seems that ensuring clean types at the entry point would allow us to simplify the class by removing this defensive logic and the repetitive try/except checks.
|
Thanks @JeanVanDyk. I'll take these comments into account and ping you whenever I've got an improved version. |
@drbenvincent, thanks for tagging me here. Since a long time ago we have this issue opened. In Bambi (but also with any formula-based modeling interface), I think one could do: def step(x, threshold):
return 1.0 * (x >= threshold) # ensure numeric output
def ramp(x, threshold):
return (x - threshold) * (x >= threshold)Then you would write formula = "y ~ x + step(x, 10)" # level (aka intercept) changes when x>=10
formula = "y ~ x + ramp(x, 10)" # slope changes when x>=10
formula = "y ~ x + step(x, 10) + ramp(x, 10)" # both level and slope change when x>=10With that said, I’m not sure whether those examples provide enough motivation for a stateful transformation. A stateful transformation is useful when some aspect of the transformation (e.g., a threshold) depends on the initial (training) dataset. If the value is hardcoded in the formula, there is no need for a stateful transformation (although using one would not cause any harm). While adding the examples, I just realized one could achieve the same via the interaction operator in combination with the special identity function f = "y ~ x + I(x >= 10)" # step
f = "y ~ x + x:I(x >= 10)" # ramp
f = "y ~ x + I(x >= 10) + x:I(x >= 10)" # step + ramp@drbenvincent, just let me know if you want to have a further chat about this |
|
Thanks for this! I think you are right, there's no need for this to be a stateful transform. It can just be a regular function. I think this specific example might be problematic. Because f = "y ~ x + I(x-10):I(x >= 10)" # ramp(Typing this at 3am after a toddler wake-up, so we'll see if this makes sense in the morning 🤣) |
|
@drbenvincent: yes, you are right about the problem and the solution, I missed it. Anyway, I think having specific keywords such as |
Closes #613
This pull request introduces support for Piecewise Interrupted Time Series (ITS) analysis in the codebase. The main changes include adding a new experiment class, stateful patsy transforms for specifying level and slope changes at multiple intervention points, a simulation utility for generating piecewise ITS data, and updates to the package API and documentation to expose these new features.
Piecewise ITS support and API exposure:
PiecewiseITSexperiment class to the codebase and included it in the main package API (__init__.py) and experiments API (causalpy/experiments/__init__.py). This enables users to import and usePiecewiseITSdirectly.Patsy transforms for segmented regression:
causalpy/transforms.pyproviding stateful patsy transforms:stepfor level changes andrampfor slope changes at arbitrary intervention points. These can be used in regression formulas for flexible piecewise ITS modeling, supporting both numeric and datetime time variables.stepandramptransforms in the main package API (__init__.py) to allow easy access.Data simulation utilities:
generate_piecewise_its_datafunction tocausalpy/data/simulate_data.pyfor simulating time series data with multiple interventions, customizable level and slope changes, and ground truth counterfactuals for testing and demonstration.Documentation and notebook updates:
piecewise_its_pymc.ipynbin the documentation index to demonstrate piecewise ITS analysis.Pre-commit configuration:
piecewise_its_pymc.ipynbfrom large file checks, ensuring smoother development workflow.📚 Documentation preview 📚: https://causalpy--614.org.readthedocs.build/en/614/