-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Hi this is an idea to implement an option for a balanced sampler that would allow correcting biases in the data or improve learning rare outcomes.
This is most important for binary outcomes, but could be generalized for more categories, though then the sample size needs to be large.
A simple option would be a sampler that balances for outcome class at each batch.
A more sophisticated version would be to balance for on categorical variable or a numerical score that would be read from the patient file. For this score, a number of bins can be defined and the sampler would balance the outcome and the distribution in bins, in a way that if the score is related to the outcome, that effect would be removed. Using a score you can adjust for multiple variables (age, sex, center, ...) using the predictions of a logistic regression model on these factors respect the outcome (propensity score matching).
Imagine you have data from multiple scanners, and the distribution of cases/control is imbalanced by scanner. The model will learn that. This can be removed if at each batch, an equal number of cases/controls from each scanner are sampled.
The cost of that is that some extreme samples may never be used, if we choose downsampling, or that some samples will be used more often with oversampling. That could be an option.
I attach an example of code on how this could be implemented