Skip to content

Balanced sampler #73

@victor-moreno

Description

@victor-moreno

Hi this is an idea to implement an option for a balanced sampler that would allow correcting biases in the data or improve learning rare outcomes.
This is most important for binary outcomes, but could be generalized for more categories, though then the sample size needs to be large.
A simple option would be a sampler that balances for outcome class at each batch.
A more sophisticated version would be to balance for on categorical variable or a numerical score that would be read from the patient file. For this score, a number of bins can be defined and the sampler would balance the outcome and the distribution in bins, in a way that if the score is related to the outcome, that effect would be removed. Using a score you can adjust for multiple variables (age, sex, center, ...) using the predictions of a logistic regression model on these factors respect the outcome (propensity score matching).

Imagine you have data from multiple scanners, and the distribution of cases/control is imbalanced by scanner. The model will learn that. This can be removed if at each batch, an equal number of cases/controls from each scanner are sampled.
The cost of that is that some extreme samples may never be used, if we choose downsampling, or that some samples will be used more often with oversampling. That could be an option.
I attach an example of code on how this could be implemented

balanced.ipynb.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions