-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Description
I encounter the following error during Random Forest model training in my testing environment. The error appears to be stochastically deterministic (deterministic with respect to the seed, but otherwise unpredictable for identical non-seed inputs), and only occurs for very small datasets.
RangeError: column indices are out of range
at checkColumnIndices (node_modules/ml-matrix/matrix.js:1013:13)
at new MatrixColumnSelectionView (node_modules/ml-matrix/matrix.js:3375:5)
at RandomForestRegression.train (node_modules/ml-random-forest/random-forest.js:352:20)
Steps to Reproduce
This is not deterministically reproducible, as it depends on the (seeded) random selection of bags. A sufficient number of sequential tests with small test datasets (2-4 samples with only a few features) is likely to reproduce it quickly.
Expected Behaviour
The RandomForestBase.train function should succeed deterministically for valid input datasets.
Actual Behaviour
The function fails a small percentage of the time with the error above for small training datasets.
Cause
There is a bug in RandomForestBase.train when useSampleBagging = true that mostly shows up for very small training datasets. The examplesBaggingWithReplacement selects a bag of n samples with replacement from the n training samples. It them generates the "out of bag" (OOB) array from the unselected samples. This means that each bag has a (1/n!) chance to contain every sample, resulting in an empty out of bag array being returned.
This is incredibly unlikely for n > 10, which is likely why this has never been caught in production environment. This does have a reasonably high chance to cause issues in small toy/testing environments (where n<=6).
This does not trigger an error in examplesBaggingWithReplacement but does cause train to fail later on in an opaque way where part of the training dataset appears empty.
Possible Fix
This could be fixed in a few ways:
- The bagging process in
examplesBaggingWithReplacementcould restart if there are zero OOB samples remaining after bagging. This does risk failing for datasets of size 1. - The bagging process in
examplesBaggingWithReplacementcould pick an OOB sample before bagging to ensure it has a minimum of one element in it. This also risks failing for datasets of size 1. - The
trainfunction could skip the OOB prediction process (treating it as ifthis.useSampleBagging === false) ifXoob.length === 0, similar to howcollectOOBis skipped later in the function. - An error check could be added after bagging to ensure the OOB array isn't empty.
I personally suggest 3, as a similar skip is already implemented later in the function, and it doesn't risk changing the determinism of the training process itself.
Environment
Package: ml-random-forest
Version: 2.1.0