Skip to content

Non-deterministic (deterministic w.r.t. seed) RangeError during model training on small datasets #47

@dorox1

Description

@dorox1

Description

I encounter the following error during Random Forest model training in my testing environment. The error appears to be stochastically deterministic (deterministic with respect to the seed, but otherwise unpredictable for identical non-seed inputs), and only occurs for very small datasets.

RangeError: column indices are out of range

      at checkColumnIndices (node_modules/ml-matrix/matrix.js:1013:13)
      at new MatrixColumnSelectionView (node_modules/ml-matrix/matrix.js:3375:5)
      at RandomForestRegression.train (node_modules/ml-random-forest/random-forest.js:352:20)

Steps to Reproduce

This is not deterministically reproducible, as it depends on the (seeded) random selection of bags. A sufficient number of sequential tests with small test datasets (2-4 samples with only a few features) is likely to reproduce it quickly.

Expected Behaviour

The RandomForestBase.train function should succeed deterministically for valid input datasets.

Actual Behaviour

The function fails a small percentage of the time with the error above for small training datasets.

Cause

There is a bug in RandomForestBase.train when useSampleBagging = true that mostly shows up for very small training datasets. The examplesBaggingWithReplacement selects a bag of n samples with replacement from the n training samples. It them generates the "out of bag" (OOB) array from the unselected samples. This means that each bag has a (1/n!) chance to contain every sample, resulting in an empty out of bag array being returned.

This is incredibly unlikely for n > 10, which is likely why this has never been caught in production environment. This does have a reasonably high chance to cause issues in small toy/testing environments (where n<=6).

This does not trigger an error in examplesBaggingWithReplacement but does cause train to fail later on in an opaque way where part of the training dataset appears empty.

Possible Fix

This could be fixed in a few ways:

  1. The bagging process in examplesBaggingWithReplacement could restart if there are zero OOB samples remaining after bagging. This does risk failing for datasets of size 1.
  2. The bagging process in examplesBaggingWithReplacement could pick an OOB sample before bagging to ensure it has a minimum of one element in it. This also risks failing for datasets of size 1.
  3. The train function could skip the OOB prediction process (treating it as if this.useSampleBagging === false) if Xoob.length === 0, similar to how collectOOB is skipped later in the function.
  4. An error check could be added after bagging to ensure the OOB array isn't empty.

I personally suggest 3, as a similar skip is already implemented later in the function, and it doesn't risk changing the determinism of the training process itself.

Environment

Package: ml-random-forest
Version: 2.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions