Non-deterministic (deterministic w.r.t. seed) RangeError during model training on small datasets

## Description
I encounter the following error during Random Forest model training in my testing environment. The error appears to be stochastically deterministic (deterministic with respect to the seed, but otherwise unpredictable for identical non-seed inputs), and only occurs for very small datasets.

```
RangeError: column indices are out of range

      at checkColumnIndices (node_modules/ml-matrix/matrix.js:1013:13)
      at new MatrixColumnSelectionView (node_modules/ml-matrix/matrix.js:3375:5)
      at RandomForestRegression.train (node_modules/ml-random-forest/random-forest.js:352:20)
```

## Steps to Reproduce
This is not deterministically reproducible, as it depends on the (seeded) random selection of bags. A sufficient number of sequential tests with small test datasets (2-4 samples with only a few features) is likely to reproduce it quickly.

## Expected Behaviour
The `RandomForestBase.train` function should succeed deterministically for valid input datasets.

## Actual Behaviour
The function fails a small percentage of the time with the error above for small training datasets.

## Cause
There is a bug in `RandomForestBase.train` when `useSampleBagging = true` that mostly shows up for very small training datasets. The `examplesBaggingWithReplacement` selects a bag of _n_ samples with replacement from the _n_ training samples. It them generates the "out of bag" (OOB) array from the unselected samples. This means that each bag has a _(1/n!)_ chance to contain every sample, resulting in an empty out of bag array being returned.

This is incredibly unlikely for _n_ > 10, which is likely why this has never been caught in production environment. This does have a reasonably high chance to cause issues in small toy/testing environments (where _n_<=6).

This does not trigger an error in `examplesBaggingWithReplacement` but does cause `train` to fail later on in an opaque way where part of the training dataset appears empty.

## Possible Fix
This could be fixed in a few ways:

1. The bagging process in `examplesBaggingWithReplacement` could restart if there are zero OOB samples remaining after bagging. This does risk failing for datasets of size 1.
2. The bagging process in `examplesBaggingWithReplacement` could pick an OOB sample before bagging to ensure it has a minimum of one element in it. This also risks failing for datasets of size 1.
3. The `train` function could skip the OOB prediction process (treating it as if `this.useSampleBagging === false`) if `Xoob.length === 0`, similar to how `collectOOB` is skipped later in the function.
4. An error check could be added after bagging to ensure the OOB array isn't empty.

I personally suggest 3, as a similar skip is already implemented later in the function, and it doesn't risk changing the determinism of the training process itself.

## Environment
Package: ml-random-forest
Version: 2.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic (deterministic w.r.t. seed) RangeError during model training on small datasets #47

Description

Steps to Reproduce

Expected Behaviour

Actual Behaviour

Cause

Possible Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Non-deterministic (deterministic w.r.t. seed) RangeError during model training on small datasets #47

Description

Description

Steps to Reproduce

Expected Behaviour

Actual Behaviour

Cause

Possible Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions