ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects#123
ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects#123
Conversation
Codecov Report
@@ Coverage Diff @@
## master #123 +/- ##
============================================
+ Coverage 80.14% 80.37% +0.22%
- Complexity 428 430 +2
============================================
Files 43 43
Lines 1516 1513 -3
Branches 138 138
============================================
+ Hits 1215 1216 +1
+ Misses 224 222 -2
+ Partials 77 75 -2
Continue to review full report at Codecov.
|
| throw new LightGBMException(); | ||
| } | ||
|
|
||
| swigTrainData.destroySwigTrainLabelDataArray(); |
There was a problem hiding this comment.
I think we shouldn't remove this line, instead we should:
swigTrainData.initSwigTrainLabelDataArray(); // Init and copy from chunked data.
try {
(...)
} finally {
swigTrainData.destroySwigTrainLabelDataArray();
}
The reason is because initSwigTrainLabelDataArray() actually creates a new array and clones the chuncks, and if I understood well the code, if we call this method twice without calling destroySwigTrainLabelDataArray we will leak an array of data.
We can think of an improvement by having a closeable class for this case too.
There was a problem hiding this comment.
Yes this is also the same situation as above @mlobofeedzai and should not be removed for the same reasons. I'd say, put a comment in the same manner as in the other case so there are no doubts as to why it's there.
I agree with @gandola , changing this to a try/finally is more idiomatic and better behaved in case of prior exceptions 👌.
| } | ||
|
|
||
| swigTrainData.initSwigDatasetHandle(); | ||
| swigTrainData.destroySwigTrainFeaturesChunkedDataArray(); |
There was a problem hiding this comment.
@AlbertoEAF is there any reason for calling destroySwigTrainFeaturesChunkedDataArray() here? it looks to be an array that lives within the entire lifecycle of a SWIGTrainData instance and it is released in the close method.
If that is the case and to simplify and be more safe I would remove this destroySwigTrainFeaturesChunkedDataArray() because it may not be necessary and can cause confusion.
There was a problem hiding this comment.
Yes, it's wrong to delete this line. It'll unnecessarily duplicate RAM usage on the heaviest part of the job. Maybe add an inline comment with ("// Critical to save RAM. Do not remove!") along with a comment above explaining why it's here in the first place.
Transferring train data to a LightGBM Dataset has several stages:
- Transfer data from a Pulse (OpenML) Dataset iterator (which doesn't know how large the dataset is) instance by instance to our ChunkedArrays. A ChunkedArray works as dynamically growing array of buffers automatically managed on the C++ side.
- Create the LGBM
Datasetdata structure from such ChunkedArrays (buffers). // In this stage we achieve peak RAM usage (2x copies of the dataset => ~2x memory usage) - Delete the buffers as soon as possible (
destroySwigTrainFeaturesChunkedDataArray()). Now we only have the train data on a LightGBM Dataset. (we're back to only one copy of the train dataset 👍 )
Finally we can train the LGBM model from the LGBM Dataset, which can still use ample memory during train, but won't ever reach the 2x memory usage factor.
| logger.error("Could not create LightGBM dataset."); | ||
| throw new LightGBMException(); | ||
| } | ||
| // FIXME is this init necessary? |
There was a problem hiding this comment.
Yes this is necessary. It's in here we pass from a void ** to a LGBM DatasetHandle (https://lightgbm.readthedocs.io/en/latest/C-API.html#c.DatasetHandle) so we can pass it around with SWIG.
48f9bdf to
80ba4ca
Compare
80ba4ca to
3b96f41
Compare
|
The last commit broke the build. Address later |
Because we believe this non-closed/destroyed C++ objects were causing the segmentation faults
Mostly authored by @gandola