Skip to content

Optimize scikit-learn pipeline and add --scikit_model_name flag#37

Open
vile319 wants to merge 2 commits intoGleghorn-Lab:mainfrom
vile319:scikit-pipeline-optimization
Open

Optimize scikit-learn pipeline and add --scikit_model_name flag#37
vile319 wants to merge 2 commits intoGleghorn-Lab:mainfrom
vile319:scikit-pipeline-optimization

Conversation

@vile319
Copy link
Contributor

@vile319 vile319 commented Feb 10, 2026

Optimizes the scikit-learn pipeline to handle large datasets and adds the ability to skip LazyPredict when a model is already known.

Changes
main.py

  • When --scikit_model_name is specified, skips LazyPredict and goes directly to hyperparameter tuning + training

lazy_predict.py

  • Precompute preprocessing once instead of refitting StandardScaler/Imputer per model
  • Added n_jobs=-1 to parallelizable models (RandomForest, etc.)
  • Removed slow models from LazyPredict: SVC, NuSVC, AdaBoost, KNeighbors, DecisionTree, LDA/QDA, etc.
  • Added XGBoost/LightGBM to model dictionaries correctly

scikit_classes.py

  • Fixed --scikit_model_name CLI arg mapping to model_name
  • Added hyperparameter tuning when using --scikit_model_name directly
  • Added verbose logging to RandomizedSearchCV

New Usage
Full pipeline (LazyPredict → best model → hyperparameter tuning)
python main.py --model_names ESMC-600 --data_names gold-ppi --embedding_pooling_types mean var --use_scikit --n_jobs -1

Skip LazyPredict, go straight to XGBoost tuning
python main.py --model_names ESMC-600 --data_names gold-ppi --embedding_pooling_types mean var --use_scikit --n_jobs -1 --scikit_model_name XGBClassifier --scikit_n_iter 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant