Skip to content

Nextflow pipeline to evaluate MIL architectures in concordance with histology foundational models.

License

Notifications You must be signed in to change notification settings

digenoma-lab/HistologyMultiInstanceLearning

Repository files navigation

HistologyMultiInstanceLearning

Histology Multi Instance Learning Pipeline

Multi-Instance Learning (MIL) pipeline for histopathology to evaluate different MIL architectures (ABMIL, CLAM, DSMIL, etc.) using pre-extracted features from foundation models (for example, uni_v2, virchow2).

The workflow is implemented in Nextflow DSL2 and uses containers (Wave/Singularity) to run both the Python part (MIL training and grid search) and the R part (visualizations).


Pipeline overview

  • main.nf
    Orchestrates the pipeline:

    • Reads the clinical/dataset file (params.dataset).
    • Reads the list of feature extractors from params/feature_extractors.csv (automatically loaded).
    • Reads the list of MIL architectures from params/architectures.csv (automatically loaded).
    • Uses params.features_dir to construct feature directory paths.
    • Launches:
      • split_dataset: splits the dataset into train/val/test folds for cross-validation at the case level.
      • grid_search: runs grid-search for each feature_extractor × MIL architecture combination with cross-validation.
      • concat_results: concatenates all test metrics into a single summary file.
      • boxplot_auc: generates a global performance boxplot (ROC AUC).
      • roc_auc_curve: generates ROC AUC curves for each configuration.
      • heatmap_workflow:
        • select_best_config: selects the best configuration based on validation AUC.
        • predict: generates attention scores and predictions for the best model.
        • heatmap: creates heatmap visualizations for top-k patches.
        • convert_tiff: converts heatmaps to TIFF format.
  • modules/grid_search.nf

    • process split_dataset: runs histomil-splits to create train/val/test splits for cross-validation at the case level.
    • process grid_search: runs histomil-grid for each feature_extractor × MIL architecture combination and publishes:
      • test_results_*.csv (test set metrics per fold)
      • predictions_*.csv (test set predictions per fold)
    • process concat_results: concatenates all test metrics into a single summary.csv file.
  • modules/plots.nf

    • process boxplot_auc: generates a boxplot comparing ROC AUC across all configurations using bin/boxplot_auc.R.
    • process roc_auc_curve: generates ROC AUC curves using bin/roc_auc_curve.R.
  • modules/heatmaps.nf

    • process select_best_config: identifies the best hyperparameter configuration based on validation metrics.
    • process predict: runs histomil-predict to generate predictions and attention scores using the best model.
    • process heatmap: runs histomil-heatmap to visualize attention scores as heatmaps on slide images.
    • process convert_tiff: converts generated heatmap images to tiled BigTIFF format using gdal_translate.
  • bin/

    • boxplot_auc.R: reads the summary.csv file and generates a ROC AUC boxplot.png comparing performance across feature extractors and MIL architectures.
    • roc_auc_curve.R: plots ROC curves for model predictions.

Inputs

  • Dataset file (params.dataset)

    • CSV with at least:
      • A case_id column to identify cases (patients) for case-level splitting.
      • A slide_id column to link samples with feature files.
      • A target column (specified by params.target, e.g., target, ESR1, MKI67).
    • Example structure:
      case_id,slide_id,target
      case_1,slide_1,0
      case_1,slide_2,0
      case_2,slide_3,1
      case_2,slide_4,1
      ...
  • Feature extractors configuration (params/feature_extractors.csv)

    • CSV file automatically loaded by the pipeline (located in params/ directory).
    • Required columns:
      • patch_encoder: patch-level encoder name (e.g. uni_v2, virchow2).
      • patch_size: patch size in pixels (e.g. 256, 224).
      • mag: magnification level (e.g. 20).
      • overlap: overlap in pixels (e.g. 0).
    • Example:
      patch_encoder,patch_size,mag,overlap
      uni_v2,256,20,0
      virchow2,224,20,0
  • MIL architectures configuration (params/architectures.csv)

    • CSV file automatically loaded by the pipeline (located in params/ directory).
    • Required columns:
      • architecture: MIL architecture name (e.g. abmil, clam, dsmil, dftd, ilra, rrt, transformer, transmil, wikg).
    • Example:
      architecture
      abmil
      clam
      dsmil
      dftd
      ilra
      rrt
      transformer
      transmil
      wikg
  • Features directory (params.features_dir)

    • Base directory path where feature directories are located.
    • Feature directories follow the pattern: {features_dir}{mag}x_{patch_size}px_{overlap}px_overlap/features_{patch_encoder}/
    • Each feature directory should contain one .h5 file per slide (named {slide_id}.h5).
    • Each H5 file should contain:
      • features: Array of shape (num_patches, feature_dim)
      • Optionally: coords: Array of patch coordinates
  • Slides directory (params.slides_dir)

    • Base directory path where WSIs directories are located.
  • Pipeline parameters (YAML files in params/)

    • The key parameters are:

      • dataset: path to the CSV with case_id, slide_id, and target columns.
      • features_dir: base directory path where feature directories are located.
      • slides_dir: base directory path where WSIs are located.
      • outdir: output directory for this run (default: ./results/).
      • target: column name of the target variable (e.g., target, ESR1, MKI67).
      • task: "classification" (currently only classification is supported).
    • Example:

      • HRR ER classification (params/params_hrr_er.yml):
        dataset: '/path/to/class_dataset_er.csv'
        features_dir: "/path/to/features/base/directory/"
        features_dir: "/path/to/slides/base/directory/"
        outdir: "./results_hrr_er/"
        target: "target"
        task: "classification"

Outputs

All outputs are written under params.outdir (configured in the selected params file):

  • Training results

    • training/
      • summary.csv (concatenated test metrics from all feature extractors and MIL architectures).
      • {feature_extractor}.{mil}/
        • test_results_{feature_extractor}.{mil}.csv with metrics per fold.
      • Classification metrics: test_auc, test_acc, test_f1, test_precision, test_recall.
  • Predictions

    • predictions/
      • {feature_extractor}.{mil}/
        • predictions_{feature_extractor}.{mil}_{fold}.csv with slide_id, y_true, y_pred, y_score (probability for the positive class).
  • Splits

    • splits/
      • {target}/
        • dataset.csv (processed dataset with case_id, slide_id, and label columns).
        • splits_{fold}_bool.csv (boolean splits for each fold with train/val/test columns).
        • splits_{fold}_descriptor.csv (summary statistics for each split).
  • Plots

    • plots/
      • boxplot.png: Distribution of ROC AUC by feature_extractor and mil architecture.
      • *.roc_auc.png: ROC AUC curves for each configuration.
  • Heatmaps

    • heatmaps/{feature_extractor}.{mil}/
      • attention_scores/: H5 files containing attention scores.
      • predictions.csv: Predictions for the best model.
      • topk_patches/:
        • {slide_id}/heatmap_*.png: Attention heatmap overlay.
        • {slide_id}/topk_patches/top_*.png: Highest attention patches.
      • tiff/: Converted BigTIFF heatmaps.
  • Pipeline information

    • pipeline_info/ (timeline, report, trace, DAG HTML) generated automatically by Nextflow.

Requirements

  • Nextflow ≥ 22.x
  • Access to Singularity/Wave containers (configured in nextflow.config).
  • Cluster with SLURM if using the kutral profile (default in this repo).

Basic usage

  1. Load the environment where Nextflow and Singularity are available.
  2. Build the Singularity container for HistoMILTrainer: Navigate to the singularity/ directory and build the container image:
    cd singularity/
    singularity build histomil.sif histomil.def
    This will create the histomil.sif image that will be used by Nextflow to run the pipeline processes.
  3. Configure feature extractors: Ensure params/feature_extractors.csv exists and contains the feature extractor configurations you want to evaluate.
  4. Configure MIL architectures: Ensure params/architectures.csv exists and contains the MIL architectures you want to evaluate.
  5. Choose or edit a params file in params/ directory:
    • Set dataset: path to your CSV with case_id, slide_id, and target columns.
    • Set features_dir: base directory where feature directories are located.
    • Set target: column name of the target variable (e.g., target, ESR1, MKI67).
    • Set outdir: output directory for this run.
    • Set task: "classification" (currently only classification is supported).
  6. Run the pipeline:
# HRR ER classification
nextflow run main.nf -profile kutral -params-file params/params_hrr_er.yml

# MKI67 classification
nextflow run main.nf -profile kutral -params-file params/params_mki67_class.yml

For local execution (without SLURM), you can use the local profile defined in nextflow.config:

nextflow run main.nf -profile local -params-file params/params_hrr_er.yml

Supported MIL architectures

The pipeline supports multiple state-of-the-art MIL architectures from MIL-Lab:

  • ABMIL: Attention-based Multiple Instance Learning
  • CLAM: Clustering-constrained Attention Multiple instance learning
  • DSMIL: Dual-stream Multiple Instance Learning
  • DFTD: Deep Feature-based Top-Down attention
  • ILRA: Instance-Level Representation Aggregation
  • RRT: Residual Regression Transformer
  • Transformer: Transformer-based MIL
  • TransMIL: Transductive Multiple Instance Learning
  • WIKG: Weighted Instance Knowledge Graph

Each architecture can be configured via JSON files in bin/HistoMILTrainer/configs/. The pipeline uses 3-fold cross-validation by default (configurable in grid_search.py).

Note: CLAM automatically sets batch_size to 1 during training. Make sure MIL-Lab is properly installed and accessible in your Python path.


Output directory structure

After running the pipeline, the output directory (params.outdir) will have the following structure:

results/
├── splits/                    # Train/val/test splits
│   ├── target/
│   │   ├── dataset.csv
│   │   ├── splits_0_bool.csv
│   │   ├── splits_0_descriptor.csv
│   │   └── ...
│   └── ...
├── training/                  # Training results
│   ├── summary.csv            # Concatenated summary
│   ├── {feature_extractor}.{mil}/
│   │   └── test_results_{feature_extractor}.{mil}.csv
│   └── ...
├── predictions/               # Test set predictions
│   ├── {feature_extractor}.{mil}/
│   │   ├── predictions_{feature_extractor}.{mil}_0.csv
│   │   ├── predictions_{feature_extractor}.{mil}_1.csv
│   │   └── ...
│   └── ...
├── plots/                     # Generated plots
│   ├── boxplot.png            # ROC AUC comparison boxplot
│   └── *.roc_auc.png          # ROC curves
├── heatmaps/                  # Attention heatmaps and predictions
│   ├── {feature_extractor}.{mil}/
│   │   ├── attention_scores/
│   │   ├── predictions.csv
│   │   ├── topk_patches/
│   │   └── tiff/
│   └── ...
└── pipeline_info/              # Nextflow execution reports
    ├── execution_report_*.html
    ├── execution_timeline_*.html
    ├── execution_trace_*.txt
    └── pipeline_dag_*.html

Tips and best practices

  1. Feature extractor configuration: Make sure the patch_encoder, patch_size, mag, and overlap values in params/feature_extractors.csv match the directory structure in your features_dir.

  2. Case-level splitting: The pipeline splits data at the case level to prevent data leakage. Multiple slides from the same case will always be in the same split (train/val/test).

  3. Cross-validation: The pipeline uses 10-fold cross-validation by default. Each fold generates separate test metrics and predictions.

  4. Memory and GPU requirements: Grid search processes can be memory and GPU-intensive. The default configuration allocates 80G memory, 16 CPUs, and 1 GPU for grid search processes. Adjust in nextflow.config if needed.

  5. Resume execution: Nextflow supports resuming failed runs. Use -resume flag:

    nextflow run main.nf -profile kutral -params-file params/params_hrr_er.yml -resume
  6. Feature format: Features should be pre-extracted and stored in H5 format. Each slide should have a corresponding {slide_id}.h5 file containing the features array.


Citation

If you use this pipeline in your research, please cite:

  • MIL-Lab: The repository containing the MIL architectures used in this pipeline

  • HistoMIL: The library used for training MIL architectures on histology data

  • This pipeline: If you use this Nextflow pipeline, please cite this repository


Contact

Author: Gabriel Cabas
For questions or suggestions, please open an issue or pull request in this repository.

About

Nextflow pipeline to evaluate MIL architectures in concordance with histology foundational models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published