Cluster Margin (CM) for Batch Active Learning / Bayesian Optimization

Overview

One of the most important aspects of the Active Learning (AL) and Bayesian Optimization (BO) process is determining how much new data should be labeled before retraining the model. Classical BO methods often operate in a sequential manner, selecting and labeling a single data point at a time. However, retraining a model after each newly labeled sample can be inefficient:

Retraining may be computationally expensive or time-consuming
A single new sample may not significantly improve performance. This is especially true for deep learning models
Many real-world experimental workflows run multiple experiments in parallel To address these limitations, batch active learning has been proposed. Instead of selecting a single point, batch AL selects multiple samples to be labeled before retraining the model. However, batch selection introduces its own challenges:
Uncertainty-only sampling: Can lead to biased selections that are not representative of the unlabeled data distribution.
Diversity-only sampling: Can increase labeling cost by selecting samples with low information content.

Cluster Margin

An effective batch AL strategy must therefore balance: Uncertainty (how informative a sample is) and diversity (how representative and non-redundant the batch is). Hence we introduce the use of Cluster Margin (CM), which is a simple and practical batch active learning strategy that explicitly accounts for both. The method is developed originally for classify of images but we adapted the original work to use for regression problem with similar principle:

Use HAC with Average Linkage to build a clustering from the pool of examples
Ranking all samples by prediction margin (based on acquisation function)
Selecting high-uncertainty points from diverse clusters (Round Robin Sampling)

We applied this method for optimized yield of Colicin M and E1 in proCFPS, and Colicin M in euCFPS with minimal experiments effort. Results are published in An AI-driven workflow for the accelerated optimization of cell-free protein synthesis

Repository Structure

.
├── data/
| ├── data 20/          # data example for AL loop where only 20 new data points is selected for next rounds
| └── data 50/          # data example for AL loop where 50 new data points is selected for next rounds      
├── library/            # core implementation code
├── new_exp/            # Folder to save output selected new data point for labelling
├── active_20.ipynb     # Example usage to run AL-CM and select 20 new experiments
├── active_50.ipynb     # Example usage to run AL-CM and select 50 new experiments
├── active_loop.py      # Example usage in python format
├── config.csv          # All required parameters for active_loop.py
├── config.tsv          # Unused, parameters for active_loop.py
├── environment.yaml    # Python environment
├── loop_params.py      # Record of used parameters from jupyter notebooks
├── params.tsv          # Pre-select concentration for each compounds
└── README.md           # This file

Installation:

git clone https://github.com/yourusername/your-repo-name.git
cd your-repo-name
env create -f environment.yaml

References

Original Method: Citovsky, G., et al. (2021). Batch Active Learning at Scale.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cluster Margin (CM) for Batch Active Learning / Bayesian Optimization

Overview

Cluster Margin

Repository Structure

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
library		library
new_exp		new_exp
README.md		README.md
active_20.ipynb		active_20.ipynb
active_50.ipynb		active_50.ipynb
active_loop.py		active_loop.py
config.csv		config.csv
config.tsv		config.tsv
environment.yaml		environment.yaml
loop_params.py		loop_params.py
param.tsv		param.tsv

brsynth/al_cluster_margin

Folders and files

Latest commit

History

Repository files navigation

Cluster Margin (CM) for Batch Active Learning / Bayesian Optimization

Overview

Cluster Margin

Repository Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages