One of the most important aspects of the Active Learning (AL) and Bayesian Optimization (BO) process is determining how much new data should be labeled before retraining the model. Classical BO methods often operate in a sequential manner, selecting and labeling a single data point at a time. However, retraining a model after each newly labeled sample can be inefficient:
- Retraining may be computationally expensive or time-consuming
- A single new sample may not significantly improve performance. This is especially true for deep learning models
- Many real-world experimental workflows run multiple experiments in parallel To address these limitations, batch active learning has been proposed. Instead of selecting a single point, batch AL selects multiple samples to be labeled before retraining the model. However, batch selection introduces its own challenges:
- Uncertainty-only sampling: Can lead to biased selections that are not representative of the unlabeled data distribution.
- Diversity-only sampling: Can increase labeling cost by selecting samples with low information content.
An effective batch AL strategy must therefore balance: Uncertainty (how informative a sample is) and diversity (how representative and non-redundant the batch is). Hence we introduce the use of Cluster Margin (CM), which is a simple and practical batch active learning strategy that explicitly accounts for both. The method is developed originally for classify of images but we adapted the original work to use for regression problem with similar principle:
- Use HAC with Average Linkage to build a clustering from the pool of examples
- Ranking all samples by prediction margin (based on acquisation function)
- Selecting high-uncertainty points from diverse clusters (Round Robin Sampling)
We applied this method for optimized yield of Colicin M and E1 in proCFPS, and Colicin M in euCFPS with minimal experiments effort. Results are published in An AI-driven workflow for the accelerated optimization of cell-free protein synthesis
.
├── data/
| ├── data 20/ # data example for AL loop where only 20 new data points is selected for next rounds
| └── data 50/ # data example for AL loop where 50 new data points is selected for next rounds
├── library/ # core implementation code
├── new_exp/ # Folder to save output selected new data point for labelling
├── active_20.ipynb # Example usage to run AL-CM and select 20 new experiments
├── active_50.ipynb # Example usage to run AL-CM and select 50 new experiments
├── active_loop.py # Example usage in python format
├── config.csv # All required parameters for active_loop.py
├── config.tsv # Unused, parameters for active_loop.py
├── environment.yaml # Python environment
├── loop_params.py # Record of used parameters from jupyter notebooks
├── params.tsv # Pre-select concentration for each compounds
└── README.md # This file
Installation:
git clone https://github.com/yourusername/your-repo-name.git
cd your-repo-name
env create -f environment.yaml
Original Method: Citovsky, G., et al. (2021). Batch Active Learning at Scale.