BOLD Library Curation Pipeline

Overview

A comprehensive Snakemake pipeline for processing and curating BOLD Systems barcode reference sequence data. This tool implements standardized quality assessment criteria developed by the Biodiversity Genomics Europe (BGE) consortium to evaluate and rank DNA barcode sequences for library curation.

Key Features

Comprehensive Quality Assessment: Evaluates specimens against 16 standardized criteria including metadata completeness, voucher information, sequence quality, and phylogenetic analyses
Advanced Phylogenetic Analysis: Includes haplotype identification and OTU clustering for genetic diversity assessment
BAGS Species Assessment: Automated species-level quality grading system with subspecies inheritance
Geographic Representation: Country representative selection for balanced geographic sampling
Scalable Architecture: Family-level database splitting for efficient analysis of large datasets
FAIR Compliance: Built with reproducibility and provenance tracking using Snakemake workflows

Background

The classification criteria are actively developed by the BGE consortium and documented in this living document. The pipeline evaluates sequences based on multiple quality dimensions to support evidence-based curation decisions.

Pipeline Workflow

DETAILED WORKFLOW

Example RUN OUTPUT

The pipeline processes BOLD data through six main phases:

Data Preparation: Optional pre-filtering by taxa, geography, or genetic markers
Database Setup: SQLite database creation with taxonomic enrichment
Quality Assessment: Evaluation against 16 standardized criteria plus phylogenetic analyses
BAGS Assessment: Species-level quality grading with database optimization
Data Integration: Ranking system combining all assessments with country representative selection
Family Splitting: Creation of family-level databases for scalable downstream analysis

Assessment Criteria

Specimen Metadata: Collection date, collectors, identifier, identification method
Geographic Data: Country, region, site, sector, coordinates
Repository Info: Institution, museum ID, public voucher
Sequence Quality: DNA quality metrics, species ID, type specimen, images
Phylogenetic: Haplotype identification, OTU clustering

Requirements

Operating System: Linux, macOS, or Windows with WSL
Package Manager: Mamba (recommended) or Conda
Memory: Minimum 8GB RAM (16GB+ recommended for large datasets)
Storage: Sufficient space for BOLD data and family databases
Dependencies: VSEARCH (for OTU clustering), SQLite, Perl, Python Data**: Country, region, site, sector, coordinates
Repository Info: Institution, museum ID, public voucher
Sequence Quality: DNA quality metrics, species ID, type specimen, images
Phylogenetic: Haplotype identification, OTU clustering

Requirements

Operating System: Linux, macOS, or Windows with WSL
Package Manager: Mamba (recommended) or Conda
Memory: Minimum 8GB RAM (16GB+ recommended for large datasets)
Storage: Sufficient space for BOLD data and family databases
Dependencies: VSEARCH (for OTU clustering), SQLite, Perl, Python

Installation

Clone the repository:

git clone https://github.com/bge-barcoding/bold-library-curation.git
cd bold-library-curation

Set up the environment:

mamba env create -f environment.yml
mamba activate bold-curation

Quick Start

Configure your analysis in config/config.yml:

# Input data
BOLD_TSV: "resources/your_bold_data.tsv"

# Optional filtering
ENABLE_PRESCORING_FILTER: false
USE_TARGET_LIST: false

# Analysis parameters
OTU_CLUSTERING_THRESHOLD: 0.99
FAMILY_SIZE_THRESHOLD: 10000

Run the pipeline:
```
snakemake --cores 4 --use-conda
```
Check results in results/result_output.tsv and results/family_databases/

Configuration Options

Core Settings

BOLD_TSV: Path to BOLD data dump (BCDM TSV format)
RESULTS_DIR/LOG_DIR: Customizable output directories
TAXONOMY_CHUNK_SIZE: Memory optimization (default: 10,000)

Optional Filtering

Pre-scoring Filter: Early dataset reduction by taxa, countries, markers, or BIN sharing
Target Lists: Focus on specific species of interest
Geographic Filtering: Country-based specimen filtering

Analysis Parameters

OTU_CLUSTERING_THRESHOLD: Genetic similarity for clustering (default: 0.99)
OTU_CLUSTERING_THREADS: Parallel processing threads (default: 8)
FAMILY_SIZE_THRESHOLD: Minimum records for individual family databases (default: 10,000)

Usage Examples

Basic Analysis

# Standard pipeline execution
snakemake --cores 8 --use-conda

Large Dataset with Pre-filtering

# Configure pre-filtering in config.yml first
# ENABLE_PRESCORING_FILTER: true
# FILTER_TAXA: true
# FILTER_TAXA_LIST: "resources/target_taxa.txt"
snakemake --cores 16 --use-conda --resources mem_mb=32000

HPC Cluster Execution (SLURM)

#!/bin/bash
#SBATCH --partition=day
#SBATCH --mem=32G
#SBATCH --cpus-per-task=16

source activate bold-curation
snakemake --cores 16 --use-conda

Target Species Focus

# Enable target list in config.yml
# USE_TARGET_LIST: true
# TARGET_LIST: "resources/target_species.csv"
snakemake --cores 8 --use-conda

Output Files

Primary Outputs

results/result_output.tsv: Final scored and ranked specimens with all assessments
results/family_databases/: Family-level SQLite databases organized by phylum
results/pipeline_summary.txt: Comprehensive execution summary

Assessment Results

Individual criteria files: assessed_*.tsv for each quality criterion
BAGS assessment: Species-level quality grades with BIN sharing analysis
Phylogenetic analyses: Haplotype IDs and OTU clustering results
Database: Complete SQLite database with specialized tables for complex queries

Performance Monitoring

Comprehensive logging: Step-by-step execution logs in logs/ directory
Progress tracking: Real-time monitoring for long-running operations
Error handling: Detailed debugging information for troubleshooting

Performance Guidelines

Dataset Size Recommendations

Small (< 10K records): 30-60 minutes, 8GB RAM
Medium (10K-100K records): 2-6 hours, 16GB RAM
Large (100K+ records): 6-24 hours, 32GB+ RAM
Very large (1M+ records): 1-3 days, consider pre-filtering

Optimization Tips

Use pre-scoring filter for very large datasets
Adjust memory settings based on available resources
Configure threading for OTU clustering based on CPU cores
Use SSD storage for database operations when possible

Project Structure

bold-library-curation/
├── config/                 # Configuration files
├── workflow/              # Snakemake workflow
│   ├── bold-ranker.smk   # Main workflow definition
│   ├── scripts/          # Analysis scripts
│   ├── envs/             # Conda environments
│   └── README.md         # Detailed workflow documentation
├── resources/            # Input data and references
├── results/              # Output files and databases
└── logs/                 # Execution logs

Advanced Features

Phylogenetic Integration

Haplotype Analysis: Identifies genetic variants within species/BIN groups
OTU Clustering: VSEARCH-based clustering with configurable similarity thresholds
Multi-threaded Processing: Parallel execution for computationally intensive steps

Geographic Representation

Country Representatives: Systematic selection of optimal specimens per region
Balanced Sampling: Maintains geographic diversity while optimizing quality

Database Architecture

Hierarchical Organization: Family-level databases organized by phylum
Specialized Tables: Optimized schema for different data types
Efficient Querying: Comprehensive indexing for complex analyses

Contributing

We welcome contributions! Please see our contributing guidelines and:

Fork the repository
Create a feature branch
Make changes with appropriate tests
Submit a pull request

License

This project is licensed under the GNU General Public License v3.0. See LICENSE for details.

Citation

If you use this tool in your research, please cite:

[Citation information - to be added when published]

DOI: 10.5281/zenodo.11267905

Support

Issues: Report bugs and request features via GitHub Issues
Documentation: Detailed workflow documentation in workflow/README.md
BGE Consortium: Visit biodiversitygenomics.eu for project updates

Acknowledgments

This work is supported by the Biodiversity Genomics Europe (BGE) consortium and contributes to the International Barcode of Life (IBOL) initiative. Biodiversity Genomics Europe (Grant no.101059492) is funded by Horizon Europe under the Biodiversity, Circular Economy and Environment call (REA.B.3); co-funded by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract numbers 22.00173 and 24.00054; and by the UK Research and Innovation (UKRI) under the Department for Business, Energy and Industrial Strategy's Horizon Europe Guarantee Scheme.

Name		Name	Last commit message	Last commit date
Latest commit History 514 Commits
.github/workflows		.github/workflows
config		config
doc		doc
lib		lib
logs		logs
resources		resources
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
sh-bold-ranker.sh		sh-bold-ranker.sh
updated_environment.yml		updated_environment.yml

License

bge-barcoding/bold-library-curation

Folders and files

Latest commit

History

Repository files navigation

BOLD Library Curation Pipeline

Overview

Key Features

Background

Pipeline Workflow

Assessment Criteria

Requirements

Requirements

Installation

Quick Start

Configuration Options

Core Settings

Optional Filtering

Analysis Parameters

Usage Examples

Basic Analysis

Large Dataset with Pre-filtering

HPC Cluster Execution (SLURM)

Target Species Focus

Output Files

Primary Outputs

Assessment Results

Performance Monitoring

Performance Guidelines

Dataset Size Recommendations

Optimization Tips

Project Structure

Advanced Features

Phylogenetic Integration

Geographic Representation

Database Architecture

Contributing

License

Citation

Support

Acknowledgments

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages