A comprehensive Snakemake pipeline for processing and curating BOLD Systems barcode reference sequence data. This tool implements standardized quality assessment criteria developed by the Biodiversity Genomics Europe (BGE) consortium to evaluate and rank DNA barcode sequences for library curation.
- Comprehensive Quality Assessment: Evaluates specimens against 16 standardized criteria including metadata completeness, voucher information, sequence quality, and phylogenetic analyses
- Advanced Phylogenetic Analysis: Includes haplotype identification and OTU clustering for genetic diversity assessment
- BAGS Species Assessment: Automated species-level quality grading system with subspecies inheritance
- Geographic Representation: Country representative selection for balanced geographic sampling
- Scalable Architecture: Family-level database splitting for efficient analysis of large datasets
- FAIR Compliance: Built with reproducibility and provenance tracking using Snakemake workflows
The classification criteria are actively developed by the BGE consortium and documented in this living document. The pipeline evaluates sequences based on multiple quality dimensions to support evidence-based curation decisions.
The pipeline processes BOLD data through six main phases:
- Data Preparation: Optional pre-filtering by taxa, geography, or genetic markers
- Database Setup: SQLite database creation with taxonomic enrichment
- Quality Assessment: Evaluation against 16 standardized criteria plus phylogenetic analyses
- BAGS Assessment: Species-level quality grading with database optimization
- Data Integration: Ranking system combining all assessments with country representative selection
- Family Splitting: Creation of family-level databases for scalable downstream analysis
Specimen Metadata: Collection date, collectors, identifier, identification method
Geographic Data: Country, region, site, sector, coordinates
Repository Info: Institution, museum ID, public voucher
Sequence Quality: DNA quality metrics, species ID, type specimen, images
Phylogenetic: Haplotype identification, OTU clustering
- Operating System: Linux, macOS, or Windows with WSL
- Package Manager: Mamba (recommended) or Conda
- Memory: Minimum 8GB RAM (16GB+ recommended for large datasets)
- Storage: Sufficient space for BOLD data and family databases
- Dependencies: VSEARCH (for OTU clustering), SQLite, Perl, Python
Data**: Country, region, site, sector, coordinates
Repository Info: Institution, museum ID, public voucher
Sequence Quality: DNA quality metrics, species ID, type specimen, images
Phylogenetic: Haplotype identification, OTU clustering
- Operating System: Linux, macOS, or Windows with WSL
- Package Manager: Mamba (recommended) or Conda
- Memory: Minimum 8GB RAM (16GB+ recommended for large datasets)
- Storage: Sufficient space for BOLD data and family databases
- Dependencies: VSEARCH (for OTU clustering), SQLite, Perl, Python
-
Clone the repository:
git clone https://github.com/bge-barcoding/bold-library-curation.git cd bold-library-curation -
Set up the environment:
mamba env create -f environment.yml mamba activate bold-curation
-
Configure your analysis in
config/config.yml:# Input data BOLD_TSV: "resources/your_bold_data.tsv" # Optional filtering ENABLE_PRESCORING_FILTER: false USE_TARGET_LIST: false # Analysis parameters OTU_CLUSTERING_THRESHOLD: 0.99 FAMILY_SIZE_THRESHOLD: 10000
-
Run the pipeline:
snakemake --cores 4 --use-conda
-
Check results in
results/result_output.tsvandresults/family_databases/
- BOLD_TSV: Path to BOLD data dump (BCDM TSV format)
- RESULTS_DIR/LOG_DIR: Customizable output directories
- TAXONOMY_CHUNK_SIZE: Memory optimization (default: 10,000)
- Pre-scoring Filter: Early dataset reduction by taxa, countries, markers, or BIN sharing
- Target Lists: Focus on specific species of interest
- Geographic Filtering: Country-based specimen filtering
- OTU_CLUSTERING_THRESHOLD: Genetic similarity for clustering (default: 0.99)
- OTU_CLUSTERING_THREADS: Parallel processing threads (default: 8)
- FAMILY_SIZE_THRESHOLD: Minimum records for individual family databases (default: 10,000)
# Standard pipeline execution
snakemake --cores 8 --use-conda# Configure pre-filtering in config.yml first
# ENABLE_PRESCORING_FILTER: true
# FILTER_TAXA: true
# FILTER_TAXA_LIST: "resources/target_taxa.txt"
snakemake --cores 16 --use-conda --resources mem_mb=32000#!/bin/bash
#SBATCH --partition=day
#SBATCH --mem=32G
#SBATCH --cpus-per-task=16
source activate bold-curation
snakemake --cores 16 --use-conda# Enable target list in config.yml
# USE_TARGET_LIST: true
# TARGET_LIST: "resources/target_species.csv"
snakemake --cores 8 --use-condaresults/result_output.tsv: Final scored and ranked specimens with all assessmentsresults/family_databases/: Family-level SQLite databases organized by phylumresults/pipeline_summary.txt: Comprehensive execution summary
- Individual criteria files:
assessed_*.tsvfor each quality criterion - BAGS assessment: Species-level quality grades with BIN sharing analysis
- Phylogenetic analyses: Haplotype IDs and OTU clustering results
- Database: Complete SQLite database with specialized tables for complex queries
- Comprehensive logging: Step-by-step execution logs in
logs/directory - Progress tracking: Real-time monitoring for long-running operations
- Error handling: Detailed debugging information for troubleshooting
- Small (< 10K records): 30-60 minutes, 8GB RAM
- Medium (10K-100K records): 2-6 hours, 16GB RAM
- Large (100K+ records): 6-24 hours, 32GB+ RAM
- Very large (1M+ records): 1-3 days, consider pre-filtering
- Use pre-scoring filter for very large datasets
- Adjust memory settings based on available resources
- Configure threading for OTU clustering based on CPU cores
- Use SSD storage for database operations when possible
bold-library-curation/
├── config/ # Configuration files
├── workflow/ # Snakemake workflow
│ ├── bold-ranker.smk # Main workflow definition
│ ├── scripts/ # Analysis scripts
│ ├── envs/ # Conda environments
│ └── README.md # Detailed workflow documentation
├── resources/ # Input data and references
├── results/ # Output files and databases
└── logs/ # Execution logs
- Haplotype Analysis: Identifies genetic variants within species/BIN groups
- OTU Clustering: VSEARCH-based clustering with configurable similarity thresholds
- Multi-threaded Processing: Parallel execution for computationally intensive steps
- Country Representatives: Systematic selection of optimal specimens per region
- Balanced Sampling: Maintains geographic diversity while optimizing quality
- Hierarchical Organization: Family-level databases organized by phylum
- Specialized Tables: Optimized schema for different data types
- Efficient Querying: Comprehensive indexing for complex analyses
We welcome contributions! Please see our contributing guidelines and:
- Fork the repository
- Create a feature branch
- Make changes with appropriate tests
- Submit a pull request
This project is licensed under the GNU General Public License v3.0. See LICENSE for details.
If you use this tool in your research, please cite:
[Citation information - to be added when published]
- Issues: Report bugs and request features via GitHub Issues
- Documentation: Detailed workflow documentation in
workflow/README.md - BGE Consortium: Visit biodiversitygenomics.eu for project updates
This work is supported by the Biodiversity Genomics Europe (BGE) consortium and contributes to the International Barcode of Life (IBOL) initiative. Biodiversity Genomics Europe (Grant no.101059492) is funded by Horizon Europe under the Biodiversity, Circular Economy and Environment call (REA.B.3); co-funded by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract numbers 22.00173 and 24.00054; and by the UK Research and Innovation (UKRI) under the Department for Business, Energy and Industrial Strategy's Horizon Europe Guarantee Scheme.


