Skip to content

Snakemake workflow for recovering barcode sequences from genome skim data generated from museum (hDNA) specimens.

License

Notifications You must be signed in to change notification settings

bge-barcoding/BeeGees

 
 

Repository files navigation

Barcode Extraction and Evaluation from Genome Skims (BeeGees) Snakemake workflow

Snakemake workflow for recovering high-quality barcode sequences at scale, built around MitoGeneExtractor and adapted for genome skims of museum specimens.

Snakemake

Contents


Requirements

  • MitoGeneExtractor version 1.9.6 installed.
  • Paired-end reads in .fastq.gz or .fastq format.
  • samples.csv (generated manually, or as outlined below if working from BOLD sample metadata).
  • sequence_references.csv (generated manually, or using Gene Fetch within the workflow).
  • Activated conda env (see BeeGees_env.yaml).

Workflow

  1. Preprocessing modes (both modes run in parallel to optimise barcode recovery from degraded hDNA):
    • concat: Adapter trimming, quality filtering, poly-G trimming, deduplication of paired-end reads using fastp, optional BBDuk-based contaminant screening, followed by concatenation of R1+R2 reads, a secondary quality trimming with TrimGalore, and optional read downsampling. Empty or fully-filtered samples are handled gracefully via placeholder file creation to prevent pipeline failure.
    • merge: Quality control and merging of overlapping paired-end reads using fastp, optional BBDuk-based contaminant screening, with header cleaning for MitoGeneExtractor compatibility, and optional read downsampling. Empty or fully-filtered samples are handled gracefully via placeholder file creation to prevent pipeline failure.
  2. Sample-specific reference retrieval: Automated retrieval of taxonomically-appropriate protein reference sequences from GenBank using Gene-Fetch.
  3. Barcode recovery: Protein reference-guided extraction of barcode sequences from preprocessed reads using MitoGeneExtractor, producing initial consensus sequences for both preprocessing modes.
  4. Consensus sequence preparation: Header standardisation and concatenation of raw consensus sequences into multi-FASTA format for downstream processing.
  5. Consensus cleaning and filtering pipeline (fasta_cleaner): Sequential quality filters applied to MGE read alignments to remove contaminants and outliers before generating cleaned consensus sequences:
  6. Barcode validation and selection (see Validation Process section for more detail):
    • Structural validation: HMM-based barcode extraction, reading frame analysis, stop codon detection, and quality ranking of all generated barcode consensus sequences (structural_validation.py)
    • Local BLASTn search: Parallel BLASTn searches of structurally validated barcodes against local reference database (tv_local_blast.py)
    • Taxonomic validation: Hierarchical matching of BLAST results against expected taxonomy, selecting the best sequence per sample based on taxonomic match quality and alignment metrics (tv_blast2taxonomy.py)
  7. Statistics compilation: Aggregation of QC, recovery, cleaning, filtering, and validation metrics into comprehensive CSV reports for both preprocessing modes (compile_barcoding_stats.py).
  8. Final integration: Merging of all pipeline metrics (read QC, MGE, fasta_cleaner, structural validation, taxonomic validation) into a unified output CSV (val_csv_merger.py).
  9. Evaluate barcoding outcome: Take unified CSV file and determine barcoding success (PASS/PARTIAL/FAIL) for each sample (barcoding_outcome.py).
  10. Cleanup: Removal of temporary files and redundant sample-specific logs.

Installation and set up:

Install MitoGeneExtractor

Clone the BeeGees github repository and set up conda environment

git clone https://github.com/bge-barcoding/BeeGees.git [path/to/desired/install/location/]
cd installation/dir/BeeGees
conda env create -f BeeGees_env.yaml
git status

Generate input sample CSV file

  • This can be created manually, or via sample-processing workflow.
  • The file must contian the following headers: 'ID', 'forward', 'reverse', and 'taxid' OR 'phylum->species' (see below for more information).
    • ID: Unique sample identifier. Due to regex matching and statistics aggregation, the sample ID will be considered as the string before the first underscore. It is therefore imperative that sample names do not use '_' characters. E.g. BSNHM002-24 instead of BSNHM002_24, or P3-1-A10-2-G1 instead of P3_1_A10_2_G1.
    • forward & reverse: Absolute paths to forward (R1) and reverse (R2) PE read files. In fastq/fq fomrat, either gzipped not.
    • taxid OR heirarchical taxonomy: Unique taxonomic identifier or taxonomic lineage for sample. Taxid's can be found manually by searching the expected species/genus/family of each sample in the NCBI taxonomy database. If you provide the heirarchical taxonomic lineages of each sample (with the headers phylum, class, order, family, genus, species - see below), the corresponding taxid of the lowest identified taxonomic rank will be retrieved.

Note: The following alternative column name spellings are also accepted for samples.csv:

  • ID: id, sample_id, SampleID
  • forward: fwd, R1, read1
  • reverse: rev, R2, read2

samples.csv example (taxid)

ID forward reverse taxid
BSNHM002-24 abs/path/to/R1.fq.gz abs/path/to/R2.fq.gz 177658
BSNHM038-24 abs/path/to/R1.fq.gz abs/path/to/R2.fq.gz 177627
BSNHM046-24 abs/path/to/R1.fq.gz abs/path/to/R2.fq.gz 3084599

samples.csv example (hierarchical taxonomy)

ID forward reverse phylum class order family genus species
BSNHM002-24 abs/path/to/R1.fq.gz abs/path/to/R2.fq.gz Arthropoda Insecta Hemiptera Cicadidae Tibicina Tibicina tomentosa
BSNHM038-24 abs/path/to/R1.fq.gz abs/path/to/R2.fq.gz Tracheophyta Pinopsida Pinales Pinaceae Abies
BSNHM046-24 abs/path/to/R1.fq.gz abs/path/to/R2.fq.gz Annelida Polychaeta Terebellida Ampharetidae Samytha Samytha sexcirrata

Gathering sample-specific pseudo-references

  • The sample_references.csv file can be created manually, or using Gene-fetch integrated into the workflow (highly recommended). If enabled in the config.yaml by setting run_gene_fetch to 'true', Gene-fetch will retrieve the necessary protein pseudo-references for each sample from NCBI GenBank using the sample's taxonomic identifier (taxid) or taxonomic hierarchy. A sequence target (e.g. COI) must be specified in the config.yaml, as well as your NCBI API credentials (email address & API key - see guidance on getting a key).
  • The file must contain the following header: 'ID', 'reference'name' and 'protein_reference_path'.
    • ID: Unique sample identifier. This must be the same string as used in the 'ID' column in the input samples.csv file.
    • protein_reference_path: Absolute path to the protein pseudo-reference sequence used for sample-specific protein-guided read alignment.

sample_references.csv example

ID protein_reference_path
BSNHM002-24 path/to/BSNHM002-24.fasta
BSNHM038-24 path/to/BSNHM038-24.fasta
BSNHM046-24 path/toBSNHM046-24.fasta
  • Currently, it is crucial that the sample ID (ID), the reference sequence FASTA filename, and corresponding reference sequence FASTA header are all identical for correct sample-reference file mapping. Gene-fetch will handle this for you.

Contaminant reference files

The resources/contaminants/ directory contains reference files used in contaminant screening and filtering:

  • human_mitogenome.fasta: Human mitochondrial genome reference (NC_012920.1), used by the fasta_cleaner human mitogenome filter step. Included in the repository. An alterative human mitochondrial genome can be supplied, with the path specified in the config/config.yaml (see config customisation section below).
  • contaminants.fasta: Collection of reference sequences used for optional pre-processing (BBDuk) contaminant screening. A set of 31 common museum specimen contaminants are included in the repository as a default set (see contaminants.fasta for details). A custom contaminant FASTA can be specified by declaring an alternative path in contam_seqs in the config/config.yaml (see config customisation section below).

Customising snakemake configuration file

## General BeeGees pipeline parameters and paths
run_name: BeeGees run identifier
mge_path: Path to MGE install (MitoGeneExtractor-vX.X.X file)
samples_file: Path to samples.csv (see above for formatting)
sequence_reference_file: Path to sequence_references.csv. Only required when run_gene_fetch == false. Leave blank if run_gene_fetch == true (see above for formatting)
output_dir: Path to output directory. If any directories in the path do not already exist, then they wil be created

## Gene Fetch parameters (https://github.com/bge-barcoding/gene_fetch)
run_gene_fetch: Set to true to use gene-fetch to generate reference sequences (default: true)
email: Email for NCBI API. Required if run_gene_fetch == true. 
api_key: NCBI API key. Required if run_gene_fetch == true.
gene: Target gene (cox1 or rbcl)
type: Sequence type to retrieve - 'protein' for protein references only, 'both' for protein and nucleotide sequences (default: protein)
minimum_length: Minimum length (in amino acids) of protein pseudo-reference(s) to fetch (default: 500)
input_type: Taxonomic identification column(s) (taxid/hierarchical). i.e. Does the 'samples.csv' contain a 'taxid' column or 'hierarchical' taxonomic information columns (default: taxid)? (see above for formatting)
genbank: Download complete GenBank records of retrieved protein pseudo-references

## Contaminant screening parameters (BBDuk)
enabled: Set to true to enable BBDuk-based contaminant screening prior to read merging/concatenation (default: false)
contam_seqs: Path to FASTA file containing contaminant reference sequences. Default (if left empty): resources/contaminants/contaminants.fasta
k: k-mer size for contaminant matching (default: 31)

## Downsampling parameters
enabled: Set to true to enable downsampling (default: false)
max_reads: Maximum number of read PAIRS to process (e.g. 25M read pairs = 25M fwd + 25M rev reads, = 50M reads in total). Setting this to zero is equivilent to 'enabled: false'

## MitoGeneExtractor parameters (https://github.com/cmayer/MitoGeneExtractor/tree/main?tab=readme-ov-file#command-line-options)
r: Exonerate relative score threshold parameter. Accepts a list of values (e.g. [1, 1.3, 1.5]) - each value generates a separate MGE job per sample. More values increase barcode recovery sensitivity but multiply runtime and output volume proportionally.
s: Exonerate minimum score threshold parameter. Accepts a list of values (e.g. [50, 100]) - combined with r values, every r×s combination is run per sample. For example, 3 r values × 2 s values = 6 MGE jobs per sample.
n: Number of base pairs to extend beyond the Exonerate alignment
C: Genetic code to use for Exonerate (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) 
t: Consensus threshold (e.g. 0.5 = 50%)

## Post-processing of aligned reads for cleaning and filtering
# human mitogenome filtering -> at content filtering -> statistical outlier filtering -> (optional) reference-base filtering -> 'cleaned' consensus generation
human_genome: Path to human mitogenome FASTA reference for contamination filtering. Default (if left empty): resources/contaminants/human_mitogenome.fasta
aligner: Alignment tool to use for human mitogenome filtering: minimap2, bwa-mem, or bwa-aln (default: bwa-aln)
save_removed: Set to true to save reads removed by human mitogenome filtering to a removed/ subdirectory (default: false)
at_difference: Threshold at which reads are removed due to AT content variation (e.g. 0.1 = reads with AT% differing by 10% from the consensus are removed)
at_mode: AT content filtering mode (Absolute/Higher/Lower). Absolute = Remove sequences if AT content differs from consensus by more than threshold in either direction. Higher = Only remove sequences with AT content above at_difference threshold. Lower = Only remove sequences with AT content below at_difference threshold
outlier_percentile: Threshold at which reads are flagged as statistical outliers compared to the consensus and removed (e.g. 90.0 = reads < 90% 'similar' to the consensus are removed)
reference_dir: Path to directory containing at least one [ID].fasta file with known contaminant or target species genome(s) to be filtered or retained (see reference_filter_mode below)
reference_filter_mode: Either keep sequences that map to the supplied reference sequence (reference-based retention) or remove sequences that map to the supplied reference sequence (contaminant removal) ("keep_similar"/"remove_similar")
consensus_threshold: Threshold at which bases at each positional 'column' must 'agree' to be incldued in the consensus (e.g. 0.5 = ≥50% of bases at each position must agree)

## Structural validation
run_structural_validation: Set false to skip structural validation step (default: true)
target: Barcode marker to extract. Corresponds to HMM files in 'resources/hmm' (cox1/coi or rbcl)
verbose: Enable verbose logging (default: false)
genetic_code: Genetic code for translation table (must be the same as 'C' MitoGeneExtractor parameter

## Taxonomic validation
run_taxonomic_validation: Set false to skip structural validation step (default: true). If run_structural_validation == false, run_taxonomic_validation MUST also == false
database: Path to directory containing BLASTn database, or to a specific FASTA file to make a BLASTn database from (using makeblastdb)
database_taxonomy: Path to TSV file containing taxonomic mappings corresponding to records in the BLASTn 'database'
taxval_rank: Taxonomic rank to stop validating at (e.g. family, genus, species) (default & recommended: family)
expected_taxonomy: Expected taxonomy file must contain the following columns: Process ID,phylum,class,order,family,genus,speces. Process ID must equal 'ID' from the samples_file above. If heirarchical taxonomy information was provided in the `samples.csv` file, this file can be reused as the expected_taxonomy CSV file required for taxonomic validation of barcode consensus sequences.
verbose: Enable verbose logging (default: true)
min_pident: Minimum percent identity (pident) threshold to be considered for returned BLAST hits. Any hit with a pident below this value is removed
min_length: Minimum length of alignment to be considered for returned BLAST hits. Any hit with a length below this value is removed

## Resource allocation for each rule
rules: Each of the main rules in the config.yaml can specify the number of requested threads and memory resources (in Mb) for every job (e.g. specifying 4 threads and 4G memory for fastp_pe_merge would allocate those resources for every 'fastp_pe_merge' job). Rules have dynamic memory scaling upon retry (mem_mb * retry #). Make sure to change 'PARTITION' in the config/config.yaml for relevant rules.

human_mitogenome_filter threads note: threads = parallel_files × 4 threads-per-file (e.g., 32 threads = 8 parallel files × 4 aligner threads each)
taxonomic_validation threads note: minimum 4 threads (4 threads allocated to each parallelised BLASTn instance)

Currently, BeeGees-integrated barcode validation only works for COI-5P and rbcL barcodes due to HMM and BLAST database availability (to be expanded with future updates).


Cluster configuration using Snakemake profiles

  • See profiles/ directory for config.yaml files for 'SLURM' or 'local' cluster submission parameters. Other than the default slurm_partition and jobs parameters, all other parameters can likely stay as they are unless you experience issues.
    • The default slurm_partition determines the SLURM cluster partition for each snakemake job, unless otherwise specified (in the config/config.yaml). It is recommended to set this to a partition with at least 12-24 hour time limits.
    • The jobs parameter dictates the maximium number of workflow jobs that can be run concurrently. The value to set jobs to depends on your specific cluster. If the value is too low, it will create a bottleneck and reduce run speed/efficiency. If the value is too high, you may hit filesystem limits, job submission limits, user resource quotas, and fairshare policies, resulting in many pending or idle jobs. For example, if your cluster had a per-user memory limit of 256G, setting jobs to 20 and allocating 32G memory to each MitoGeneExtractor job would result in only 8 MitoGeneExtractor jobs running in parallel and the remaining 12 jobs to be pending until memroy is available.

The profile (profiles/local or profiles/slurm) will need to be changed in snakemake_run.sh (for local runs) or snakemake_run-sbatch.sh (for SLURM runs) depending on your system (see the $PROFILE variable in each script).

Cluster submission

  • Depending on your system and whether you are using the 'SLURM' or 'local' snakemake profile, there are two ways to run the BeeGees pipeline:
    • SLURM: Use snakemake_run-sbatch.sh. Run sbatch snakemake_run-sbatch.sh on the head/login node of your cluster. Submits the main snakemake coordinating job to the SLURM cluster using SBATCH, and will 'farm out' each job in the workflow to a new SBATCH job for increased parallelisation. Please change --partition in the SBATCH header section of the snakemake_run-sbatch.sh script to an appropriate parition for your cluster - The main snakemake coordinating job needs to run throughout the entire BeeGees run. It is therefore recommended to set this to a partition with at least 1 day->1 week time limits.
    • local: Use snakemake_run.sh. Simply run ./snakemake_run.sh on your desired cluster compute node. This node will handle all job scheduling and job computation.

Results structure

output_dir/
├── **01_preprocessing/**
│   ├── merge_mode/
│   │   ├── trimmed_data/
│   │   │   ├── {sample}_merged.fq                             # Merged paired-end reads
│   │   │   ├── {sample}_merged_clean.fq                       # Header-cleaned merged reads
│   │   │   ├── {sample}_fastp_report.html                     # FastP HTML report
│   │   │   ├── {sample}_fastp_report.json                     # FastP JSON report
│   │   │   └── unpaired/                                      # Unpaired reads from merging
│   │   ├── contaminant_screen/                            # Only if contaminant_screen.enabled = true
│   │   │   └── {sample}/
│   │   │       ├── {sample}_contaminant_reads.fastq.gz    # Reads matching contaminant references
│   │   │       ├── {sample}_contaminant_stats.txt         # BBDuk summary statistics
│   │   │       └── {sample}_refstats.txt                  # Per-reference match statistics
│   │   └── logs/
│   │       ├── clean_headers/
│   │       │   └── clean_headers.log                          # Aggregated header cleaning logs
│   │       ├── fastp/                                         # Individual FastP logs per sample
│   │       └── final_cleanup_complete.txt
│   └── concat_mode/
│       ├── trimmed_data/
│       │   └── {sample}/
│       │       ├── {sample}_R1_trimmed.fastq.gz               # Trimmed forward reads
│       │       ├── {sample}_R2_trimmed.fastq.gz               # Trimmed reverse reads
│       │       ├── {sample}_concat_trimmed.fq                 # Quality-trimmed concatenated reads
│       │       ├── {sample}_fastp_report.html                 # FastP HTML report
│       │       ├── {sample}_fastp_report.json                 # FastP JSON report
│       │       └── {sample}_concat.fastq_trimming_report.txt  # Trim Galore report
│       ├── contaminant_screen/                            # Only if contaminant_screen.enabled = true
│       │   └── {sample}/
│       │       ├── {sample}_contaminant_R1.fastq.gz       # R1 reads matching contaminant references
│       │       ├── {sample}_contaminant_R2.fastq.gz       # R2 reads matching contaminant references
│       │       ├── {sample}_contaminant_stats.txt         # BBDuk summary statistics
│       │       └── {sample}_refstats.txt                  # Per-reference match statistics
│       └── logs/
│           ├── concat/
│           │   └── concat_reads.log                           # Aggregated concatenation logs
│           ├── trim_galore/
│           │   └── trim_galore.log                            # Aggregated Trim Galore logs
│           ├── fastp/                                         # Individual FastP logs per sample
│           ├── gzip/                                          # Compression logs per sample
│           └── final_cleanup_complete.txt
│
├── **02_references/**                                             # Only if run_gene_fetch = true
│   ├── protein/
│   │   └── {sample}.fasta                                     # Protein references for each sample
│   ├── genbank/                                               # GenBank records (if genbank: true)
│   └── sequence_references.csv                                # Reference metadata
│
├── **03_barcode_recovery/**
│   ├── merge_mode/
│   │   ├── alignment/
│   │   │   └── {sample}_r_{r}_s_{s}_align_{sample}.fas        # MGE alignment files
│   │   ├── consensus/
│   │   │   ├── {sample}_r_{r}_s_{s}_con_{sample}.fas          # Individual consensus files
│   │   │   └── {run_name}_cons_combined-merge.fasta           # Combined consensus sequences
│   │   ├── fasta_cleaner/
│   │   │   ├── 01_human_filtered/
│   │   │   │   ├── human_filtered.txt                         # List of filtered files
│   │   │   │   └── human_filter_metrics.csv                   # Human filtering metrics
│   │   │   ├── 02_at_filtered/
│   │   │   │   ├── at_filtered_sequences/                     # Individual filtered files
│   │   │   │   ├── at_filtered.txt                            # List of filtered files
│   │   │   │   └── at_filter_summary.csv                      # AT filtering summary
│   │   │   ├── 03_outlier_filtered/
│   │   │   │   ├── outlier_filtered.txt                       # List of filtered files
│   │   │   │   ├── outlier_filter_summary_metrics.csv         # Summary metrics
│   │   │   │   └── outlier_filter_individual_metrics.csv      # Individual metrics
│   │   │   ├── 04_reference_filtered/                         # Optional - if reference filtering enabled
│   │   │   │   ├── reference_filtered.txt                     # List of filtered files
│   │   │   │   └── reference_filter_metrics.csv               # Reference filtering metrics
│   │   │   ├── 05_cleaned_consensus/
│   │   │   │   └── cleaned_cons_metrics-merge.csv             # Consensus generation metrics
│   │   │   ├── combined_statistics.csv                        # Aggregated cleaning statistics
│   │   │   └── cleaned_cons_combined.fasta                    # Final cleaned consensus sequences
│   │   ├── logs/
│   │   │   ├── mge/
│   │   │   │   ├── alignment_files.log                        # List of alignment files
│   │   │   │   ├── compile_barcoding_stats.log                              # MGE statistics log
│   │   │   │   └── {sample}_r_{r}_s_{s}/                      # MGE vulgar files per sample
│   │   │   ├── fasta_cleaner/
│   │   │   │   └── fasta_cleaner_complete.txt                 # Cleaner completion flag
│   │   │   ├── rename_consensus/
│   │   │   │   └── rename_fasta.log                           # Header renaming logs
│   │   │   ├── fasta_cleaner_complete.txt                     # Main cleaner completion
│   │   │   └── exonerate_int_cleanup_complete.txt             # Intermediate cleanup completion
│   │   ├── out/                                               # MGE output files per sample/parameter
│   │   ├── err/                                               # MGE error logs per sample/parameter
│   │   └── {run_name}_merge-stats.csv                         # Mode-specific statistics
│   └── concat_mode/
│       ├── alignment/
│       │   └── {sample}_r_{r}_s_{s}_align_{sample}.fas        # MGE alignment files
│       ├── consensus/
│       │   ├── {sample}_r_{r}_s_{s}_con_{sample}.fas          # Individual consensus files
│       │   └── {run_name}_cons_combined-concat.fasta          # Combined consensus sequences
│       ├── fasta_cleaner/
│       │   ├── 01_human_filtered/
│       │   │   ├── human_filtered.txt                         # List of filtered files
│       │   │   └── human_filter_metrics.csv                   # Human filtering metrics
│       │   ├── 02_at_filtered/
│       │   │   ├── at_filtered_sequences/                     # Individual filtered files
│       │   │   ├── at_filtered.txt                            # List of filtered files
│       │   │   └── at_filter_summary.csv                      # AT filtering summary
│       │   ├── 03_outlier_filtered/
│       │   │   ├── outlier_filtered.txt                       # List of filtered files
│       │   │   ├── outlier_filter_summary_metrics.csv         # Summary metrics
│       │   │   └── outlier_filter_individual_metrics.csv      # Individual metrics
│       │   ├── 04_reference_filtered/                         # Optional - if reference filtering enabled
│       │   │   ├── reference_filtered.txt                     # List of filtered files
│       │   │   └── reference_filter_metrics.csv               # Reference filtering metrics
│       │   ├── 05_cleaned_consensus/
│       │   │   └── cleaned_cons_metrics-concat.csv            # Consensus generation metrics
│       │   ├── combined_statistics.csv                        # Aggregated cleaning statistics
│       │   └── cleaned_cons_combined.fasta                    # Final cleaned consensus sequences
│       ├── logs/
│       │   ├── mge/
│       │   │   ├── alignment_files.log                        # List of alignment files
│       │   │   ├── compile_barcoding_stats.log                              # MGE statistics log
│       │   │   └── {sample}_r_{r}_s_{s}/                      # MGE vulgar files per sample
│       │   ├── fasta_cleaner/
│       │   │   └── fasta_cleaner_complete.txt                 # Cleaner completion flag
│       │   ├── rename_consensus/
│       │   │   └── rename_fasta.log                           # Header renaming logs
│       │   ├── fasta_cleaner_complete.txt                     # Main cleaner completion
│       │   └── exonerate_int_cleanup_complete.txt             # Intermediate cleanup completion
│       ├── out/                                               # MGE output files per sample/parameter
│       ├── err/                                               # MGE error logs per sample/parameter
│       └── {run_name}_concat-stats.csv                        # Mode-specific statistics
│   ├── {run_name}_BeeGees_stats.csv                              # Combined statistics from both modes
│   └── {run_name}_all_cons_combined.fasta                     # All consensus sequences from both modes
│
├── **04_barcode_validation/**
│   ├── structural/                                            # Only if run_structural_validation = true
│   │   ├── structural_validation.csv                          # Structural validation results
│   │   ├── {run_name}_full_sequences.fasta                    # Full sequences passing validation
│   │   └── {run_name}_barcode_sequences.fasta                 # Barcode sequences passing validation
│   └── taxonomic/                                             # Only if run_taxonomic_validation = true (& run_structural_validation = true)
│       ├── 01_local_blast_output.csv                          # BLAST results
│       ├── 02_taxonomic_validation.csv                        # Taxonomic validation results
│       └── {run_name}_barcode_sequences.fasta                 # Final validated barcode sequences
│
├── **05_barcoding_outcome/**
│   ├── baroding_outcome.log                                   # Summary and logging of barcoding outcome analysis
│   └── barcoding-outcome.tsv                                  # Overview of barcoding outcomes (PASS/FAILS)
|
├── {run_name}_final_validated_barcodes.fasta                  # Only produced if both validations run
└── {run_name}_final_stats.csv                                 # Master run metrics overview. Only produced if both validations run

Validation process

The BeeGees pipeline contains an optional barcode validation process (see Workflow section and config.yaml) to ensure output barcode quality is maximised through sequential structural and taxonomic validation steos, selecting the best barcode consensus sequences for downstream analyses.

  • The BeeGees pipeline has the capacity to validate the following barcode markers:
    • COI-5P: Requires the BOLDistilled BLASTn COI database and corresponding taxonomy mapping TSV file (*_SEQUENCES.fasta & *_TAXONOMY.tsv files) downloaded via the 'Download Source Data' button. Utilised the COI-5p.hmm for structural validation.
    • rbcL: Requires the custom reference BLASTn rbcL database and corresponding taxonomy mapping TSV file (*_dereplicated_*.fasta & *_dereplicated_*.tsv), downloaded here. Utilised the rbcL.hmm for structural validation.

Structural validation

Structural validation (via structural_validation.py) evaluates all generated barcode consensus sequences (from both pre-processing mode and all fasta_cleaner variants) through structural and functional analysis to identify high-quality, protein-coding sequences suitable for taxonomic assignment and species identification. Outputs a validation CSV containing comprehensive metrics for all sequences, including structural features, translation analysis, and quality ranks, and 'output_barcode_all_passing.fasta' containing ALL barcode sequences that pass the five quality criteria (multiple barcode sequences per process_id may pass)

Process:

  1. Barcode region extraction: Remove tilde characters (~) representing missing gene regions, replace gap ('-') characters with ambiguous bases (N's), use nhmmer to align sequences against marker-specific HMM profiles, constructs barcode sequences in HMM coordinate space, and trims leading/trailing N's while preserving internal ambiguous bases.
  2. Structural analysis: Calculates sequence length, gap distribution (leading/trailing/internal), N base count, and distinguishes 'original' N's (barcode_ambiguous_bases_original, representing quality issues) from processing-introduced N's (barcode_ambiguous_bases, representing all N's in final sequence).
  3. Translation analysis: Evaluates all three reading frames (0, 1 2), translates sequences using specified genetic code, counts stop codons in each frame, and selects the optimal frame with the fewest stop codons.
  4. Quality ranking: Assigns barcode ranks (1-6) based on original N's, stop codons, reading frame validity, and base count (lower = better):
    • Rank 1: Perfect sequences (no original N's, no stop codons, valid frame, ≥500bp)
    • Rank 2: High quality (no original N's, no stop codons, valid frame, 400-499bp)
    • Rank 3: Good quality (no original N's, no stop codons, valid frame, 300-399bp)
    • Rank 4: Acceptable (no original N's, no stop codons, valid frame, 200-299bp)
    • Rank 5: Minimal (no original N's, no stop codons, valid frame, 1-199bp)
    • Rank 6: Problematic (contains original N's or translation issues)
  5. Sequence selection: To be considered structurally validated and proceed to taxonomic validation, sequences mut pass ALL of the following criteria:
    • No original N's ( barcode_ambiguous_bases_original == 0)
    • No stop codons (stop_codons == 0)
    • Sequence is in a valid reading frame (reading_frame >= 0)
    • Sufficient informative nucleotide base content (barcode_base_count > 300bp)
    • Acceptable post-processing sequence 'quality' (barcode_ambiguous_bases < 30% of barcode_base_count)

Taxonomic validation

Taxonomic validation is a two-step process (via tv_local_blast.py and tv_blast2taxonomy.py) for verifying barcode identity through local BLAST searches and hierarchical taxonomic matching.

Process:

  1. Local BLASTn search: Perform parallel BLASTn searches against a local database, either created from a multi-FASTA file (using makeblastdb), or using a pre-constructed BLAST database. The e-value threshold is hardcoded to 1e-5. In sequence-specific TSV output files (output format 6), the top 500 BLAST hits are then ordered by percent identity (in descending order). These are then filtered to the top 100 hits and are output to the summary CSV.
  2. Taxonomic assignment validation: Validates BLASTn results against expected taxonomy using hierarchical matching and quality-based filtering to confirm barcode identity.
  3. Parses local BLASTn summary CSV, expected taxonomic lineage for each sample, BLAST database taxonomy mappings, and structurally validated sequences in FASTA format.
  4. Filter top 100 BLASTn hits to remove those with percent identity values below the specified threshold (< min_pident), as well as hits below a specified minimum aignment length (< min_length).
  5. Assess taxonomy of remaining BLAST hits for each sequence via hierarchical taxonomic (exact string) matching between the BLAST database taxonomy mapping and expected taxonomic lineage. Looks for matches between the expected taxonomy and database taxonomy mapping at family, genus, or species-level (highest rank to consider set with taxval_rank). The first (i.e. top) hit with a taxonomy match at any of the allowed ranks is accepted.
  6. Sequence selection for each sample (Process ID): Among the structurally validated consensus sequences with taxonomy matches, the 'best' sequence is selected based on the following criteria:
    • Lowest matched_rank (species > genus > family - more specific preferred)
    • Lowest gaps (alignment quality)
    • Lowest mismatches (sequence variability)
    • Highest percent identity (overall sequence similarity)
    • Lowest e-value (statistical significance)
    • Highest alignment length (matching hit confidence)
    • Highest s value (MitoGeneExtractor parameter)
    • Highest r value (MitoGeneExtractor parameter)
    • Has "fcleaner" in seq_id (prioritises cleaned consensus sequences)
  7. Generation of taxonomic validation CSV file

Final metric integration

The barcode validation outputs are merged with pre-processing and barcode recovery statistics (via val_csv_merger.py) to create the final comprehensive BeeGees output ({run_name}_final_metrics.csv), consolidating:

  • Read QC metrics (fastp, TrimGalore)
  • Reference retrieval results (Gene Fetch)
  • Barcode recovery statistics (MGE, fasta_cleaner)
  • Structural validation metrics
  • Taxonomic validation results

Contributing

  • Please feel free to submit issues, fork the repository, and create pull requests for any improvements.
  • This snakemake pipeline was produced by Dan Parsons @ NHMUK for the Biodiversity Genomics Europe (BGE) consortium. If you use BeeGees in your work, please cite our paper at ...
  • Since BeeGees uses MitogeneExtractor at its core, please cite: Brasseur, M.V., Astrin, J.J., Geiger, M.F., Mayer, C., 2023. MitoGeneExtractor: Efficient extraction of mitochondrial genes from next-generation sequencing libraries. Methods in Ecology and Evolution.

Future developments

  • Expand supported markers beyond COI-5P and rbcL. Will require marker-specific HMMs, BLAST databases and associated taxonomy files for barcode validation. Next likely maker to be added = Matk.
  • Increase flexibility of input sequence_references CSV headers, so that ID/id/Process ID/PROCESS ID/process_id/sample/sample_id/SAMPLE ID/etc are accepted. Currently sequence_references only accepts 'ID'.
  • Output simple plots.

About

Snakemake workflow for recovering barcode sequences from genome skim data generated from museum (hDNA) specimens.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.4%
  • Shell 0.6%