Gap List Data Repository

This repository contains species data files and an advanced data cleaning pipeline for processing taxonomic information and synonyms.

Overview

The repository provides a comprehensive system for cleaning and normalizing species data across two primary datasets:

Species and Synonyms: Valid species names with their synonyms
Species and Taxonomy: Species with taxonomic classifications

Data Files

Input Data (`/data/`)

all_specs_and_syn.csv - Species and their synonyms
Gap_list_all_updated.csv - Species with taxonomic hierarchy
targetlist.csv - Additional target species list

Processed Data (`/cleaned_data/`)

Cleaned and normalized versions of input files
Removed records for audit trail
Generated with mode-specific suffixes (e.g., _hybrid, _majority_rule)

Data Cleaning Pipeline

SMART Enhanced Pipeline (`/data_cleaning_pipeline/`)

The SMART Enhanced Data Cleaning Pipeline provides intelligent taxonomy normalization with 95%+ reduction in API calls.

Main Script

smart_pipeline.py - Enhanced pipeline with SMART taxonomy normalization

Key Features

Three Processing Modes: Majority rule, GBIF validation, and hybrid approach
Intelligent GBIF Usage: Only queries species with actual taxonomy conflicts
Comprehensive Cleaning: Unicode handling, gender variant merging, duplicate resolution
Complete Audit Trail: Detailed logging of all modifications

Quick Start

# Default hybrid mode (recommended)
python data_cleaning_pipeline/smart_pipeline.py

# Fast processing with majority rule
python data_cleaning_pipeline/smart_pipeline.py --mode majority_rule

# External validation (SMART optimized)
python data_cleaning_pipeline/smart_pipeline.py --mode gbif_only

Documentation

Complete Pipeline Documentation
Processing phases, examples, and troubleshooting

Core Pipeline Components

data_cleaning_pipeline.py - Base cleaning functionality
taxonomy_normalization/ - SMART taxonomy processing modules

Processing Results (`/log/`)

Detailed modification logs (TSV format)
Summary reports (Markdown format)
Taxonomy normalization reports (JSON format)

Usage

Command Line

# Process with default settings
python data_cleaning_pipeline/smart_pipeline.py

# Custom confidence threshold
python data_cleaning_pipeline/smart_pipeline.py --mode hybrid --confidence 0.9

Programmatic

from data_cleaning_pipeline.smart_pipeline import SmartSpeciesDataCleaner

cleaner = SmartSpeciesDataCleaner(".", taxonomy_mode="hybrid")
cleaner.run_pipeline()

Output Files

The pipeline generates:

Cleaned CSV files with normalized taxonomy and merged synonyms
Removed records for audit and review
Comprehensive logs with all modifications tracked
Summary reports with processing statistics and SMART efficiency metrics

Support

Issues: Report bugs and request features via GitHub Issues
Documentation: See the Pipeline Documentation
BGE Consortium: Visit biodiversitygenomics.eu for project updates

Acknowledgments

This work is supported by the Biodiversity Genomics Europe (BGE) consortium and contributes to the International Barcode of Life (IBOL) initiative. Biodiversity Genomics Europe (Grant no.101059492) is funded by Horizon Europe under the Biodiversity, Circular Economy and Environment call (REA.B.3); co-funded by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract numbers 22.00173 and 24.00054; and by the UK Research and Innovation (UKRI) under the Department for Business, Energy and Industrial Strategy’s Horizon Europe Guarantee Scheme.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
cleaned_data		cleaned_data
data		data
data_cleaning_pipeline		data_cleaning_pipeline
doc		doc
log		log
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gap List Data Repository

Overview

Data Files

Input Data (`/data/`)

Processed Data (`/cleaned_data/`)

Data Cleaning Pipeline

SMART Enhanced Pipeline (`/data_cleaning_pipeline/`)

Main Script

Key Features

Quick Start

Documentation

Core Pipeline Components

Processing Results (`/log/`)

Usage

Command Line

Programmatic

Output Files

Support

Acknowledgments

About

Uh oh!

Releases 6

Packages

Contributors 3

Uh oh!

Languages

License

bge-barcoding/gaplist-data

Folders and files

Latest commit

History

Repository files navigation

Gap List Data Repository

Overview

Data Files

Input Data (/data/)

Processed Data (/cleaned_data/)

Data Cleaning Pipeline

SMART Enhanced Pipeline (/data_cleaning_pipeline/)

Main Script

Key Features

Quick Start

Documentation

Core Pipeline Components

Processing Results (/log/)

Usage

Command Line

Programmatic

Output Files

Support

Acknowledgments

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 3

Uh oh!

Languages

Input Data (`/data/`)

Processed Data (`/cleaned_data/`)

SMART Enhanced Pipeline (`/data_cleaning_pipeline/`)

Processing Results (`/log/`)

Packages