From 8b738e4d8cb021e3b5c09284a7dfe670476697c9 Mon Sep 17 00:00:00 2001 From: Abhishek Tiwari Date: Sat, 9 Aug 2025 09:31:58 -0400 Subject: [PATCH] Update paper.md Fixing typo --- paper/paper.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index 7c3cc5a..dc19c3d 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -38,9 +38,9 @@ Comparative genetic analyses within or between taxonomic groups often requires r While several existing tools enable sequence retrieval from NCBI sequence repositories, they often require considerable bioinformatics expertise, are limited in functionality or data scope, or are suited for slightly different applications. E-utilities like Entrez Direct [@Kans:2024] offer broad NCBI database access via several APIs but rely on manual NCBI search term construction and significant scripting expertise, burdening the user with navigating database-specific syntax and structure. Similarly, other tools like CRABS: Creating Reference databases for Amplicon-Based Sequencing [@Jeunen:2023] and RESCRIPt: REference Sequence annotation and CuRatIon Pipeline [@Robeson:2021] offer bulk, programmatic retrieval of sequences from several databases (e.g. NCBI, BOLD, etc.), yet they also require manual search terms construction, operate on a single taxon, and often require substantial post-processing by the user. NCBI Datasets [@Oleary2024], while facilitating web- and programmatic-access to NCBI sequence data, is restricted to the curated RefSeq database (a small subset of sequences available on GenBank), limited to species-level queries, and lacks sequence filtering and batch processing capabilities. -In contrast, `Gene Fetch` offers an accessible, high-throughput solution that automates and simplifies sequence retrieval from GenBank by matching sought after targets to feature table annotations in GenBank records, and requires no prior NCBI syntax knowledge. The tool integrates robust logging, error handling, checkpointing, and a standardised output format, making it suited for reproducible, efficient, and biologically-informed sequence retrieval at scale. `Gene Fetch` directly addresses the challenge of variable sequence representation by systematically traversing taxonomic hierarchies when target sequences are unavailable at the initially specified taxonomic rank (e.g., species → genus → family, etc), documenting the matched rank. This is especially valuable for researchers working with non-model organisms or taxonomic groups with limited sequence data, facilitating retrieval of the taxonomically closest available sequence. The integrated 'batch' mode processes multiple input taxa and retrieve the single ‘best’ sequence per taxon, whilst 'single' mode exhaustively searches for all target sequences for a specified taxon. Collectively, these modes enable efficient retrieval of sequence data for genomic and phylogenetic studies across diverse taxa. +In contrast, `Gene Fetch` offers an accessible, high-throughput solution that automates and simplifies sequence retrieval from GenBank by matching sought after targets to feature table annotations in GenBank records, and requires no prior NCBI syntax knowledge. The tool integrates robust logging, error handling, checkpointing, and a standardised output format, making it suited for reproducible, efficient, and biologically-informed sequence retrieval at scale. `Gene Fetch` directly addresses the challenge of variable sequence representation by systematically traversing taxonomic hierarchies when target sequences are unavailable at the initially specified taxonomic rank (e.g., species → genus → family, etc), documenting the matched rank. This is especially valuable for researchers working with non-model organisms or taxonomic groups with limited sequence data, facilitating retrieval of the taxonomically closest available sequence. The integrated 'batch' mode processes multiple input taxa and retrieves the single ‘best’ sequence per taxon, whilst 'single' mode exhaustively searches for all target sequences for a specified taxon. Collectively, these modes enable efficient retrieval of sequence data for genomic and phylogenetic studies across diverse taxa. -`Gene Fetch` supports 'batch' and 'single' query modes across both protein and nucleotide sequences, with automated CDS extraction, customisable length filtering, and fallback mechanisms for atypical GenBank annotations. The integrated 'batch' mode processes multiple input taxa and retrieve the single ‘best’ sequence per taxon, whilst 'single' mode exhaustively searches for all target sequences for a specified taxon. Collectively, these modes enable efficient retrieval of sequence data for genomic and phylogenetic studies across diverse taxa. It can also process variable GenBank features, including complementary strands, joined sequences, and whole genome shotgun entries, enabling extraction regions of interest from variable feature annotations (e.g., COI from mitogenome records). Cross-validation of retrieved NCBI taxonomy against the input taxonomy prevents taxonomic homonyms matches (identical names referring to different organisms across the tree of life). At release, the tool is optimised for 18 common targets, including “barcoding” genes, with curated synonyms for improved search specificity. Users can also specify additional markers to those 18 targets, and optionally retrieve corresponding GenBank records for each fetched sequence. +`Gene Fetch` supports 'batch' and 'single' query modes across both protein and nucleotide sequences, with automated CDS extraction, customisable length filtering, and fallback mechanisms for atypical GenBank annotations. The integrated 'batch' mode processes multiple input taxa and retrieves the single ‘best’ sequence per taxon, whilst 'single' mode exhaustively searches for all target sequences for a specified taxon. Collectively, these modes enable efficient retrieval of sequence data for genomic and phylogenetic studies across diverse taxa. It can also process variable GenBank features, including complementary strands, joined sequences, and whole genome shotgun entries, enabling extraction regions of interest from variable feature annotations (e.g., COI from mitogenome records). Cross-validation of retrieved NCBI taxonomy against the input taxonomy prevents taxonomic homonyms matches (identical names referring to different organisms across the tree of life). At release, the tool is optimised for 18 common targets, including “barcoding” genes, with curated synonyms for improved search specificity. Users can also specify additional markers to those 18 targets, and optionally retrieve corresponding GenBank records for each fetched sequence.