██ ██ █████ ██████ ████████ ██████ █████ ██████ ██ ██ ███████ ██████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██ ██ ███████ ██████ ██ ██████ ███████ ██ █████ █████ ██████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
████ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██████ ██ ██ ███████ ██ ██
A bioinformatics pipeline to summarise variants called against a reference in a longitudinal study design. Written to investigate longitudinal sequencing data from long-term passaging of SARS-CoV-2. In theory, it could be expanded for other organisms too.
Author: Dr Charles Foster
- Features
- Installation
- Quick Start
- Output
- What does vartracker do?
- Citation
- License
- Contributing
- Support
- Track mutation persistence across longitudinal samples
- Comprehensive variant analysis including amino acid consequences
- Built-in SARS-CoV-2 reference data and annotations
- Integration with functional mutation databases (literature)
- Automated plotting and statistical analysis
- Support for both SNPs and indels
- Quality control metrics for variants
Requires Python 3.11 or newer.
Create the fully pinned, reproducible environment (uses strict channel priority as configured
in environment.yml), then install vartracker from PyPI:
mamba create -n vartracker -f environment.yml
mamba activate vartracker
pip install vartrackerIf you prefer conda:
conda env create -n vartracker -f environment.yml
conda activate vartracker
pip install vartrackerOn Apple Silicon (macOS ARM), the environment may not solve with native packages. Use the x86_64 subdir or Docker instead:
CONDA_SUBDIR=osx-64 mamba create -n vartracker -f environment.yml
mamba activate vartrackerpip install vartrackervartracker shells out to a handful of bioinformatics tools. Make sure they are discoverable on PATH before running the CLI.
Minimum tested versions are tracked in docs/DEPENDENCIES.md.
- bcftools and tabix – required for all modes
- samtools, lofreq, fastp, bwa, and snakemake – required for the
bamandend-to-endSnakemake workflows
If you only plan to run vartracker vcf against pre-generated VCFs, the first pair is sufficient. The additional tools are needed whenever you ask vartracker to align reads or call variants for you.
Note: the pinned micromamba environment installs tabix/bgzip via htslib.
On macOS:
# Using Homebrew
brew install bcftools htslib samtools fastp bwa
# lofreq is available via bioconda (requires conda/mamba)
conda install -c bioconda lofreq
# Using MacPorts
sudo port install bcftools htslib samtools fastp bwaOn Linux (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install bcftools tabix samtools fastp bwa
# lofreq is easiest to install via bioconda on Debian-based systems:
conda install -c bioconda lofreqOn Linux (CentOS/RHEL/Fedora):
# CentOS/RHEL with EPEL
sudo yum install epel-release
sudo yum install bcftools htslib samtools fastp bwa
# Fedora
sudo dnf install bcftools htslib samtools fastp bwa
# Install lofreq via bioconda on RPM-based systems:
conda install -c bioconda lofreqUsing conda:
conda install -c bioconda bcftools samtools tabix fastp bwa lofreqFor development or to get the latest version (requires Python 3.11+):
git clone https://github.com/charlesfoster/vartracker.git
cd vartracker
pip install -e .[dev]
pre-commit installBuild a container image that bundles Python, vartracker, and all external bioinformatics tools:
docker build -t vartracker:latest .Docker is a self-contained reproducible option. If you publish the image, record the digest and set it when running to include it in the run manifest:
export VARTRACKER_CONTAINER_IMAGE=ghcr.io/your-org/vartracker:2.0.0
export VARTRACKER_CONTAINER_DIGEST=sha256:...Run workflows by mounting your data directory into the container. The command below analyses an input CSV located in the current directory and writes results beside it:
docker run --rm -v "$(pwd)":/workspace vartracker \
vcf /workspace/inputs/vcf_inputs.csv \
--outdir /workspace/resultsAfter installation, vartracker will be available as a command-line tool:
vartracker --help# Analyse pre-called VCFs plus coverage files
vartracker vcf path/to/vcf_inputs.csv --outdir results/vcf_run
# Run BAMs through the Snakemake workflow, then summarise variants
vartracker bam path/to/bam_inputs.csv \
--snakemake-outdir work/bam_pipeline \
--outdir results/bam_summary
# Start from raw reads (FASTQ) and run the full pipeline
vartracker end-to-end path/to/read_inputs.csv \
--cores 12 \
--outdir results/e2e_summary
# Generate a template spreadsheet for a directory of files
vartracker prepare spreadsheet --mode e2e --dir data/passaging --out inputs.csv
# Build a reference FASTA+GFF3 bundle from GenBank accessions
vartracker prepare reference --accessions CY114381,CY114382 --outdir refs/flu --prefix flu_ref
# Exercise the bundled smoke-test dataset
vartracker vcf --test
vartracker bam --test
vartracker end-to-end --testAll modes understand --test, which copies the example dataset from vartracker/test_data
into a temporary directory, resolves relative paths, and runs the appropriate workflow.
Every CLI mode reads the same canonical columns:
sample_name(required) – display name for the samplesample_number(required) – passage/order index used in longitudinal plotsreads1,reads2– FASTQ paths (required forend-to-end, optional elsewhere). The pipeline runs in single-end mode (leave thereads2column empty) but the results are less well tested.bam– BAM file aligned against the SARS-CoV-2 referencevcf– bgzipped VCF containing variant calls with depth (DP) and allele-frequency tagscoverage– per-base coverage TSV with columnsreference<TAB>position<TAB>depth
Mode-specific expectations:
- VCF mode requires
vcfandcoverage, while leavingreads*/bamempty. - BAM mode requires
bamand will fillvcf+coverageduring the workflow. - End-to-end mode requires
reads1(and optionallyreads2); remaining fields are generated.
Relative paths are resolved with respect to the CSV location, so you can store the sheet alongside
your sequencing artefacts. The prepare spreadsheet subcommand can scaffold a CSV and highlight missing files.
Coverage files can be produced with samtools depth -aa sample.bam > sample_depth.txt or
bedtools genomecov -ibam sample.bam -d. The file name suffix does not matter; vartracker checks
for both .depth.txt and _depth.txt patterns when preparing its internal test dataset.
vartracker vcf– accepts plotting and filtering options such as--min-snv-freq,--min-indel-freq,--allele-frequency-tag,--name,--outdir,--passage-cap,--manifest-level, and literature controls (--search-pokay,--literature-csv). Use--testto run the bundled smoke test.vartracker bam– everything fromvcf, plus Snakemake options:--snakemake-outdir,--cores,--snakemake-dryrun,--verbose,--redo,--rulegraph.vartracker end-to-end– similar tobam, with an optional--primer-bedfor amplicon clipping.vartracker prepare spreadsheet– specify--mode(vcf,bam, ore2e),--dirto scan,--outfor the CSV, and--dry-runto preview without writing a file.vartracker prepare reference– build a merged FASTA/GFF3 bundle from GenBank nucleotide accessions. Use--accessionsor--accession-file, plus--outdir. Optional flags:--prefix,--force,--keep-intermediates,--skip-csq-validation.
To search mutations against functional databases:
- Set up a literature database (optional):
parse_pokay pokay_database.csvThis command automatically downloads the required literature files from the
pokay repository into pokay_literature/NC_045512 (override with
--download-dir) and writes the processed CSV for downstream analysis.
- Run vartracker with literature search:
vartracker [mode] input_data.csv --literature-csv pokay_database.csv -o results/Alternatively, pass --search-pokay to automatically download and search
against the Pokay SARS-CoV-2 literature database.
usage: main.py [-h] [-V] {vcf,bam,end-to-end,e2e,prepare,schema} ...
positional arguments:
{vcf,bam,end-to-end,e2e,prepare,schema}
vcf Analyse VCF inputs
bam Run the BAM preprocessing workflow
end-to-end (e2e) Run the end-to-end workflow (Snakemake + vartracker)
prepare Prepare inputs and references for vartracker
schema Print schemas for results tables or literature CSV input
options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
Use vartracker <subcommand> --help to inspect the full list of mode-specific arguments.
Use this workflow to build a bcftools csq-ready reference bundle from nucleotide accessions:
# Comma-separated accessions
# Example with influenza A segments
vartracker prepare reference \
--accessions CY114381,CY114382,CY114383,CY114384,CY114385,CY114386,CY114387,CY114388 \
--outdir refs/influenza_a \
--prefix influenza_a_ref
# One accession per line in a file
vartracker prepare reference \
--accession-file accessions.txt \
--outdir refs/Required external tools:
bcftoolsfor csq smoke validation
Outputs:
<outdir>/<prefix>.fa<outdir>/<prefix>.gff3<outdir>/<prefix>.fa.fai<outdir>/prepare_metadata.json
Validation notes:
- Unless
--skip-csq-validationis supplied, vartracker writes a dummy coding-region VCF variant and runsbcftools csqagainst the generated FASTA/GFF3. - Validation fails fast if
bcftools csqexits non-zero or if the output VCF does not containBCSQ.
Troubleshooting:
- Accession fetch failures: verify accession spelling and network access to NCBI efetch.
- SeqID mismatch errors: confirm FASTA headers and GFF3 seqids match exactly.
- csq validation failure: inspect the stderr snippet in the error output and confirm
bcftoolsversion and annotation structure.
After installation you can verify the workflows using the bundled demonstration dataset:
vartracker vcf --test --outdir vartracker_vcf_test_results
vartracker bam --test --outdir vartracker_bam_test_results
vartracker end-to-end --test --outdir vartracker_e2e_test_resultsEach command copies the example dataset, resolves relative paths, checks for the required external tools, and writes a self-contained set of results.
vartracker produces several output files:
- results.csv: Comprehensive variant analysis with all metrics
- results_metadata.json: Output schema version and results metadata
- new_mutations.csv: Mutations not present in the first sample
- persistent_new_mutations.csv: New mutations that persist to the final sample
- cumulative_mutations.pdf: Plot showing mutation accumulation over time
- mutations_per_gene.pdf: Gene-wise mutation statistics
- variant_allele_frequency_heatmap.html: Interactive heatmap with optional literature annotations
- variant_allele_frequency_heatmap.pdf: Heatmap of variant allele frequencies across passages
- literature_database_hits.*.csv: Functional annotation results (if literature search used)
- run_metadata.json: Provenance manifest capturing inputs, tool versions, and run status
By default the manifest is lightweight. Use --manifest-level deep to checksum all referenced
input files (FASTQ/BAM/VCF/coverage) and include file sizes.
The results table schema is documented in docs/OUTPUT_SCHEMA.md. You can also print it from the CLI:
vartracker schema resultsTo write the schema to a file instead, use:
vartracker schema results --out docs/output_schema.csv
vartracker schema results --out docs/output_schema.json --format jsonTo print the expected literature CSV structure for --literature-csv, use:
vartracker schema literatureThe pipeline performs the following analysis:
-
VCF Standardization: Normalizes and standardizes input VCF files
-
Annotation: Adds amino acid consequences using
bcftools csq -
Variant Merging: Combines all longitudinal samples
-
Comprehensive Analysis: For each variant, determines:
- Gene location and amino acid consequences
- Variant type (SNP/indel) and change type (synonymous/missense/etc.)
- Persistence across samples (new/original, persistent/transient)
- Quality control metrics
- Amino acid property changes
- Allele frequency dynamics
-
Visualization: Generates plots for mutation accumulation and gene-wise statistics
-
Functional Annotation: (optional) Searches against literature databases for known functional impacts
When using vartracker, please cite the software release you used. Citation metadata is provided
in CITATION.cff, and GitHub releases are archived on Zenodo (DOI will appear here once minted).
If you use vartracker, please cite the software record on Zenodo:
- Foster, C. (2026). vartracker (Version x.y.z). Zenodo. https://doi.org/10.5281/zenodo.XXXXX
- Concept DOI (all versions): https://doi.org/10.5281/zenodo.18452274 Note: a version-specific DOI is minted by Zenodo after each GitHub release.
Also cite relevant methods or data sources, for example:
- Foster CSP, et al. Long-term serial passaging of SARS-CoV-2 reveals signatures of convergent evolution. Journal of Virology. 2025;99: e00363-25. doi:10.1128/jvi.00363-25
- Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10. doi:10.1093/gigascience/giab008
- Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017;33: 2037–2039. doi:10.1093/bioinformatics/btx100
- Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012;40: 11189–11201. doi:10.1093/nar/gks918
- Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34: i884–i890. doi:10.1093/bioinformatics/bty560
- Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 2013 [cited 13 Apr 2021]. Available: https://arxiv.org/abs/1303.3997v2
- Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10: 33. doi:10.12688/f1000research.29032.2
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
If you encounter any issues or have questions:
- Check the documentation
- Search existing issues
- Create a new issue with detailed information about your problem