extracTR

Introduction

extracTR is a tool for identifying and analyzing tandem repeats (satellite DNA) in genomic sequences. It works with raw sequencing data (FASTQ) or assembled genomes (FASTA), using k-mer based approaches to detect repetitive patterns efficiently. extracTR can also design FISH probes for detected satellites and enrich monomer variant sequences directly from the de Bruijn graph.

Features

Efficient tandem repeat detection from raw sequencing data
Support for single-end and paired-end FASTQ files
Support for genome assemblies in FASTA format
Support for precomputed aindex (--aindex)
FISH probe design for detected satellite monomers
Monomer variant enrichment via de Bruijn graph cycle search
IUPAC degenerate consensus generation from variants
Customizable parameters for fine-tuning repeat detection
Multi-threaded processing for improved performance

Requirements

Python 3.7 or later
Jellyfish 2.3.0 or later
Conda (for easy environment management)

Installation

We recommend installing extracTR in a separate Conda environment to manage dependencies effectively.

Create a new Conda environment:

conda create -n extractr_env python=3.9

Activate the environment:

conda activate extractr_env

Install Jellyfish:

conda install -c bioconda jellyfish

Install extracTR using pip:

pip install extracTR

To deactivate the environment when you're done:

conda deactivate

Usage

Before running extracTR, ensure that you have removed adapters from your sequencing reads and activated the Conda environment:

conda activate extractr_env

Basic usage

For paired-end FASTQ files:

extracTR -1 reads_1.fastq -2 reads_2.fastq -o output_prefix -c 30

For single-end FASTQ file:

extracTR -1 reads.fastq -o output_prefix -c 30

For genome assembly in FASTA format:

extracTR -f genome.fasta -o output_prefix -c 1

For precomputed aindex:

extracTR --aindex /path/to/index_prefix -o output_prefix -c 30

Advanced usage

Custom k-mer size and probe design parameters:

extracTR -1 reads_1.fastq -2 reads_2.fastq -o output_prefix -t 64 -c 30 -k 25 \
    --probe-length 45 --top-probes 5 --min-gc 0.40 --max-gc 0.60

Skip variant enrichment and/or probe design:

extracTR -1 reads.fastq -o output_prefix -c 30 --skip-variants --skip-probes

Options

Input / output

Option	Description
`-1, --fastq1`	Input file with forward reads in FASTQ format
`-2, --fastq2`	Input file with reverse reads in FASTQ format (optional)
`-f, --fasta`	Input genome assembly in FASTA format
`--aindex`	Prefix for a precomputed aindex (skips index computation)
`-o, --output`	Prefix for output files (required)

Indexing parameters

Option	Default	Description
`-t, --threads`	32	Number of threads for index computation
`-c, --coverage`	—	Data coverage (required; set 1 for genome assembly)
`-k, --k`	23	K-mer size for aindex
`--lu`	100 * coverage	Minimum k-mer frequency cutoff

FISH probe design

Option	Default	Description
`--probe-length`	40	Probe length in bp
`--top-probes`	3	Number of top probes to report per monomer
`--min-gc`	0.35	Minimum GC content for probes
`--max-gc`	0.65	Maximum GC content for probes
`--skip-probes`	false	Skip the FISH probe design step

Variant enrichment

Option	Default	Description
`--skip-variants`	false	Skip the variant enrichment step

Note: You must provide either FASTQ file(s), a FASTA file, or a precomputed aindex as input.

Pipeline

extracTR runs the following steps:

Index computation — Build or load a k-mer frequency index (aindex)
Tandem repeat detection — Bidirectional greedy walk in the de Bruijn graph to find circular paths (tandem repeats) and linear elements
Save results — Write detected monomers and dispersed elements to FASTA
Analyze repeat borders — (placeholder for future development)
Variant enrichment — For each monomer, search alternative cycles in the de Bruijn graph to find sequence variants. Generate IUPAC degenerate consensus from same-length variants
FISH probe design — Slide a window across each circular monomer, score candidates by k-mer frequency strength and specificity (CV-based), filter by GC% and Tm, remove overlapping probes

Output

extracTR generates the following output files:

File	Description
`{prefix}.fa`	Predicted tandem repeat monomers (FASTA)
`{prefix}_te.fa`	Predicted dispersed / non-circular elements (FASTA)
`{prefix}_variants.fa`	Monomer sequence variants from de Bruijn graph
`{prefix}_consensus.fa`	IUPAC degenerate consensus sequences
`{prefix}_probes.fa`	FISH probe candidates (FASTA with metrics in headers)
`{prefix}_probes.tsv`	FISH probe candidates with full metrics (TSV)

Probe TSV columns

Column	Description
`probe_id`	Unique probe identifier
`source_monomer`	Monomer the probe was designed from
`position`	Start position within the monomer
`length`	Probe length in bp
`sequence`	Probe nucleotide sequence
`gc_content`	GC fraction (0-1)
`melting_temp`	Estimated melting temperature (°C)
`frequency_score`	Normalized mean k-mer frequency (signal strength)
`specificity_score`	CV-based specificity (1 = highly specific)
`composite_score`	frequency_score * specificity_score

Benchmarking

extracTR includes a benchmark module for evaluating repeat detection against TRF ground truth on reference genomes:

python -m extractr.benchmark \
    --fastq1 reads_1.fastq --fastq2 reads_2.fastq \
    -o benchmark_results -c 30 \
    --ref-index /path/to/genome.23 \
    --ref-header /path/to/genome.header \
    --ref-sdat /path/to/genome.23.sdat \
    --trf /path/to/genome.trf \
    --hl /path/to/srf.fa

License

BSD License

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
src/extractr		src/extractr
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

extracTR

Introduction

Features

Requirements

Installation

Usage

Basic usage

Advanced usage

Options

Input / output

Indexing parameters

FISH probe design

Variant enrichment

Pipeline

Output

Probe TSV columns

Benchmarking

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

aglabx/extracTR

Folders and files

Latest commit

History

Repository files navigation

extracTR

Introduction

Features

Requirements

Installation

Usage

Basic usage

Advanced usage

Options

Input / output

Indexing parameters

FISH probe design

Variant enrichment

Pipeline

Output

Probe TSV columns

Benchmarking

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages