-
Notifications
You must be signed in to change notification settings - Fork 10
nuc_process
The nuc_process command is used to create Hi-C chromatin contact data in NCC format given input FASTQ paired read files and a genome index/sequence.
nuc_process [-h] [-g GENOME_FILE] [-g2 GENOME_FILE_2] [-cn CHROM_NAME_FILE] [-cn2 CHROM_NAME_FILE_2] [-re1 ENZYME] [-re2 ENZYME] [-s SIZE_RANGE] [-n CPU_COUNT] [-r COUNT] [-o NCC_FILE] [-pdf PDF_FILE] [-b EXE_FILE] [-q SCHEME] [-qm MIN_QUALITY] [-m] [-p] [-pt PAIRED_READ_TAGS PAIRED_READ_TAGS] [-x] [-f FASTA_FILES [FASTA_FILES ...]] [-f2 FASTA_FILES_2 [FASTA_FILES_2 ...]] [-a] [-k] [-sam] [-l SEQUENCE] [-z] [-v] [-u] [-cc GENOME_COPIES] [-lim MAX_READS] [-5 CLIP_BP] [-3 CLIP_BP] [-ad [ADAPTER_SEQ [ADAPTER_SEQ ...]]] FASTQ_FILE [FASTQ_FILE ...]
-h, --help
Show command line options and exit
FASTQ_FILE [FASTQ_FILE ...]
Input paired-read FASTQ files to process. Accepts
wildcards that match paired files. If more than two
files are input, processing will be run in batch mode
using the same parameters.
-g GENOME_FILE
Location of genome index files to map sequence reads
to without any file extensions like ".1.b2" etc. A new
index will be created with the name if the index is
missing and genome FASTA files are specified
-g2 GENOME_FILE_2
Location of secondary genome index files for hybrid
genomes. A new index will be created with the name if
the index is missing and genome FASTA files are
specified
-cn CHROM_NAME_FILE
Location of a file containing chromosome names for the genome build: tab-separated lines mapping sequence/contig names (as appear at the start of genome FASTA headers) to desired (human readable) chromosome names. This file is not mandatory if the primary restriction enzyme (-re1) is specified (i.e. not "None") and a corresponding RE1 mapping file has already been created for the genome. The naming file may be built automatically from NCBI genome FASTA files using the supplied "nuc_sequence_names" program
-cn2 CHROM_NAME_FILE_2
Location of a file containing chromosome names for a second hybrid genome build. This file is only mandatory if an RE1 mapping file has not already been created for the genome. The names in this file must exactly match those for the other genome build where chromosomes are homologous. The file may be built automatically from NCBI genome FASTA files using the supplied "nuc_sequence_names" program
-re1 ENZYME
Primary restriction enzyme (for ligation junctions). May be set to "None" for MicroC etc., where digestion is not sequence specific. Options with "" denote promiscuous/secondary cleavage activity at star sites. Default: MboI. Available: AluI, BglII, DpnII, DpnII, HindIII, HindIII*, MboI, None
-re2 ENZYME
Secondary restriction enzyme (if used). Available: AluI, BglII, DpnII, DpnII*, HindIII, HindIII*, MboI, None
-s SIZE_RANGE
Allowed range of sequenced molecule sizes, e.g.
"150-1000", "100,800" or "200" (no maximum)
-n CPU_COUNT
Number of CPU cores to use in parallel
-r COUNT
Minimum number of sequencing repeats required to support a contact
-o NCC_FILE
Optional output name for NCC format chromosome contact file. This option will be ignored if more than two paired FASTA files are input (i.e. for batch mode); automated naming will be used instead. If the -a option is used this file will contain ambiguous contacts, as well as unambiguous.
-pdf PDF_FILE
Optional output name for PDF format report file. This option will be ignored if more than two paired FASTA files are input (i.e. for batch mode); automated naming will be used instead.
-b EXE_FILE
Path to bowtie2 (read aligner) executable (will be searched for if not specified)
-q SCHEME
Use a specific FASTQ quality scheme (normally not set and deduced automatically). Available: phred33, phred64, solexa
-qm MIN_QUALITY
Minimum acceptable FASTQ quality score in range 0-40 for clipping end of reads. Default: 10
-m
Force a re-mapping of genome restriction enzyme sites (otherwise cached values will be used if present)
-p
The input data is multi-cell/population Hi-C; single-cell processing steps are avoided
-pt PAIRED_READ_TAG PAIRED_READ_TAG
When more than two FASTQ files are input (batch mode), the subtrings/tags which differ between paired FASTQ file paths. Default: r_1 r_2
-x, --reindex
Force a re-indexing of the genome (given appropriate FASTA files)
-f FASTA_FILES [FASTA_FILES ...]
Specify genome FASTA files for genome index building (accepts wildcards)
-f2 FASTA_FILES_2 [FASTA_FILES_2 ...]
A second set of genome FASTA files for building a second genome index when using hybrid strain cells (accepts wildcards)
-a
Whether to report ambiguously mapped contacts
-k
Keep any intermediate files (e.g. clipped FASTQ etc)
-sam
Write paired contacts files to SAM format
-l SEQUENCE
Seek a specific ligation junction sequence (otherwise this is guessed from the primary restriction enzyme)
-z
GZIP compress any output FASTQ files
-v, --verbose
Display verbose messages to report progress
-u
Whether to only accept uniquely mapping genome positions and not attempt to resolve certain classes of ambiguous mapping where a single perfect match is found.
-cc GENOME_COPIES
Number of whole-genome, and hence chromosome copies, e.g. for G2 phase; Default 1 for a single genome index or 2 if second genome index is specified, for hybrid samples
-lim MAX_READS
Limit the number of input reads considered: useful for testing population Hi-C data prior to a lengthy full run
-5 CLIP_BP
Number of base pairs to trim from the 5' start of all input reads. Default: 0
-3 CLIP_BP
Number of base pairs to trim from the 3' end of all input reads. Default: 0
-ad [ADAPTER_SEQ [ADAPTER_SEQ ...]]
Adapter sequences to truncate reads at (or blank for none). E.g. Nextera:CTGTCTCTTATA, Illumina universal:AGATCGGAAGAGC. Default: AGATCGGAAGAGC (Illumina universal)