streamcount

This is a program which counts occurences of k-mers (strings of length k characters) in an arbitrarily large input.

The program first takes a set of pattern strings, breaks the strings into k-mers, and builds from this set of k-mers a keyword tree with suffix links (see Aho-Corasick algorithm).

In the second part, each line of an input file is streamed through the keyword tree and the counters of the corresponding k-mers for this file are collected.

The number of k-mers which can be simultaneously counted is limited by the amount of the available RAM. The number is also limited by the use of the signed integer SC_INT defined as int32_t on line 18 of common.h. With this definition, we can build an index for at most Int32.MaxValue/k input k-mers. To increase this limit, redefine SC_INT as int64_t and recompile.

Dependencies:

 zlib

To install: apt-get install zlib1g-dev

To compile:

 make

To run:

If you add a path to the compiled streamcount to your PATH variable, it can be run as a standard unix command: streamcount

Program arguments

Required:

 --kmers 'kmers_file'

where 'kmers_file' is the full path and file name of the file from which to extract the k-mers.
NOTE: The file with k-mers should contain only characters from a valid DNA alphabet. This should be dealt with prior to running the program.

 -i --input 'input_file'

where 'input_file' is the full path and file name of the file where to count the k-mers. If the input option is not specified, the program tries to read the input text from stdin. In this case, the following commands are valid:

 cat 'input_file' |./streamcount --kmers 'kmers_file'

 ./streamcount --kmers 'kmers_file' < 'input_file'

By specifying only these two parameters, we accept the following default program behaviour:

'input_file' is of type FASTA. It can be compressed.
Each line of 'kmers_file' is treated as a separate k-mer.
The final count for each k-mer includes a count for its reverse complement string.
The final counts for each k-mer are written to stdout, one count per line.
If some k-mers in 'kmers_file' are not unique, the information about this is supressed.
Counting is performed with DEFAULT_NUMBER_OF_THREADS defined on line 24 in common.h.

Optional:

Input options:

length of each k-mer

 -k='k'

If there are more than one k-mer in each input line, all of them will be considered. In this case, output for each line will consist of a line of comma-separated counts

type of k-mers input

 --kmers-multiline

This will extract k-mers from 'kmers_file' treating the entire file as one string

type of input file

 --input-plain-text

This will treat input as text lines, rather than FASTA.

number of threads

-t

It is optimal to define the number of threads as the number of cores. Maximum number of threads is set to 8. It can be redefined in common.h line 23

Counting options:

reverse complement

 --no-rc

This will not include count of reverse complement into final count of each k-mer. This option can be useful when counting k-mers in a genomic sequence, rather than in set of reads.

memory in MB

 -m,     --mem='MEMORY_MB'

Specify the amount of memory (in MB) that you are ready to sacrifice to hold a k-mer index. This is used to estimate if you can hold k-mers index prior to processing. Default: 4000MB

Output options:

print options

 --printseq

Prints each original line of 'kmers_file' before its count(s).

mark repeats

 --repeat-mask-tofile='repeat-mask-file'

For each k-mer, prints to 'repeat-mask-file' 0 or 1. 1 is printed if this k-mer is not unique (repeats) in the 'kmers_file'. This is used if you need a precise count for all k-mers extracted from the same line. Because the same k-mer occurs also on a different line, the counts of consecutive k-mers could be distorted.

Sample usage:

In folder 'sample_data.zip' there are one sample input file, and one k-mers file. Folder also contains SAMPLE_RUNS.txt with examples of running streamcount.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
LICENCE.txt		LICENCE.txt
Makefile		Makefile
README.md		README.md
common.c		common.c
common.h		common.h
count_kmers.c		count_kmers.c
count_kmers.h		count_kmers.h
dna_common.c		dna_common.c
extractRandomKmersFromReads.c		extractRandomKmersFromReads.c
keyword_tree.c		keyword_tree.c
keyword_tree.h		keyword_tree.h
kmers_to_kwtree.c		kmers_to_kwtree.c
kmers_to_kwtree.h		kmers_to_kwtree.h
kseq.h		kseq.h
sample_data.zip		sample_data.zip
streamcount.c		streamcount.c
streamcount.h		streamcount.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

streamcount

Dependencies:

To compile:

To run:

Program arguments

Required:

Optional:

Input options:

length of each k-mer

type of k-mers input

type of input file

number of threads

Counting options:

reverse complement

memory in MB

Output options:

print options

mark repeats

Sample usage:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mgbarsky/streamcount

Folders and files

Latest commit

History

Repository files navigation

streamcount

Dependencies:

To compile:

To run:

Program arguments

Required:

Optional:

Input options:

length of each k-mer

type of k-mers input

type of input file

number of threads

Counting options:

reverse complement

memory in MB

Output options:

print options

mark repeats

Sample usage:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages