The program first takes a set of pattern strings, breaks the strings into k-mers, and builds from this set of k-mers a keyword tree with suffix links (see Aho-Corasick algorithm).
In the second part, each line of an input file is streamed through the keyword tree and the counters of the corresponding k-mers for this file are collected.
The number of k-mers which can be simultaneously counted is limited by the amount of the available RAM. The number is also limited by the use of the signed integer SC_INT defined as int32_t on line 18 of common.h. With this definition, we can build an index for at most Int32.MaxValue/k input k-mers. To increase this limit, redefine SC_INT as int64_t and recompile.
zlib make --kmers 'kmers_file' NOTE: The file with k-mers should contain only characters from a valid DNA alphabet. This should be dealt with prior to running the program.
-i --input 'input_file' where 'input_file' is the full path and file name of the file where to count the k-mers. If the input option is not specified, the program tries to read the input text from stdin. In this case, the following commands are valid:
cat 'input_file' |./streamcount --kmers 'kmers_file' ./streamcount --kmers 'kmers_file' < 'input_file' By specifying only these two parameters, we accept the following default program behaviour:
- 'input_file' is of type FASTA. It can be compressed.
- Each line of 'kmers_file' is treated as a separate k-mer.
- The final count for each k-mer includes a count for its reverse complement string.
- The final counts for each k-mer are written to stdout, one count per line.
- If some k-mers in 'kmers_file' are not unique, the information about this is supressed.
- Counting is performed with DEFAULT_NUMBER_OF_THREADS defined on line 24 in common.h.
-k='k' --kmers-multiline --input-plain-text -t --no-rc -m, --mem='MEMORY_MB' --printseq --repeat-mask-tofile='repeat-mask-file'