Skip to content

Feature request: Keep index in memory while mapping multiple samples in a row #483

@shiraz-shah

Description

@shiraz-shah

For metagenomics and --aemb the index often reaches several hundred gigabytes. At this point loading the index into memory takes longer than the actual mapping.

Could strobealign be given a list of samples as input, such that it iterates over each sample, mapping its reads and saving its output, then progressing to the next sample without exiting, thereby not having to reload the index into memory again?

This would make strobealign three to four times faster for our use case.

An input tsv file could look like:

sample1 fq/sample1_1.fq.gz fq/sample1_2.fq.gz map/sample1.tsv
sample2 fq/sample2_1.fq.gz fq/sample2_2.fq.gz map/sample2.tsv
sample3 fq/sample3_1.fq.gz fq/sample3_2.fq.gz map/sample3.tsv
sample4 fq/sample4_1.fq.gz fq/sample4_2.fq.gz map/sample4.tsv

Or just a list of samples

sample1
sample2
sample3
sample4

where strobealign is then told to look for input files sampleX_1.fq.gz and sampleX_2.fq.gz in the specified input folder and write the --aemb output tsv file in the specified output folder.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions