calculate coverage using run report instead of fastq processing#4
calculate coverage using run report instead of fastq processing#4mrmonkington wants to merge 6 commits intomainfrom
Conversation
…summary-only support
| parser.add_argument("--summary", help="Path to the sequencing summary TXT file.") | ||
| parser.add_argument("--below", type=int, help="Only output lines where coverage is below this value.") | ||
| parser.add_argument("--bin-threshold", type=int, default=7000, help="Read length threshold for binned stats (default: 7000).") | ||
| parser.add_argument("--csv", action="store_true", help="Output in CSV format.") |
There was a problem hiding this comment.
help="Prints stdout in CSV format" - maybe?
| thresh_kb = f"{bin_threshold/1000:g}kb" | ||
| if output_csv: | ||
| writer = csv.writer(sys.stdout) | ||
| csv_cols = [c if c not in ['short_total_mb', 'short_avg_len', 'long_total_mb', 'long_avg_len'] else |
There was a problem hiding this comment.
would change all 'avgs' to means to be more clear - most people in the FASTQ space would use medians or N50/N90s as they are less suceptible to skew / wide distribution of read lens - which ONT generall has.
Median is more annoying to calculate - so I think that why we have historically used mean average (and would be impossible when using JSON as input? - so mean is fine for us I think)
Would also be good to get short and long read numbers? like long_read_n or short_read_n. This will help us tell if a sample is bad - e.g. 50% of reads are short is easier to interpret
There was a problem hiding this comment.
You're right, I should be explicit - I was going from the run template the lab use which expects means though it just says 'ave':
File | Total_Yield_(bp) | Ave_length_(bp) | Yield_<7kbp_(bp) | Ave_length_<7kbp_(bp) | Yield_>7kbp_(bp) | Ave_length_>7kbp_(bp) | N50
But maybe they haven't really given any thought to it and have just inherited this approach.
@AndrewMNG would median be more useful for QCing the read distribution? Do the lab use this info, or is it just you? The old script produced means because it was just bodged with awk, but I can do whatever is most useful.
There was a problem hiding this comment.
here's the SOP
https://docs.google.com/document/d/1ZC_R_JkTnHM5Nm80IkZ5OI9OKszJfHQJl847PmWxkMs/edit?tab=t.0
and example of where the lab put this info
Of course ultimately want to do away with all of these manual steps - the Run Manager should just report this info where it's need (or something derived from this info)
There was a problem hiding this comment.
@mrmonkington Median or N50 is more useful than mean, since the read lengths are not normally distributed and mean will be skewed by longer reads. Measuring yield <7kb and >7kb is good to get a measure of how fragmented the DNA is, although median will imply this anyway if it's low. At the moment it's not something we use, we do run Nanoplot for LR which computes a lot of stats about the reads, but I only check it if the assembly is very bad - and typically it's down to long reads that are too short. It's something we could us in future though eg. databasing - could be a way of tracking how certain species perform, or performance of native vs. rapid.
R-Cardenas
left a comment
There was a problem hiding this comment.
Just some small changes to some names, no functional changes. Is really nice!
What does this Pull Request do and why??
Provides a coverage analysis tool which uses Minknow own output run stats, rather than scraping fastqs.
Will allow faster run summaries, including mid-run, without hammering the data disks.
Can use either of:
report_xxxxxxx.jsonfor instant coverage stats.sequencing_summary_xxxx.txt(tsv) for read size binning.Can optionally only output samples below a target coverage, and default bin threshold of 7kb can be changed.
What are the relevant Github Issues or support tickets?
None.
Where should the reviewer start?
See the README!
How can the reviewer see this in action and/or test?
Experiment with the files in
examples/Compliance
None