Skip to content

Comments

calculate coverage using run report instead of fastq processing#4

Open
mrmonkington wants to merge 6 commits intomainfrom
mark-new-style-coverage
Open

calculate coverage using run report instead of fastq processing#4
mrmonkington wants to merge 6 commits intomainfrom
mark-new-style-coverage

Conversation

@mrmonkington
Copy link
Member

@mrmonkington mrmonkington commented Feb 17, 2026

  • This PR is ready for review!
    • There are no new testing gaps (coverage has not gone down).
    • This change has had sufficient user-acceptance-testing.
    • All new work is documented (inc the README).
    • This code will deploy safely with all build and test steps automated and documented (static compilations, migrations, config changes, version bumps, etc).
    • There are no debug statements or settings, or critical TODOs left in.
    • This change introduces no security gaps.
    • This change introduces no privacy or other compliance gaps.
    • This change will cause no unplanned performance regressions.
    • This change introduces no unplanned technical debt or unfinished/unrelated work.
    • This change satisfies the change management policy and where needed has been signed off by the relevant project stakeholders.
    • No secrets or other sensitive data (passwords, access keys, PII, etc) are present in any of the constituent commits (not just in the final PR).

What does this Pull Request do and why??

Provides a coverage analysis tool which uses Minknow own output run stats, rather than scraping fastqs.

Will allow faster run summaries, including mid-run, without hammering the data disks.

Can use either of:

  • the run report report_xxxxxxx.json for instant coverage stats.
  • the run summary sequencing_summary_xxxx.txt (tsv) for read size binning.

Can optionally only output samples below a target coverage, and default bin threshold of 7kb can be changed.

What are the relevant Github Issues or support tickets?

None.

Where should the reviewer start?

See the README!

How can the reviewer see this in action and/or test?

Experiment with the files in examples/

Compliance

None

@mrmonkington mrmonkington marked this pull request as ready for review February 18, 2026 02:09
parser.add_argument("--summary", help="Path to the sequencing summary TXT file.")
parser.add_argument("--below", type=int, help="Only output lines where coverage is below this value.")
parser.add_argument("--bin-threshold", type=int, default=7000, help="Read length threshold for binned stats (default: 7000).")
parser.add_argument("--csv", action="store_true", help="Output in CSV format.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help="Prints stdout in CSV format" - maybe?

thresh_kb = f"{bin_threshold/1000:g}kb"
if output_csv:
writer = csv.writer(sys.stdout)
csv_cols = [c if c not in ['short_total_mb', 'short_avg_len', 'long_total_mb', 'long_avg_len'] else

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would change all 'avgs' to means to be more clear - most people in the FASTQ space would use medians or N50/N90s as they are less suceptible to skew / wide distribution of read lens - which ONT generall has.

Median is more annoying to calculate - so I think that why we have historically used mean average (and would be impossible when using JSON as input? - so mean is fine for us I think)

Would also be good to get short and long read numbers? like long_read_n or short_read_n. This will help us tell if a sample is bad - e.g. 50% of reads are short is easier to interpret

Copy link
Member Author

@mrmonkington mrmonkington Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I should be explicit - I was going from the run template the lab use which expects means though it just says 'ave':

File | Total_Yield_(bp) | Ave_length_(bp) | Yield_<7kbp_(bp) | Ave_length_<7kbp_(bp) | Yield_>7kbp_(bp) | Ave_length_>7kbp_(bp) | N50

But maybe they haven't really given any thought to it and have just inherited this approach.

@AndrewMNG would median be more useful for QCing the read distribution? Do the lab use this info, or is it just you? The old script produced means because it was just bodged with awk, but I can do whatever is most useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the SOP

https://docs.google.com/document/d/1ZC_R_JkTnHM5Nm80IkZ5OI9OKszJfHQJl847PmWxkMs/edit?tab=t.0

and example of where the lab put this info

https://docs.google.com/spreadsheets/d/1lTPI1VpJRYt9lyOSYe1Scsm005OnBx2j/edit?gid=1178709320#gid=1178709320

Of course ultimately want to do away with all of these manual steps - the Run Manager should just report this info where it's need (or something derived from this info)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrmonkington Median or N50 is more useful than mean, since the read lengths are not normally distributed and mean will be skewed by longer reads. Measuring yield <7kb and >7kb is good to get a measure of how fragmented the DNA is, although median will imply this anyway if it's low. At the moment it's not something we use, we do run Nanoplot for LR which computes a lot of stats about the reads, but I only check it if the assembly is very bad - and typically it's down to long reads that are too short. It's something we could us in future though eg. databasing - could be a way of tracking how certain species perform, or performance of native vs. rapid.

Copy link

@R-Cardenas R-Cardenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small changes to some names, no functional changes. Is really nice!

@mrmonkington mrmonkington changed the base branch from mark-run-manager to main February 20, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants