From 632890e31bd34ae39dde45941c06c29b9abffce1 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Tue, 18 Feb 2020 16:46:37 -0600 Subject: [PATCH 01/68] ci: replace Travis with Github Actions --- bin/ci-build.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bin/ci-build.sh b/bin/ci-build.sh index 9ab6bcf..ded30c2 100755 --- a/bin/ci-build.sh +++ b/bin/ci-build.sh @@ -66,4 +66,4 @@ for CURRENT_BRANCH in "${BRANCHES[@]}"; do done rm ${DRAFTS_FILE} -git checkout "${STARTING_BRANCH_NAME}" \ No newline at end of file +git checkout "${STARTING_BRANCH_NAME}" From 2a78e0b089e6332d7748751aea4113956a8d9905 Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Fri, 28 Jun 2019 09:06:03 -0500 Subject: [PATCH 02/68] Add rudimentary draft for quality assurance pipeline RFC --- text/0000-quality-assurance-pipeline.md | 69 +++++++++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100644 text/0000-quality-assurance-pipeline.md diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md new file mode 100644 index 0000000..4b289b4 --- /dev/null +++ b/text/0000-quality-assurance-pipeline.md @@ -0,0 +1,69 @@ +# Table of Contents + +- [Introduction](#Introduction) +- [Motivation](#Motivation) +- [Current Process](#Current-Process) +- [Proposals](#Proposals) +- [Workflow Description](#Workflow-Description) +- [Outstanding Questions](#Outstanding-Questions) + +# Introduction + +This RFC seeks to establish an automated pipeline workflow around how genomic data on St. Jude Cloud is vetted, covering both existing data and new uploads to the platform. The end goal for this would be to publish results from various tools, but it hopes to draw discussion around what metrics and statistics are important to the community as a whole. + +# Motivation + +With the introduction of Real-Time Clinical Genomics, there exists a need for an automated quality assurance pipeline guaranteeing any uploaded data meets predefined standards. By guaranteeing the integrity of our data and the reproducibility of these results, it would allow St. Jude to publish statistics about the genomics data hosted on our platform that might be of interest to other scientists and researchers. + +# Current Process + +Because St. Jude Cloud currently provides three-platform whole-genome (WGS), whole-exome (WES), and transcriptome (RNA-Seq) sequencing data, it is important to differentiate how we currently run our current quality control workflow on each type of sequencing. + +Our current process to vet and screen data consists of the following tools: + +| Tool | Version | +| ------------------------ | --------- | +| `samtools flagstat` | [v1.9] | +| `fastqc` | [v0.11.8] | +| `qualimap bamqc` | [v2.2.2] | +| `qualimap rnaseq` | [v2.2.2] | +| `picard ValidateSamFile` | [v2.20.2] | +| `multiqc` | [v1.7] | + +[v1.9]: http://www.htslib.org/doc/samtools.html +[v0.11.8]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ +[v2.2.2]: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html +[v2.20.2]: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_ValidateSamFile.php +[v1.7]: https://multiqc.info/ + +# Proposals + +- Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`], [`junction_annotation`], and [`junction_saturation`]. + +[`infer_experiment`]: http://rseqc.sourceforge.net/#infer-experiment-py +[`junction_annotation`]: http://rseqc.sourceforge.net/#junction-annotation-py +[`junction_saturation`]: http://rseqc.sourceforge.net/#junction-saturation-py + +- Include md5 hash as an annotation property for vended files. + +# Workflow Description + +The end workflow (covering both our current process and the addition of the new tool) would be as following: + +| Command | Purpose | +| - | - | +| `samtools quickcheck $INPUT_BAM` | Validate BAM headers and EOF block existence | +| `md5sum $INPUT_BAM > $INPUT_BAM.md5` | For comparison to md5 vended file property | +| `picard ValidateSamFile I=$INPUT_BAM MODE=SUMMARY` | Ensure validity of file | +| `samtools flagstat $INPUT_BAM > $INPUT_BAM.flagstat.txt` | Generate flag statistics | +| `fastqc $INPUT_BAM -o $OUTDIR` | Screen for GC content and adapter contamination | +| `qualimap bamqc -bam $INPUT_BAM -outdir $OUTDIR` | Screen for mapping quality, coverage, and duplication rate | +| `qualimap rnaseq -bam $INPUT_BAM -gtf $GTF_FILE -outdir $OUTDIR` | Screen for RNA-Seq bias and junction analysis | +| `multiqc` | Report aggregation | + +Note: Specific options such as memory size thresholds and thread count have been left out. + +# Outstanding Questions + +- What thresholds or metrics differentiate a poor-quality sample from a high-quality one? +- What other metrics or properties would be valuable? \ No newline at end of file From 91bd217ee9381755e7e88f20e0c5cebdcbaedd0e Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Fri, 28 Jun 2019 09:11:21 -0500 Subject: [PATCH 03/68] Remove tool options from workflow table --- text/0000-quality-assurance-pipeline.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 4b289b4..ec2e179 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -50,20 +50,20 @@ Our current process to vet and screen data consists of the following tools: The end workflow (covering both our current process and the addition of the new tool) would be as following: -| Command | Purpose | -| - | - | -| `samtools quickcheck $INPUT_BAM` | Validate BAM headers and EOF block existence | -| `md5sum $INPUT_BAM > $INPUT_BAM.md5` | For comparison to md5 vended file property | -| `picard ValidateSamFile I=$INPUT_BAM MODE=SUMMARY` | Ensure validity of file | -| `samtools flagstat $INPUT_BAM > $INPUT_BAM.flagstat.txt` | Generate flag statistics | -| `fastqc $INPUT_BAM -o $OUTDIR` | Screen for GC content and adapter contamination | -| `qualimap bamqc -bam $INPUT_BAM -outdir $OUTDIR` | Screen for mapping quality, coverage, and duplication rate | -| `qualimap rnaseq -bam $INPUT_BAM -gtf $GTF_FILE -outdir $OUTDIR` | Screen for RNA-Seq bias and junction analysis | -| `multiqc` | Report aggregation | +| Command | Purpose | +| -------------------------| ---------------------------------------------------------- | +| `samtools quickcheck` | Validate BAM headers and EOF block existence | +| `md5sum` | For comparison to md5 vended file property | +| `picard ValidateSamFile` | Ensure validity of file | +| `samtools flagstat` | Generate flag statistics | +| `fastqc` | Screen for GC content and adapter contamination | +| `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | +| `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | +| `multiqc` | Report aggregation | Note: Specific options such as memory size thresholds and thread count have been left out. # Outstanding Questions - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? -- What other metrics or properties would be valuable? \ No newline at end of file +- What other metrics or properties would be valuable? From 45432e7a4404e31e609b7c99316625f812c6141d Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Fri, 28 Jun 2019 09:27:32 -0500 Subject: [PATCH 04/68] Add table row for infer_experiment --- text/0000-quality-assurance-pipeline.md | 1 + 1 file changed, 1 insertion(+) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index ec2e179..83e47d2 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -59,6 +59,7 @@ The end workflow (covering both our current process and the addition of the new | `fastqc` | Screen for GC content and adapter contamination | | `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | | `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | +| `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | | `multiqc` | Report aggregation | Note: Specific options such as memory size thresholds and thread count have been left out. From f62098f2f340a578fb62c487ccb3ac0cbdc49e32 Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Fri, 28 Jun 2019 09:37:24 -0500 Subject: [PATCH 05/68] Add table rows for junction_annotation and junction_saturation --- text/0000-quality-assurance-pipeline.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 83e47d2..320fe7f 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -50,17 +50,19 @@ Our current process to vet and screen data consists of the following tools: The end workflow (covering both our current process and the addition of the new tool) would be as following: -| Command | Purpose | -| -------------------------| ---------------------------------------------------------- | -| `samtools quickcheck` | Validate BAM headers and EOF block existence | -| `md5sum` | For comparison to md5 vended file property | -| `picard ValidateSamFile` | Ensure validity of file | -| `samtools flagstat` | Generate flag statistics | -| `fastqc` | Screen for GC content and adapter contamination | -| `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | -| `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | -| `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | -| `multiqc` | Report aggregation | +| Command | Purpose | +| --------------------------- | ---------------------------------------------------------- | +| `samtools quickcheck` | Validate BAM headers and EOF block existence | +| `md5sum` | For comparison to md5 vended file property | +| `picard ValidateSamFile` | Ensure validity of file | +| `samtools flagstat` | Generate flag statistics | +| `fastqc` | Screen for GC content and adapter contamination | +| `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | +| `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | +| `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | +| `rseqc junction_annotation` | Compare detected splice junctions to reference gene model | +| `rseqc junction_saturation` | Verify sequencing depth saturation | +| `multiqc` | Report aggregation | Note: Specific options such as memory size thresholds and thread count have been left out. From fcd0c37e3f86491873b8a72553c98ff871d3d88b Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Fri, 28 Jun 2019 11:23:34 -0500 Subject: [PATCH 06/68] Remove reference to rseqc junction_annotation --- text/0000-quality-assurance-pipeline.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 320fe7f..f3f4d90 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -38,10 +38,9 @@ Our current process to vet and screen data consists of the following tools: # Proposals -- Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`], [`junction_annotation`], and [`junction_saturation`]. +- Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`], and [`junction_saturation`]. [`infer_experiment`]: http://rseqc.sourceforge.net/#infer-experiment-py -[`junction_annotation`]: http://rseqc.sourceforge.net/#junction-annotation-py [`junction_saturation`]: http://rseqc.sourceforge.net/#junction-saturation-py - Include md5 hash as an annotation property for vended files. @@ -60,7 +59,6 @@ The end workflow (covering both our current process and the addition of the new | `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | | `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | | `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | -| `rseqc junction_annotation` | Compare detected splice junctions to reference gene model | | `rseqc junction_saturation` | Verify sequencing depth saturation | | `multiqc` | Report aggregation | From d152b1e92c7ef609f9d68cd045a1501a42f729e5 Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Fri, 28 Jun 2019 12:58:15 -0500 Subject: [PATCH 07/68] Add in-progress section and update goals --- text/0000-quality-assurance-pipeline.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index f3f4d90..90999b7 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -5,6 +5,7 @@ - [Current Process](#Current-Process) - [Proposals](#Proposals) - [Workflow Description](#Workflow-Description) +- [Items Still In-Progress](#Items-Still-In-Progress) - [Outstanding Questions](#Outstanding-Questions) # Introduction @@ -15,6 +16,8 @@ This RFC seeks to establish an automated pipeline workflow around how genomic da With the introduction of Real-Time Clinical Genomics, there exists a need for an automated quality assurance pipeline guaranteeing any uploaded data meets predefined standards. By guaranteeing the integrity of our data and the reproducibility of these results, it would allow St. Jude to publish statistics about the genomics data hosted on our platform that might be of interest to other scientists and researchers. +The ultimate goal is to be able to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) separated by dataset and sequencing type (and ideally, also on a sample level). This would aid in visibility into the quality and type of data hosted to researchers and scientist. This RFC hopes to present a form for open discussion to the community regarding what type of other properties/attributes would be helpful and practical. + # Current Process Because St. Jude Cloud currently provides three-platform whole-genome (WGS), whole-exome (WES), and transcriptome (RNA-Seq) sequencing data, it is important to differentiate how we currently run our current quality control workflow on each type of sequencing. @@ -38,10 +41,10 @@ Our current process to vet and screen data consists of the following tools: # Proposals -- Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`], and [`junction_saturation`]. +- Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`] and [`junction_annotation`]. [`infer_experiment`]: http://rseqc.sourceforge.net/#infer-experiment-py -[`junction_saturation`]: http://rseqc.sourceforge.net/#junction-saturation-py +[`junction_annotation`]: http://rseqc.sourceforge.net/#junction-annotation-py - Include md5 hash as an annotation property for vended files. @@ -59,11 +62,16 @@ The end workflow (covering both our current process and the addition of the new | `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | | `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | | `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | -| `rseqc junction_saturation` | Verify sequencing depth saturation | +| `rseqc junction_annotation` | Compare detected splice junctions to reference gene model | | `multiqc` | Report aggregation | Note: Specific options such as memory size thresholds and thread count have been left out. +# Items Still In-Progress + +- [ ] Analysis tools for other types of sequencing +- [ ] Useful metadata from various stages (sample collection, laboratory, pre-sequencing, sequencing, post-sequencing) + # Outstanding Questions - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? From adf6cce2956ac036351a74ef2ed77648444f1c8e Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Wed, 3 Jul 2019 21:42:53 -0500 Subject: [PATCH 08/68] Add important metrics section with some brief justifications --- text/0000-quality-assurance-pipeline.md | 34 +++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 90999b7..b320d31 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -39,6 +39,40 @@ Our current process to vet and screen data consists of the following tools: [v2.20.2]: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_ValidateSamFile.php [v1.7]: https://multiqc.info/ +# Important Metrics + +- Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) + +The "Per Base Sequence Quality" module from FastQC will show the distribution of quality scores across all bases at each position in the reads. It will automatically determine the encoding method used, but this should be cross-referenced with the actual encoding method. + +- Overrepresented Sequences ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) + +The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify any sort of contamination (vector, adapter sequences, etc.). + +- Reads Genomic Origin ([Qualimap](http://qualimap.bioinfo.cipf.es/)) + +The "Reads Genomic Origin" from Qualimap is able to determine how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped to. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Any abnormal results could indicate issues such as DNA contamination. + +- rRNA Content (?) + +Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. + +- Transcript Coverage and 5’-3’ Bias ([Qualimap](http://qualimap.bioinfo.cipf.es/)) + +Libraries prepared with polyA selection have the possibility to lead to high expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting material was of low RNA quality. + +- Junction Analysis ([Qualimap](http://qualimap.bioinfo.cipf.es/)) + +Analysis of known, partly known, and novel junction positions in spliced alignments. + +- Strand Specificity ([RSeQC](http://rseqc.sourceforge.net/)) + +Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). + +- GC Content Bias (?) + +GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). + # Proposals - Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`] and [`junction_annotation`]. From ab952c575775fccc79c4d8ac41d81e10e66513f3 Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Tue, 9 Jul 2019 02:54:34 -0500 Subject: [PATCH 09/68] Update text/0000-quality-assurance-pipeline.md Co-Authored-By: Clay McLeod <3411613+claymcleod@users.noreply.github.com> --- text/0000-quality-assurance-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index b320d31..5c4b330 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -43,7 +43,7 @@ Our current process to vet and screen data consists of the following tools: - Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) -The "Per Base Sequence Quality" module from FastQC will show the distribution of quality scores across all bases at each position in the reads. It will automatically determine the encoding method used, but this should be cross-referenced with the actual encoding method. +The "Per Base Sequence Quality" module from FastQC will show the distribution of quality scores across all bases at each position in the reads. In our case, this is just for informational purposes to our end users — the quality of the sequencing run has already been assessed by the lab upstream, so there is no changing it at this point. - Overrepresented Sequences ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) From 68db49645d378ac31a7eb4bc331f9400b12ffc01 Mon Sep 17 00:00:00 2001 From: Arthur Chiao Date: Mon, 15 Jul 2019 13:49:48 -0500 Subject: [PATCH 10/68] Remove junction_annotation --- text/0000-quality-assurance-pipeline.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 5c4b330..6c3252c 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -75,10 +75,9 @@ GC profiles are typically remarkably stable. Even small/minor deviations could i # Proposals -- Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`] and [`junction_annotation`]. +- Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`]. [`infer_experiment`]: http://rseqc.sourceforge.net/#infer-experiment-py -[`junction_annotation`]: http://rseqc.sourceforge.net/#junction-annotation-py - Include md5 hash as an annotation property for vended files. @@ -96,7 +95,6 @@ The end workflow (covering both our current process and the addition of the new | `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | | `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | | `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | -| `rseqc junction_annotation` | Compare detected splice junctions to reference gene model | | `multiqc` | Report aggregation | Note: Specific options such as memory size thresholds and thread count have been left out. From 0495c999167739df04231e5d7c4d042741f57613 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Wed, 23 Oct 2019 07:44:31 -0500 Subject: [PATCH 11/68] Update 0000-quality-assurance-pipeline.md --- text/0000-quality-assurance-pipeline.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 6c3252c..6530466 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -10,17 +10,17 @@ # Introduction -This RFC seeks to establish an automated pipeline workflow around how genomic data on St. Jude Cloud is vetted, covering both existing data and new uploads to the platform. The end goal for this would be to publish results from various tools, but it hopes to draw discussion around what metrics and statistics are important to the community as a whole. +This RFC documents an automated pipeline workflow for vetting St. Jude Cloud genomic data, covering both existing data and new data uploads to the platform. The end goal is to publish results from various tools, but currently we hope to discuss which quality metrics and statistics are important to the community as a whole. # Motivation -With the introduction of Real-Time Clinical Genomics, there exists a need for an automated quality assurance pipeline guaranteeing any uploaded data meets predefined standards. By guaranteeing the integrity of our data and the reproducibility of these results, it would allow St. Jude to publish statistics about the genomics data hosted on our platform that might be of interest to other scientists and researchers. +Since introducing Real-Time Clinical Genomics, there is a need for an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing the data integrity and the reproducibility of these results allows St. Jude to publish statistics that are of interest to scientists and researchers. -The ultimate goal is to be able to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) separated by dataset and sequencing type (and ideally, also on a sample level). This would aid in visibility into the quality and type of data hosted to researchers and scientist. This RFC hopes to present a form for open discussion to the community regarding what type of other properties/attributes would be helpful and practical. +The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make visible the quality of data offered to researchers and scientists. We hope this RFC becomes a forum for open community discussion of the quality properties/attributes are helpful and practical. # Current Process -Because St. Jude Cloud currently provides three-platform whole-genome (WGS), whole-exome (WES), and transcriptome (RNA-Seq) sequencing data, it is important to differentiate how we currently run our current quality control workflow on each type of sequencing. +Currently, St. Jude Cloud provides three sequencing data types: whole-genome (WGS), whole-exome (WES), and transcriptome (RNA-Seq) data. It is important to differentiate our quality control workflows for each type of sequencing. Our current process to vet and screen data consists of the following tools: @@ -43,7 +43,7 @@ Our current process to vet and screen data consists of the following tools: - Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) -The "Per Base Sequence Quality" module from FastQC will show the distribution of quality scores across all bases at each position in the reads. In our case, this is just for informational purposes to our end users — the quality of the sequencing run has already been assessed by the lab upstream, so there is no changing it at this point. +The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just for informational purposes to our end users — the quality of the sequencing run has already been assessed by the lab upstream, so there is no changing it at this point. - Overrepresented Sequences ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) @@ -51,7 +51,7 @@ The "Overrepresented Sequences" module from FastQC displays sequences (at least - Reads Genomic Origin ([Qualimap](http://qualimap.bioinfo.cipf.es/)) -The "Reads Genomic Origin" from Qualimap is able to determine how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped to. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Any abnormal results could indicate issues such as DNA contamination. +The "Reads Genomic Origin" from Qualimap is able to determine how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped to. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. - rRNA Content (?) From 1d4507ab3623a67d27630f4979212f55e676d8f0 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Wed, 23 Oct 2019 09:59:36 -0500 Subject: [PATCH 12/68] Updated language for clarity --- text/0000-quality-assurance-pipeline.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 6530466..c114c26 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -16,7 +16,7 @@ This RFC documents an automated pipeline workflow for vetting St. Jude Cloud gen Since introducing Real-Time Clinical Genomics, there is a need for an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing the data integrity and the reproducibility of these results allows St. Jude to publish statistics that are of interest to scientists and researchers. -The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make visible the quality of data offered to researchers and scientists. We hope this RFC becomes a forum for open community discussion of the quality properties/attributes are helpful and practical. +The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. # Current Process @@ -47,11 +47,11 @@ The "Per Base Sequence Quality" module from FastQC shows the distribution of qua - Overrepresented Sequences ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) -The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify any sort of contamination (vector, adapter sequences, etc.). +The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). - Reads Genomic Origin ([Qualimap](http://qualimap.bioinfo.cipf.es/)) -The "Reads Genomic Origin" from Qualimap is able to determine how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped to. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. +The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. - rRNA Content (?) @@ -59,7 +59,7 @@ Verify that excess ribosomal content is filtered/normalized across samples to en - Transcript Coverage and 5’-3’ Bias ([Qualimap](http://qualimap.bioinfo.cipf.es/)) -Libraries prepared with polyA selection have the possibility to lead to high expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting material was of low RNA quality. +Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting material was of low RNA quality. - Junction Analysis ([Qualimap](http://qualimap.bioinfo.cipf.es/)) @@ -101,10 +101,12 @@ Note: Specific options such as memory size thresholds and thread count have been # Items Still In-Progress -- [ ] Analysis tools for other types of sequencing +- [ ] Analysis tools for other types of sequencing (ChIP seq) - [ ] Useful metadata from various stages (sample collection, laboratory, pre-sequencing, sequencing, post-sequencing) # Outstanding Questions - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? - What other metrics or properties would be valuable? +- What is best way to define and handle outliers? +- What is the best way to examine cohort intergrity, meaning category based tests of samples to find experiemtal ouliers tha may be of sufficent quality if examined alone? From cc3db2b6a46e9ec781bf27416018447f14258cab Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Wed, 23 Oct 2019 10:55:14 -0500 Subject: [PATCH 13/68] Further clarification /language edits --- text/0000-quality-assurance-pipeline.md | 43 +++++++++++++------------ 1 file changed, 22 insertions(+), 21 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index c114c26..32be810 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -1,12 +1,13 @@ # Table of Contents -- [Introduction](#Introduction) -- [Motivation](#Motivation) -- [Current Process](#Current-Process) -- [Proposals](#Proposals) -- [Workflow Description](#Workflow-Description) -- [Items Still In-Progress](#Items-Still-In-Progress) -- [Outstanding Questions](#Outstanding-Questions) +- [Introduction](#introduction) +- [Motivation](#motivation) +- [Current Process](#current-process) +- [Important Metrics](#important-metrics) +- [Proposals](#proposals) +- [Workflow Description](#workflow-description) +- [Items Still In-Progress](#items-still-in-progress) +- [Outstanding Questions](#outstanding-questions) # Introduction @@ -43,7 +44,7 @@ Our current process to vet and screen data consists of the following tools: - Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) -The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just for informational purposes to our end users — the quality of the sequencing run has already been assessed by the lab upstream, so there is no changing it at this point. +The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users — the quality of the sequencing run has already been assessed by the lab upstream, so there is no changing it at this point. - Overrepresented Sequences ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) @@ -59,7 +60,7 @@ Verify that excess ribosomal content is filtered/normalized across samples to en - Transcript Coverage and 5’-3’ Bias ([Qualimap](http://qualimap.bioinfo.cipf.es/)) -Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting material was of low RNA quality. +Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. - Junction Analysis ([Qualimap](http://qualimap.bioinfo.cipf.es/)) @@ -85,17 +86,17 @@ GC profiles are typically remarkably stable. Even small/minor deviations could i The end workflow (covering both our current process and the addition of the new tool) would be as following: -| Command | Purpose | -| --------------------------- | ---------------------------------------------------------- | -| `samtools quickcheck` | Validate BAM headers and EOF block existence | -| `md5sum` | For comparison to md5 vended file property | -| `picard ValidateSamFile` | Ensure validity of file | -| `samtools flagstat` | Generate flag statistics | -| `fastqc` | Screen for GC content and adapter contamination | -| `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | -| `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | -| `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | -| `multiqc` | Report aggregation | +| Command | Purpose | +| ------------------------ | ---------------------------------------------------------- | +| `samtools quickcheck` | Validate BAM headers and EOF block existence | +| `md5sum` | For comparison to md5 vended file property | +| `picard ValidateSamFile` | Ensure validity of file | +| `samtools flagstat` | Generate flag statistics | +| `fastqc` | Screen for GC content and adapter contamination | +| `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | +| `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | +| `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | +| `multiqc` | Report aggregation | Note: Specific options such as memory size thresholds and thread count have been left out. @@ -109,4 +110,4 @@ Note: Specific options such as memory size thresholds and thread count have been - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? - What other metrics or properties would be valuable? - What is best way to define and handle outliers? -- What is the best way to examine cohort intergrity, meaning category based tests of samples to find experiemtal ouliers tha may be of sufficent quality if examined alone? +- What is the best way to examine cohort intergrity, meaning category based tests of samples to find experimental outliers that are of sufficent quality if examined alone? From 6201ab20ccb419d4fad0eec095ea6cb1e5658271 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Wed, 23 Oct 2019 12:49:03 -0500 Subject: [PATCH 14/68] Adjust outline to look like rnseq workflow --- text/0000-quality-assurance-pipeline.md | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 32be810..1c31578 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -2,10 +2,12 @@ - [Introduction](#introduction) - [Motivation](#motivation) -- [Current Process](#current-process) -- [Important Metrics](#important-metrics) -- [Proposals](#proposals) -- [Workflow Description](#workflow-description) +- [Discussion](#discussion) + - [Current Process](#current-process) + - [Important Metrics](#important-metrics) + - [Proposals](#proposals) +- [Specification](#specification) + - [Workflow Description](#workflow-description) - [Items Still In-Progress](#items-still-in-progress) - [Outstanding Questions](#outstanding-questions) @@ -19,7 +21,9 @@ Since introducing Real-Time Clinical Genomics, there is a need for an automated The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. -# Current Process +# Discussion + +## Current Process Currently, St. Jude Cloud provides three sequencing data types: whole-genome (WGS), whole-exome (WES), and transcriptome (RNA-Seq) data. It is important to differentiate our quality control workflows for each type of sequencing. @@ -40,7 +44,7 @@ Our current process to vet and screen data consists of the following tools: [v2.20.2]: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_ValidateSamFile.php [v1.7]: https://multiqc.info/ -# Important Metrics +## Important Metrics - Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) @@ -74,7 +78,7 @@ Verification/sanity check of how reads were stranded for the RNA sequencing (str GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). -# Proposals +## Proposals - Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`]. @@ -82,7 +86,9 @@ GC profiles are typically remarkably stable. Even small/minor deviations could i - Include md5 hash as an annotation property for vended files. -# Workflow Description +# Specification + +## Workflow Description The end workflow (covering both our current process and the addition of the new tool) would be as following: From e9337d9fa957544fb01199c16377673d69ade2d7 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Thu, 24 Oct 2019 08:10:16 -0500 Subject: [PATCH 15/68] Spelling Corrections outstanding questions --- text/0000-quality-assurance-pipeline.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 1c31578..b82ef01 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -23,6 +23,8 @@ The ultimate goal is to present a comprehensive report much like the [example Mu # Discussion +The quality metrics discussed here are sequence and mapping quality metrics. Other metrics related to nucleic acid integrity or library quality are not part of the current process. + ## Current Process Currently, St. Jude Cloud provides three sequencing data types: whole-genome (WGS), whole-exome (WES), and transcriptome (RNA-Seq) data. It is important to differentiate our quality control workflows for each type of sequencing. @@ -116,4 +118,4 @@ Note: Specific options such as memory size thresholds and thread count have been - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? - What other metrics or properties would be valuable? - What is best way to define and handle outliers? -- What is the best way to examine cohort intergrity, meaning category based tests of samples to find experimental outliers that are of sufficent quality if examined alone? +- What is the best way to examine cohort integrity, meaning category-based tests of samples to find experimental outliers that are of sufficient quality if examined alone? From 2c98ad267b94d1ea04f07c71edde4ac875d1f34b Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Fri, 25 Oct 2019 10:45:32 -0500 Subject: [PATCH 16/68] Added draft Metrics for WGS,WES,RNAseq --- text/0000-quality-assurance-pipeline.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index b82ef01..45dd368 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -5,6 +5,9 @@ - [Discussion](#discussion) - [Current Process](#current-process) - [Important Metrics](#important-metrics) + - [Metrics for WGS](#metrics-for-wgs) + - [Metrics for WES](#metrics-for-wes) + - [Metrics for RNAseq](#metrics-for-rnaseq) - [Proposals](#proposals) - [Specification](#specification) - [Workflow Description](#workflow-description) @@ -48,6 +51,8 @@ Our current process to vet and screen data consists of the following tools: ## Important Metrics +- Mapping Percentage + - Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users — the quality of the sequencing run has already been assessed by the lab upstream, so there is no changing it at this point. @@ -80,6 +85,18 @@ Verification/sanity check of how reads were stranded for the RNA sequencing (str GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). +## Metrics for WGS + +The quality metrics of special concern for WGS include depth of coverage and genomic regional coverage. Mapping quality is also critical. + +## Metrics for WES + +The quality metrics of special concern for WES include depth of coverage in exomic regional coverage. Mapping quality is also critical. + +## Metrics for RNAseq + +The quality metrics of special concern for RNAseq include depth of mapping percentage and exomic regional coverage. Mapping quality is also critical. + ## Proposals - Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`]. @@ -118,4 +135,4 @@ Note: Specific options such as memory size thresholds and thread count have been - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? - What other metrics or properties would be valuable? - What is best way to define and handle outliers? -- What is the best way to examine cohort integrity, meaning category-based tests of samples to find experimental outliers that are of sufficient quality if examined alone? +- What is the best way to examine cohort integrit? This means category-based tests of samples to find experimental outliers that are of sufficient quality if examined alone. From 7b02f767637a11dc43c49e466567ac3c4611101d Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Wed, 30 Oct 2019 07:51:14 -0500 Subject: [PATCH 17/68] Edited language in "Metrics for" sections --- text/0000-quality-assurance-pipeline.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 45dd368..8b58ad1 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -87,15 +87,15 @@ GC profiles are typically remarkably stable. Even small/minor deviations could i ## Metrics for WGS -The quality metrics of special concern for WGS include depth of coverage and genomic regional coverage. Mapping quality is also critical. +The quality metrics of special concern for WGS include depth of coverage and genomic regional coverage. Mapping quality is also critical. The analysis of whole genome sequencing to call variants depends on depth and sample purity. Accurate calls are made through replication and contamination creates false positives. So metrics that are sensitive to impurity are valuable. ## Metrics for WES -The quality metrics of special concern for WES include depth of coverage in exomic regional coverage. Mapping quality is also critical. +The quality metrics of special concern for WES include depth of coverage in exomic regions. Mapping quality, % mapped and duplication rate are also important. ## Metrics for RNAseq -The quality metrics of special concern for RNAseq include depth of mapping percentage and exomic regional coverage. Mapping quality is also critical. +The quality metrics of special concern for RNAseq include mapping percentage, and exomic regional coverage. Mapping quality is also critical. ## Proposals From fed96b77852cbe2d7293526fe31948d5b2cb9a41 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Wed, 30 Oct 2019 09:21:36 -0500 Subject: [PATCH 18/68] Correct spelling errors "Outstanding Questions" --- text/0000-quality-assurance-pipeline.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 8b58ad1..2a55f6d 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -95,7 +95,7 @@ The quality metrics of special concern for WES include depth of coverage in exom ## Metrics for RNAseq -The quality metrics of special concern for RNAseq include mapping percentage, and exomic regional coverage. Mapping quality is also critical. +The quality metrics of special concern for RNAseq include mapping percentage, percentage properly paired reads, and exomic regional coverage. Mapping quality is also critical. ## Proposals @@ -135,4 +135,4 @@ Note: Specific options such as memory size thresholds and thread count have been - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? - What other metrics or properties would be valuable? - What is best way to define and handle outliers? -- What is the best way to examine cohort integrit? This means category-based tests of samples to find experimental outliers that are of sufficient quality if examined alone. +- What is the best way to examine cohort integrity? This means category-based tests of samples to find experimental outliers that are of sufficient quality if examined alone. From 00baeecd464b6d1a44ad471a17cbed94be9031bc Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Wed, 30 Oct 2019 09:28:38 -0500 Subject: [PATCH 19/68] Mapping Percentage to % Aligned --- text/0000-quality-assurance-pipeline.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 2a55f6d..c602542 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -51,7 +51,9 @@ Our current process to vet and screen data consists of the following tools: ## Important Metrics -- Mapping Percentage +- Percent Aligned + +Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consisitent with sample purity. - Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) From f69d00831c111eba599c8927a3e1f00da8765f39 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Mon, 4 Nov 2019 12:37:42 -0600 Subject: [PATCH 20/68] Expanded introduction --- text/0000-quality-assurance-pipeline.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index c602542..8a7eb50 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -16,11 +16,11 @@ # Introduction -This RFC documents an automated pipeline workflow for vetting St. Jude Cloud genomic data, covering both existing data and new data uploads to the platform. The end goal is to publish results from various tools, but currently we hope to discuss which quality metrics and statistics are important to the community as a whole. +This RFC documents an automated pipeline workflow for vetting St. Jude Cloud genomic data, covering both existing data and new data uploads to the platform. The end goal is to publish results from various tools. But currently, we hope to discuss which quality metrics and statistics are important to the bioinformatics community. Further, we invite the community to comment on the best methods for selecting thresholds for quality metrics. # Motivation -Since introducing Real-Time Clinical Genomics, there is a need for an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing the data integrity and the reproducibility of these results allows St. Jude to publish statistics that are of interest to scientists and researchers. +Since introducing Real-Time Clinical Genomics, we need an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing the data integrity and the reproducibility of these results allows St. Jude to publish statistics that are of interest to scientists and researchers. The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. @@ -137,4 +137,4 @@ Note: Specific options such as memory size thresholds and thread count have been - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? - What other metrics or properties would be valuable? - What is best way to define and handle outliers? -- What is the best way to examine cohort integrity? This means category-based tests of samples to find experimental outliers that are of sufficient quality if examined alone. +- What is the best way to examine cohort integrity? This means experimental category-based tests of samples to find outliers that are of sufficient quality if examined alone. Outliers in this case may indicate classification errors or rare biological conditions. Which metrics are best tested here? From aa8e0195ac5b0e0a6e02deb21ad1c1f43f98b2b7 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Mon, 4 Nov 2019 15:03:26 -0600 Subject: [PATCH 21/68] Edits Intro, Motivation and Discussion --- text/0000-quality-assurance-pipeline.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 8a7eb50..78f9ee8 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -16,17 +16,18 @@ # Introduction -This RFC documents an automated pipeline workflow for vetting St. Jude Cloud genomic data, covering both existing data and new data uploads to the platform. The end goal is to publish results from various tools. But currently, we hope to discuss which quality metrics and statistics are important to the bioinformatics community. Further, we invite the community to comment on the best methods for selecting thresholds for quality metrics. +This RFC documents an automated pipeline workflow for vetting St. Jude Cloud genomic data, covering both existing and new data uploads to the platform. The end goal is to publish the quality control results from various tools. But currently, we hope to discuss which quality metrics and statistics are important to the bioinformatics community. Further, we invite the community to comment on the best methods for selecting thresholds for quality metrics. # Motivation -Since introducing Real-Time Clinical Genomics, we need an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing the data integrity and the reproducibility of these results allows St. Jude to publish statistics that are of interest to scientists and researchers. +Since introducing Real-Time Clinical Genomics, we need an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing data integrity and the reproducibility of these results allows St. Jude to assure scientists and researchers that the data we provide is useful. The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. # Discussion -The quality metrics discussed here are sequence and mapping quality metrics. Other metrics related to nucleic acid integrity or library quality are not part of the current process. +The quality metrics discussed here are sequence and mapping quality metrics. Other metrics related to nucleic acid integrity or library quality are not part of the current process. Pre-sequencing quality metrics, however, are clearly important and part of our long term interests. Thus, we invite comments on those metrics as well. + ## Current Process From 7f4e92d0a6bb8455283b9460f08a0950a8ab1b29 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Mon, 4 Nov 2019 15:12:56 -0600 Subject: [PATCH 22/68] Two spaces after period --- text/0000-quality-assurance-pipeline.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 78f9ee8..0017cf5 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -16,13 +16,13 @@ # Introduction -This RFC documents an automated pipeline workflow for vetting St. Jude Cloud genomic data, covering both existing and new data uploads to the platform. The end goal is to publish the quality control results from various tools. But currently, we hope to discuss which quality metrics and statistics are important to the bioinformatics community. Further, we invite the community to comment on the best methods for selecting thresholds for quality metrics. +This RFC documents an automated pipeline workflow for vetting St. Jude Cloud genomic data, covering both existing and new data uploads to the platform. The end goal is to publish the quality control results from various tools. But currently, we hope to discuss which quality metrics and statistics are important to the bioinformatics community. Further, we invite the community to comment on the best methods for selecting thresholds for quality metrics. # Motivation -Since introducing Real-Time Clinical Genomics, we need an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing data integrity and the reproducibility of these results allows St. Jude to assure scientists and researchers that the data we provide is useful. +Since introducing Real-Time Clinical Genomics, we need an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing data integrity and the reproducibility of these results allows St. Jude to assure scientists and researchers that the data we provide is useful. -The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. +The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. # Discussion From 6ff0cb5bf6a3d1ab226adbd7f3f43abce440c377 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Mon, 4 Nov 2019 15:14:39 -0600 Subject: [PATCH 23/68] Two spaces after periods full doc --- text/0000-quality-assurance-pipeline.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 0017cf5..b6b5130 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -58,7 +58,7 @@ Also known as mapping percentage, this indicator of quality, when high, verifies - Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) -The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users — the quality of the sequencing run has already been assessed by the lab upstream, so there is no changing it at this point. +The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users — the quality of the sequencing run has already been assessed by the lab upstream. So, there is no changing it at this point. - Overrepresented Sequences ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) @@ -66,7 +66,7 @@ The "Overrepresented Sequences" module from FastQC displays sequences (at least - Reads Genomic Origin ([Qualimap](http://qualimap.bioinfo.cipf.es/)) -The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. +The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. - rRNA Content (?) @@ -74,7 +74,7 @@ Verify that excess ribosomal content is filtered/normalized across samples to en - Transcript Coverage and 5’-3’ Bias ([Qualimap](http://qualimap.bioinfo.cipf.es/)) -Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. +Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. - Junction Analysis ([Qualimap](http://qualimap.bioinfo.cipf.es/)) @@ -86,7 +86,7 @@ Verification/sanity check of how reads were stranded for the RNA sequencing (str - GC Content Bias (?) -GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). +GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). ## Metrics for WGS From cfc7d847844cc57d9f7a1e299bcb69f21300f236 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Wed, 6 Nov 2019 10:58:58 -0600 Subject: [PATCH 24/68] Added Thresholds section --- text/0000-quality-assurance-pipeline.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index b6b5130..714a4ed 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -100,6 +100,11 @@ The quality metrics of special concern for WES include depth of coverage in exom The quality metrics of special concern for RNAseq include mapping percentage, percentage properly paired reads, and exomic regional coverage. Mapping quality is also critical. + +## Thresholds + + To apply quality control metrics to vett data, we need reasonable thresholds that are practically acheivable and neither too lax or too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thrresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in postion to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. + ## Proposals - Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`]. From 9c454251230c25d3cd36bfd6af8b6e646533b3f7 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Thu, 14 Nov 2019 15:26:28 -0600 Subject: [PATCH 25/68] Added commands for QC steps --- text/0000-quality-assurance-pipeline.md | 162 ++++++++++++++++++++++-- 1 file changed, 148 insertions(+), 14 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 714a4ed..7d8c4d1 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -5,12 +5,24 @@ - [Discussion](#discussion) - [Current Process](#current-process) - [Important Metrics](#important-metrics) - - [Metrics for WGS](#metrics-for-wgs) - - [Metrics for WES](#metrics-for-wes) - - [Metrics for RNAseq](#metrics-for-rnaseq) - - [Proposals](#proposals) + - [Thresholds and Metrics for Specific Applications](#thresholds-and-metrics-for-specific-applications) + - [Metrics for WES](#metrics-for-wes) + - [Metrics for RNAseq](#metrics-for-rnaseq) + - [Proposals](#proposals) - [Specification](#specification) - [Workflow Description](#workflow-description) +- [The St Jude Genomics QC Process](#the-st-jude-genomics-qc-process) + - [Installation](#installation) + - [Anaconda Environment](#anaconda-environment) + - [Samtools Quickcheck](#samtools-quickcheck) + - [Picard ValidateSamFile](#picard-validatesamfile) + - [Sambamba Flagstat](#sambamba-flagstat) + - [FASTQC](#fastqc) + - [Qualimap Bam QC](#qualimap-bam-qc) + - [Qualimap RNA seq QC](#qualimap-rna-seq-qc) + - [md5sum](#md5sum) + - [RseQC infer_experiment.py](#rseqc-infer_experimentpy) + - [Report Aggregation](#report-aggregation) - [Items Still In-Progress](#items-still-in-progress) - [Outstanding Questions](#outstanding-questions) @@ -22,7 +34,7 @@ This RFC documents an automated pipeline workflow for vetting St. Jude Cloud gen Since introducing Real-Time Clinical Genomics, we need an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing data integrity and the reproducibility of these results allows St. Jude to assure scientists and researchers that the data we provide is useful. -The ultimate goal is to present a comprehensive report much like the [example MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. +The ultimate goal is to present a comprehensive report much like the example [ MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. # Discussion @@ -88,24 +100,22 @@ Verification/sanity check of how reads were stranded for the RNA sequencing (str GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). -## Metrics for WGS +## Thresholds and Metrics for Specific Applications + + To apply quality control metrics to vett data, we need reasonable thresholds that are practically acheivable and neither too lax or too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thrresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in postion to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost.## Metrics for WGS The quality metrics of special concern for WGS include depth of coverage and genomic regional coverage. Mapping quality is also critical. The analysis of whole genome sequencing to call variants depends on depth and sample purity. Accurate calls are made through replication and contamination creates false positives. So metrics that are sensitive to impurity are valuable. -## Metrics for WES +### Metrics for WES The quality metrics of special concern for WES include depth of coverage in exomic regions. Mapping quality, % mapped and duplication rate are also important. -## Metrics for RNAseq +### Metrics for RNAseq The quality metrics of special concern for RNAseq include mapping percentage, percentage properly paired reads, and exomic regional coverage. Mapping quality is also critical. -## Thresholds - - To apply quality control metrics to vett data, we need reasonable thresholds that are practically acheivable and neither too lax or too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thrresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in postion to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. - -## Proposals +### Proposals - Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`]. @@ -125,13 +135,137 @@ The end workflow (covering both our current process and the addition of the new | `md5sum` | For comparison to md5 vended file property | | `picard ValidateSamFile` | Ensure validity of file | | `samtools flagstat` | Generate flag statistics | +| `sambamba flagstat` | Generate flag statistics using a samtools alternative | | `fastqc` | Screen for GC content and adapter contamination | | `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | | `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | | `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | | `multiqc` | Report aggregation | -Note: Specific options such as memory size thresholds and thread count have been left out. +Note: Specific options such as memory size thresholds and thread count are below. + +# The St Jude Genomics QC Process + +These are generic instructions for running each of the tools in our pipeline. We run our pipeline in a series of QC scripts that are tailored for our compute cluster, so those commands may not apply elsewhere. Instead we've supplied examples of the commands used to each package. Our default memory is 80G and we employ 4 threads for these processes. + +## Installation + +We presume anaconda is available and installed. If not please follow the link to [anaconda](https://www.anaconda.com/) first. + +#### Anaconda Environment + + +```bash +conda create --name bio-qc \ + --channel bioconda \ + fastqc==0.11.8 \ + picard==2.20.2 \ + qualimap==2.2.2c \ + rseqc==3.0.0 \ + sambamba==0.6.6 \ + samtools==1.9 \ + -y + +conda activate bio-qc +``` +### Samtools Quickcheck + +Very basic BAM file validation: + +```bash +# Should not be relied on for file integrity, only checks for header and EOF +samtools quickcheck -v *.bam > bad_bams.txt && echo "all ok" || echo "some files failed check, see bad_bams.txt" +``` +### Picard ValidateSamFile + +BAM file validation: + +```bash +# This method is used to assess file integrity + +"picard ValidateSamFile \ + I=$BAM \ # specify bam file + MODE=SUMMARY\ # concise output + INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time + OUTPUT=$OUTDIR/$BAM_BN.validate.txt" # output directory and file +``` + +### Sambamba Flagstat + +Very basic BAM file validation: + +```bash +# Sambamba includes faster reliable implementation of samtools commands + +"sambamba flagstat -t $NUM_THREADS $BAM \ # number of threads and bam filename + > $OUTDIR/$BAM_BN.flagstat.txt" # output directory and file +``` + +### FASTQC + +Standard sequence quality check: + +```bash + +"fastqc $BAM \ # bam filename + -o $OUTDIR " # output directory + +``` + +### Qualimap Bam QC + +Comprehensive QC statistics includes read stats, coverage, mapping quality, insert size, mismatches etc. + +```bash + +"qualimap bamqc -bam $BAM \ # bam filename + --java-mem-size=$MEM_SIZE \ # memory + -nt $NUM_THREADS \ # threads requested + -nw 400 \ # number of windows + -outdir $QBAMQC_OUT" # output directory +``` +### Qualimap RNA seq QC + +Comprehensive QC statistics tailored for RNA seq files. + +```bash + +"qualimap rnaseq -bam $BAM \ # bam filename + -gtf $GTF_REF # transcript definition file + --java-mem-size=$MEM_SIZE \ # memory + -pe # specify paired end + -outdir $QBAMQC_OUT" # output directory +``` + +### md5sum + +Check size and integrity of files. + +```bash + +"md5sum $BAM \ # bam filename + > $OUTDIR/$BAM.md5" # output directory + +``` + +### RseQC infer_experiment.py + +Python script that tests for strandedness. + +```bash + +"infer_experiment.py -i $BAM \ # bam filename + -r $BED_REF \ # reference in bed format + > $OUTDIR/$BAM_BN.infer_experiment.txt" # output directory and filename + +``` + +### Report Aggregation + +```bash +multiqc /path/to/outdir +``` + # Items Still In-Progress From c69f7521d34c419244d8d5f1e8aa7fe0d93fbee0 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Mon, 18 Nov 2019 12:22:34 -0600 Subject: [PATCH 26/68] Changed bash code syntax --- text/0000-quality-assurance-pipeline.md | 28 ++++++++++++------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 7d8c4d1..cfae0e7 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -183,11 +183,11 @@ BAM file validation: ```bash # This method is used to assess file integrity -"picard ValidateSamFile \ +picard ValidateSamFile \ I=$BAM \ # specify bam file MODE=SUMMARY\ # concise output INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time - OUTPUT=$OUTDIR/$BAM_BN.validate.txt" # output directory and file + OUTPUT=$OUTDIR/$BAM_BN.validate.txt # output directory and file ``` ### Sambamba Flagstat @@ -197,8 +197,8 @@ Very basic BAM file validation: ```bash # Sambamba includes faster reliable implementation of samtools commands -"sambamba flagstat -t $NUM_THREADS $BAM \ # number of threads and bam filename - > $OUTDIR/$BAM_BN.flagstat.txt" # output directory and file +sambamba flagstat -t $NUM_THREADS $BAM \ # number of threads and bam filename + > $OUTDIR/$BAM_BN.flagstat.txt # output directory and file ``` ### FASTQC @@ -207,8 +207,8 @@ Standard sequence quality check: ```bash -"fastqc $BAM \ # bam filename - -o $OUTDIR " # output directory +fastqc $BAM \ # bam filename + -o $OUTDIR # output directory ``` @@ -218,11 +218,11 @@ Comprehensive QC statistics includes read stats, coverage, mapping quality, inse ```bash -"qualimap bamqc -bam $BAM \ # bam filename +qualimap bamqc -bam $BAM \ # bam filename --java-mem-size=$MEM_SIZE \ # memory -nt $NUM_THREADS \ # threads requested -nw 400 \ # number of windows - -outdir $QBAMQC_OUT" # output directory + -outdir $QBAMQC_OUT # output directory ``` ### Qualimap RNA seq QC @@ -230,11 +230,11 @@ Comprehensive QC statistics tailored for RNA seq files. ```bash -"qualimap rnaseq -bam $BAM \ # bam filename +qualimap rnaseq -bam $BAM \ # bam filename -gtf $GTF_REF # transcript definition file --java-mem-size=$MEM_SIZE \ # memory -pe # specify paired end - -outdir $QBAMQC_OUT" # output directory + -outdir $QBAMQC_OUT # output directory ``` ### md5sum @@ -243,8 +243,8 @@ Check size and integrity of files. ```bash -"md5sum $BAM \ # bam filename - > $OUTDIR/$BAM.md5" # output directory +md5sum $BAM \ # bam filename + > $OUTDIR/$BAM.md5 # output directory ``` @@ -254,9 +254,9 @@ Python script that tests for strandedness. ```bash -"infer_experiment.py -i $BAM \ # bam filename +infer_experiment.py -i $BAM \ # bam filename -r $BED_REF \ # reference in bed format - > $OUTDIR/$BAM_BN.infer_experiment.txt" # output directory and filename + > $OUTDIR/$BAM_BN.infer_experiment.txt # output directory and filename ``` From 7859a58f461ac9b124dec4e26e99656e8bd25954 Mon Sep 17 00:00:00 2001 From: dfinkels <52255956+dfinkels@users.noreply.github.com> Date: Mon, 18 Nov 2019 13:15:32 -0600 Subject: [PATCH 27/68] Edited text under St Jude Genomics QC Process --- text/0000-quality-assurance-pipeline.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index cfae0e7..48c4f1f 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -192,7 +192,7 @@ picard ValidateSamFile \ ### Sambamba Flagstat -Very basic BAM file validation: +Summary statistics of read counts and mapping status ```bash # Sambamba includes faster reliable implementation of samtools commands @@ -262,6 +262,8 @@ infer_experiment.py -i $BAM \ # bam filename ### Report Aggregation +Package combines output from other QC tools in easily reviewed html format. + ```bash multiqc /path/to/outdir ``` From f1091d0fc919e1739b09ee224602385bd2ee0f45 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Tue, 18 Feb 2020 15:35:17 -0600 Subject: [PATCH 28/68] Various updates to the QC RFC --- text/0000-quality-assurance-pipeline.md | 302 ++++++++++-------------- 1 file changed, 123 insertions(+), 179 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 48c4f1f..284f377 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -3,102 +3,69 @@ - [Introduction](#introduction) - [Motivation](#motivation) - [Discussion](#discussion) - - [Current Process](#current-process) - - [Important Metrics](#important-metrics) - - [Thresholds and Metrics for Specific Applications](#thresholds-and-metrics-for-specific-applications) - - [Metrics for WES](#metrics-for-wes) - - [Metrics for RNAseq](#metrics-for-rnaseq) - - [Proposals](#proposals) - [Specification](#specification) - - [Workflow Description](#workflow-description) -- [The St Jude Genomics QC Process](#the-st-jude-genomics-qc-process) - - [Installation](#installation) - - [Anaconda Environment](#anaconda-environment) - - [Samtools Quickcheck](#samtools-quickcheck) - - [Picard ValidateSamFile](#picard-validatesamfile) - - [Sambamba Flagstat](#sambamba-flagstat) - - [FASTQC](#fastqc) - - [Qualimap Bam QC](#qualimap-bam-qc) - - [Qualimap RNA seq QC](#qualimap-rna-seq-qc) - - [md5sum](#md5sum) - - [RseQC infer_experiment.py](#rseqc-infer_experimentpy) - - [Report Aggregation](#report-aggregation) - [Items Still In-Progress](#items-still-in-progress) - [Outstanding Questions](#outstanding-questions) # Introduction -This RFC documents an automated pipeline workflow for vetting St. Jude Cloud genomic data, covering both existing and new data uploads to the platform. The end goal is to publish the quality control results from various tools. But currently, we hope to discuss which quality metrics and statistics are important to the bioinformatics community. Further, we invite the community to comment on the best methods for selecting thresholds for quality metrics. +This RFC documents an automated workflow for assessing the integrity and quality +of St. Jude Cloud genomics data. The end goal is to publish a collection of +metrics that users can leverage to assess the quality of the data available. +Furthermore, we outline the method used internally to vet the data before we +publish it. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3). # Motivation -Since introducing Real-Time Clinical Genomics, we need an automated quality assurance pipeline that guarantees uploaded data meets predefined standards. Guaranteeing data integrity and the reproducibility of these results allows St. Jude to assure scientists and researchers that the data we provide is useful. - -The ultimate goal is to present a comprehensive report much like the example [ MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html) for each dataset and sequencing type (and ideally, also on a sample level). This would make the quality of data offered to researchers and scientists accessible. We hope this RFC becomes a forum for open community discussion of quality properties and attributes that are helpful and practical. +Since the introduction of uploading clinical genomics data in real-time (the +"Real-Time Clinical Genomics" initiative), we need an automated quality +assurance pipeline that guarantees uploaded data meets predefined standards. +Guaranteeing data integrity and the reproducibility of these results allows St. +Jude to assure scientists and researchers that the data we provide is useful. As +much as possible, we'd like to automate this process to ensure it scales. The end-goal is to present a comprehensive report, much like the example [ +MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html), for +each dataset + sequencing type tuple. # Discussion -The quality metrics discussed here are sequence and mapping quality metrics. Other metrics related to nucleic acid integrity or library quality are not part of the current process. Pre-sequencing quality metrics, however, are clearly important and part of our long term interests. Thus, we invite comments on those metrics as well. - +The quality metrics discussed here are sequence and mapping quality metrics. +Other metrics related to nucleic acid integrity or library quality are typically +done upstream in the genomics lab contributing the data. Pre-sequencing quality metrics, however, are clearly important and part of our long term interests. ## Current Process -Currently, St. Jude Cloud provides three sequencing data types: whole-genome (WGS), whole-exome (WES), and transcriptome (RNA-Seq) data. It is important to differentiate our quality control workflows for each type of sequencing. +Currently, St. Jude Cloud provides three sequencing data types: whole-genome +(WGS), whole-exome (WES), and transcriptome (RNA-seq) data. It is important to +differentiate our quality control workflows for each type of sequencing. At the +time that the RFC was authored, we use the following tools to _manually_ vet +data. -Our current process to vet and screen data consists of the following tools: +> Note that one of the goals of this RFC is to constantly update these +versions, so they are likely not what will be used in the pipeline. -| Tool | Version | -| ------------------------ | --------- | -| `samtools flagstat` | [v1.9] | -| `fastqc` | [v0.11.8] | -| `qualimap bamqc` | [v2.2.2] | -| `qualimap rnaseq` | [v2.2.2] | -| `picard ValidateSamFile` | [v2.20.2] | -| `multiqc` | [v1.7] | +| Tool | Version | Website | +| ------------------------ | ------- | ---------------- | +| `samtools flagstat` | v1.9 | [link](samtools) | +| `fastqc` | v0.11.8 | [link](fastqc) | +| `qualimap bamqc` | v2.2.2 | [link](qualimap) | +| `qualimap rnaseq` | v2.2.2 | [link](qualimap) | +| `picard ValidateSamFile` | v2.20.2 | [link](picard) | +| `multiqc` | v1.7 | [link](multiqc) | -[v1.9]: http://www.htslib.org/doc/samtools.html -[v0.11.8]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ -[v2.2.2]: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html -[v2.20.2]: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_ValidateSamFile.php -[v1.7]: https://multiqc.info/ ## Important Metrics -- Percent Aligned - -Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consisitent with sample purity. - -- Per Base Sequence Quality ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) - -The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users — the quality of the sequencing run has already been assessed by the lab upstream. So, there is no changing it at this point. - -- Overrepresented Sequences ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)) - -The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). - -- Reads Genomic Origin ([Qualimap](http://qualimap.bioinfo.cipf.es/)) - -The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. - -- rRNA Content (?) - -Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. - -- Transcript Coverage and 5’-3’ Bias ([Qualimap](http://qualimap.bioinfo.cipf.es/)) - -Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. - -- Junction Analysis ([Qualimap](http://qualimap.bioinfo.cipf.es/)) - -Analysis of known, partly known, and novel junction positions in spliced alignments. - -- Strand Specificity ([RSeQC](http://rseqc.sourceforge.net/)) - -Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). - -- GC Content Bias (?) - -GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). +| Name | Produced By | Description | +| ---------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| % Aligned | [Samtools][samtools] | Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consisitent with sample purity. | +| Per Base Sequence Quality | [FastQC][fastqc] | The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users — the quality of the sequencing run has already been assessed by the lab upstream. So, there is no changing it at this point. | +| Overrepresented Sequences | [FastQC][fastqc] | The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). | +| Reads Genomic Origin | [Qualimap][qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | +| rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | +| Transcript Coverage and 5’-3’ Bias | [Qualimap][qualimap] | Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | +| Junction Analysis | [Qualimap][qualimap] | Analysis of known, partly known, and novel junction positions in spliced alignments. | +| Strand Specificity | RSeQC | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | +| GC Content Bias | ? | GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). | ## Thresholds and Metrics for Specific Applications @@ -125,36 +92,16 @@ The quality metrics of special concern for RNAseq include mapping percentage, pe # Specification -## Workflow Description - -The end workflow (covering both our current process and the addition of the new tool) would be as following: - -| Command | Purpose | -| ------------------------ | ---------------------------------------------------------- | -| `samtools quickcheck` | Validate BAM headers and EOF block existence | -| `md5sum` | For comparison to md5 vended file property | -| `picard ValidateSamFile` | Ensure validity of file | -| `samtools flagstat` | Generate flag statistics | -| `sambamba flagstat` | Generate flag statistics using a samtools alternative | -| `fastqc` | Screen for GC content and adapter contamination | -| `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | -| `qualimap rnaseq` | Screen for RNA-Seq bias and junction analysis | -| `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | -| `multiqc` | Report aggregation | - -Note: Specific options such as memory size thresholds and thread count are below. +These are generic instructions for running each of the tools in our pipeline. +We run our pipeline in a series of QC scripts that are tailored for our compute +cluster, so those commands may not apply elsewhere. Instead we've supplied +examples of the commands used to each package. Our default memory is 80G and we +employ 4 threads for these processes. -# The St Jude Genomics QC Process - -These are generic instructions for running each of the tools in our pipeline. We run our pipeline in a series of QC scripts that are tailored for our compute cluster, so those commands may not apply elsewhere. Instead we've supplied examples of the commands used to each package. Our default memory is 80G and we employ 4 threads for these processes. - -## Installation +## Dependencies We presume anaconda is available and installed. If not please follow the link to [anaconda](https://www.anaconda.com/) first. -#### Anaconda Environment - - ```bash conda create --name bio-qc \ --channel bioconda \ @@ -168,106 +115,97 @@ conda create --name bio-qc \ conda activate bio-qc ``` -### Samtools Quickcheck - -Very basic BAM file validation: - -```bash -# Should not be relied on for file integrity, only checks for header and EOF -samtools quickcheck -v *.bam > bad_bams.txt && echo "all ok" || echo "some files failed check, see bad_bams.txt" -``` -### Picard ValidateSamFile - -BAM file validation: - -```bash -# This method is used to assess file integrity - -picard ValidateSamFile \ - I=$BAM \ # specify bam file - MODE=SUMMARY\ # concise output - INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time - OUTPUT=$OUTDIR/$BAM_BN.validate.txt # output directory and file -``` - -### Sambamba Flagstat - -Summary statistics of read counts and mapping status - -```bash -# Sambamba includes faster reliable implementation of samtools commands - -sambamba flagstat -t $NUM_THREADS $BAM \ # number of threads and bam filename - > $OUTDIR/$BAM_BN.flagstat.txt # output directory and file -``` -### FASTQC +## Workflow -Standard sequence quality check: +The end workflow (covering both our current process and the addition of the new tool) would be as following: -```bash +| Command | Purpose | +| ------------------------ | ---------------------------------------------------------- | +| `samtools quickcheck` | Validate BAM headers and EOF block existence | +| `md5sum` | For comparison to md5 vended file property | +| `picard ValidateSamFile` | Ensure validity of file | +| `samtools flagstat` | Generate flag statistics | +| `sambamba flagstat` | Generate flag statistics using a samtools alternative | +| `fastqc` | Screen for GC content and adapter contamination | +| `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | +| `qualimap rnaseq` | Screen for RNA-seq bias and junction analysis | +| `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | +| `multiqc` | Report aggregation | -fastqc $BAM \ # bam filename - -o $OUTDIR # output directory - -``` - -### Qualimap Bam QC +Note: Specific options such as memory size thresholds and thread count are below. -Comprehensive QC statistics includes read stats, coverage, mapping quality, insert size, mismatches etc. +1. Run `samtools quickcheck` to ensure that input BAMs are relatively + well-formed (for instance, to ensure a header and EOF marker exist). -```bash + ```bash + samtools quickcheck $BAM + ``` -qualimap bamqc -bam $BAM \ # bam filename - --java-mem-size=$MEM_SIZE \ # memory - -nt $NUM_THREADS \ # threads requested - -nw 400 \ # number of windows - -outdir $QBAMQC_OUT # output directory -``` -### Qualimap RNA seq QC +2. Use Picard's `ValidateSamFile` tool to ensure the inner contents of the BAM + file are well-formed. -Comprehensive QC statistics tailored for RNA seq files. + ```bash + picard ValidateSamFile \ + I=$BAM \ # specify bam file + MODE=SUMMARY\ # concise output + INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time + IGNORE=INVALID_PLATFORM_VALUE # Validations to ignore. + ``` -```bash +3. Run `samtools flagstat` to gather general statistics such as alignment + percentage. -qualimap rnaseq -bam $BAM \ # bam filename - -gtf $GTF_REF # transcript definition file - --java-mem-size=$MEM_SIZE \ # memory - -pe # specify paired end - -outdir $QBAMQC_OUT # output directory -``` + ```bash + samtools flagstat $BAM + ``` -### md5sum +4. Run `fastqc` to collect sequencing and library-related statistics. These are + only for informational purposes — as stated above, we typically do not remove + samples based on this information (with rare exception), as the + sequencing-related QC work was done upstream in the genomics lab. -Check size and integrity of files. + ```bash + fastqc $BAM + ``` -```bash +5. Run `qualimap bamqc` to gather more in-depth statistics about read stats, + coverage, mapping quality, mismatches, etc. -md5sum $BAM \ # bam filename - > $OUTDIR/$BAM.md5 # output directory - -``` + ```bash + qualimap bamqc -bam $BAM \ # bam filename + -nt $NUM_THREADS \ # threads requested + -nw 400 # number of windows + ``` -### RseQC infer_experiment.py +6. If RNA-seq data, run `qualimap rnaseq` to gather QC statistics that are + tailored for RNA-seq files. -Python script that tests for strandedness. + ```bash + qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory + -bam $BAM \ # bam filename + -gtf $GTF_REF # transcript definition file + -pe # specify paired end if paired end + ``` -```bash +7. If RNA-seq, run `ngsderive strandedness` to determine a backwards-computed + strandedness of the RNA-seq experiment. -infer_experiment.py -i $BAM \ # bam filename - -r $BED_REF \ # reference in bed format - > $OUTDIR/$BAM_BN.infer_experiment.txt # output directory and filename - -``` + ```bash + ngsderive strandedness + ``` -### Report Aggregation +8. Compute the md5 checksum of the file. -Package combines output from other QC tools in easily reviewed html format. + ```bash + md5sum $BAM + ``` -```bash -multiqc /path/to/outdir -``` +9. Combine all of the above metrics using `multiqc`. + ```bash + multiqc . # recurse all files in '.' + ``` # Items Still In-Progress @@ -280,3 +218,9 @@ multiqc /path/to/outdir - What other metrics or properties would be valuable? - What is best way to define and handle outliers? - What is the best way to examine cohort integrity? This means experimental category-based tests of samples to find outliers that are of sufficient quality if examined alone. Outliers in this case may indicate classification errors or rare biological conditions. Which metrics are best tested here? + +[samtools]: http://www.htslib.org/doc/samtools.html +[fastqc]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ +[qualimap]: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html +[picard]: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_ValidateSamFile.php +[multiqc]: https://multiqc.info/ \ No newline at end of file From 4c0a2d891964c6b8d07b0834ca22e434361ded78 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Tue, 18 Feb 2020 15:35:32 -0600 Subject: [PATCH 29/68] Change all 'RNA-Seq' to 'RNA-seq' --- .../GenomeComparison.ipynb | 8 ++--- text/0001-rnaseq-workflow-v2.0.md | 34 +++++++++---------- 2 files changed, 21 insertions(+), 21 deletions(-) diff --git a/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb b/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb index e42bfb6..8fff488 100644 --- a/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb +++ b/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb @@ -139,7 +139,7 @@ "\n", "The ENCODE project stores a reference to all of it's currently used reference files [here](https://www.encodeproject.org/data-standards/reference-sequences/). From that page, you can see that the base reference genome can be downloaded [here](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz). However, we'd like to do a complete analysis including of all of the sequences they use for decoys/viruse/etc. if we want to use them later. Here's the steps I took to find the complete set of reference FASTAs they used in their STAR index:\n", "\n", - "1. Searching for their RNA-Seq pipeline yields their specification pretty quickly ([here](https://www.encodeproject.org/pages/pipelines/#RNA-seq)).\n", + "1. Searching for their RNA-seq pipeline yields their specification pretty quickly ([here](https://www.encodeproject.org/pages/pipelines/#RNA-seq)).\n", "2. The pipeline we are looking for is [this one](https://www.encodeproject.org/pipelines/ENCPL002LPE/).\n", "3. At the bottom of the page, you will see a PDF that is a comprehensive overview of their pipelines and contains a list to all of the current ENCODE reference accessions ([link](https://www.encodeproject.org/documents/6354169f-86f6-4b59-8322-141005ea44eb/@@download/attachment/Long%20RNA-seq%20pipeline%20overview.pdf)).\n", "4. In that document, you find that the link to their most currently built `STAR` genome is [here](https://www.encodeproject.org/references/ENCSR314WMD/). \n", @@ -287,11 +287,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## TOPMed + GTEx RNA-Seq pipeline\n", + "## TOPMed + GTEx RNA-seq pipeline\n", "\n", - "The GTEx consortium and TOPMed program both use the [GTEx RNA-Seq pipeline](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq) developed by the Broad Institute. This workflow processes a high number of samples and has high reputation, so it's worth taking a look at.\n", + "The GTEx consortium and TOPMed program both use the [GTEx RNA-seq pipeline](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq) developed by the Broad Institute. This workflow processes a high number of samples and has high reputation, so it's worth taking a look at.\n", "\n", - "Following the \"reference genome and annotation\" [section](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq#reference-genome-and-annotation) of their `README.md`, you are directed to the [TOPMed RNA-Seq pipeline harmonization](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) page. Reading the \"Reference files\" section of that documentation essentially lays out that they use the Broad Insitute's version of `GRCh38` and add the `ERCC SpikeIn` sequences. They provide both [a link to the Broad's original FASTA](https://software.broadinstitute.org/gatk/download/bundle) and [a link to their built FASTA](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) (although, given it points to a personal page, I'm not sure how long this link will be valid. For now, we will use the personal link." + "Following the \"reference genome and annotation\" [section](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq#reference-genome-and-annotation) of their `README.md`, you are directed to the [TOPMed RNA-seq pipeline harmonization](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) page. Reading the \"Reference files\" section of that documentation essentially lays out that they use the Broad Insitute's version of `GRCh38` and add the `ERCC SpikeIn` sequences. They provide both [a link to the Broad's original FASTA](https://software.broadinstitute.org/gatk/download/bundle) and [a link to their built FASTA](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) (although, given it points to a personal page, I'm not sure how long this link will be valid. For now, we will use the personal link." ] }, { diff --git a/text/0001-rnaseq-workflow-v2.0.md b/text/0001-rnaseq-workflow-v2.0.md index 425df26..c2fed40 100644 --- a/text/0001-rnaseq-workflow-v2.0.md +++ b/text/0001-rnaseq-workflow-v2.0.md @@ -8,30 +8,30 @@ # Introduction -This RFC lays out the specification for the RNA-Seq mapping pipeline v2.0. The improvements contained within are largely based on (a) new version of tools/reference files and (b) feedback from the community. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/1). +This RFC lays out the specification for the RNA-seq mapping pipeline v2.0. The improvements contained within are largely based on (a) new version of tools/reference files and (b) feedback from the community. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/1). # Motivation - **Tool additions and updates.** The tools we use are woefully out of date (2 years old). We should reap the benefits of new tools if possible. Additionally, there is some new functionality in the area of QC and validation that I'd like to add. See the [section below](#Tool-additions-updates) for more details. - - Note that all of the tools used in the RNA-Seq Workflow v1.0 were the latest available version. -- **Updated reference files.** No changes have really been made to the `GRCh38_no_alt` analysis set FASTA. However, three major releases of the GENCODE gene model have transpired since we released the first revision of the RNA-Seq workflow ([GENCODE v31](https://www.gencodegenes.org/human/release_31.html) is now out). + - Note that all of the tools used in the RNA-seq Workflow v1.0 were the latest available version. +- **Updated reference files.** No changes have really been made to the `GRCh38_no_alt` analysis set FASTA. However, three major releases of the GENCODE gene model have transpired since we released the first revision of the RNA-seq workflow ([GENCODE v31](https://www.gencodegenes.org/human/release_31.html) is now out). - **QC and quality of life improvements based on feedback from the community.** Many interactions with the community have impacted the thoughts in this release: - A primary driver for the rewrite of the pipeline is the feedback we heard about the `ERCC SpikeIn` sequences. - Popular tools such as `GATK` and `picard` are generally unhappy if the sequence dictionaries don't match perfectly. - - The inclusion of the External RNA Controls Consortium (ERCC) Spike-in Control RNA sequences in the alignment reference file we used for RNA-Seq mapping was hence causing issues when using mapped RNA-Seq BAM files in conjunction with other non-RNA-Seq BAM files in downstream analysis using these tools. - - Last, many of our RNA-Seq samples were not generated using 'ERCC' spike-in control sequences. + - The inclusion of the External RNA Controls Consortium (ERCC) Spike-in Control RNA sequences in the alignment reference file we used for RNA-seq mapping was hence causing issues when using mapped RNA-seq BAM files in conjunction with other non-RNA-seq BAM files in downstream analysis using these tools. + - Last, many of our RNA-seq samples were not generated using 'ERCC' spike-in control sequences. - After some discussion internally, we decided the best thing to do was to remove the ERCC genome by default. We are considering providing an ERCC version of the BAM for samples containing these sequences, but there is no consensus on whether it's worth it yet. - - One of the most important themes in the RNA-Seq Workflow v2.0 proposal is the emphasis on QC and quality of life improvements (e.g. `fq lint`, generation and publication of md5sums). + - One of the most important themes in the RNA-seq Workflow v2.0 proposal is the emphasis on QC and quality of life improvements (e.g. `fq lint`, generation and publication of md5sums). # Discussion ## Tool additions and upgrades -As part of the RNA-Seq workflow v2, multiple tools will be added and upgraded: +As part of the RNA-seq workflow v2, multiple tools will be added and upgraded: - `fq v0.2.0` ([Released](https://github.com/stjude/fqlib/releases/tag/v0.2.0) November 28, 2018) will be added. This tool will be used to validate the output of `picard SamToFastq`. `picard SamToFastq` does not currently catch all of the errors we wish to catch at this stage (such as duplicate read names in the FastQ file). Thus, we will leverage this tool to independently validate that the data is well-formed by our definition of that phrase. - `rseqc v3.0.0` ([Source](http://rseqc.sourceforge.net/#download-rseqc)) will be added. We have started using `infer_experiment.py` to infer strandedness from the data and ensure that the data matches what information we get from the lab. -- Added `qualimap v.2.2.2` ([Source](https://bitbucket.org/kokonech/qualimap/)). Although we have been using `qualimap` quite heavily in our QC pipeline, we are formally adding this to the end of the RNA-Seq alignment workflow. The `bamqc` and `rnaseq` subcommands are both used. +- Added `qualimap v.2.2.2` ([Source](https://bitbucket.org/kokonech/qualimap/)). Although we have been using `qualimap` quite heavily in our QC pipeline, we are formally adding this to the end of the RNA-seq alignment workflow. The `bamqc` and `rnaseq` subcommands are both used. - Update `STAR 2.5.3a` ([Released](https://github.com/alexdobin/STAR/releases/tag/2.5.3a) March 17, 2017) to `STAR 2.7.1a` ([Released](https://github.com/alexdobin/STAR/releases/tag/2.7.1a) May 15, 2019). Upgraded to receive the benefits of bug fixes and software optimizations. - Update `samtools 1.4.0` ([Released](https://github.com/samtools/samtools/releases/tag/1.4) March 13, 2017) to `samtools 1.9` ([Released](https://github.com/samtools/samtools/releases/tag/1.9) July 18, 2018). Updating the samtools version whenever possible is of particular interest to me due to the historical fragility of the samtools code (although it has seemed to get better over the last year or so). - Update `picard 2.9.4` ([Released](https://github.com/broadinstitute/picard/releases/tag/2.9.4) June 15, 2017) to `picard 2.20.2` ([Released](https://github.com/broadinstitute/picard/releases/tag/2.20.2) May 28, 2019). Upgraded to receive the benefits of bug fixes and software optimizations. @@ -49,9 +49,9 @@ First, we researched what some of the projects we respect in the community are d | Pipeline | Reference Genome | Reference Genome Patch | Gene Model | Gene Model Patch | | ------------------------------------------------------------------------ | -------------------------------------------------------------------- | ---------------------- | -------------------------- | ---------------- | -| GDC's [mRNA-Seq pipeline][gdc-mrnaseq-pipeline] | [`GRCh38_no_alt`-based w/ decoys + viral][gdc-reference-genome] | `GRCh38.p0` | [GENCODE v22][gencode-v22] | `GRCh38.p2` | -| ENCODE's [RNA-Seq pipeline][encode-rnaseq-pipeline] | [`GRCh38_no_alt`-based w/ SpikeIns][encode-reference-genome] | `GRCh38.p0` | [GENCODE v24][gencode-v24] | `GRCh38.p5` | -| Broad Institute's [GTEx + TOPMed RNA-Seq pipeline][gtex-rnaseq-pipeline] | [Broad's `GRCh38` w/ ERCC SpikeIn][broad-institute-reference-genome] | `GRCh38.p0` | [GENCODE v26][gencode-v26] | `GRCh38.p10` | +| GDC's [mRNA-seq pipeline][gdc-mrnaseq-pipeline] | [`GRCh38_no_alt`-based w/ decoys + viral][gdc-reference-genome] | `GRCh38.p0` | [GENCODE v22][gencode-v22] | `GRCh38.p2` | +| ENCODE's [RNA-seq pipeline][encode-rnaseq-pipeline] | [`GRCh38_no_alt`-based w/ SpikeIns][encode-reference-genome] | `GRCh38.p0` | [GENCODE v24][gencode-v24] | `GRCh38.p5` | +| Broad Institute's [GTEx + TOPMed RNA-seq pipeline][gtex-rnaseq-pipeline] | [Broad's `GRCh38` w/ ERCC SpikeIn][broad-institute-reference-genome] | `GRCh38.p0` | [GENCODE v26][gencode-v26] | `GRCh38.p10` | [gdc-mrnaseq-pipeline]: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/ [gdc-reference-genome]: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files @@ -86,7 +86,7 @@ Given that there was no perfect option, we decided to stick with option #3. Originally, I had posed this question to the group: -> - Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model. This was due to best practices outlined during our RNA-Seq Workflow v1.0 discussions. I propose we revert this for the following reasons: +> - Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model. This was due to best practices outlined during our RNA-seq Workflow v1.0 discussions. I propose we revert this for the following reasons: > - The first sentence in section 2.2.2 of the [STAR 2.7.1.a manual](https://github.com/alexdobin/STAR/blob/2.7.1a/doc/STARmanual.pdf): "The use of the most comprehensive annotations for a given species is strongly recommended". So it seems the author recommends you use the most comprehensive gene model. > - Here is what [the GENCODE FAQ](https://www.gencodegenes.org/pages/faq.html) has to say about the level 3 annotations: "Ensembl loci where they are different from the Havana annotation or where no Havana annotation can be found". Given that the GENCODE geneset is the union of automated annotations from the `Ensembl-genebuild` and manual curation of the `Ensembl-Havana` team, this level should be harmless in the event that levels 1 & 2 don't apply. > - Last, the various other pipelines in the community don't tend to remove these features: @@ -142,7 +142,7 @@ cargo install --git https://github.com/stjude/fqlib.git --tag v0.3.1 ## Reference files -The following reference files are used as the basis of the RNA-Seq Workflow v2.0: +The following reference files are used as the basis of the RNA-seq Workflow v2.0: - Similarly to all analysis pipelines in St. Jude Cloud, we use the `GRCh38_no_alt` analysis set for our reference genome. You can get a copy of the file [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). Additionally, you can get the file by running the following commands: @@ -177,12 +177,12 @@ The following reference files are used as the basis of the RNA-Seq Workflow v2.0 --runThreadN $NCPU \ # Number of threads to use to build genome database. --genomeFastaFiles $FASTA \ # A path to the GRCh38_no_alt.fa FASTA file. --sjdbGTFfile $GENCODE_GTF_V31 \ # GENCODE v31 gene model file. - --sjdbOverhang 125 # Splice junction database overhang parameter, the optimal value is (Max length of RNA-Seq read-1). + --sjdbOverhang 125 # Splice junction database overhang parameter, the optimal value is (Max length of RNA-seq read-1). ``` ## Workflow -Here are the resulting steps in the RNA-Seq Workflow v2.0 pipeline. +Here are the resulting steps in the RNA-seq Workflow v2.0 pipeline. 1. Run `samtools quickcheck` on the incoming BAM to ensure that it is well-formed enough to convert back to FastQ. 2. Split BAM file into multiple BAMs on the different read groups using `samtools split`. See [the samtools documentation](http://www.htslib.org/doc/samtools.html) for more information. @@ -283,7 +283,7 @@ Here are the resulting steps in the RNA-Seq Workflow v2.0 pipeline. -outdir $OUTPUT_DIR \ # Output directory. -oc qualimap_counts.txt \ # Counts as calculated by qualimap. -p $COMPUTED \ # Strandedness as specified by the lab and confirmed by "infer_experiment.py" above. Typically "strand-specific-reverse" for St. Jude Cloud data. - -pe # All RNA-Seq data in St. Jude Cloud is currently paired-end. + -pe # All RNA-seq data in St. Jude Cloud is currently paired-end. ``` 11. Next, `htseq-count` is run for the final counts file to be delivered: @@ -297,7 +297,7 @@ Here are the resulting steps in the RNA-Seq Workflow v2.0 pipeline. --supplementary-alignments ignore \ # Elect to ignore supplementary alignments. Needs input from reviewers. $INPUT_BAM # Input BAM file. ``` -12. Generate the remaining files generally desired as output for the RNA-Seq Workflow. +12. Generate the remaining files generally desired as output for the RNA-seq Workflow. ```bash samtools flagstat $INPUT_BAM samtools index $INPUT_BAM From 75ec04cee554075aa91186bfa549d9c0f1c637ae Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Tue, 18 Feb 2020 16:29:20 -0600 Subject: [PATCH 30/68] improve: updates to the QC RFC --- text/0000-quality-assurance-pipeline.md | 74 ++++++++----------------- 1 file changed, 22 insertions(+), 52 deletions(-) diff --git a/text/0000-quality-assurance-pipeline.md b/text/0000-quality-assurance-pipeline.md index 284f377..61b473b 100644 --- a/text/0000-quality-assurance-pipeline.md +++ b/text/0000-quality-assurance-pipeline.md @@ -32,28 +32,20 @@ The quality metrics discussed here are sequence and mapping quality metrics. Other metrics related to nucleic acid integrity or library quality are typically done upstream in the genomics lab contributing the data. Pre-sequencing quality metrics, however, are clearly important and part of our long term interests. -## Current Process +## Tool additions and upgrades -Currently, St. Jude Cloud provides three sequencing data types: whole-genome -(WGS), whole-exome (WES), and transcriptome (RNA-seq) data. It is important to -differentiate our quality control workflows for each type of sequencing. At the -time that the RFC was authored, we use the following tools to _manually_ vet -data. +* `ngsderive v1.0.1` will be added for RNA-seq strandedness derivation, read + length derivation, and instrument derivation. +* `fastqc v0.11.8` will be upgraded to `fastqc v0.11.9`. Upgraded to receive the + benefits of bug fixes and software optimizations. +* `picard v2.20.2` will be upgraded to `picard v2.21.8`. Upgraded to receive the + benefits of bug fixes and software optimizations. +* `qualimap v2.2.2c` will be upgraded to `qualimap v2.2.2d`. Upgraded to receive + the benefits of bug fixes and software optimizations. +* `samtools v1.9` will be upgraded to `samtools v1.10.2`. Upgraded to receive + the benefits of bug fixes and software optimizations. -> Note that one of the goals of this RFC is to constantly update these -versions, so they are likely not what will be used in the pipeline. - -| Tool | Version | Website | -| ------------------------ | ------- | ---------------- | -| `samtools flagstat` | v1.9 | [link](samtools) | -| `fastqc` | v0.11.8 | [link](fastqc) | -| `qualimap bamqc` | v2.2.2 | [link](qualimap) | -| `qualimap rnaseq` | v2.2.2 | [link](qualimap) | -| `picard ValidateSamFile` | v2.20.2 | [link](picard) | -| `multiqc` | v1.7 | [link](multiqc) | - - -## Important Metrics +## Automated metrics comparison | Name | Produced By | Description | | ---------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | @@ -64,7 +56,7 @@ versions, so they are likely not what will be used in the pipeline. | rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | | Transcript Coverage and 5’-3’ Bias | [Qualimap][qualimap] | Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | | Junction Analysis | [Qualimap][qualimap] | Analysis of known, partly known, and novel junction positions in spliced alignments. | -| Strand Specificity | RSeQC | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | +| Strand Specificity | ngsderive | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | | GC Content Bias | ? | GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). | ## Thresholds and Metrics for Specific Applications @@ -81,15 +73,6 @@ The quality metrics of special concern for WES include depth of coverage in exom The quality metrics of special concern for RNAseq include mapping percentage, percentage properly paired reads, and exomic regional coverage. Mapping quality is also critical. - -### Proposals - -- Add [`RSeQC v3.0.0`](http://rseqc.sourceforge.net), specifically [`infer_experiment`]. - -[`infer_experiment`]: http://rseqc.sourceforge.net/#infer-experiment-py - -- Include md5 hash as an annotation property for vended files. - # Specification These are generic instructions for running each of the tools in our pipeline. @@ -105,12 +88,11 @@ We presume anaconda is available and installed. If not please follow the link to ```bash conda create --name bio-qc \ --channel bioconda \ - fastqc==0.11.8 \ - picard==2.20.2 \ - qualimap==2.2.2c \ - rseqc==3.0.0 \ - sambamba==0.6.6 \ - samtools==1.9 \ + --channel conda-forge \ + fastqc==0.11.9 \ + picard==2.21.8 \ + qualimap==2.2.2d \ + samtools==1.10.2 \ -y conda activate bio-qc @@ -118,22 +100,10 @@ conda activate bio-qc ## Workflow -The end workflow (covering both our current process and the addition of the new tool) would be as following: - -| Command | Purpose | -| ------------------------ | ---------------------------------------------------------- | -| `samtools quickcheck` | Validate BAM headers and EOF block existence | -| `md5sum` | For comparison to md5 vended file property | -| `picard ValidateSamFile` | Ensure validity of file | -| `samtools flagstat` | Generate flag statistics | -| `sambamba flagstat` | Generate flag statistics using a samtools alternative | -| `fastqc` | Screen for GC content and adapter contamination | -| `qualimap bamqc` | Screen for mapping quality, coverage, and duplication rate | -| `qualimap rnaseq` | Screen for RNA-seq bias and junction analysis | -| `rseqc infer_experiment` | Determine RNA-SEQ strandedness and reads | -| `multiqc` | Report aggregation | - -Note: Specific options such as memory size thresholds and thread count are below. +The workflow specification is as follows. Note that some arguments that are not +integral to the command (such as output directories) or arguments that can vary +between compute environements (such as memory thresholds or number of threads) +are not included. 1. Run `samtools quickcheck` to ensure that input BAMs are relatively well-formed (for instance, to ensure a header and EOF marker exist). From 580dc9b159580878d21a44a9fbda922a48968b29 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Tue, 18 Feb 2020 17:42:48 -0600 Subject: [PATCH 31/68] Assign RFC a number of 0002. --- resources/0002-quality-check-workflow/.gitkeep | 0 ...ality-assurance-pipeline.md => 0002-quality-check-workflow.md} | 0 2 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 resources/0002-quality-check-workflow/.gitkeep rename text/{0000-quality-assurance-pipeline.md => 0002-quality-check-workflow.md} (100%) diff --git a/resources/0002-quality-check-workflow/.gitkeep b/resources/0002-quality-check-workflow/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/text/0000-quality-assurance-pipeline.md b/text/0002-quality-check-workflow.md similarity index 100% rename from text/0000-quality-assurance-pipeline.md rename to text/0002-quality-check-workflow.md From 387051056a9c8c35fa8f7fda35e38773603eebd4 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Wed, 19 Feb 2020 15:57:50 -0600 Subject: [PATCH 32/68] Tool versions reverted back to what we use in RNA-seq v2. --- text/0002-quality-check-workflow.md | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 61b473b..8eb2f57 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -36,14 +36,6 @@ done upstream in the genomics lab contributing the data. Pre-sequencing quality * `ngsderive v1.0.1` will be added for RNA-seq strandedness derivation, read length derivation, and instrument derivation. -* `fastqc v0.11.8` will be upgraded to `fastqc v0.11.9`. Upgraded to receive the - benefits of bug fixes and software optimizations. -* `picard v2.20.2` will be upgraded to `picard v2.21.8`. Upgraded to receive the - benefits of bug fixes and software optimizations. -* `qualimap v2.2.2c` will be upgraded to `qualimap v2.2.2d`. Upgraded to receive - the benefits of bug fixes and software optimizations. -* `samtools v1.9` will be upgraded to `samtools v1.10.2`. Upgraded to receive - the benefits of bug fixes and software optimizations. ## Automated metrics comparison @@ -89,10 +81,11 @@ We presume anaconda is available and installed. If not please follow the link to conda create --name bio-qc \ --channel bioconda \ --channel conda-forge \ - fastqc==0.11.9 \ - picard==2.21.8 \ - qualimap==2.2.2d \ - samtools==1.10.2 \ + fastqc==0.11.8 \ + picard==2.20.2 \ + qualimap==2.2.2c \ + samtools==1.9 \ + ngsderive==1.0.1 \ -y conda activate bio-qc From 5f87a254268b6a2ddacbe377d059ce1ad25e80a9 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Wed, 19 Feb 2020 16:08:56 -0600 Subject: [PATCH 33/68] Add fastq_screen tool --- text/0002-quality-check-workflow.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 8eb2f57..57551e8 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -35,7 +35,7 @@ done upstream in the genomics lab contributing the data. Pre-sequencing quality ## Tool additions and upgrades * `ngsderive v1.0.1` will be added for RNA-seq strandedness derivation, read - length derivation, and instrument derivation. +* `fastq_screen v0.13.0` will be added to estimate the percentage of material derived from different sources (human, mouse PhiX, etc). ## Automated metrics comparison @@ -53,7 +53,9 @@ done upstream in the genomics lab contributing the data. Pre-sequencing quality ## Thresholds and Metrics for Specific Applications - To apply quality control metrics to vett data, we need reasonable thresholds that are practically acheivable and neither too lax or too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thrresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in postion to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost.## Metrics for WGS + To apply quality control metrics to vett data, we need reasonable thresholds that are practically acheivable and neither too lax or too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in postion to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. + + ## Metrics for WGS The quality metrics of special concern for WGS include depth of coverage and genomic regional coverage. Mapping quality is also critical. The analysis of whole genome sequencing to call variants depends on depth and sample purity. Accurate calls are made through replication and contamination creates false positives. So metrics that are sensitive to impurity are valuable. @@ -84,7 +86,8 @@ conda create --name bio-qc \ fastqc==0.11.8 \ picard==2.20.2 \ qualimap==2.2.2c \ - samtools==1.9 \ + samtools==1.9 \ + fastq_screen==0.13.0 \ ngsderive==1.0.1 \ -y From ab8684a838e82793050f3828b85ad9e193f52667 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Wed, 19 Feb 2020 16:12:13 -0600 Subject: [PATCH 34/68] fix: correct conda package name --- text/0002-quality-check-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 57551e8..a239841 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -87,7 +87,7 @@ conda create --name bio-qc \ picard==2.20.2 \ qualimap==2.2.2c \ samtools==1.9 \ - fastq_screen==0.13.0 \ + fastq-screen==0.13.0 \ ngsderive==1.0.1 \ -y From ead2d8ab056e356274fdddd5c696ffea91cfeb39 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 17:01:09 -0500 Subject: [PATCH 35/68] docs: first pass of QC updates --- text/0002-quality-check-workflow.md | 34 ++++++++++++++--------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index a239841..bb67769 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -10,31 +10,31 @@ # Introduction This RFC documents an automated workflow for assessing the integrity and quality -of St. Jude Cloud genomics data. The end goal is to publish a collection of -metrics that users can leverage to assess the quality of the data available. -Furthermore, we outline the method used internally to vet the data before we -publish it. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3). +of St. Jude Cloud genomics data. The goal of this RFC is two-fold: + +1. Establish state of the art method for comprehensively evaluating genomics data quality at scale — both at time of receipt and after processing. +2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage to assess the quality of the data available. This context should save users time computing the information themselves while also informing appropriate use of the data. + +You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3). # Motivation -Since the introduction of uploading clinical genomics data in real-time (the -"Real-Time Clinical Genomics" initiative), we need an automated quality -assurance pipeline that guarantees uploaded data meets predefined standards. -Guaranteeing data integrity and the reproducibility of these results allows St. -Jude to assure scientists and researchers that the data we provide is useful. As -much as possible, we'd like to automate this process to ensure it scales. The end-goal is to present a comprehensive report, much like the example [ -MultiQC report](https://multiqc.info/examples/rna-seq/multiqc_report.html), for -each dataset + sequencing type tuple. +St. Jude Cloud is one of the largest repositories of omics data available for request to date. As such, the project processes thousands of samples from whole-genome, whole-exome, RNA-seq, and various omics-based assays each year. +A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to (1) define the standard set of QC tools used to evaluating omics-based data, (2) identify and implement key metrics that can be automated to assist in manual observation of the data, and (3) publish these results alongside the data already published in St. Jude Cloud so that end-users can similar view results. # Discussion -The quality metrics discussed here are sequence and mapping quality metrics. -Other metrics related to nucleic acid integrity or library quality are typically -done upstream in the genomics lab contributing the data. Pre-sequencing quality metrics, however, are clearly important and part of our long term interests. +## Types of QC + +There are (at least) two different types of QC typically carried out omics-based data. **Experiment** QC attempts to identify the success of the assay(s) performed. Once one is sufficiently satisfied that the data generated by an experiment is "good" (by some definition of that word), **computational** QC examines the degree to which computational processing of that data was completed successfully. + +By the time data reaches the St. Jude Cloud team from various sources, extensive *experimental* and *computational* evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas which is dependent on the best practices at that point in time and the goals of the project. Most often, we take the computational data, revert it back to its raw form (such as FastQ files), and reprocess it using our harmonization pipeline. + +Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the *computational* QC of the files produce for publication in St. Jude Cloud. While we do produce results that define *experimental* results (such as `fastqc`), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aide in decision-making about downstream analysis approaches. -## Tool additions and upgrades +## Tools Used -* `ngsderive v1.0.1` will be added for RNA-seq strandedness derivation, read +* `ngsderive v1.0.2` will be added for RNA-seq strandedness derivation, read * `fastq_screen v0.13.0` will be added to estimate the percentage of material derived from different sources (human, mouse PhiX, etc). ## Automated metrics comparison From c348cbd6d60afb1ffa2b08117e5b6929c29e67ba Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 17:03:04 -0500 Subject: [PATCH 36/68] docs: further updates to QC pipeline RFC --- text/0002-quality-check-workflow.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index bb67769..8a3b763 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -10,9 +10,9 @@ # Introduction This RFC documents an automated workflow for assessing the integrity and quality -of St. Jude Cloud genomics data. The goal of this RFC is two-fold: +of St. Jude Cloud genomics data. The goal of this RFC is two-fold. -1. Establish state of the art method for comprehensively evaluating genomics data quality at scale — both at time of receipt and after processing. +1. Establish state of the art method for comprehensively evaluating genomics data quality at scale: both at time of receipt and after processing. 2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage to assess the quality of the data available. This context should save users time computing the information themselves while also informing appropriate use of the data. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3). @@ -20,7 +20,11 @@ You can find the relevant discussion on the [associated pull request](https://gi # Motivation St. Jude Cloud is one of the largest repositories of omics data available for request to date. As such, the project processes thousands of samples from whole-genome, whole-exome, RNA-seq, and various omics-based assays each year. -A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to (1) define the standard set of QC tools used to evaluating omics-based data, (2) identify and implement key metrics that can be automated to assist in manual observation of the data, and (3) publish these results alongside the data already published in St. Jude Cloud so that end-users can similar view results. +A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to + +* define the standard set of QC tools used to evaluating omics-based data +* identify and implement key metrics that can be automated to assist in manual observation of the data, and +* publish these results alongside the data already published in St. Jude Cloud so that end-users can similar view results. # Discussion From dfdc0cd85ddc5026d6bcf599c89942f7b5a9bcb7 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 17:06:39 -0500 Subject: [PATCH 37/68] docs: further refinement of QC RFC --- text/0002-quality-check-workflow.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 8a3b763..1fb4ad8 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -36,10 +36,10 @@ By the time data reaches the St. Jude Cloud team from various sources, extensive Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the *computational* QC of the files produce for publication in St. Jude Cloud. While we do produce results that define *experimental* results (such as `fastqc`), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aide in decision-making about downstream analysis approaches. -## Tools Used +## Tools -* `ngsderive v1.0.2` will be added for RNA-seq strandedness derivation, read -* `fastq_screen v0.13.0` will be added to estimate the percentage of material derived from different sources (human, mouse PhiX, etc). +* `ngsderive v1.0.2` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. +* `fastq_screen v0.13.0` is used estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). ## Automated metrics comparison From f9d612b978cc5edc48e50065d4ba41f6fe566264 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 21:02:31 -0500 Subject: [PATCH 38/68] docs: add tool sections to see how they render --- text/0002-quality-check-workflow.md | 31 +++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 1fb4ad8..338db82 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -36,10 +36,33 @@ By the time data reaches the St. Jude Cloud team from various sources, extensive Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the *computational* QC of the files produce for publication in St. Jude Cloud. While we do produce results that define *experimental* results (such as `fastqc`), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aide in decision-making about downstream analysis approaches. -## Tools +## Tools and Metrics -* `ngsderive v1.0.2` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. -* `fastq_screen v0.13.0` is used estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). +Here, we outline each tool, what metrics are considered in an automated manner, and which metrics and require manual inspect. To keep from duplicating information and to ensure the RFC doesn't get out of sync, versions for each tool can be found in the [dependencies](#dependencies) section. + +## fastqc + +TODO + +## qualimap + +TODO + +## samtools + +TODO + +## ngsderive + +`ngsderive` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. + +## picard + +`picard` is for several operations including validating BAM files with `ValidateSam` and converting SAM to FastQ files with `SamToFastq`. + +## fastq-screen + +* `fastq_screen` is used estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). ## Automated metrics comparison @@ -92,7 +115,7 @@ conda create --name bio-qc \ qualimap==2.2.2c \ samtools==1.9 \ fastq-screen==0.13.0 \ - ngsderive==1.0.1 \ + ngsderive==1.0.2 \ -y conda activate bio-qc From 7b0f7720e94dab985d7095ce90f5f14ded6bf5ce Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 21:15:56 -0500 Subject: [PATCH 39/68] docs: change tools to h3 --- text/0002-quality-check-workflow.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 338db82..31d2683 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -40,29 +40,29 @@ Thus, the scope of this RFC, and the QC of samples on the project in general, is Here, we outline each tool, what metrics are considered in an automated manner, and which metrics and require manual inspect. To keep from duplicating information and to ensure the RFC doesn't get out of sync, versions for each tool can be found in the [dependencies](#dependencies) section. -## fastqc +### fastqc TODO -## qualimap +### qualimap TODO -## samtools +### samtools TODO -## ngsderive +### ngsderive `ngsderive` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. -## picard +### picard `picard` is for several operations including validating BAM files with `ValidateSam` and converting SAM to FastQ files with `SamToFastq`. -## fastq-screen +### fastq-screen -* `fastq_screen` is used estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). +`fastq_screen` is used estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). ## Automated metrics comparison From cce8ee2c9c2af5d0a355cd437a9335f135589a2b Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 21:22:51 -0500 Subject: [PATCH 40/68] docs: add ngsderive metrics --- text/0002-quality-check-workflow.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 31d2683..cc4308a 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -56,6 +56,14 @@ TODO `ngsderive` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. +In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, and strandedness (if RNA-seq). + +| Name | Applicable Experiments | Check Type | Description | +| --------------------- | ---------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------- | +| Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | +| Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | +| Inferred strandedness | RNA-seq | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | + ### picard `picard` is for several operations including validating BAM files with `ValidateSam` and converting SAM to FastQ files with `SamToFastq`. From ae4716fd8020f1dba7bf6f1543331ed408b9782d Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 21:24:31 -0500 Subject: [PATCH 41/68] docs: header changes --- text/0002-quality-check-workflow.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index cc4308a..cab11a1 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -58,11 +58,11 @@ TODO In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, and strandedness (if RNA-seq). -| Name | Applicable Experiments | Check Type | Description | -| --------------------- | ---------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------- | -| Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | -| Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | -| Inferred strandedness | RNA-seq | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | +| Name | Experiments | Check | Description | +| --------------------- | ----------- | ------ | -------------------------------------------------------------------------------------------------------------------- | +| Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | +| Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | +| Inferred strandedness | RNA-seq | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | ### picard From ed0cee7081ac648bc356dd58e8f756e7c8b32ca7 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 21:24:59 -0500 Subject: [PATCH 42/68] docs: add metrics header --- text/0002-quality-check-workflow.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index cab11a1..8ba54e9 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -58,6 +58,8 @@ TODO In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, and strandedness (if RNA-seq). +#### Metrics + | Name | Experiments | Check | Description | | --------------------- | ----------- | ------ | -------------------------------------------------------------------------------------------------------------------- | | Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | From 4d009026fb221d09f8fd937e8e697ad38d78f8fe Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Sun, 15 Mar 2020 21:26:17 -0500 Subject: [PATCH 43/68] Revert "docs: add metrics header" This reverts commit ed0cee7081ac648bc356dd58e8f756e7c8b32ca7. --- text/0002-quality-check-workflow.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 8ba54e9..cab11a1 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -58,8 +58,6 @@ TODO In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, and strandedness (if RNA-seq). -#### Metrics - | Name | Experiments | Check | Description | | --------------------- | ----------- | ------ | -------------------------------------------------------------------------------------------------------------------- | | Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | From 47c3e0f1f90f23c49fff4605ef10968d29e22771 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Mon, 23 Mar 2020 11:47:45 -0500 Subject: [PATCH 44/68] further WIP --- text/0002-quality-check-workflow.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index cab11a1..8356912 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -40,17 +40,17 @@ Thus, the scope of this RFC, and the QC of samples on the project in general, is Here, we outline each tool, what metrics are considered in an automated manner, and which metrics and require manual inspect. To keep from duplicating information and to ensure the RFC doesn't get out of sync, versions for each tool can be found in the [dependencies](#dependencies) section. -### fastqc +### samtools -TODO +TODO ### qualimap TODO -### samtools +### fastqc -TODO +TODO ### ngsderive From 06d1cfa034f4f74d35d8abba061ee8187175d66b Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Wed, 18 Nov 2020 16:09:52 -0500 Subject: [PATCH 45/68] update RFC to current workflow specs --- text/0002-quality-check-workflow.md | 234 ++++++++++++++-------------- 1 file changed, 116 insertions(+), 118 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 8356912..00605ba 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -7,52 +7,50 @@ - [Items Still In-Progress](#items-still-in-progress) - [Outstanding Questions](#outstanding-questions) -# Introduction +## Introduction -This RFC documents an automated workflow for assessing the integrity and quality -of St. Jude Cloud genomics data. The goal of this RFC is two-fold. +This RFC documents an automated workflow for assessing the integrity and quality of St. Jude Cloud genomics data. The goal of this RFC is two-fold. 1. Establish state of the art method for comprehensively evaluating genomics data quality at scale: both at time of receipt and after processing. 2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage to assess the quality of the data available. This context should save users time computing the information themselves while also informing appropriate use of the data. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3). -# Motivation +## Motivation -St. Jude Cloud is one of the largest repositories of omics data available for request to date. As such, the project processes thousands of samples from whole-genome, whole-exome, RNA-seq, and various omics-based assays each year. -A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to +St. Jude Cloud is one of the largest repositories of omics data available for request to date. As such, the project processes thousands of samples from whole-genome, whole-exome, RNA-seq, and various omics-based assays each year. A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to -* define the standard set of QC tools used to evaluating omics-based data -* identify and implement key metrics that can be automated to assist in manual observation of the data, and -* publish these results alongside the data already published in St. Jude Cloud so that end-users can similar view results. +- define the standard set of QC tools used to evaluating omics-based data +- identify and implement key metrics that can be automated to assist in manual observation of the data, and +- publish these results alongside the data already published in St. Jude Cloud so that end-users can similar view results. -# Discussion +## Discussion -## Types of QC +### Types of QC There are (at least) two different types of QC typically carried out omics-based data. **Experiment** QC attempts to identify the success of the assay(s) performed. Once one is sufficiently satisfied that the data generated by an experiment is "good" (by some definition of that word), **computational** QC examines the degree to which computational processing of that data was completed successfully. -By the time data reaches the St. Jude Cloud team from various sources, extensive *experimental* and *computational* evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas which is dependent on the best practices at that point in time and the goals of the project. Most often, we take the computational data, revert it back to its raw form (such as FastQ files), and reprocess it using our harmonization pipeline. +By the time data reaches the St. Jude Cloud team from various sources, extensive _experimental_ and _computational_ evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas which is dependent on the best practices at that point in time and the goals of the project. Most often, we take the computational data, revert it back to its raw form (such as FastQ files), and reprocess it using our harmonization pipeline. -Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the *computational* QC of the files produce for publication in St. Jude Cloud. While we do produce results that define *experimental* results (such as `fastqc`), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aide in decision-making about downstream analysis approaches. +Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the _computational_ QC of the files produce for publication in St. Jude Cloud. While we do produce results that define _experimental_ results (such as `fastqc` ), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aide in decision-making about downstream analysis approaches. -## Tools and Metrics +### Tools and Metrics Here, we outline each tool, what metrics are considered in an automated manner, and which metrics and require manual inspect. To keep from duplicating information and to ensure the RFC doesn't get out of sync, versions for each tool can be found in the [dependencies](#dependencies) section. -### samtools +#### samtools -TODO +Samtools is used both as a utility for file transformations, and its `flagstat` command is used to generate metricts for quality checking. Metrics include number of duplicate reads and properly paired reads. -### qualimap +#### qualimap -TODO +Qualimap is used to find coverage across the genome, insert sizes, and QC content. -### fastqc +#### fastqc -TODO +FastQC generates metrics about read quality scores, sequence duplication levels and length distributions, among others. -### ngsderive +#### ngsderive `ngsderive` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. @@ -64,53 +62,49 @@ In the QC pipeline, we leverage all currently available subcommands to try to de | Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | | Inferred strandedness | RNA-seq | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | -### picard +#### picard -`picard` is for several operations including validating BAM files with `ValidateSam` and converting SAM to FastQ files with `SamToFastq`. +`picard` is for several operations including validating BAM files with `ValidateSam` and converting SAM to FastQ files with `SamToFastq` . -### fastq-screen +#### fastq-screen `fastq_screen` is used estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). -## Automated metrics comparison +### Automated metrics comparison -| Name | Produced By | Description | -| ---------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| % Aligned | [Samtools][samtools] | Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consisitent with sample purity. | -| Per Base Sequence Quality | [FastQC][fastqc] | The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users — the quality of the sequencing run has already been assessed by the lab upstream. So, there is no changing it at this point. | -| Overrepresented Sequences | [FastQC][fastqc] | The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). | -| Reads Genomic Origin | [Qualimap][qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | -| rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | -| Transcript Coverage and 5’-3’ Bias | [Qualimap][qualimap] | Libraries prepared with polyA selection may have higher biased expression in 3’ region. If reads primarily accumulate at the 3’ end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | -| Junction Analysis | [Qualimap][qualimap] | Analysis of known, partly known, and novel junction positions in spliced alignments. | -| Strand Specificity | ngsderive | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | -| GC Content Bias | ? | GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). | +| Name | Produced By | Description | +| ---------------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| % Aligned | [Samtools] | Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consisitent with sample purity. | +| Per Base Sequence Quality | [FastQC] | The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users -- the quality of the sequencing run has already been assessed by the lab upstream. So, there is no changing it at this point. | +| Overrepresented Sequences | [FastQC] | The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). | +| Reads Genomic Origin | [Qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | +| rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | +| Transcript Coverage and 5'-3' Bias | [Qualimap] | Libraries prepared with polyA selection may have higher biased expression in 3' region. If reads primarily accumulate at the 3' end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | +| Junction Analysis | [Qualimap] | Analysis of known, partly known, and novel junction positions in spliced alignments. | +| Strand Specificity | ngsderive | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | +| GC Content Bias | ? | GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). | -## Thresholds and Metrics for Specific Applications +#### Thresholds and Metrics for Specific Applications - To apply quality control metrics to vett data, we need reasonable thresholds that are practically acheivable and neither too lax or too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in postion to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. - - ## Metrics for WGS - -The quality metrics of special concern for WGS include depth of coverage and genomic regional coverage. Mapping quality is also critical. The analysis of whole genome sequencing to call variants depends on depth and sample purity. Accurate calls are made through replication and contamination creates false positives. So metrics that are sensitive to impurity are valuable. +To apply quality control metrics to vett data, we need reasonable thresholds that are practically acheivable and neither too lax or too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in postion to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. -### Metrics for WES +#### Metrics for WGS + +The quality metrics of special concern for WGS include depth of coverage and genomic regional coverage. Mapping quality is also critical. The analysis of whole genome sequencing to call variants depends on depth and sample purity. Accurate calls are made through replication and contamination creates false positives. So metrics that are sensitive to impurity are valuable. + +#### Metrics for WES The quality metrics of special concern for WES include depth of coverage in exomic regions. Mapping quality, % mapped and duplication rate are also important. -### Metrics for RNAseq +#### Metrics for RNAseq -The quality metrics of special concern for RNAseq include mapping percentage, percentage properly paired reads, and exomic regional coverage. Mapping quality is also critical. +The quality metrics of special concern for RNAseq include mapping percentage, percentage properly paired reads, and exomic regional coverage. Mapping quality is also critical. -# Specification +## Specification -These are generic instructions for running each of the tools in our pipeline. -We run our pipeline in a series of QC scripts that are tailored for our compute -cluster, so those commands may not apply elsewhere. Instead we've supplied -examples of the commands used to each package. Our default memory is 80G and we -employ 4 threads for these processes. +These are generic instructions for running each of the tools in our pipeline. We run our pipeline in a series of QC scripts that are tailored for our compute cluster, so those commands may not apply elsewhere. Instead we've supplied examples of the commands used to each package. Our default memory is 80G and we employ 4 threads for these processes. -## Dependencies +### Dependencies We presume anaconda is available and installed. If not please follow the link to [anaconda](https://www.anaconda.com/) first. @@ -123,105 +117,109 @@ conda create --name bio-qc \ qualimap==2.2.2c \ samtools==1.9 \ fastq-screen==0.13.0 \ - ngsderive==1.0.2 \ + ngsderive==1.1.0 \ + multiqc==1.9 \ -y conda activate bio-qc ``` -## Workflow +For linting created fastqs, `fqlib` must be installed. See installation instructions [here](https://github.com/stjude/fqlib) + +### Workflow + +The workflow specification is as follows. Note that some arguments that are not integral to the command (such as output directories) or arguments that can vary between compute environements (such as memory thresholds or number of threads) are not included. + +1. Compute the md5 checksum of the file. + + ```bash + md5sum $BAM + ``` -The workflow specification is as follows. Note that some arguments that are not -integral to the command (such as output directories) or arguments that can vary -between compute environements (such as memory thresholds or number of threads) -are not included. +2. Use Picard's `ValidateSamFile` tool to ensure the inner contents of the BAM file are well-formed. -1. Run `samtools quickcheck` to ensure that input BAMs are relatively - well-formed (for instance, to ensure a header and EOF marker exist). + ```bash + picard ValidateSamFile \ + I=$BAM \ # specify bam file + MODE=SUMMARY\ # concise output + INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time + IGNORE=INVALID_PLATFORM_VALUE # Validations to ignore. + ``` - ```bash - samtools quickcheck $BAM - ``` +3. Run `samtools flagstat` to gather general statistics such as alignment percentage. -2. Use Picard's `ValidateSamFile` tool to ensure the inner contents of the BAM - file are well-formed. + ```bash + samtools flagstat $BAM + ``` - ```bash - picard ValidateSamFile \ - I=$BAM \ # specify bam file - MODE=SUMMARY\ # concise output - INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time - IGNORE=INVALID_PLATFORM_VALUE # Validations to ignore. - ``` +4. Run `fastqc` to collect sequencing and library-related statistics. These are only for informational purposes -- as stated above, we typically do not remove samples based on this information (with rare exception), as the sequencing-related QC work was done upstream in the genomics lab. -3. Run `samtools flagstat` to gather general statistics such as alignment - percentage. + ```bash + fastqc $BAM + ``` - ```bash - samtools flagstat $BAM - ``` +5. Run `ngsderive instrument` to infer sequencing instrument. -4. Run `fastqc` to collect sequencing and library-related statistics. These are - only for informational purposes — as stated above, we typically do not remove - samples based on this information (with rare exception), as the - sequencing-related QC work was done upstream in the genomics lab. + ```bash + ngsderive instrument $BAM + ``` - ```bash - fastqc $BAM - ``` +6. Run `ngsderive readlen` to infer read lengths. -5. Run `qualimap bamqc` to gather more in-depth statistics about read stats, - coverage, mapping quality, mismatches, etc. + ```bash + ngsderive readlen $BAM + ``` - ```bash - qualimap bamqc -bam $BAM \ # bam filename - -nt $NUM_THREADS \ # threads requested - -nw 400 # number of windows - ``` +7. Run `qualimap bamqc` to gather more in-depth statistics about read stats, coverage, mapping quality, mismatches, etc. -6. If RNA-seq data, run `qualimap rnaseq` to gather QC statistics that are - tailored for RNA-seq files. + ```bash + qualimap bamqc -bam $BAM \ # bam filename + -nt $NUM_THREADS \ # threads requested + -nw 400 # number of windows + ``` - ```bash - qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory - -bam $BAM \ # bam filename - -gtf $GTF_REF # transcript definition file - -pe # specify paired end if paired end - ``` +8. If WGS or WES data, run `fastq_screen`. For performance, we subsample the input BAM using `samtools view -s $computed_fraction` before running it through `picard SamToFastq`. The resulting fastqs are validated with `fq lint` provided by `fqlib`. -7. If RNA-seq, run `ngsderive strandedness` to determine a backwards-computed - strandedness of the RNA-seq experiment. + ```bash + cat $fastq_1 $fastq_2 > $combined_fastq + fastq_screen $combined_fastq + ``` - ```bash - ngsderive strandedness - ``` +9. If RNA-seq, run `ngsderive strandedness` to determine a backwards-computed strandedness of the RNA-seq experiment. -8. Compute the md5 checksum of the file. + ```bash + ngsderive strandedness + ``` - ```bash - md5sum $BAM - ``` +10. If RNA-seq data, run `qualimap rnaseq` to gather QC statistics that are tailored for RNA-seq files. -9. Combine all of the above metrics using `multiqc`. + ```bash + qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory + -bam $BAM \ # bam filename + -gtf $GTF_REF # transcript definition file + -pe # specify paired end if paired end + ``` - ```bash - multiqc . # recurse all files in '.' - ``` +11. Combine all of the above metrics using `multiqc`. -# Items Still In-Progress + ```bash + multiqc . # recurse all files in '.' + ``` + +## Items Still In-Progress - [ ] Analysis tools for other types of sequencing (ChIP seq) - [ ] Useful metadata from various stages (sample collection, laboratory, pre-sequencing, sequencing, post-sequencing) -# Outstanding Questions +## Outstanding Questions - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? - What other metrics or properties would be valuable? - What is best way to define and handle outliers? -- What is the best way to examine cohort integrity? This means experimental category-based tests of samples to find outliers that are of sufficient quality if examined alone. Outliers in this case may indicate classification errors or rare biological conditions. Which metrics are best tested here? +- What is the best way to examine cohort integrity? This means experimental category-based tests of samples to find outliers that are of sufficient quality if examined alone. Outliers in this case may indicate classification errors or rare biological conditions. Which metrics are best tested here? -[samtools]: http://www.htslib.org/doc/samtools.html [fastqc]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ -[qualimap]: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html +[multiqc]: https://multiqc.info/ [picard]: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_ValidateSamFile.php -[multiqc]: https://multiqc.info/ \ No newline at end of file +[qualimap]: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html +[samtools]: http://www.htslib.org/doc/samtools.html From 99476e6f5cb17d68c005d6d030794243d7c5e823 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Wed, 18 Nov 2020 16:19:55 -0500 Subject: [PATCH 46/68] white space changes --- text/0002-quality-check-workflow.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 00605ba..fca71f2 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -140,10 +140,10 @@ The workflow specification is as follows. Note that some arguments that are not ```bash picard ValidateSamFile \ - I=$BAM \ # specify bam file - MODE=SUMMARY\ # concise output - INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time - IGNORE=INVALID_PLATFORM_VALUE # Validations to ignore. + I=$BAM \ # specify bam file + MODE=SUMMARY\ # concise output + INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time + IGNORE=INVALID_PLATFORM_VALUE # Validations to ignore. ``` 3. Run `samtools flagstat` to gather general statistics such as alignment percentage. @@ -174,8 +174,8 @@ The workflow specification is as follows. Note that some arguments that are not ```bash qualimap bamqc -bam $BAM \ # bam filename - -nt $NUM_THREADS \ # threads requested - -nw 400 # number of windows + -nt $NUM_THREADS \ # threads requested + -nw 400 # number of windows ``` 8. If WGS or WES data, run `fastq_screen`. For performance, we subsample the input BAM using `samtools view -s $computed_fraction` before running it through `picard SamToFastq`. The resulting fastqs are validated with `fq lint` provided by `fqlib`. @@ -195,9 +195,9 @@ The workflow specification is as follows. Note that some arguments that are not ```bash qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory - -bam $BAM \ # bam filename - -gtf $GTF_REF # transcript definition file - -pe # specify paired end if paired end + -bam $BAM \ # bam filename + -gtf $GTF_REF # transcript definition file + -pe # specify paired end if paired end ``` 11. Combine all of the above metrics using `multiqc`. From ad88c8929e158763e7891648fbe2943d162611f8 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Sat, 21 Nov 2020 15:09:34 -0500 Subject: [PATCH 47/68] Apply suggestions from code review Co-authored-by: Andrew Thrasher --- text/0002-quality-check-workflow.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index fca71f2..4f4e34a 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -32,11 +32,11 @@ There are (at least) two different types of QC typically carried out omics-based By the time data reaches the St. Jude Cloud team from various sources, extensive _experimental_ and _computational_ evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas which is dependent on the best practices at that point in time and the goals of the project. Most often, we take the computational data, revert it back to its raw form (such as FastQ files), and reprocess it using our harmonization pipeline. -Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the _computational_ QC of the files produce for publication in St. Jude Cloud. While we do produce results that define _experimental_ results (such as `fastqc` ), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aide in decision-making about downstream analysis approaches. +Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the _computational_ QC of the files produced for publication in St. Jude Cloud. While we do produce results that define _experimental_ results (such as `fastqc` ), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aide in decision-making about downstream analysis approaches. ### Tools and Metrics -Here, we outline each tool, what metrics are considered in an automated manner, and which metrics and require manual inspect. To keep from duplicating information and to ensure the RFC doesn't get out of sync, versions for each tool can be found in the [dependencies](#dependencies) section. +Here, we outline each tool, what metrics are considered in an automated manner, and which metrics require manual inspection. To keep from duplicating information and to ensure the RFC does not get out of sync, versions for each tool can be found in the [dependencies](#dependencies) section. #### samtools @@ -60,7 +60,7 @@ In the QC pipeline, we leverage all currently available subcommands to try to de | --------------------- | ----------- | ------ | -------------------------------------------------------------------------------------------------------------------- | | Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | | Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | -| Inferred strandedness | RNA-seq | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | +| Inferred strandedness | RNA-seq | Manual | Ensure that the inferred strandedness matches the reported strandedness by the lab (if available). | #### picard @@ -74,10 +74,10 @@ In the QC pipeline, we leverage all currently available subcommands to try to de | Name | Produced By | Description | | ---------------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| % Aligned | [Samtools] | Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consisitent with sample purity. | -| Per Base Sequence Quality | [FastQC] | The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is just to inform our end users -- the quality of the sequencing run has already been assessed by the lab upstream. So, there is no changing it at this point. | +| % Aligned | [Samtools] | Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consistent with sample purity. | +| Per Base Sequence Quality | [FastQC] | The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is to inform our end users -- the quality of the sequencing run has already been assessed by the lab upstream. | | Overrepresented Sequences | [FastQC] | The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). | -| Reads Genomic Origin | [Qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are being mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | +| Reads Genomic Origin | [Qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | | rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | | Transcript Coverage and 5'-3' Bias | [Qualimap] | Libraries prepared with polyA selection may have higher biased expression in 3' region. If reads primarily accumulate at the 3' end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | | Junction Analysis | [Qualimap] | Analysis of known, partly known, and novel junction positions in spliced alignments. | @@ -86,23 +86,23 @@ In the QC pipeline, we leverage all currently available subcommands to try to de #### Thresholds and Metrics for Specific Applications -To apply quality control metrics to vett data, we need reasonable thresholds that are practically acheivable and neither too lax or too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in postion to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. +To apply quality control metrics to vet data, we need reasonable thresholds that are practically achievable and neither too lax nor too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in a position to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. #### Metrics for WGS -The quality metrics of special concern for WGS include depth of coverage and genomic regional coverage. Mapping quality is also critical. The analysis of whole genome sequencing to call variants depends on depth and sample purity. Accurate calls are made through replication and contamination creates false positives. So metrics that are sensitive to impurity are valuable. +The quality metrics of special concern for WGS include depth of coverage and coverage distribution across genomic regions. Mapping quality is also critical. The analysis of whole genome sequencing to call variants depends on depth and sample purity. Accurate calls are made through replication and contamination creates false positives. So metrics that are sensitive to impurity are valuable. #### Metrics for WES -The quality metrics of special concern for WES include depth of coverage in exomic regions. Mapping quality, % mapped and duplication rate are also important. +The quality metrics of special concern for WES include depth of coverage in exonic regions. Mapping quality, mapping percentage, and duplication rate are also important. #### Metrics for RNAseq -The quality metrics of special concern for RNAseq include mapping percentage, percentage properly paired reads, and exomic regional coverage. Mapping quality is also critical. +The quality metrics of special concern for RNA-seq include mapping percentage, percentage properly paired reads, and exonic region coverage. Mapping quality is also critical. ## Specification -These are generic instructions for running each of the tools in our pipeline. We run our pipeline in a series of QC scripts that are tailored for our compute cluster, so those commands may not apply elsewhere. Instead we've supplied examples of the commands used to each package. Our default memory is 80G and we employ 4 threads for these processes. +These are generic instructions for running each of the tools in our pipeline. We run our pipeline in a series of QC scripts that are tailored for our compute cluster, so those commands may not apply elsewhere. Instead, we have supplied examples of the commands used in each package. Our default memory is 80G and we employ 4 threads for these processes. ### Dependencies @@ -197,7 +197,7 @@ The workflow specification is as follows. Note that some arguments that are not qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory -bam $BAM \ # bam filename -gtf $GTF_REF # transcript definition file - -pe # specify paired end if paired end + [-pe] # specify paired end if paired end ``` 11. Combine all of the above metrics using `multiqc`. @@ -208,7 +208,7 @@ The workflow specification is as follows. Note that some arguments that are not ## Items Still In-Progress -- [ ] Analysis tools for other types of sequencing (ChIP seq) +- [ ] Analysis tools for other types of sequencing (ChIP-seq) - [ ] Useful metadata from various stages (sample collection, laboratory, pre-sequencing, sequencing, post-sequencing) ## Outstanding Questions From 885d20469d8adf2d5594624291281a96ed72d6be Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Sat, 21 Nov 2020 15:13:54 -0500 Subject: [PATCH 48/68] Normalize RNA-Seq, whole genome, and whole exome references --- text/0002-quality-check-workflow.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 4f4e34a..b66f7c0 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -18,7 +18,7 @@ You can find the relevant discussion on the [associated pull request](https://gi ## Motivation -St. Jude Cloud is one of the largest repositories of omics data available for request to date. As such, the project processes thousands of samples from whole-genome, whole-exome, RNA-seq, and various omics-based assays each year. A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to +St. Jude Cloud is one of the largest repositories of omics data available for request to date. As such, the project processes thousands of samples from whole genome, whole exome, RNA-Seq, and various omics-based assays each year. A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to - define the standard set of QC tools used to evaluating omics-based data - identify and implement key metrics that can be automated to assist in manual observation of the data, and @@ -52,15 +52,15 @@ FastQC generates metrics about read quality scores, sequence duplication levels #### ngsderive -`ngsderive` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. +`ngsderive` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-Seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. -In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, and strandedness (if RNA-seq). +In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, and strandedness (if RNA-Seq). | Name | Experiments | Check | Description | | --------------------- | ----------- | ------ | -------------------------------------------------------------------------------------------------------------------- | | Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | | Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | -| Inferred strandedness | RNA-seq | Manual | Ensure that the inferred strandedness matches the reported strandedness by the lab (if available). | +| Inferred strandedness | RNA-Seq | Manual | Ensure that the inferred strandedness matches the reported strandedness by the lab (if available). | #### picard @@ -98,7 +98,7 @@ The quality metrics of special concern for WES include depth of coverage in exon #### Metrics for RNAseq -The quality metrics of special concern for RNA-seq include mapping percentage, percentage properly paired reads, and exonic region coverage. Mapping quality is also critical. +The quality metrics of special concern for RNA-Seq include mapping percentage, percentage properly paired reads, and exonic region coverage. Mapping quality is also critical. ## Specification @@ -185,13 +185,13 @@ The workflow specification is as follows. Note that some arguments that are not fastq_screen $combined_fastq ``` -9. If RNA-seq, run `ngsderive strandedness` to determine a backwards-computed strandedness of the RNA-seq experiment. +9. If RNA-Seq, run `ngsderive strandedness` to determine a backwards-computed strandedness of the RNA-Seq experiment. ```bash ngsderive strandedness ``` -10. If RNA-seq data, run `qualimap rnaseq` to gather QC statistics that are tailored for RNA-seq files. +10. If RNA-Seq data, run `qualimap rnaseq` to gather QC statistics that are tailored for RNA-Seq files. ```bash qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory From 9611f4c69a2593637d0d4494b1f87a7422cdcae5 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Sat, 21 Nov 2020 15:25:15 -0500 Subject: [PATCH 49/68] added typo-ci ignore file --- .typo-ci.yml | 20 ++++++++++++++++++++ text/0002-quality-check-workflow.md | 2 +- 2 files changed, 21 insertions(+), 1 deletion(-) create mode 100644 .typo-ci.yml diff --git a/.typo-ci.yml b/.typo-ci.yml new file mode 100644 index 0000000..8992888 --- /dev/null +++ b/.typo-ci.yml @@ -0,0 +1,20 @@ +dictionaries: + - en + - en_GB + +excluded_words: + - rnaseq + - ipynb + - bamqc + - qualimap + - GDC's + - gdc-mrnaseq-pipeline + - gdc-reference-genome + - ENCODE's + - encode-rnaseq-pipeline + - gtex-rnaseq-pipeline + - sjdbOverhang + - exome + - omics + - omics-based + - fastqc diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index b66f7c0..f0fcfba 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -40,7 +40,7 @@ Here, we outline each tool, what metrics are considered in an automated manner, #### samtools -Samtools is used both as a utility for file transformations, and its `flagstat` command is used to generate metricts for quality checking. Metrics include number of duplicate reads and properly paired reads. +Samtools is used both as a utility for file transformations, and its `flagstat` command is used to generate metrics for quality checking. Metrics include number of duplicate reads and properly paired reads. #### qualimap From fdf5be97c691019c7cdff110c159c5103260d2b4 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Sat, 21 Nov 2020 15:28:51 -0500 Subject: [PATCH 50/68] add to typo-ci ignore --- .typo-ci.yml | 13 +++++++++++++ text/0002-quality-check-workflow.md | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/.typo-ci.yml b/.typo-ci.yml index 8992888..4cc3f18 100644 --- a/.typo-ci.yml +++ b/.typo-ci.yml @@ -18,3 +18,16 @@ excluded_words: - omics - omics-based - fastqc + - ngsderive + - SamToFastq + - fastq + - fastq-screen + - unstranded + - conda + - bioconda + - conda-forge + - multiqc + - fastqs + - fqlib + - readlen + - gtf diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index f0fcfba..c895978 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -128,7 +128,7 @@ For linting created fastqs, `fqlib` must be installed. See installation instruct ### Workflow -The workflow specification is as follows. Note that some arguments that are not integral to the command (such as output directories) or arguments that can vary between compute environements (such as memory thresholds or number of threads) are not included. +The workflow specification is as follows. Note that some arguments that are not integral to the command (such as output directories) or arguments that can vary between compute environments (such as memory thresholds or number of threads) are not included. 1. Compute the md5 checksum of the file. From e5f1bb75e75681dcdb793c58d0296feeb449a723 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 14:50:39 -0400 Subject: [PATCH 51/68] Update text/0002-quality-check-workflow.md Co-authored-by: Michael Macias --- text/0002-quality-check-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index c895978..cacfba6 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -185,7 +185,7 @@ The workflow specification is as follows. Note that some arguments that are not fastq_screen $combined_fastq ``` -9. If RNA-Seq, run `ngsderive strandedness` to determine a backwards-computed strandedness of the RNA-Seq experiment. +9. If RNA-Seq data, run `ngsderive strandedness` to determine a backwards-computed strandedness of the RNA-Seq experiment. ```bash ngsderive strandedness From 087d17dd65bf6e442249b00a0efe8e8ffafff4e1 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 14:50:47 -0400 Subject: [PATCH 52/68] Update text/0002-quality-check-workflow.md Co-authored-by: Michael Macias --- text/0002-quality-check-workflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index cacfba6..77e787b 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -196,8 +196,8 @@ The workflow specification is as follows. Note that some arguments that are not ```bash qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory -bam $BAM \ # bam filename - -gtf $GTF_REF # transcript definition file - [-pe] # specify paired end if paired end + -gtf $GTF_REF \ # transcript definition file + [-pe] # specify paired end if paired end ``` 11. Combine all of the above metrics using `multiqc`. From 79a043d30e7f25537d3d246898bd7da19da5f266 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 14:50:55 -0400 Subject: [PATCH 53/68] Update text/0002-quality-check-workflow.md Co-authored-by: Michael Macias --- text/0002-quality-check-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 77e787b..63f5027 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -182,7 +182,7 @@ The workflow specification is as follows. Note that some arguments that are not ```bash cat $fastq_1 $fastq_2 > $combined_fastq - fastq_screen $combined_fastq + fastq_screen $combined_fastq ``` 9. If RNA-Seq data, run `ngsderive strandedness` to determine a backwards-computed strandedness of the RNA-Seq experiment. From 4b20e2eaf4501f8cf5ff260b94ba3365826f8b58 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 14:51:08 -0400 Subject: [PATCH 54/68] Update text/0002-quality-check-workflow.md Co-authored-by: Michael Macias --- text/0002-quality-check-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 63f5027..898b4c7 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -106,7 +106,7 @@ These are generic instructions for running each of the tools in our pipeline. We ### Dependencies -We presume anaconda is available and installed. If not please follow the link to [anaconda](https://www.anaconda.com/) first. +We presume Anaconda is available and installed. If not, please follow the link to [Anaconda](https://www.anaconda.com/) first. ```bash conda create --name bio-qc \ From 6365acd9391a127352d074a3427b471e35eabec1 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 14:51:15 -0400 Subject: [PATCH 55/68] Update text/0002-quality-check-workflow.md Co-authored-by: Michael Macias --- text/0002-quality-check-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 898b4c7..84add88 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -124,7 +124,7 @@ conda create --name bio-qc \ conda activate bio-qc ``` -For linting created fastqs, `fqlib` must be installed. See installation instructions [here](https://github.com/stjude/fqlib) +For linting created fastqs, `fqlib` must be installed. See installation instructions [here](https://github.com/stjude/fqlib). ### Workflow From bcce81c4c17348c247bba650c5a4e31420abfc76 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 14:51:31 -0400 Subject: [PATCH 56/68] Update text/0002-quality-check-workflow.md Co-authored-by: Michael Macias --- text/0002-quality-check-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 84add88..31620f4 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -128,7 +128,7 @@ For linting created fastqs, `fqlib` must be installed. See installation instruct ### Workflow -The workflow specification is as follows. Note that some arguments that are not integral to the command (such as output directories) or arguments that can vary between compute environments (such as memory thresholds or number of threads) are not included. +The workflow specification is as follows. Note that arguments that are not integral to the command (such as output directories) or vary between compute environments (such as memory thresholds or number of threads) are not included. 1. Compute the md5 checksum of the file. From efe0f7f6c752dd9b7f222b22a2e861a770e94608 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 14:51:38 -0400 Subject: [PATCH 57/68] Update text/0002-quality-check-workflow.md Co-authored-by: Michael Macias --- text/0002-quality-check-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 31620f4..b826e12 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -215,7 +215,7 @@ The workflow specification is as follows. Note that arguments that are not integ - What thresholds or metrics differentiate a poor-quality sample from a high-quality one? - What other metrics or properties would be valuable? -- What is best way to define and handle outliers? +- What is the best way to define and handle outliers? - What is the best way to examine cohort integrity? This means experimental category-based tests of samples to find outliers that are of sufficient quality if examined alone. Outliers in this case may indicate classification errors or rare biological conditions. Which metrics are best tested here? [fastqc]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ From c0c372340fd641b58db36c3be613f47260a45b95 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 14:55:28 -0400 Subject: [PATCH 58/68] revert: change 'RNA-seq' back to 'RNA-Seq' --- .../GenomeComparison.ipynb | 8 ++--- text/0001-rnaseq-workflow-v2.0.md | 34 +++++++++---------- 2 files changed, 21 insertions(+), 21 deletions(-) diff --git a/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb b/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb index 8fff488..a9d9543 100644 --- a/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb +++ b/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb @@ -139,7 +139,7 @@ "\n", "The ENCODE project stores a reference to all of it's currently used reference files [here](https://www.encodeproject.org/data-standards/reference-sequences/). From that page, you can see that the base reference genome can be downloaded [here](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz). However, we'd like to do a complete analysis including of all of the sequences they use for decoys/viruse/etc. if we want to use them later. Here's the steps I took to find the complete set of reference FASTAs they used in their STAR index:\n", "\n", - "1. Searching for their RNA-seq pipeline yields their specification pretty quickly ([here](https://www.encodeproject.org/pages/pipelines/#RNA-seq)).\n", + "1. Searching for their RNA-Seq pipeline yields their specification pretty quickly ([here](https://www.encodeproject.org/pages/pipelines/#RNA-Seq)).\n", "2. The pipeline we are looking for is [this one](https://www.encodeproject.org/pipelines/ENCPL002LPE/).\n", "3. At the bottom of the page, you will see a PDF that is a comprehensive overview of their pipelines and contains a list to all of the current ENCODE reference accessions ([link](https://www.encodeproject.org/documents/6354169f-86f6-4b59-8322-141005ea44eb/@@download/attachment/Long%20RNA-seq%20pipeline%20overview.pdf)).\n", "4. In that document, you find that the link to their most currently built `STAR` genome is [here](https://www.encodeproject.org/references/ENCSR314WMD/). \n", @@ -287,11 +287,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## TOPMed + GTEx RNA-seq pipeline\n", + "## TOPMed + GTEx RNA-Seq pipeline\n", "\n", - "The GTEx consortium and TOPMed program both use the [GTEx RNA-seq pipeline](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq) developed by the Broad Institute. This workflow processes a high number of samples and has high reputation, so it's worth taking a look at.\n", + "The GTEx consortium and TOPMed program both use the [GTEx RNA-Seq pipeline](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq) developed by the Broad Institute. This workflow processes a high number of samples and has high reputation, so it's worth taking a look at.\n", "\n", - "Following the \"reference genome and annotation\" [section](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq#reference-genome-and-annotation) of their `README.md`, you are directed to the [TOPMed RNA-seq pipeline harmonization](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) page. Reading the \"Reference files\" section of that documentation essentially lays out that they use the Broad Insitute's version of `GRCh38` and add the `ERCC SpikeIn` sequences. They provide both [a link to the Broad's original FASTA](https://software.broadinstitute.org/gatk/download/bundle) and [a link to their built FASTA](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) (although, given it points to a personal page, I'm not sure how long this link will be valid. For now, we will use the personal link." + "Following the \"reference genome and annotation\" [section](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq#reference-genome-and-annotation) of their `README.md`, you are directed to the [TOPMed RNA-Seq pipeline harmonization](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) page. Reading the \"Reference files\" section of that documentation essentially lays out that they use the Broad Insitute's version of `GRCh38` and add the `ERCC SpikeIn` sequences. They provide both [a link to the Broad's original FASTA](https://software.broadinstitute.org/gatk/download/bundle) and [a link to their built FASTA](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) (although, given it points to a personal page, I'm not sure how long this link will be valid. For now, we will use the personal link." ] }, { diff --git a/text/0001-rnaseq-workflow-v2.0.md b/text/0001-rnaseq-workflow-v2.0.md index c2fed40..ba2f2f9 100644 --- a/text/0001-rnaseq-workflow-v2.0.md +++ b/text/0001-rnaseq-workflow-v2.0.md @@ -8,30 +8,30 @@ # Introduction -This RFC lays out the specification for the RNA-seq mapping pipeline v2.0. The improvements contained within are largely based on (a) new version of tools/reference files and (b) feedback from the community. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/1). +This RFC lays out the specification for the RNA-Seq mapping pipeline v2.0. The improvements contained within are largely based on (a) new version of tools/reference files and (b) feedback from the community. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/1). # Motivation - **Tool additions and updates.** The tools we use are woefully out of date (2 years old). We should reap the benefits of new tools if possible. Additionally, there is some new functionality in the area of QC and validation that I'd like to add. See the [section below](#Tool-additions-updates) for more details. - - Note that all of the tools used in the RNA-seq Workflow v1.0 were the latest available version. -- **Updated reference files.** No changes have really been made to the `GRCh38_no_alt` analysis set FASTA. However, three major releases of the GENCODE gene model have transpired since we released the first revision of the RNA-seq workflow ([GENCODE v31](https://www.gencodegenes.org/human/release_31.html) is now out). + - Note that all of the tools used in the RNA-Seq Workflow v1.0 were the latest available version. +- **Updated reference files.** No changes have really been made to the `GRCh38_no_alt` analysis set FASTA. However, three major releases of the GENCODE gene model have transpired since we released the first revision of the RNA-Seq workflow ([GENCODE v31](https://www.gencodegenes.org/human/release_31.html) is now out). - **QC and quality of life improvements based on feedback from the community.** Many interactions with the community have impacted the thoughts in this release: - A primary driver for the rewrite of the pipeline is the feedback we heard about the `ERCC SpikeIn` sequences. - Popular tools such as `GATK` and `picard` are generally unhappy if the sequence dictionaries don't match perfectly. - - The inclusion of the External RNA Controls Consortium (ERCC) Spike-in Control RNA sequences in the alignment reference file we used for RNA-seq mapping was hence causing issues when using mapped RNA-seq BAM files in conjunction with other non-RNA-seq BAM files in downstream analysis using these tools. - - Last, many of our RNA-seq samples were not generated using 'ERCC' spike-in control sequences. + - The inclusion of the External RNA Controls Consortium (ERCC) Spike-in Control RNA sequences in the alignment reference file we used for RNA-Seq mapping was hence causing issues when using mapped RNA-Seq BAM files in conjunction with other non-RNA-Seq BAM files in downstream analysis using these tools. + - Last, many of our RNA-Seq samples were not generated using 'ERCC' spike-in control sequences. - After some discussion internally, we decided the best thing to do was to remove the ERCC genome by default. We are considering providing an ERCC version of the BAM for samples containing these sequences, but there is no consensus on whether it's worth it yet. - - One of the most important themes in the RNA-seq Workflow v2.0 proposal is the emphasis on QC and quality of life improvements (e.g. `fq lint`, generation and publication of md5sums). + - One of the most important themes in the RNA-Seq Workflow v2.0 proposal is the emphasis on QC and quality of life improvements (e.g. `fq lint`, generation and publication of md5sums). # Discussion ## Tool additions and upgrades -As part of the RNA-seq workflow v2, multiple tools will be added and upgraded: +As part of the RNA-Seq workflow v2, multiple tools will be added and upgraded: - `fq v0.2.0` ([Released](https://github.com/stjude/fqlib/releases/tag/v0.2.0) November 28, 2018) will be added. This tool will be used to validate the output of `picard SamToFastq`. `picard SamToFastq` does not currently catch all of the errors we wish to catch at this stage (such as duplicate read names in the FastQ file). Thus, we will leverage this tool to independently validate that the data is well-formed by our definition of that phrase. - `rseqc v3.0.0` ([Source](http://rseqc.sourceforge.net/#download-rseqc)) will be added. We have started using `infer_experiment.py` to infer strandedness from the data and ensure that the data matches what information we get from the lab. -- Added `qualimap v.2.2.2` ([Source](https://bitbucket.org/kokonech/qualimap/)). Although we have been using `qualimap` quite heavily in our QC pipeline, we are formally adding this to the end of the RNA-seq alignment workflow. The `bamqc` and `rnaseq` subcommands are both used. +- Added `qualimap v.2.2.2` ([Source](https://bitbucket.org/kokonech/qualimap/)). Although we have been using `qualimap` quite heavily in our QC pipeline, we are formally adding this to the end of the RNA-Seq alignment workflow. The `bamqc` and `rnaseq` subcommands are both used. - Update `STAR 2.5.3a` ([Released](https://github.com/alexdobin/STAR/releases/tag/2.5.3a) March 17, 2017) to `STAR 2.7.1a` ([Released](https://github.com/alexdobin/STAR/releases/tag/2.7.1a) May 15, 2019). Upgraded to receive the benefits of bug fixes and software optimizations. - Update `samtools 1.4.0` ([Released](https://github.com/samtools/samtools/releases/tag/1.4) March 13, 2017) to `samtools 1.9` ([Released](https://github.com/samtools/samtools/releases/tag/1.9) July 18, 2018). Updating the samtools version whenever possible is of particular interest to me due to the historical fragility of the samtools code (although it has seemed to get better over the last year or so). - Update `picard 2.9.4` ([Released](https://github.com/broadinstitute/picard/releases/tag/2.9.4) June 15, 2017) to `picard 2.20.2` ([Released](https://github.com/broadinstitute/picard/releases/tag/2.20.2) May 28, 2019). Upgraded to receive the benefits of bug fixes and software optimizations. @@ -50,12 +50,12 @@ First, we researched what some of the projects we respect in the community are d | Pipeline | Reference Genome | Reference Genome Patch | Gene Model | Gene Model Patch | | ------------------------------------------------------------------------ | -------------------------------------------------------------------- | ---------------------- | -------------------------- | ---------------- | | GDC's [mRNA-seq pipeline][gdc-mrnaseq-pipeline] | [`GRCh38_no_alt`-based w/ decoys + viral][gdc-reference-genome] | `GRCh38.p0` | [GENCODE v22][gencode-v22] | `GRCh38.p2` | -| ENCODE's [RNA-seq pipeline][encode-rnaseq-pipeline] | [`GRCh38_no_alt`-based w/ SpikeIns][encode-reference-genome] | `GRCh38.p0` | [GENCODE v24][gencode-v24] | `GRCh38.p5` | -| Broad Institute's [GTEx + TOPMed RNA-seq pipeline][gtex-rnaseq-pipeline] | [Broad's `GRCh38` w/ ERCC SpikeIn][broad-institute-reference-genome] | `GRCh38.p0` | [GENCODE v26][gencode-v26] | `GRCh38.p10` | +| ENCODE's [RNA-Seq pipeline][encode-rnaseq-pipeline] | [`GRCh38_no_alt`-based w/ SpikeIns][encode-reference-genome] | `GRCh38.p0` | [GENCODE v24][gencode-v24] | `GRCh38.p5` | +| Broad Institute's [GTEx + TOPMed RNA-Seq pipeline][gtex-rnaseq-pipeline] | [Broad's `GRCh38` w/ ERCC SpikeIn][broad-institute-reference-genome] | `GRCh38.p0` | [GENCODE v26][gencode-v26] | `GRCh38.p10` | [gdc-mrnaseq-pipeline]: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/ [gdc-reference-genome]: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files -[encode-rnaseq-pipeline]: https://www.encodeproject.org/pipelines/ENCPL002LPE/https://www.encodeproject.org/pages/pipelines/#RNA-seq +[encode-rnaseq-pipeline]: https://www.encodeproject.org/pipelines/ENCPL002LPE/https://www.encodeproject.org/pages/pipelines/#RNA-Seq [encode-reference-genome]: https://www.encodeproject.org/files/ENCFF742NER/ [gtex-rnaseq-pipeline]: https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq#reference-genome-and-annotation [broad-institute-reference-genome]: https://software.broadinstitute.org/gatk/download/bundle @@ -86,7 +86,7 @@ Given that there was no perfect option, we decided to stick with option #3. Originally, I had posed this question to the group: -> - Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model. This was due to best practices outlined during our RNA-seq Workflow v1.0 discussions. I propose we revert this for the following reasons: +> - Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model. This was due to best practices outlined during our RNA-Seq Workflow v1.0 discussions. I propose we revert this for the following reasons: > - The first sentence in section 2.2.2 of the [STAR 2.7.1.a manual](https://github.com/alexdobin/STAR/blob/2.7.1a/doc/STARmanual.pdf): "The use of the most comprehensive annotations for a given species is strongly recommended". So it seems the author recommends you use the most comprehensive gene model. > - Here is what [the GENCODE FAQ](https://www.gencodegenes.org/pages/faq.html) has to say about the level 3 annotations: "Ensembl loci where they are different from the Havana annotation or where no Havana annotation can be found". Given that the GENCODE geneset is the union of automated annotations from the `Ensembl-genebuild` and manual curation of the `Ensembl-Havana` team, this level should be harmless in the event that levels 1 & 2 don't apply. > - Last, the various other pipelines in the community don't tend to remove these features: @@ -142,7 +142,7 @@ cargo install --git https://github.com/stjude/fqlib.git --tag v0.3.1 ## Reference files -The following reference files are used as the basis of the RNA-seq Workflow v2.0: +The following reference files are used as the basis of the RNA-Seq Workflow v2.0: - Similarly to all analysis pipelines in St. Jude Cloud, we use the `GRCh38_no_alt` analysis set for our reference genome. You can get a copy of the file [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). Additionally, you can get the file by running the following commands: @@ -177,12 +177,12 @@ The following reference files are used as the basis of the RNA-seq Workflow v2.0 --runThreadN $NCPU \ # Number of threads to use to build genome database. --genomeFastaFiles $FASTA \ # A path to the GRCh38_no_alt.fa FASTA file. --sjdbGTFfile $GENCODE_GTF_V31 \ # GENCODE v31 gene model file. - --sjdbOverhang 125 # Splice junction database overhang parameter, the optimal value is (Max length of RNA-seq read-1). + --sjdbOverhang 125 # Splice junction database overhang parameter, the optimal value is (Max length of RNA-Seq read-1). ``` ## Workflow -Here are the resulting steps in the RNA-seq Workflow v2.0 pipeline. +Here are the resulting steps in the RNA-Seq Workflow v2.0 pipeline. 1. Run `samtools quickcheck` on the incoming BAM to ensure that it is well-formed enough to convert back to FastQ. 2. Split BAM file into multiple BAMs on the different read groups using `samtools split`. See [the samtools documentation](http://www.htslib.org/doc/samtools.html) for more information. @@ -283,7 +283,7 @@ Here are the resulting steps in the RNA-seq Workflow v2.0 pipeline. -outdir $OUTPUT_DIR \ # Output directory. -oc qualimap_counts.txt \ # Counts as calculated by qualimap. -p $COMPUTED \ # Strandedness as specified by the lab and confirmed by "infer_experiment.py" above. Typically "strand-specific-reverse" for St. Jude Cloud data. - -pe # All RNA-seq data in St. Jude Cloud is currently paired-end. + -pe # All RNA-Seq data in St. Jude Cloud is currently paired-end. ``` 11. Next, `htseq-count` is run for the final counts file to be delivered: @@ -297,7 +297,7 @@ Here are the resulting steps in the RNA-seq Workflow v2.0 pipeline. --supplementary-alignments ignore \ # Elect to ignore supplementary alignments. Needs input from reviewers. $INPUT_BAM # Input BAM file. ``` -12. Generate the remaining files generally desired as output for the RNA-seq Workflow. +12. Generate the remaining files generally desired as output for the RNA-Seq Workflow. ```bash samtools flagstat $INPUT_BAM samtools index $INPUT_BAM From 25db2d15477c2c11885ad733b17e8b964bd8e2aa Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 15:00:46 -0400 Subject: [PATCH 59/68] revert: fix from last commit --- resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb | 2 +- text/0001-rnaseq-workflow-v2.0.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb b/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb index a9d9543..e42bfb6 100644 --- a/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb +++ b/resources/0001-rnaseq-workflow-v2.0/GenomeComparison.ipynb @@ -139,7 +139,7 @@ "\n", "The ENCODE project stores a reference to all of it's currently used reference files [here](https://www.encodeproject.org/data-standards/reference-sequences/). From that page, you can see that the base reference genome can be downloaded [here](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz). However, we'd like to do a complete analysis including of all of the sequences they use for decoys/viruse/etc. if we want to use them later. Here's the steps I took to find the complete set of reference FASTAs they used in their STAR index:\n", "\n", - "1. Searching for their RNA-Seq pipeline yields their specification pretty quickly ([here](https://www.encodeproject.org/pages/pipelines/#RNA-Seq)).\n", + "1. Searching for their RNA-Seq pipeline yields their specification pretty quickly ([here](https://www.encodeproject.org/pages/pipelines/#RNA-seq)).\n", "2. The pipeline we are looking for is [this one](https://www.encodeproject.org/pipelines/ENCPL002LPE/).\n", "3. At the bottom of the page, you will see a PDF that is a comprehensive overview of their pipelines and contains a list to all of the current ENCODE reference accessions ([link](https://www.encodeproject.org/documents/6354169f-86f6-4b59-8322-141005ea44eb/@@download/attachment/Long%20RNA-seq%20pipeline%20overview.pdf)).\n", "4. In that document, you find that the link to their most currently built `STAR` genome is [here](https://www.encodeproject.org/references/ENCSR314WMD/). \n", diff --git a/text/0001-rnaseq-workflow-v2.0.md b/text/0001-rnaseq-workflow-v2.0.md index ba2f2f9..425df26 100644 --- a/text/0001-rnaseq-workflow-v2.0.md +++ b/text/0001-rnaseq-workflow-v2.0.md @@ -49,13 +49,13 @@ First, we researched what some of the projects we respect in the community are d | Pipeline | Reference Genome | Reference Genome Patch | Gene Model | Gene Model Patch | | ------------------------------------------------------------------------ | -------------------------------------------------------------------- | ---------------------- | -------------------------- | ---------------- | -| GDC's [mRNA-seq pipeline][gdc-mrnaseq-pipeline] | [`GRCh38_no_alt`-based w/ decoys + viral][gdc-reference-genome] | `GRCh38.p0` | [GENCODE v22][gencode-v22] | `GRCh38.p2` | +| GDC's [mRNA-Seq pipeline][gdc-mrnaseq-pipeline] | [`GRCh38_no_alt`-based w/ decoys + viral][gdc-reference-genome] | `GRCh38.p0` | [GENCODE v22][gencode-v22] | `GRCh38.p2` | | ENCODE's [RNA-Seq pipeline][encode-rnaseq-pipeline] | [`GRCh38_no_alt`-based w/ SpikeIns][encode-reference-genome] | `GRCh38.p0` | [GENCODE v24][gencode-v24] | `GRCh38.p5` | | Broad Institute's [GTEx + TOPMed RNA-Seq pipeline][gtex-rnaseq-pipeline] | [Broad's `GRCh38` w/ ERCC SpikeIn][broad-institute-reference-genome] | `GRCh38.p0` | [GENCODE v26][gencode-v26] | `GRCh38.p10` | [gdc-mrnaseq-pipeline]: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/ [gdc-reference-genome]: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files -[encode-rnaseq-pipeline]: https://www.encodeproject.org/pipelines/ENCPL002LPE/https://www.encodeproject.org/pages/pipelines/#RNA-Seq +[encode-rnaseq-pipeline]: https://www.encodeproject.org/pipelines/ENCPL002LPE/https://www.encodeproject.org/pages/pipelines/#RNA-seq [encode-reference-genome]: https://www.encodeproject.org/files/ENCFF742NER/ [gtex-rnaseq-pipeline]: https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq#reference-genome-and-annotation [broad-institute-reference-genome]: https://software.broadinstitute.org/gatk/download/bundle From 2387f8c30fc7205a152cbec02d9e780e9c638fd1 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 15:14:21 -0400 Subject: [PATCH 60/68] Apply suggestions from code review Co-authored-by: Michael Macias --- text/0002-quality-check-workflow.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index b826e12..4a7a1d9 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -11,7 +11,7 @@ This RFC documents an automated workflow for assessing the integrity and quality of St. Jude Cloud genomics data. The goal of this RFC is two-fold. -1. Establish state of the art method for comprehensively evaluating genomics data quality at scale: both at time of receipt and after processing. +1. Establish state of the art method for comprehensively evaluating genomics data quality at scale, both at time of receipt and after processing. 2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage to assess the quality of the data available. This context should save users time computing the information themselves while also informing appropriate use of the data. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3). @@ -20,9 +20,9 @@ You can find the relevant discussion on the [associated pull request](https://gi St. Jude Cloud is one of the largest repositories of omics data available for request to date. As such, the project processes thousands of samples from whole genome, whole exome, RNA-Seq, and various omics-based assays each year. A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to -- define the standard set of QC tools used to evaluating omics-based data +- define the standard set of QC tools used to evaluating omics-based data, - identify and implement key metrics that can be automated to assist in manual observation of the data, and -- publish these results alongside the data already published in St. Jude Cloud so that end-users can similar view results. +- publish these results alongside the data already published in St. Jude Cloud so that end-users can view and use the results. ## Discussion @@ -30,9 +30,9 @@ St. Jude Cloud is one of the largest repositories of omics data available for re There are (at least) two different types of QC typically carried out omics-based data. **Experiment** QC attempts to identify the success of the assay(s) performed. Once one is sufficiently satisfied that the data generated by an experiment is "good" (by some definition of that word), **computational** QC examines the degree to which computational processing of that data was completed successfully. -By the time data reaches the St. Jude Cloud team from various sources, extensive _experimental_ and _computational_ evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas which is dependent on the best practices at that point in time and the goals of the project. Most often, we take the computational data, revert it back to its raw form (such as FastQ files), and reprocess it using our harmonization pipeline. +By the time data reaches the St. Jude Cloud team from various sources, extensive _experimental_ and _computational_ evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas, which is dependent on the best practices at that point in time and the goals of the project. Most often, we take the computational data, revert it back to its raw form (such as FastQ files), and reprocess it using our harmonization pipeline. -Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the _computational_ QC of the files produced for publication in St. Jude Cloud. While we do produce results that define _experimental_ results (such as `fastqc` ), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aide in decision-making about downstream analysis approaches. +Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the _computational_ QC of the files produced for publication in St. Jude Cloud. While we do produce results that define _experimental_ results (such as `fastqc`), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aid in decision-making about downstream analysis approaches. ### Tools and Metrics @@ -52,7 +52,7 @@ FastQC generates metrics about read quality scores, sequence duplication levels #### ngsderive -`ngsderive` is an in-house tool developed to backwards derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-Seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. +`ngsderive` is an in-house tool developed to backwards-derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-Seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, and strandedness (if RNA-Seq). @@ -64,11 +64,11 @@ In the QC pipeline, we leverage all currently available subcommands to try to de #### picard -`picard` is for several operations including validating BAM files with `ValidateSam` and converting SAM to FastQ files with `SamToFastq` . +`picard` is used for several operations, including validating BAM files with `ValidateSam` and converting SAM to FastQ files with `SamToFastq` . #### fastq-screen -`fastq_screen` is used estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). +`fastq_screen` is used to estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). ### Automated metrics comparison @@ -79,14 +79,14 @@ In the QC pipeline, we leverage all currently available subcommands to try to de | Overrepresented Sequences | [FastQC] | The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). | | Reads Genomic Origin | [Qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | | rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | -| Transcript Coverage and 5'-3' Bias | [Qualimap] | Libraries prepared with polyA selection may have higher biased expression in 3' region. If reads primarily accumulate at the 3' end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | +| Transcript Coverage and 5'-3' Bias | [Qualimap] | Libraries prepared with poly(A) selection may have higher biased expression in 3' region. If reads primarily accumulate at the 3' end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | | Junction Analysis | [Qualimap] | Analysis of known, partly known, and novel junction positions in spliced alignments. | | Strand Specificity | ngsderive | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | | GC Content Bias | ? | GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). | #### Thresholds and Metrics for Specific Applications -To apply quality control metrics to vet data, we need reasonable thresholds that are practically achievable and neither too lax nor too strict. Our preference is for statistically or empirically determined thresholds rather than arbitrary estimates. By statistical thresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in a position to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. +To apply quality control metrics to vet data, we need reasonable thresholds that are practically achievable and neither too lax nor too strict. Our preference is for statistically or empirically determined thresholds, rather than arbitrary estimates. By statistical thresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in a position to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. #### Metrics for WGS From 041c203a932142eb7d982099ab06f3f9fb1a0411 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Tue, 6 Jul 2021 17:19:17 -0400 Subject: [PATCH 61/68] docs: update to latest iteration of QC workflow --- text/0002-quality-check-workflow.md | 94 +++++++++++++++++------------ 1 file changed, 55 insertions(+), 39 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 4a7a1d9..5dacbde 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -52,19 +52,21 @@ FastQC generates metrics about read quality scores, sequence duplication levels #### ngsderive -`ngsderive` is an in-house tool developed to backwards-derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), and RNA-Seq strandedness. Please see [the repository](https://github.com/claymcleod/ngsderive/) for more information. +`ngsderive` is an in-house tool developed to backwards-derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), PHRED score encoding, and RNA-Seq strandedness. `ngsderive` is also used to annotate splice junctions in RNA-Seq data. Please see [the repository](https://github.com/stjudecloud/ngsderive/) for more information. -In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, and strandedness (if RNA-Seq). +In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, encoding, strandedness (if RNA-Seq), and to annotate junctions (if RNA-Seq). | Name | Experiments | Check | Description | | --------------------- | ----------- | ------ | -------------------------------------------------------------------------------------------------------------------- | | Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | | Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | -| Inferred strandedness | RNA-Seq | Manual | Ensure that the inferred strandedness matches the reported strandedness by the lab (if available). | +| Inferred encoding | All | Manual | Ensure that the PHRED score ASCII encoding is "PHRED+33", which is synonomous with "Sanger/Illumina 1.8+ encoding". | +| Inferred strandedness | RNA-Seq | Manual | Ensure that the inferred strandedness matches the reported strandedness by the lab (if available). | +| Junction Annotation | RNA-Seq | Manual | Ensure there is a sensible portion of novel, partial-novel, and annotated junctions. | #### picard -`picard` is used for several operations, including validating BAM files with `ValidateSam` and converting SAM to FastQ files with `SamToFastq` . +`picard` is used for several operations, including validating BAM files with `ValidateSamFile` and converting SAM to FastQ files with `SamToFastq`. #### fastq-screen @@ -72,17 +74,17 @@ In the QC pipeline, we leverage all currently available subcommands to try to de ### Automated metrics comparison -| Name | Produced By | Description | -| ---------------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| % Aligned | [Samtools] | Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consistent with sample purity. | -| Per Base Sequence Quality | [FastQC] | The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is to inform our end users -- the quality of the sequencing run has already been assessed by the lab upstream. | -| Overrepresented Sequences | [FastQC] | The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). | -| Reads Genomic Origin | [Qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | -| rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | -| Transcript Coverage and 5'-3' Bias | [Qualimap] | Libraries prepared with poly(A) selection may have higher biased expression in 3' region. If reads primarily accumulate at the 3' end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | -| Junction Analysis | [Qualimap] | Analysis of known, partly known, and novel junction positions in spliced alignments. | -| Strand Specificity | ngsderive | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | -| GC Content Bias | ? | GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). | +| Name | Produced By | Description | +| ---------------------------------- | ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| % Aligned | [Samtools] | Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consistent with sample purity. | +| Per Base Sequence Quality | [FastQC] | The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is to inform our end users -- the quality of the sequencing run has already been assessed by the lab upstream. | +| Overrepresented Sequences | [FastQC] | The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). | +| Reads Genomic Origin | [Qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | +| rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | +| Transcript Coverage and 5'-3' Bias | [Qualimap] | Libraries prepared with poly(A) selection may have higher biased expression in 3' region. If reads primarily accumulate at the 3' end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | +| Junction Analysis | [ngsderive] | Analysis of known, partly known, and novel junction positions in spliced alignments. | +| Strand Specificity | [ngsderive] | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | +| GC Content Bias | [Qualimap] and [FastQC] | GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). | #### Thresholds and Metrics for Specific Applications @@ -102,7 +104,7 @@ The quality metrics of special concern for RNA-Seq include mapping percentage, p ## Specification -These are generic instructions for running each of the tools in our pipeline. We run our pipeline in a series of QC scripts that are tailored for our compute cluster, so those commands may not apply elsewhere. Instead, we have supplied examples of the commands used in each package. Our default memory is 80G and we employ 4 threads for these processes. +These are generic instructions for running each of the tools in our pipeline. We run our pipeline as a [WDL workflow](https://github.com/stjudecloud/workflows/blob/master/workflows/qc/quality-check-standard.wdl). We have supplied examples of the commands used for each package. For the typical memory requirements of each command, please see our [WDL repository](https://github.com/stjudecloud/workflows). ### Dependencies @@ -117,8 +119,8 @@ conda create --name bio-qc \ qualimap==2.2.2c \ samtools==1.9 \ fastq-screen==0.13.0 \ - ngsderive==1.1.0 \ - multiqc==1.9 \ + ngsderive==2.2.0 \ + multiqc==1.10.1 \ -y conda activate bio-qc @@ -140,17 +142,15 @@ The workflow specification is as follows. Note that arguments that are not integ ```bash picard ValidateSamFile \ - I=$BAM \ # specify bam file - MODE=SUMMARY\ # concise output - INDEX_VALIDATION_STRINGENCY=LESS_EXHAUSTIVE \ # lower stringency faster processing time - IGNORE=INVALID_PLATFORM_VALUE # Validations to ignore. + I=$BAM \ # specify bam file + MODE=SUMMARY # concise output ``` 3. Run `samtools flagstat` to gather general statistics such as alignment percentage. - ```bash - samtools flagstat $BAM - ``` + ```bash + samtools flagstat $BAM + ``` 4. Run `fastqc` to collect sequencing and library-related statistics. These are only for informational purposes -- as stated above, we typically do not remove samples based on this information (with rare exception), as the sequencing-related QC work was done upstream in the genomics lab. @@ -170,37 +170,52 @@ The workflow specification is as follows. Note that arguments that are not integ ngsderive readlen $BAM ``` -7. Run `qualimap bamqc` to gather more in-depth statistics about read stats, coverage, mapping quality, mismatches, etc. +7. Run `ngsderive encoding` to infer PHRED score encoding. + + ```bash + ngsderive encoding \ + -n -1 \ # parse the entire file + $BAM + ``` + +8. Run `qualimap bamqc` to gather more in-depth statistics about read stats, coverage, mapping quality, mismatches, etc. ```bash - qualimap bamqc -bam $BAM \ # bam filename - -nt $NUM_THREADS \ # threads requested - -nw 400 # number of windows + qualimap bamqc -bam $BAM \ # bam filename + -nt $NUM_THREADS \ # threads requested + -nw 400 # number of windows ``` -8. If WGS or WES data, run `fastq_screen`. For performance, we subsample the input BAM using `samtools view -s $computed_fraction` before running it through `picard SamToFastq`. The resulting fastqs are validated with `fq lint` provided by `fqlib`. +9. If WGS or WES data, run `fastq_screen`. For performance, we subsample the input BAM using `samtools view -s $computed_fraction` before running it through `picard SamToFastq`. The resulting fastqs are validated with `fq lint` provided by `fqlib`. ```bash cat $fastq_1 $fastq_2 > $combined_fastq - fastq_screen $combined_fastq + fastq_screen \ + --aligner bowtie2 \ + $combined_fastq + ``` + +10. If RNA-Seq data, run `ngsderive strandedness` to determine a backwards-computed strandedness of the RNA-Seq experiment. + + ```bash + ngsderive strandedness $BAM ``` -9. If RNA-Seq data, run `ngsderive strandedness` to determine a backwards-computed strandedness of the RNA-Seq experiment. +11. If RNA-Seq data, run `ngsderive junction-annotation` to calculate the number of known, novel, and partial-novel junctions. ```bash - ngsderive strandedness + ngsderive junction-annotation $BAM ``` -10. If RNA-Seq data, run `qualimap rnaseq` to gather QC statistics that are tailored for RNA-Seq files. +12. If RNA-Seq data, run `qualimap rnaseq` to gather QC statistics that are tailored for RNA-Seq files. ```bash - qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory - -bam $BAM \ # bam filename - -gtf $GTF_REF \ # transcript definition file - [-pe] # specify paired end if paired end + qualimap rnaseq --java-mem-size=$MEM_SIZE \ # memory + -bam $BAM \ # bam filename + [-pe] # specify paired end if paired end ``` -11. Combine all of the above metrics using `multiqc`. +13. Combine all of the above metrics using `multiqc`. ```bash multiqc . # recurse all files in '.' @@ -223,3 +238,4 @@ The workflow specification is as follows. Note that arguments that are not integ [picard]: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_ValidateSamFile.php [qualimap]: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html [samtools]: http://www.htslib.org/doc/samtools.html +[ngsderive]: https://github.com/stjudecloud/ngsderive/ From 6b59d0978ff29c1e4444c70b3c46bbac9ab50329 Mon Sep 17 00:00:00 2001 From: Andrew Frantz Date: Wed, 7 Jul 2021 10:57:45 -0400 Subject: [PATCH 62/68] fix: typo and CI exclusion --- .typo-ci.yml | 1 + text/0002-quality-check-workflow.md | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/.typo-ci.yml b/.typo-ci.yml index 4cc3f18..96aa87d 100644 --- a/.typo-ci.yml +++ b/.typo-ci.yml @@ -31,3 +31,4 @@ excluded_words: - fqlib - readlen - gtf + - Illumina diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 5dacbde..acc3f76 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -60,7 +60,7 @@ In the QC pipeline, we leverage all currently available subcommands to try to de | --------------------- | ----------- | ------ | -------------------------------------------------------------------------------------------------------------------- | | Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | | Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | -| Inferred encoding | All | Manual | Ensure that the PHRED score ASCII encoding is "PHRED+33", which is synonomous with "Sanger/Illumina 1.8+ encoding". | +| Inferred encoding | All | Manual | Ensure that the PHRED score ASCII encoding is "PHRED+33", which is synonymous with "Sanger/Illumina 1.8+ encoding". | | Inferred strandedness | RNA-Seq | Manual | Ensure that the inferred strandedness matches the reported strandedness by the lab (if available). | | Junction Annotation | RNA-Seq | Manual | Ensure there is a sensible portion of novel, partial-novel, and annotated junctions. | From 21f1594c28c5b4e919c0527d34fe66112dda17fc Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Tue, 27 Jun 2023 15:08:16 -0500 Subject: [PATCH 63/68] revise: revises content for the QC RFC --- bin/local-build.sh | 15 ++ text/0002-quality-check-workflow.md | 219 ++++++++++++++++++++-------- 2 files changed, 171 insertions(+), 63 deletions(-) create mode 100755 bin/local-build.sh diff --git a/bin/local-build.sh b/bin/local-build.sh new file mode 100755 index 0000000..1ab6d46 --- /dev/null +++ b/bin/local-build.sh @@ -0,0 +1,15 @@ +#!/usr/bin/env bash + +[[ -d book/ ]] && rm -rf book/ +[[ -d src/ ]] && rm -rf src/ +mkdir -p src/ + +printf "# Summary\n\n[Introduction](introduction.md)\n\n" >src/SUMMARY.md +cp README.md src/introduction.md + +for RFC_FILE in $(ls text/* | sort); do + echo "- [$(basename ${RFC_FILE} ".md")]($(basename ${RFC_FILE}))" >>src/SUMMARY.md + cp ${RFC_FILE} src +done + +mdbook build diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index acc3f76..61627d2 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -18,73 +18,164 @@ You can find the relevant discussion on the [associated pull request](https://gi ## Motivation -St. Jude Cloud is one of the largest repositories of omics data available for request to date. As such, the project processes thousands of samples from whole genome, whole exome, RNA-Seq, and various omics-based assays each year. A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will attempt to - -- define the standard set of QC tools used to evaluating omics-based data, +St. Jude Cloud is a large repository of omics data available for request from +the academic and non-profit community. As such, the project processes thousands +of samples from whole genome, whole exome, RNA-Seq, and various omics-based +assays each year. A standard, robust method to assess pre-processing and +post-processing quality for samples has been developed in-house, but there are +some shortcomings with our current approach. In particular, this RFC will: + +- Define the standard set of QC tools used to evaluating omics-based data, - identify and implement key metrics that can be automated to assist in manual observation of the data, and -- publish these results alongside the data already published in St. Jude Cloud so that end-users can view and use the results. +- outline mechanisms to publish those QC results alongside the data already in + St. Jude Cloud so that end-users can leverage this information. ## Discussion ### Types of QC -There are (at least) two different types of QC typically carried out omics-based data. **Experiment** QC attempts to identify the success of the assay(s) performed. Once one is sufficiently satisfied that the data generated by an experiment is "good" (by some definition of that word), **computational** QC examines the degree to which computational processing of that data was completed successfully. - -By the time data reaches the St. Jude Cloud team from various sources, extensive _experimental_ and _computational_ evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas, which is dependent on the best practices at that point in time and the goals of the project. Most often, we take the computational data, revert it back to its raw form (such as FastQ files), and reprocess it using our harmonization pipeline. - -Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the _computational_ QC of the files produced for publication in St. Jude Cloud. While we do produce results that define _experimental_ results (such as `fastqc`), these are rarely used to decide which files pass or fail our QC. We hope that the inclusion of these results will save end-users time and aid in decision-making about downstream analysis approaches. - -### Tools and Metrics - -Here, we outline each tool, what metrics are considered in an automated manner, and which metrics require manual inspection. To keep from duplicating information and to ensure the RFC does not get out of sync, versions for each tool can be found in the [dependencies](#dependencies) section. - -#### samtools - -Samtools is used both as a utility for file transformations, and its `flagstat` command is used to generate metrics for quality checking. Metrics include number of duplicate reads and properly paired reads. - -#### qualimap - -Qualimap is used to find coverage across the genome, insert sizes, and QC content. - -#### fastqc - -FastQC generates metrics about read quality scores, sequence duplication levels and length distributions, among others. - -#### ngsderive - -`ngsderive` is an in-house tool developed to backwards-derive useful information from omics data. In this RFC, `ngsderive` is used to guess which instrument was used to sequence the data, the original read length (pre-read trimming), PHRED score encoding, and RNA-Seq strandedness. `ngsderive` is also used to annotate splice junctions in RNA-Seq data. Please see [the repository](https://github.com/stjudecloud/ngsderive/) for more information. - -In the QC pipeline, we leverage all currently available subcommands to try to determine read length, instrument, encoding, strandedness (if RNA-Seq), and to annotate junctions (if RNA-Seq). - -| Name | Experiments | Check | Description | -| --------------------- | ----------- | ------ | -------------------------------------------------------------------------------------------------------------------- | -| Inferred instrument | All | Manual | Ensure that the inferred instrument and confidence matches the reported instrument by the lab (if available). | -| Inferred read length | All | Manual | Ensure that the inferred read length (pre read trimming) matches the reported read length by the lab (if available). | -| Inferred encoding | All | Manual | Ensure that the PHRED score ASCII encoding is "PHRED+33", which is synonymous with "Sanger/Illumina 1.8+ encoding". | -| Inferred strandedness | RNA-Seq | Manual | Ensure that the inferred strandedness matches the reported strandedness by the lab (if available). | -| Junction Annotation | RNA-Seq | Manual | Ensure there is a sensible portion of novel, partial-novel, and annotated junctions. | - -#### picard - -`picard` is used for several operations, including validating BAM files with `ValidateSamFile` and converting SAM to FastQ files with `SamToFastq`. - -#### fastq-screen - -`fastq_screen` is used to estimate the percentage of material derived from different sources (human, mouse, PhiX, etc). - -### Automated metrics comparison - -| Name | Produced By | Description | -| ---------------------------------- | ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| % Aligned | [Samtools] | Also known as mapping percentage, this indicator of quality, when high, verifies the mapping process/genome was correct and is consistent with sample purity. | -| Per Base Sequence Quality | [FastQC] | The "Per Base Sequence Quality" module from FastQC shows the distribution of quality scores across all bases at each position in the reads. In our case, this is to inform our end users -- the quality of the sequencing run has already been assessed by the lab upstream. | -| Overrepresented Sequences | [FastQC] | The "Overrepresented Sequences" module from FastQC displays sequences (at least 20bp) that occur in more than 0.1% of the total number of sequences and will help identify contamination (vector, adapter sequences, etc.). | -| Reads Genomic Origin | [Qualimap] | The "Reads Genomic Origin" from Qualimap determines how many alignments fall into exonic, intronic, and intergenic regions. Even if there is a high genomic mapping rate, it is necessary to check where the reads are mapped. It should be verified that the mapping to intronic regions and exons are within acceptable ranges. Abnormal results could indicate issues such as DNA contamination. | -| rRNA Content | ? | Verify that excess ribosomal content is filtered/normalized across samples to ensure that alignment rates and subsequent normalization of data is not skewed. | -| Transcript Coverage and 5'-3' Bias | [Qualimap] | Libraries prepared with poly(A) selection may have higher biased expression in 3' region. If reads primarily accumulate at the 3' end of transcripts (in poly(A)-selected samples), this might indicate the starting RNA was of low quality. | -| Junction Analysis | [ngsderive] | Analysis of known, partly known, and novel junction positions in spliced alignments. | -| Strand Specificity | [ngsderive] | Verification/sanity check of how reads were stranded for the RNA sequencing (stranded or unstranded protocol). | -| GC Content Bias | [Qualimap] and [FastQC] | GC profiles are typically remarkably stable. Even small/minor deviations could indicate a problem with the library used (or bacterial contamination). | +There are (at least) two different types of QC typically carried out omics-based data. **Experimental** QC attempts to identify the success of the assay(s) performed. Once one is sufficiently satisfied that the data generated by an experiment is "good", **computational** QC examines the degree to which computational processing of that data was completed successfully. + +By the time data reaches the St. Jude Cloud team from various sources, extensive +_experimental_ and _computational_ evaluation have already been carried out. +Each contributing project has its own thresholds for quality in both areas, +which is dependent on the best practices at that point in time and the goals of +the project. Most often, our team takes the computational data, reverts it back +to its raw form (such as FastQ files), and reprocesses the data using a +harmonization pipeline. + +Thus, the scope of this RFC, and the QC of samples on the project in general, is +limited to the _computational_ QC of the files produced for publication in St. +Jude Cloud. While we do produce results that define _experimental_ results (such +as `fastqc`), these are rarely used to decide which files pass or fail our +QC—this is in recognition of the fact that the data, while not always perfect, +is extremely valuable due to its relative scarcity. We hope that the inclusion +of these results will save end-users time and aid in decision-making about +downstream analysis approaches. + +| Name | Tool | Description | WGS | WXS | RNA-Seq | ChIP-Seq | +| ----------------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | +| M Reads Mapped | [samtools] | Number of reads mapped in millions. This metric is useful | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| % Aligned | [picard] | Number of mapped reads divided by the total number of reads as a percentage. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | +| Median Insert Size | [picard] | Median size of the fragment that is inserted between the sequencing adapters (estimated in silico). | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![No](https://img.shields.io/badge/no-red) | +| % Duplication | [picard] | Percentage of the reads that are marked as PCR or optical duplicates. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![Yes](https://img.shields.io/badge/yes-green) | +| ≥ 30X Coverage | [mosdepth] | The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![No](https://img.shields.io/badge/no-red) | +| Predicted Instrument | [ngsderive] | The predicted sequencing machine that produced this data. This should match your expectations for what machine sequenced the data. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Predicated Read Length | [ngsderive] | The predicted read length for the sequencing experiment. This should match your expectations for how the library was prepared. A value of -1 indicates that a stable read length could not be predicted. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Probable Encoding | [ngsderive] | The predicted encoding of the _original_ FASTQ file. For this, all modern data should match `Sanger/Illumina 1.8`. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| % Reads attributed to Homo sapiens | [kraken] | The percentage of reads that were assigned as originating from Homo sapiens from [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| % Reads attributed to top 5 species | [kraken] | The percentage of reads that were assigned to the top 5 reported species from [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| % Unclassified | [kraken] | The percentage of reads that were not classified by [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | + + + +## General statistics + +### **M Reads Mapped** + +_Produced by [samtools]_ + +Number of reads mapped in millions. The number of reads in a whole genome +sequencing file can vary widely based on the experiment design and the +sequencing depth of the project. + * Typically, WGS files contain between 200M reads up to 2 billion reads (or + higher in some cases). Any deviation outside of this range should be + investigated further (particularly if the number is lower, which suggests + that the file may be truncated). + * This number is especially informative within the context of samples from the + same cohort. We expect a normal distribution for this metric within the same + sequencing cohort, special attention is required for any samples that do not + conform to this distribution. + +### **% Aligned** + +_Produced by [picard]_ + +Percentage of the number of reads that were able to be mapped to a given + reference genome. Within St. Jude Cloud, this metric is indicative of the + amount of non-human material which was sequenced during the experiment. + * This number is especially informative within the context of samples from the + same cohort. We expect the percentage of aligned reads to be tightly + clustered above 95% mapping rate. It is not uncommon for samples to go as + low as 80% mapping rate, and this should be considered within an acceptable + range. Occasionally you may find small proportion of samples between 51%-79% + mapping rate: those samples should be investigated, but typically, you + should still include those samples in the release. Anything below 50% + mapping rate signals an issue with contamination, and anything less than 30% + mapping may be an issue with the mapper or the data integrity (e.g. + mismatched read pairs). + +### **Median Insert Size** + +_Produced by [picard]_ + +Size of the fragment inserted between the sequencing adapters. In whole genome + sequencing experiments, typically you have a target fragment length for each + sample (commonly ~2x the read length to maximize the amount of information + gathered per fragment). The median insert size has proven to be an incredibly + valuable statistic for identifying mapping problems and experimental + abnormalities. + * Typically, the median insert size should be between 1.5x - 2x the read + length of the experiment. This criteria may be evaluated loosely, as insert + sizes that are close to this range (anything between 1x - 5x read length) do + not necessarily indicate a major problem. Samples with an extremely low + median insert size (10-40 nucleotides) are likely the cause of mapping + artifacts—particularly if `bwa mem` was used with a low seed size. + * Insert size distributions tend to follow a [gamma + distribution](https://en.wikipedia.org/wiki/Gamma_distribution) where long, + rolling peaks are interspersed with tight peaks close to the target insert + size. + * This number is especially informative within the context of samples from the + same cohort. We expect the median insert size across samples to be tightly + clustered within a cohort. + +### **≥ 30X Coverage** + +_Produced by [mosdepth]_ + +Percentage of the genome which has at least 30 reads covering each position (on + average) for the whole genome, the exonic regions, and the coding regions + ("30x coverage"). 30x coverage at any location is widely considered to be the + minimum number of reads needed to generate high confidence genotype call. In + whole genome sequencing, it is desirable to have as much of the genome as + possible to be covered by 30 or more reads. Expected genome coverage is often + determined at the time of sequencing and may vary from project to project + depending on cost and experimental design. + * Target sequencing depth and fraction of the genome to be covered are + generally set on a project by project basis. Speak with the someone on the + sequencing team to ensure you understand what sequencing depth and fraction + of the genome covered you should be expecting. + * At a minimum, we expect 80% of the whole genome to be covered by at least 30 + reads in high-quality whole genome sequencing experiments. For higher depth + coverage projects like our Clinical Genomics project, the standard is 60x + across 80% of the genome. + * Less coverage does not necessarily indicate a problem, but in whole genome + sequencing, anything below 65% of the genome at 30X signals that something + may be going wrong. Pay close attention to the distribution of the reads + across the genome by generating a coverage plot to see where sequencing bias + is being introduced. + * This number is especially informative within the context of samples from the + same cohort. We expect this metric to be tightly clustered across samples + within a cohort. + +### **GC Content** + +_Produced by [FastQC]_ + +Percentage of the nucleotides contained within reads which are Cs (Cytosine) or +Gs (Guanine). GC content is important because Gs pair with Cs in DNA with 3 +hydrogen bonds whereas As pair with Ts in DNA with 2 hydrogen bonds. Ergo, DNA +is considered to be stronger when comprised of more GC bonds, and GC content is +a defining characteristic of different genomes. GC content for humans is +[estimated to be at around 41%](https://www.nature.com/articles/35057062#Sec15). + * GC content in whole genome sequencing is typically between 40%-60% depending + on sequencing bias. Any deviation outside of this range should be + investigated further. + * This number is especially informative within the context of samples from the + same cohort. We expect GC content across samples to be tightly clustered + within a cohort. + * If either of the above assumptions do not hold true, probable sources of + variance include library preparation abnormalities or contamination issues. #### Thresholds and Metrics for Specific Applications @@ -233,9 +324,11 @@ The workflow specification is as follows. Note that arguments that are not integ - What is the best way to define and handle outliers? - What is the best way to examine cohort integrity? This means experimental category-based tests of samples to find outliers that are of sufficient quality if examined alone. Outliers in this case may indicate classification errors or rare biological conditions. Which metrics are best tested here? -[fastqc]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ +[FastQC]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [multiqc]: https://multiqc.info/ -[picard]: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/picard_sam_ValidateSamFile.php +[picard]: https://broadinstitute.github.io/picard/ +[mosdepth]: https://github.com/brentp/mosdepth [qualimap]: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html [samtools]: http://www.htslib.org/doc/samtools.html [ngsderive]: https://github.com/stjudecloud/ngsderive/ +[kraken]: https://github.com/DerrickWood/kraken2 From 3f49f1c7b6907642a79a4bb7e91c95f8aa681551 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Tue, 27 Jun 2023 15:18:31 -0500 Subject: [PATCH 64/68] revise: further revisions to the QC RFC --- text/0002-quality-check-workflow.md | 44 +++++++++-------------------- 1 file changed, 14 insertions(+), 30 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index 61627d2..f3191e5 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -53,19 +53,19 @@ is extremely valuable due to its relative scarcity. We hope that the inclusion of these results will save end-users time and aid in decision-making about downstream analysis approaches. -| Name | Tool | Description | WGS | WXS | RNA-Seq | ChIP-Seq | -| ----------------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | -| M Reads Mapped | [samtools] | Number of reads mapped in millions. This metric is useful | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| % Aligned | [picard] | Number of mapped reads divided by the total number of reads as a percentage. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | -| Median Insert Size | [picard] | Median size of the fragment that is inserted between the sequencing adapters (estimated in silico). | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![No](https://img.shields.io/badge/no-red) | -| % Duplication | [picard] | Percentage of the reads that are marked as PCR or optical duplicates. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![Yes](https://img.shields.io/badge/yes-green) | -| ≥ 30X Coverage | [mosdepth] | The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![No](https://img.shields.io/badge/no-red) | -| Predicted Instrument | [ngsderive] | The predicted sequencing machine that produced this data. This should match your expectations for what machine sequenced the data. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| Predicated Read Length | [ngsderive] | The predicted read length for the sequencing experiment. This should match your expectations for how the library was prepared. A value of -1 indicates that a stable read length could not be predicted. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| Probable Encoding | [ngsderive] | The predicted encoding of the _original_ FASTQ file. For this, all modern data should match `Sanger/Illumina 1.8`. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| % Reads attributed to Homo sapiens | [kraken] | The percentage of reads that were assigned as originating from Homo sapiens from [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| % Reads attributed to top 5 species | [kraken] | The percentage of reads that were assigned to the top 5 reported species from [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| % Unclassified | [kraken] | The percentage of reads that were not classified by [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Name | Tool | Description | WGS | WXS | RNA-Seq | ChIP-Seq | +| ---------------------------------------------------------------------------------------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | +| M Reads Mapped ([link](#m-reads-mapped)) | [samtools] | Number of reads mapped in millions. This metric is useful | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Percentage of Reads Aligned ([link](#percentage-of-reads-aligned)) | [picard] | Number of mapped reads divided by the total number of reads as a percentage. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | +| Median Insert Size ([link](#median-insert-size)) | [picard] | Median size of the fragment that is inserted between the sequencing adapters (estimated in silico). | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![No](https://img.shields.io/badge/no-red) | +| Percentage Duplication ([link](#percentage-duplication)) | [picard] | Percentage of the reads that are marked as PCR or optical duplicates. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![Yes](https://img.shields.io/badge/yes-green) | +| ≥ 30X Coverage ([link](#-30x-coverage)) | [mosdepth] | The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![No](https://img.shields.io/badge/no-red) | +| Predicted Instrument ([link](#predicted-instrument)) | [ngsderive] | The predicted sequencing machine that produced this data. This should match your expectations for what machine sequenced the data. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Predicted Read Length ([link](#predicted-read-length)) | [ngsderive] | The predicted read length for the sequencing experiment. This should match your expectations for how the library was prepared. A value of -1 indicates that a stable read length could not be predicted. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Probable Encoding ([link](#probable-encoding)) | [ngsderive] | The predicted encoding of the _original_ FASTQ file. For this, all modern data should match `Sanger/Illumina 1.8`. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Percentage of Reads attributed to Homo sapiens ([link](#percentage-of-reads-attributed-to-homo-sapiens)) | [kraken] | The percentage of reads that were assigned as originating from Homo sapiens from [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Percentage of Reads attributed to top 5 species ([link](#percentage-of-reads-attributed-to-top-5-species)) | [kraken] | The percentage of reads that were assigned to the top 5 reported species from [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +| Percentage of Reads Unclassified ([link](#percentage-of-reads-unclassified)) | [kraken] | The percentage of reads that were not classified by [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | @@ -87,7 +87,7 @@ sequencing depth of the project. sequencing cohort, special attention is required for any samples that do not conform to this distribution. -### **% Aligned** +### **Percentage of Reads Aligned** _Produced by [picard]_ @@ -177,22 +177,6 @@ a defining characteristic of different genomes. GC content for humans is * If either of the above assumptions do not hold true, probable sources of variance include library preparation abnormalities or contamination issues. -#### Thresholds and Metrics for Specific Applications - -To apply quality control metrics to vet data, we need reasonable thresholds that are practically achievable and neither too lax nor too strict. Our preference is for statistically or empirically determined thresholds, rather than arbitrary estimates. By statistical thresholds, we are referring to distributional tests that formally define outliers. By empirical thresholds, we are referring to standards below which data analysis or interpretation are degraded. Statistical tests can be performed on large populations of QC data. We are already in a position to do that today. Empirical tests, however, require foreknowledge of the correct results. This requires experimental design and implementation through a laboratory at some cost. - -#### Metrics for WGS - -The quality metrics of special concern for WGS include depth of coverage and coverage distribution across genomic regions. Mapping quality is also critical. The analysis of whole genome sequencing to call variants depends on depth and sample purity. Accurate calls are made through replication and contamination creates false positives. So metrics that are sensitive to impurity are valuable. - -#### Metrics for WES - -The quality metrics of special concern for WES include depth of coverage in exonic regions. Mapping quality, mapping percentage, and duplication rate are also important. - -#### Metrics for RNAseq - -The quality metrics of special concern for RNA-Seq include mapping percentage, percentage properly paired reads, and exonic region coverage. Mapping quality is also critical. - ## Specification These are generic instructions for running each of the tools in our pipeline. We run our pipeline as a [WDL workflow](https://github.com/stjudecloud/workflows/blob/master/workflows/qc/quality-check-standard.wdl). We have supplied examples of the commands used for each package. For the typical memory requirements of each command, please see our [WDL repository](https://github.com/stjudecloud/workflows). From 850ecf6a3a7283c731a8eeff4672f12acb31cd1e Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Tue, 27 Jun 2023 20:19:52 -0500 Subject: [PATCH 65/68] revise: updates to the QC RFC --- text/0002-quality-check-workflow.md | 288 +++++++++++++++++++++++++--- 1 file changed, 257 insertions(+), 31 deletions(-) diff --git a/text/0002-quality-check-workflow.md b/text/0002-quality-check-workflow.md index f3191e5..3f6f31d 100644 --- a/text/0002-quality-check-workflow.md +++ b/text/0002-quality-check-workflow.md @@ -4,15 +4,18 @@ - [Motivation](#motivation) - [Discussion](#discussion) - [Specification](#specification) -- [Items Still In-Progress](#items-still-in-progress) -- [Outstanding Questions](#outstanding-questions) ## Introduction -This RFC documents an automated workflow for assessing the integrity and quality of St. Jude Cloud genomics data. The goal of this RFC is two-fold. +This RFC documents an automated workflow for assessing the integrity and quality of St. Jude Cloud omics data. The goal of this RFC is two-fold. -1. Establish state of the art method for comprehensively evaluating genomics data quality at scale, both at time of receipt and after processing. -2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage to assess the quality of the data available. This context should save users time computing the information themselves while also informing appropriate use of the data. +1. Establish state of the art method for comprehensively evaluating genomics + data quality at scale, both at time of receipt and after processing, for + whole genome, whole exome, and transcriptome sequencing. +2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage + to assess the quality of the data available. This context should save users + time computing the information themselves while also informing appropriate + use of the data. You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3). @@ -53,23 +56,33 @@ is extremely valuable due to its relative scarcity. We hope that the inclusion of these results will save end-users time and aid in decision-making about downstream analysis approaches. -| Name | Tool | Description | WGS | WXS | RNA-Seq | ChIP-Seq | -| ---------------------------------------------------------------------------------------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | ---------------------------------------------- | -| M Reads Mapped ([link](#m-reads-mapped)) | [samtools] | Number of reads mapped in millions. This metric is useful | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| Percentage of Reads Aligned ([link](#percentage-of-reads-aligned)) | [picard] | Number of mapped reads divided by the total number of reads as a percentage. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | -| Median Insert Size ([link](#median-insert-size)) | [picard] | Median size of the fragment that is inserted between the sequencing adapters (estimated in silico). | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![No](https://img.shields.io/badge/no-red) | -| Percentage Duplication ([link](#percentage-duplication)) | [picard] | Percentage of the reads that are marked as PCR or optical duplicates. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![Yes](https://img.shields.io/badge/yes-green) | -| ≥ 30X Coverage ([link](#-30x-coverage)) | [mosdepth] | The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![No](https://img.shields.io/badge/no-red) | ![No](https://img.shields.io/badge/no-red) | -| Predicted Instrument ([link](#predicted-instrument)) | [ngsderive] | The predicted sequencing machine that produced this data. This should match your expectations for what machine sequenced the data. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| Predicted Read Length ([link](#predicted-read-length)) | [ngsderive] | The predicted read length for the sequencing experiment. This should match your expectations for how the library was prepared. A value of -1 indicates that a stable read length could not be predicted. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| Probable Encoding ([link](#probable-encoding)) | [ngsderive] | The predicted encoding of the _original_ FASTQ file. For this, all modern data should match `Sanger/Illumina 1.8`. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| Percentage of Reads attributed to Homo sapiens ([link](#percentage-of-reads-attributed-to-homo-sapiens)) | [kraken] | The percentage of reads that were assigned as originating from Homo sapiens from [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| Percentage of Reads attributed to top 5 species ([link](#percentage-of-reads-attributed-to-top-5-species)) | [kraken] | The percentage of reads that were assigned to the top 5 reported species from [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | -| Percentage of Reads Unclassified ([link](#percentage-of-reads-unclassified)) | [kraken] | The percentage of reads that were not classified by [kraken]. | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | ![Yes](https://img.shields.io/badge/yes-green) | +## General Statistics + +General statistics are contained within the summary table at the top of the +MultiQC reports. These statistics give a quick overview of the cohort of +interest with respect to the defined metrics. Generally, they are summary +statistics rather than being comprehensive in nature (those statistics are +included in the sections that follow after). + +### Summary Table + +| Name | Tool | Description | WGS | WXS | RNA-Seq | +| ---------------------------------------------------------------------------------------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | +| M Reads Mapped ([link](#m-reads-mapped)) | [samtools] | Number of reads mapped in millions. This metric is useful | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | +| Validation Status ([link](#validation-status)) | [picard] | The validation status of the file as determined by `picard ValidateSamFile`. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | +| Percentage of Reads Aligned ([link](#percentage-of-reads-aligned)) | [picard] | Number of mapped reads divided by the total number of reads as a percentage. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | +| Median Insert Size ([link](#median-insert-size)) | [picard] | Median size of the fragment that is inserted between the sequencing adapters (estimated in silico). | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) | +| Percentage Duplication ([link](#percentage-duplication)) | [picard] | Percentage of the reads that are marked as PCR or optical duplicates. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) | +| ≥ 30X Coverage ([link](#-30x-coverage)) | [mosdepth] | The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) | +| Predicted Instrument ([link](#predicted-instrument)) | [ngsderive] | The predicted sequencing machine that produced this data. This should match your expectations for what machine sequenced the data. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | +| Predicted Read Length ([link](#predicted-read-length)) | [ngsderive] | The predicted read length for the sequencing experiment. This should match your expectations for how the library was prepared. A value of -1 indicates that a stable read length could not be predicted. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | +| Probable Encoding ([link](#probable-encoding)) | [ngsderive] | The predicted encoding of the _original_ FASTQ file. For this, all modern data should match `Sanger/Illumina 1.8`. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | +| Percentage of Reads attributed to Homo sapiens ([link](#percentage-of-reads-attributed-to-homo-sapiens)) | [kraken] | The percentage of reads that were assigned as originating from Homo sapiens from [kraken]. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | +| Percentage of Reads attributed to top 5 species ([link](#percentage-of-reads-attributed-to-top-5-species)) | [kraken] | The percentage of reads that were assigned to the top 5 reported species from [kraken]. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | +| Percentage of Reads Unclassified ([link](#percentage-of-reads-unclassified)) | [kraken] | The percentage of reads that were not classified by [kraken]. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | -## General statistics ### **M Reads Mapped** @@ -177,6 +190,231 @@ a defining characteristic of different genomes. GC content for humans is * If either of the above assumptions do not hold true, probable sources of variance include library preparation abnormalities or contamination issues. +## Detailed Charts + +### mosdepth + +| Name | Tool | Description | WGS | WXS | RNA-Seq | +| -------------------------------- | ---------- | ---------------------------------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------- | +| Cumulative coverage distribution | [mosdepth] | The proportion of bases within the genome with at least `X` coverage. | ![all](https://img.shields.io/badge/all-brightgreen) | ![exonic/cds](https://img.shields.io/badge/exonic/cds-orange) | ![no](https://img.shields.io/badge/no-red) | +| Coverage distribution | [mosdepth] | The proportion of bases within the genome with _exactly_ `X` coverage. | ![all](https://img.shields.io/badge/all-brightgreen) | ![exonic/cds](https://img.shields.io/badge/exonic/cds-orange) | ![no](https://img.shields.io/badge/no-red) | +| Average coverage per contig | [mosdepth] | Average coverage per contig or chromosome. | ![all](https://img.shields.io/badge/all-brightgreen) | ![all](https://img.shields.io/badge/all-brightgreen) | ![all](https://img.shields.io/badge/all-brightgreen) | + +For all target types, the `mosdepth` will produce various charts for +the whole genome, exonic, and coding sequence ranges. The most important chart +to examine is the "cumulative coverage distribution" chart, which shows the +burndown chart of what percentage of the specified region is covered by a +particular read depth. In this way, you can assess how much of the specified +region is covered by a particular read depth. + +When interpreting this chart, the most important thing to examine is (a) the +chart's behavior with respect to the sequencing modality you are reviewing and +(b) the relative similarities of the plots within a cohort. For (a), you can +expect that + +- Whole genome data will show a significant cliff for which all places in the + genome meet a particular coverage metric. There should be a pronounced + downward curve that signifies the cliff of "normal" sites across the genome + being separated from highly-covered sites of the genome. +- Whole exome data will quickly drop off from left to right, as only about 3% of + the genome is targetted (the exonic regions). For those regions, however, you + can expect upwards of 1000x coverage. +- For RNA-Seq, this plot is not as useful. As such, you can ignore it for this + sequencing modality. + +### ngsderive + +| Name | Tool | Description | WGS | WXS | RNA-Seq | +| ------------------------------- | ----------- | ------------------------------------------------------------------------------ | ---------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | +| `ngsderive instrument` | [ngsderive] | The predicted sequencing instrument for this file. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| `ngsderive readlen` | [ngsderive] | The predicted read length for the file. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| `ngsderive encoding` | [ngsderive] | The predicted FASTQ quality score encoding scheme for the file. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| `ngsderive strandedness` | [ngsderive] | The predicted strandedness of the RNA-Seq protocol for the file. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| `ngsderive junction-annotation` | [ngsderive] | The predicted breakdown of novel, partially novel, and known splice junctions. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![yes](https://img.shields.io/badge/yes-brightgreen) | + +`ngsderive` is a tool written by the St. Jude Cloud team to provide forensic +analysis techniques for evaluating next-generation sequencing data. The +following subcommands are particularly useful in the context of these QC +reports: + +- **`ngsderive instrument`.** Gives an estimate of which sequencing instrument + was used for a particular file. You should check to ensure the result you + receive from this subcommand matches your expectations of the data. +- **`ngsderive readlen`.** Gives an estimate of the pre-trimmed read length. For + files with significant adapter trimming, this may not provide conclusive + results (giving a result of `-1`). You should check to ensure the result you + receive from this subcommand matches your expectations of the data. +- **`ngsderive encoding`.** Guesses which [FASTQ + encoding](https://en.wikipedia.org/wiki/FASTQ_format#Encoding) was used for + the original FASTQ file. For most modern data, this should be `Sanger/Illumina + 1.8`. +- **`ngsderive strandedness`.** For RNA-Seq data, derives the strandedness of + the sequencing protocol. This is useful to understand the effectiveness of the + assay and also how strong downstream results will be (if you believe your data + is `Reverse-stranded`, but your data appears to be somewhere between + `Reverse-stranded` and `Unstranded`, then you may have issues with your + analysis). +- **`ngsderive junction-annotation`.** For RNA-Seq data, reports the number of + novel junctions, partially novel junctions, and known junctions. This is + incredibly helpful when comparing with other samples within the same cohort. + Unless you have reason to suspect otherwise, you should expect the majority of + junctions to be known, followed by novel, and then partially novel junctions. + An abundance of novel junctions can indicate issues with the sample. + +For each of these, we recommend you view the graphs provided to pull out any +outliers in the data. In particular, the quantitative measures like percent +strandedness and junction saturation can give you strong indications if your +library preparation acted as you expected. For any outliers, you should contact +your lab to report the issue and determine if they noted any quality control +issues from their side. + +### Qualimap RNA-Seq + +| Name | Tool | Description | WGS | WXS | RNA-Seq | +| ----------------------- | ---------- | --------------------------------------------------------------------------- | ------------------------------------------ | ------------------------------------------ | ---------------------------------------------------- | +| Genomic Origin of Reads | [qualimap] | The proportion of intronic, exonic, and intergenic reads from RNA-Seq data. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Gene Coverage Profile | [qualimap] | The distribution of read coverage across gene transcripts. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![yes](https://img.shields.io/badge/yes-brightgreen) | + +Qualimap RNA-Seq provides specific tools evaluating more straightforward metrics +from RNA-Seq data. In our QC pipeline, we use two graphs: (a) the genomic origin +of reads graph and (b) the gene coverage profile graph. + +- The **genomic origin of reads** graph shows the breakdown of reads from + intronic, intergenic, and exonic regions. Particularly for PolyA selected + data, you should expect to see exonic regions dominate the distribution. For + Total or ribosomal depletion methods, you can expect to see more intronic + regions (as pre-mRNA are typically captured with these library preparation + protocols). Note that a high proportion of intergenic reads is often a sign + that your library selection has gone awry. +- The **gene coverage profile** graph demonstrates the coverage across different + positions of gene transcripts. You should expect to see a gentle slope with a + peak towards the middle of each gene. With respect to the ends, the 3' end of + the gene typically drops off more quickly than the 5' end of the gene. If you + see irregular coverage of gene transcripts, such as erradict patterns in + coverage, this can indicate some bias or selection in the fragment selection + of the sequencing experiment. + +### Picard + +| Name | Tool | Description | WGS | WXS | RNA-Seq | +| ----------------------------- | -------- | ---------------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | +| Alignment Summary | [picard] | Number of aligned versus unaligned reads for a given file. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Mean Read Length Distribution | [picard] | Mean read length of the file. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | +| GC Coverage Bias | [picard] | Picard's interpretation of a GC bias distribution. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | +| Insert Size Distribution | [picard] | The distribution of computed insert sizes for the file. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Deduplication stats | [picard] | Overview of duplication from reads within the file. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![warn](https://img.shields.io/badge/warn-orange) | +| Base quality distribution | [picard] | The count of bases with each quality score. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | +| SAM/BAM validation | [picard] | Validation of the SAM/BAM files. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | + +[Picard] provides a variety of important tools for our QC workflow, including: + +- An **alignment summary** chart, which illustrates the number of reads and the + relative breakdown of mapped versus unmapped reads. This is useful when + determining which files are over/under sequenced within a particular cohort + and for visually identifying samples with lower mapping rates (an indication + of a problem). +- The **mean read length** chart, which we typically ignore in favor of + `ngsderive`'s `readlen` command (note that `ngsderive` will give a result of + `-1` if a result cannot be determined, whereas Picard gives you the median + read length). +- The **GC coverage bias** chart, which we typically ignore in favor of the "Per + Sequence GC Bias" plot from FastQC. +- The **insert size distribution** chart, which is _incredibly_ helpful when + determining problems in your data. For example, if you have bacteria + contamination in your data, the bacterial reads will map with low quality + using `bwa mem` (and default parameters). This will result in a bimodal + distrbution of the insert size estimate, which is highly indicate of some + issue in the data. For these plots, (a) ensure that all of the insert size + distributions match each other within a cohort and (b) watch out for bimodal + distributions with a small (or large) peak around ~20bp, which is an + indication of contamination. +- The **deduplication stats** chart, which give an indication of duplication within + the sample (note that, for RNA-Seq data, if this plot is shown, biological + duplicates cannot be ruled out!). +- The **base quality distribution** chart, which we do not use and instead use + FastQC's "Per Sequence Quality Scores" plot. +- The **file validation** chart, all of which should be green. + +### Samtools + +Samtools also outputs various distributions as a beeswarm plot. In our +experience, these are not as useful expect in the case of examining a particular +sample to see where it lies for all metrics at once—otherwise, the summary +statistics in the "General Statistics" section at the top do just as well. + +### Kraken + +| Name | Tool | Description | WGS | WXS | RNA-Seq | +| -------- | -------- | ---------------------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | +| Top Taxa | [kraken] | Proportion of reads which are assigned to the indicated species. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | + +[Kraken] is useful for determining the origin of reads within your sample of +interest. Since we mainly are sequencing human subjects, we typically use kraken +to give an indication of the level of contamination from various species within +our data. You should look to ensure that the vast majority (80% or higher) of +your sample originates from humans (_Homo sapiens_). If you are sequencing a +xenograft, pay close attention to the proportion of mouse (_Mus musculus_) reads +in the sample. + +### FastQC + +| Name | Tool | Description | WGS | WXS | RNA-Seq | +| ---------------------------- | -------- | ----------------------------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | +| Sequencing Counts | [FastQC] | Unique and duplicate sequences in the file. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | +| Sequencing Quality Histogram | [FastQC] | Mean quality for every position in a read. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Per Sequence Quality Scores | [FastQC] | The number of reads with the indicated average quality score. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Per Base Sequence Content | [FastQC] | The bias towards a particular nucleotide at a given position of a read. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Per Sequence GC Content | [FastQC] | The GC content distribution. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Per Base N Content | [FastQC] | The percentile of base calls at each position which contained an N. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Sequence Length Distribution | [FastQC] | The distribution of read lengths in the file. | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | ![no](https://img.shields.io/badge/no-red) | +| Sequence Duplication Levels | [FastQC] | The relative level of duplication for each sequence. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Overrepresented Sequences | [FastQC] | The sequences which may be overrepresented in your file. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Adapter Content | [FastQC] | The presence or absence of adapter contamination in your file. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | +| Status Checks | [FastQC] | An overview of the pass/warn/fail status for all FastQC checks. | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | ![yes](https://img.shields.io/badge/yes-brightgreen) | + +[FastQC] rounds out the list of important tools to interrogate the quality of +the sequencing data itself. We leverage many of the plots produced by FastQC, +using it as the authority on most sequencing quality metrics, including: + +- The **sequencing counts** chart, which we typically ignore in favor of + [picard]'s mark duplicates plot. +- The **sequence quality histogram**, which gives an indication of the + sequencing quality across reads. Look for anything that fails FastQC's quite + stringent check to see if that correlates with any other failing metrics. +- The **per sequence quality scores** chart, which shows the mean sequence + quality scores across the sample (peaks for lower scores in the orange or red + areas of the graph are cause for concern). +- The **per base sequence content**, which shows is there are any biases in what + nucleotides are showing up at particular locations in the reads. For whole + genome and whole exome data, there should be relatively few biases. However, + for RNA-Seq, you will see a bias at the 5' region of the reads. +- The **per sequence GC content** chart, which is highly informative of + contamination in whole genome and whole exome sequencing data. In those two + sequencing modalities, you should check to ensure that all of the + distributions are tightly following one another—any derivation from the + expected distribution is a sign of contamination. Note that sometimes tools + like FastQC plot an "expected" distribution of the GC content; this is usually + the expectation of WGS data, but not WXS data, which has a different expected + distribution. For RNA-Seq, this graph is generally not used except to examine + extreme biases at particular positions. +- The **per base N content** chart, which should generally report very low + levels of Ns in modern data. +- The **sequence length distribution** chart, which we typically ignore in favor + of `ngsderive`'s `readlen` plot. +- The **sequence duplication level** chart, which gives an indication of the + percentage of sequences that are duplicated and by what factor (though this is + a graph that we find, often, even good data fails FastQC's stringent test). +- The **overrepresented sequences** chart, which gives an indication if you're + sequencing the same sequence over and over. +- The **adapter content** chart, which we use to ensure there are minimal to no + signs of adapters in the data we receive. +- And finally, the **status checks** view, which gives you an overview of which + samples passed/warned/failed different checks. Generally, this is a good + overview of the quality of the data, though we find that many of the charts + are often marked as "failed" for good data (in particular, the "Per Base + Sequence Content" [for RNA-Seq], the "Per Sequence GC Content", and the + "Sequence Length Distribution" graphs). + ## Specification These are generic instructions for running each of the tools in our pipeline. We run our pipeline as a [WDL workflow](https://github.com/stjudecloud/workflows/blob/master/workflows/qc/quality-check-standard.wdl). We have supplied examples of the commands used for each package. For the typical memory requirements of each command, please see our [WDL repository](https://github.com/stjudecloud/workflows). @@ -296,18 +534,6 @@ The workflow specification is as follows. Note that arguments that are not integ multiqc . # recurse all files in '.' ``` -## Items Still In-Progress - -- [ ] Analysis tools for other types of sequencing (ChIP-seq) -- [ ] Useful metadata from various stages (sample collection, laboratory, pre-sequencing, sequencing, post-sequencing) - -## Outstanding Questions - -- What thresholds or metrics differentiate a poor-quality sample from a high-quality one? -- What other metrics or properties would be valuable? -- What is the best way to define and handle outliers? -- What is the best way to examine cohort integrity? This means experimental category-based tests of samples to find outliers that are of sufficient quality if examined alone. Outliers in this case may indicate classification errors or rare biological conditions. Which metrics are best tested here? - [FastQC]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [multiqc]: https://multiqc.info/ [picard]: https://broadinstitute.github.io/picard/ From 7850c1c44d531f1875f32ef8f365ef4ef6d509cf Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Wed, 28 Jun 2023 10:17:36 -0500 Subject: [PATCH 66/68] revise: updates QC RFC to not have an assigned number --- .../.gitkeep | 0 ...2-quality-check-workflow.md => XXXX-quality-check-workflow.md} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename resources/{0002-quality-check-workflow => XXXX-quality-check-workflow}/.gitkeep (100%) rename text/{0002-quality-check-workflow.md => XXXX-quality-check-workflow.md} (100%) diff --git a/resources/0002-quality-check-workflow/.gitkeep b/resources/XXXX-quality-check-workflow/.gitkeep similarity index 100% rename from resources/0002-quality-check-workflow/.gitkeep rename to resources/XXXX-quality-check-workflow/.gitkeep diff --git a/text/0002-quality-check-workflow.md b/text/XXXX-quality-check-workflow.md similarity index 100% rename from text/0002-quality-check-workflow.md rename to text/XXXX-quality-check-workflow.md From 45115c96ea27fa72c5b9bf20111a4c364d5dfee1 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Wed, 28 Jun 2023 10:25:54 -0500 Subject: [PATCH 67/68] revise: adds title to the RFC --- text/XXXX-quality-check-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/XXXX-quality-check-workflow.md b/text/XXXX-quality-check-workflow.md index 3f6f31d..667722f 100644 --- a/text/XXXX-quality-check-workflow.md +++ b/text/XXXX-quality-check-workflow.md @@ -1,4 +1,4 @@ -# Table of Contents +# QC Pipeline for Whole Genome, Whole Exome, and RNA Sequencing - [Introduction](#introduction) - [Motivation](#motivation) From 3a7f5b063f862a34a619bafe09b3ea14bd6ac507 Mon Sep 17 00:00:00 2001 From: Clay McLeod Date: Wed, 28 Jun 2023 10:30:08 -0500 Subject: [PATCH 68/68] revise: adds TOC back into the RFC --- text/XXXX-quality-check-workflow.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/text/XXXX-quality-check-workflow.md b/text/XXXX-quality-check-workflow.md index 667722f..274c80a 100644 --- a/text/XXXX-quality-check-workflow.md +++ b/text/XXXX-quality-check-workflow.md @@ -1,5 +1,7 @@ # QC Pipeline for Whole Genome, Whole Exome, and RNA Sequencing +## Table of Contents + - [Introduction](#introduction) - [Motivation](#motivation) - [Discussion](#discussion)