-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path04-qc_module.Rmd
More file actions
83 lines (56 loc) · 3.43 KB
/
04-qc_module.Rmd
File metadata and controls
83 lines (56 loc) · 3.43 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# QC module
CopyKit Quality Control Module consists of 3 main functions:
1) `runMetrics()`
2) `findOutliers()`
3) `findAneuploidCells()`.
## runMetrics()
`runMetrics()` adds basic quality control information to colData.
It returns sample-wise metrics of overdispersion and breakpoint counts.
```{r run_metrics}
tumor <- runMetrics(tumor)
```
The resulting information can be viewed with:
```{r rm_metadata}
colData(tumor)
```
## findAneuploidCells()
Datasets may contain euploid cells mixed with the aneuploidy cells.
To detect euploid cells `findAneuploidCells()` calculates the sample-wise coefficient of variation from the segment ratio means.
The expected coefficient of variation for euploid cells `N(0, 0.01)` is simulated for *x* data points, where *x* is equal to the number of cells within the dataset.
An expectation-maximization algorithm is used to fit a mixture of gaussian distributions to the coefficient of variation from the samples together with the simulated datasets.
The distribution containing the simulated dataset is inferred to be the euploid distribution.
Samples that group with the inferred euploid distribution and present coefficient of variation smaller than 5 standard deviations from the mean euploid distribution are classified as euploid samples.
The threshold can be changed from the automatic detection to a custom threshold with the argument `resolution`. For example, by setting a threshold = 0.1, findAneuploidCells will mark as euploid all cells with a coefficient of variation less or equal than 0.1.
```{r find_normal_cells}
tumor <- findAneuploidCells(tumor)
```
The results from `findAneuploidCells()` are stored within the colData in the column _is_aneuploid_.
We can visualize the results with `plotHeatmap()`:
```{r find_normal_heat, fig.height=8}
plotHeatmap(tumor, label = 'is_aneuploid', row_split = 'is_aneuploid', n_threads = 40)
```
The object is subsetted in the same way as with any R object, to keep only the aneuploid cells.
```{r find_normal_subset}
tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]
```
## findOutliers()
`findOutliers()` annotates low-quality cells according to a defined resolution threshold.
To detect low-quality samples, CopyKit calculates the Pearson correlation matrix of all samples from the segment ratio means.
Next, we calculate a sample-wise mean of the correlation between a cell and its k-nearest-neighbors (default = 5).
Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default = 0.9).
```{r filter_cells}
tumor <- findOutliers(tumor)
```
The default correlation cutoff for filtering can be adjusted with the argument 'resolution'.
For example, setting the resolution = 0.8 will mark all cells with a mean correlation smaller than 0.8 as low-quality cells.
Higher resolution values will result in stricter filtering criterias.
Results from `findOutliers()` are added to colData (column _outlier_) marking cells that can be removed or kept.
We can check the results with `plotHeatmap()`. To make visualization easier, rows can also be split according to elements of colData with the argument `row_split`.
```{r filter_cells_heat, fig.height=8}
plotHeatmap(tumor, label = 'outlier', row_split = 'outlier', n_threads = 40)
```
We remove the marked low-quality cells from the object with:
```{r filter_subset}
tumor <- tumor[,colData(tumor)$outlier == FALSE]
```
The dataset should be ready to proceed with the analysis.