CopyKit-UserGuide/06-analysis_visualization.Rmd at main · navinlabcode/CopyKit-UserGuide · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
# Analysis and Visualization module

The analysis and visualization module from CopyKit work in synergy to help you analyze your single cell copy number results.

## plotMetrics()

The `plotMetrics()` function can plot any information store in `colData()`, which is passed to the `metric` argument. If a `label` argument is supplied, the points will be colored based on that information.
```{r}
plotMetrics(tumor, metric = c("overdispersion",
                              "breakpoint_count",
                              "reads_total",
                              "reads_duplicates",
                              "reads_assigned_bins",
                              "percentage_duplicates"),
            label = "percentage_duplicates")
```

## plotRatio()

Visualizing the segmentation to ensure it follows the ratios as expected is crucial to a copy number analysis.
This helps to verify the accuracy of the segmentation an assess the quality of the data visually.

The `plotRatio()` function is a useful tool for this, offering two different modes. When the input is the CopyKit object, an interactive app will open, allowing you to choose which cell to visualize from the drop-down option menu.
```{r plot_ratio_app, eval = FALSE}
plotRatio(tumor)
```
<center>
![Ratio Plot App](images/plot_ratio_shiny.png)
</center>

Providing a sample name to the `plotRatio()` function with the argument `sample_name`, will display the plot only for the selected cell.

```{r plot_ratio_sample, warning = FALSE}
plotRatio(tumor, "PMTC6LiverC117AL4L5S1_S885_L003_R1_001")
```

## runUmap()

The `runUmap()` function generates a [UMAP](https://umap-learn.readthedocs.io/en/latest/) embedding, which is stored in the `reducedDim` slot.

`runUmap()` is an important pre-processing step for the `findClusters()` feature.

```{r run_umap}
tumor <- runUmap(tumor)
```

You can pass additional arguments to control UMAP parameters using the `'...'` argument. The full list of additional arguments that can be passed on to `uwot::umap`  with the '...' argument can be found in the [uwot manual](https://cran.r-project.org/web/packages/uwot/uwot.pdf) and information on their influence on clustering can be seen in the [UMAP webpage](https://umap-learn.readthedocs.io/en/latest/clustering.html)


## scquantum & calcInteger()

The scquantum package is integrated into CopyKit to infer absolute copy numbers. It can be used either within CopyKit or as a [standalone package](https://github.com/navinlabcode/scquantum) available on GitHub. Other methods for inferring absolute copy numbers are metadata and fixed. After running calcInteger(), CopyKit scales the segment ratios to integer values, which are stored in the integer slot and can be accessed using the assay() function.

To infer absolute ploidy values using scquantum, specify `scquantum` as the method argument in `calcInteger()`. The estimated computational ploidy is stored in `colData()` along with confidence_ratio, which compares the obtained score with the theoretical estimate, and ploidy_score, a transformation of confidence_ratio with values closer to 0 indicating better ploidy inference calls. We will use the smoothed bincounts from the `knnSmooth()` step to perform the inference.

```{r calc_integerscquantum, warning=FALSE}
tumor <- calcInteger(tumor, method = 'scquantum', assay = 'smoothed_bincounts')
```

The resulting `ploidy_score` scores can be used for further quality control of the single cell dataset.
```{r}
plotMetrics(tumor, metric = 'ploidy', label = 'ploidy_score')
tumor <- tumor[,colData(tumor)$ploidy_score < 0.2]
```

Setting the method argument to 'fixed' scales all cells to a fixed value of ploidy (generally determined by flow cytometry), which is passed to CopyKit using the ploidy_value argument:

```{r calc_integer_fixed, eval = FALSE}
# NOT RUN
tumor <- calcInteger(tumor, method = 'fixed', ploidy_value = 4.3)
```

If the integer assay exists, plotRatio() will plot it as a secondary axis.
```{r plot_ratio_integer, warning=FALSE}
plotRatio(tumor, "PMTC6LiverC117AL4L5S1_S885_L003_R1_001")
```

## Clustering

The `findClusters()` function performs clustering using the UMAP embedding generated by `runUmap()`. For detecting subclones, CopyKit uses
different clustering algorithms such as:

1) [hdbscan](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) (recommended)

2) [Leiden](https://www.nature.com/articles/s41598-019-41695-z)

3) [Louvain](https://arxiv.org/abs/0803.0476)

The hdbscan method is recommended and has been previously successfully applied in the work from Laks *et al.* and Minussi *et al.*.

### findSugestedK()

The `findSuggestedK()` is a helper function that provides guidance to choose the `k` parameter for clustering algorithms. The function
`findSuggestedK` bootstraps clustering over a range of k values, and returns the value that maximizes the jaccard similarity, with *median* as the default metric (argument `metric`).

While `findSuggestedK` does not guarantee optimal clustering. It  provides a guide to select k values.

```{r find_suggested_k}
tumor <- findSuggestedK(tumor)
```

The `plotSuggestedK()` function can be used to inspect the results of `findSuggestedK()`. It can plot a boxplot, heatmap (tile), or scatterplot

```{r plot_suggested_k_boxplot}
plotSuggestedK(tumor)
```

If the argument `geom` is set to *tile*, `plotSuggestedK()` plots a heatmap where each row is a detected subclone, each column is a k assessed during the grid search and the color represents the jaccard similarity for a given clone.

The suggested value can be accessed from the metadata:
```{r find_sug_meta}
S4Vectors::metadata(tumor)$suggestedK
```

### findClusters()

To run `findClusters()`, pass the CopyKit object as its input. If `findSuggestedK()` was run beforehand, `findClusters()` will automatically use the stored value from `findSuggestedK()` as the argument for k_subclones, unless otherwise specified.


```{r findCL_findSuggested}
tumor <- findClusters(tumor)
```

If hdbscan is used for clustering, singletons [outliers](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) may be idetified and added to subgroup `c0`.  Copykit will notify you of any cells classified as c0 after running findClusters(). To remove outliers, subset the CopyKit object with:

```{r filter_outliers}
tumor <- tumor[,colData(tumor)$subclones != 'c0']
```

### plotUmap()

The `plotUmap()` function can be used to visualize the reduced dimensional embedding. Points can be colored by any element of the `colData` with the argument 'label'.

```{r}
plotUmap(tumor)
```

```{r plot_umap_subclones}
plotUmap(tumor, label = 'subclones')
```

## runPca()

The `runPca()` function runs a Principal Component Analysis:
```{r run_pca}
tumor <- runPca(tumor)
```

The results are stored in the `reducedDim(tumor, 'PCA')` slot.

Principal Components can be used as an alternative pre-processing step to the `findClusters()` function by changing the `findClusters()` `embedding` argument to 'PCA' and setting the number of components with `ncomponents`. However, it's recommended to use the default UMAP embedding for better results. Visualize the number of useful components with plotScree() and the PCA plot with plotPca()

```{r}
plotPca(tumor, label = 'subclones')
```


## plotHeatmap()

Copy number profiles can be visualized with a heatmap using the plotHeatmap() function. The elements of `colData()` can be annotated on the heatmap with the argument `label`. To order subclones, a consensus phylogeny can be calculated with `calcConsensus()` and `runConsensusPhylo()`, if the argument `order_cells` is not provided, the order of the cells in the heatmap will follow the order of the cells within the CopyKit object. In the plotHeatmap() function, gene annotations can be added with the genes argument.

To visualize copy number profiles with a heatmap we can use `plotHeatmap()`.

The heatmap can be annotated with elements of colData.

To order subclones, one option is to calculate a consensus phylogeny, explained in later sections:
```{r plot_heatmap_sub, fig.height=8}
# First we calculate a consensus from the subclones
# and a consensus phylogeny
tumor <- calcConsensus(tumor)
tumor <- runConsensusPhylo(tumor)
# We pass 'consensus_tree' to plotHeatmap
# to order the cells by the stored consensus phylogeny
plotHeatmap(
  tumor,
  label = 'subclones',
  genes = c("TP53", "BRAF", "MYC"),
  order_cells = 'consensus_tree'
)
```

Integer copy number heatmaps can be plotted by passing assay = 'integer', with the integer matrix in the 'integer' slot. New information can be added to colData() and used in plotting functions. For example, spatial information encoded in the sample name can be extracted and added as a column to the metadata, and used to annotate the heatmap.
```{r plot_ht_int, fig.height=8}
# adding spatial information
colData(tumor)$spatial_info <- stringr::str_extract(colData(tumor)$sample, "L[0-9]")

# plotting the integer heatmap with the spatial and subclonal information
plotHeatmap(tumor, assay = 'integer', label = c("spatial_info", "subclones"), order_cells = 'consensus_tree')
```

The spatial information can now also be used to color the elements of the UMAP:
```{r umap_spatial}
plotUmap(tumor, label = 'spatial_info')
```

## runPhylo() & plotPhylo()

The `runPhylo()` function can be used to perform phylogenetic analysis of cells' copy number profiles. The available methods are [Neighbor Joining](https://pubmed.ncbi.nlm.nih.gov/3447015/) and [Balanced Minimum evolution](https://pubmed.ncbi.nlm.nih.gov/8412650/).The resulting tree is stored in the phylo slot within the CopyKit object.

The `plotPhylo()` function can be used to visualize the trees generated by `runPhylo()`. The leaves of the tree can be colored based on any element of the colData by using the argument `label`.

```{r run_phylo}
tumor <- runPhylo(tumor, metric = 'manhattan')
plotPhylo(tumor, label = 'subclones')
```

## calcConsensus()

The `calcConsensus()` function generates a consensus matrix that can help show the distinct segments among subclones. The function uses subclone information by default, but any `colData()` element can be used for the calculation.

The `plotHeatmap()` function can be used to plot a consensus heatmap and add metadata information if the label is the same as the one used to build the consensus matrix. The `group` argument can be passed to add additional annotation from `colData()` in the format of a frequency barplot.

```{r calc_consensus}
tumor <- calcConsensus(tumor)
# We use the plotHeatmap function to plot the
# consensus matrix by passing the argument `consensus`.
plotHeatmap(tumor,
            consensus = TRUE,
            label = 'subclones',
            group = 'spatial_info')
```

To calculate consensus from the integer assay, set the `calcConsensus()` argument `assay` to _'integer'_ and plot it with `plotHeatmap()`

```{r consensus_integer}
tumor <- calcConsensus(tumor, consensus_by = 'subclones', assay = 'integer')
plotHeatmap(tumor,
            consensus = TRUE,
            assay = 'integer',
            label = 'subclones',
            group = 'spatial_info')
```

## plotConsensusLine()

To compare the differences among subclones, `plotConsensusLine()` opens an interactive app where the consensus sequences are plotted as lines.

```{r plot_cons_line, eval=FALSE}
plotConsensusLine(tumor)
```

<center>
![Plot Consensus Line](images/plot_consensus_shiny.png)
</center>

## plotGeneCopy()

The `plotGeneCopy()` function can be used to view copy number states for genes. Two different geoms are available: "swarm" (default) or "violin". Points can be colored using the 'label' argument. To visualize across groups, a positional dodge can be added. Example:

```{r}
plotGeneCopy(tumor,
             genes = c("CDKN2A",
                              "FGFR1",
                              "TP53",
                              "PTEN",
                              "MYC",
                              "CDKN1A",
                              "MDM2",
                              "AURKA",
                              "PIK3CA",
                              "CCND1",
                              "KRAS"),
             label = 'subclones',
             dodge.width = 0.8)
```

A barplot geom can be used to visualize the integer data as a frequency barplot for each gene:
```{r plot_gene_copy_integer}
plotGeneCopy(tumor, genes = c("CDKN2A",
                              "FGFR1",
                              "TP53",
                              "PTEN",
                              "MYC",
                              "CDKN1A",
                              "MDM2",
                              "AURKA",
                              "PIK3CA",
                              "CCND1",
                              "KRAS"),
             geom = 'barplot',
             assay = 'integer')
```


## plotFreq()

The `plotFreq()` function can be used to visualize genomic gains or losses across the genome. This function calculates the frequency of gain or losses in each region of the genome based on a threshold applied to all samples. The `low_threshold` argument (below which values are counted as genomic losses) and `high_threshold` argument (above which values are counted as genomic gains) control the thresholds. It is recommended to set the thresholds based on the ploidy of the sample. To use the integer assay instead of the default segment_ratios assay, pass the `assay` argument. The visual representation of the frequencies can be shown as either an `area` or a `line` geom

```{r plot_freq_sample}
plotFreq(tumor)
```

Elements of `colData()` can be used to split the plot across different groups.

```{r plot_freq_label, fig.height=6}
plotFreq(tumor,
         group = 'subclones')
```


## plotAlluvial()

To visualize frequencies across elements of the metadata we can use `plotAlluvial()`
```{r plot_alluvial}
plotAlluvial(tumor, label = c("subclones", "spatial_info"))
```