Bug in getGeneLengthAndGCContent for reverse strand genes

This bug pertains to the Biomart section of the function only. I haven't tested if this is also a problem with the "org.db" mode. 

The issue is that when fetching sequences from Biomart, it returns sequences relative to whichever strand the gene is on. So a gene on the forward strand will get a sequence from the forward strand where position "1" is the first base, but a gene on the reverse strand will get the reverse complement of the forward strand so all the coordinates are backward (position "1" is the last base). However the _coordinates_ returned by Biomart are with respect to the forward strand regardless of where the gene is. The `getGeneLengthAndGCContent` function treats all genes as if position "1" is the first base, which means that for all negative-strand genes, the exon sequences extracted from the gene sequence are wrong.

As a working example: gene **ENSG00000012817** / exon **ENSE00000652508**, which is on the reverse strand:
If you go to ensembl.org, this exon is 78 bases long and the sequence listed is "GATTGGCAGC....TGAACTGGAG".

I found which set of coordinates in `ecoords` (line 136) belong to this exon and adjusted them as in lines 143-144. The sequence you get with the coordinates (`gseq[IRanges(39298, 39375)]`) is: "GTTTTTTTGT...TTTTCTCCCC", which is an obvious mismatch.

However, if you reverse-complement the sequence and _then_ get the sequence with the same coordinates:
`reverseComplement(gseq)[IRanges(39298, 39375)]`, you get "CTCCAGTTCA....GCTGCCAATC" which, if you reverse complement again, matches the original sequence listed in Ensembl. 

I ran this gene and one on the forward strand through the original function. GC content results: 
```
ENSG00000012817 ENSG00000067646 
      0.3774701       0.3820407 
```

Corrected results after my fixes below:
```
ENSG00000012817 ENSG00000067646 
      0.4540843       0.3820407 
```

Which is a pretty large difference. 


Here is the code needed to fix this bug. I can also fork and submit a pull request if that's easiest:

Line 112, add "strand" to `attrs`:
```
 attrs <- c(id.type, "ensembl_exon_id",
                  "chromosome_name", "exon_chrom_start", "exon_chrom_end",
                  "strand")
coords <- getBM(filters=id.type, attributes=attrs, values=id, mart=ensembl)
```

Line 114:
```
coords <- GRangesList(sapply(id,
            function(i)
            {
                i.coords <- coords[coords[,1]== i, 3:5]
                g <- GRanges(i.coords[,1], IRanges(i.coords[,2],i.coords[,3]))
                return(g)
            }), compress=FALSE)
```
can be replaced with:
```
strand <- c("-1" = "-", "1" = "+")
coords$strand <- strand[as.character(coords$strand)]
coords <- makeGRangesListFromDataFrame(coords, 
        split.field = "ensembl_gene_id",
        start.field = "exon_chrom_start",
        end.field = "exon_chrom_end")
```
Then, line 145 needs to be changed to:
```
if (all(strand(ecoords) == "+")) {
    eseq <- gseq[IRanges(start, end)]
} else {
    eseq <- reverseComplement(gseq)[IRanges(start, end)]
}
```
And that should now give the correct sequence and GC content for reverse-strand genes. I have spot-checked exons from multiple genes on the forward and reverse strands to make sure this is correct. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in getGeneLengthAndGCContent for reverse strand genes #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug in getGeneLengthAndGCContent for reverse strand genes #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions