-
Notifications
You must be signed in to change notification settings - Fork 1
Description
This bug pertains to the Biomart section of the function only. I haven't tested if this is also a problem with the "org.db" mode.
The issue is that when fetching sequences from Biomart, it returns sequences relative to whichever strand the gene is on. So a gene on the forward strand will get a sequence from the forward strand where position "1" is the first base, but a gene on the reverse strand will get the reverse complement of the forward strand so all the coordinates are backward (position "1" is the last base). However the coordinates returned by Biomart are with respect to the forward strand regardless of where the gene is. The getGeneLengthAndGCContent function treats all genes as if position "1" is the first base, which means that for all negative-strand genes, the exon sequences extracted from the gene sequence are wrong.
As a working example: gene ENSG00000012817 / exon ENSE00000652508, which is on the reverse strand:
If you go to ensembl.org, this exon is 78 bases long and the sequence listed is "GATTGGCAGC....TGAACTGGAG".
I found which set of coordinates in ecoords (line 136) belong to this exon and adjusted them as in lines 143-144. The sequence you get with the coordinates (gseq[IRanges(39298, 39375)]) is: "GTTTTTTTGT...TTTTCTCCCC", which is an obvious mismatch.
However, if you reverse-complement the sequence and then get the sequence with the same coordinates:
reverseComplement(gseq)[IRanges(39298, 39375)], you get "CTCCAGTTCA....GCTGCCAATC" which, if you reverse complement again, matches the original sequence listed in Ensembl.
I ran this gene and one on the forward strand through the original function. GC content results:
ENSG00000012817 ENSG00000067646
0.3774701 0.3820407
Corrected results after my fixes below:
ENSG00000012817 ENSG00000067646
0.4540843 0.3820407
Which is a pretty large difference.
Here is the code needed to fix this bug. I can also fork and submit a pull request if that's easiest:
Line 112, add "strand" to attrs:
attrs <- c(id.type, "ensembl_exon_id",
"chromosome_name", "exon_chrom_start", "exon_chrom_end",
"strand")
coords <- getBM(filters=id.type, attributes=attrs, values=id, mart=ensembl)
Line 114:
coords <- GRangesList(sapply(id,
function(i)
{
i.coords <- coords[coords[,1]== i, 3:5]
g <- GRanges(i.coords[,1], IRanges(i.coords[,2],i.coords[,3]))
return(g)
}), compress=FALSE)
can be replaced with:
strand <- c("-1" = "-", "1" = "+")
coords$strand <- strand[as.character(coords$strand)]
coords <- makeGRangesListFromDataFrame(coords,
split.field = "ensembl_gene_id",
start.field = "exon_chrom_start",
end.field = "exon_chrom_end")
Then, line 145 needs to be changed to:
if (all(strand(ecoords) == "+")) {
eseq <- gseq[IRanges(start, end)]
} else {
eseq <- reverseComplement(gseq)[IRanges(start, end)]
}
And that should now give the correct sequence and GC content for reverse-strand genes. I have spot-checked exons from multiple genes on the forward and reverse strands to make sure this is correct.