-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The FORMAT fields in the header describe how to parse the genotype columns for each row:
...
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
...
So the per-genotype quality scores for the first variant at position 14370 are 48, 48, and 43.
btw we see that the HQ, Haplotype Quality score, should be represented as a pair of integers, but some of the fields are .,. or single integers?
The valid FORMAT key-value pairs are tightly specified and @MrCurtis has provided test cases for them: #20
The design decision we can make now is:
- How should we store the header's FORMAT data
- How should we use it to parse each of the body's rows
Here's one possibility:
enum NumberField {
Number(u32),
A, // The field has one value per alternate allele
R, // The field has one value for each possible allele
G, // The field has one value for each possible genotype
Dot, // The number of possible values varies, is unknown or unbounded
}
enum DataType {
Integer(NumberField),
Float(NumberField),
Flag,
Character(NumberField),
String(NumberField),
}
struct InfoFormat {
fieldtype: DataType,
description: String,
source: Option<String>,
version: Option<String>,
}For each format ID, (GT, GQ, DP, ..) we can store a InfoFormat struct that describes how to parse the row's fields. In this example,
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
would translate to:
InfoFormat { fieldtype: DataType::Integer(NumberField::Number(2)), description: "Haplotype Quality", None, None }And we'd associate this instance with the HQ ID using a hashmap or something.
Then, when we parse each row of the sample, we match on DataType to determine how to parse that field. This would be an opportunity to report a detailed error message.
Contrast this with the dynamic dispatch solution:
trait InfoData {
// Define methods that all InfoData must implement
}
struct InfoFormat {
field_data: Box<dyn InfoData>,
description: String,
source: Option<String>,
version: Option<String>,
}
struct IntegerData {
number: NumberField,
// Other fields specific to IntegerData
}
impl InfoData for IntegerData {
// Implement the methods from InfoData trait
}In terms of performance the choice is between matching on the field types at runtime and dereferencing the InfoData parsing methods. I think static dispatch would benefit from monomorphisation. I don't know which strategies guarantees better performance, but my guess is static dispatch.
The appeal of dynamic dispatch is that it's much more flexible: we could add a new field type by just implementing InfoData on it. On the other hand, the VCF spec clearly limits the scope of what values we're expecting.
Are there other fields, besides FORMAT, that might better be parsed with dynamic dispatch? If this is the case then it would probably be best to only use one strategy.