Skip to content

Parsing the FORMAT field: static vs. dynamic dispatch #22

@jeff-k

Description

@jeff-k

The FORMAT fields in the header describe how to parse the genotype columns for each row:

...
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
...

So the per-genotype quality scores for the first variant at position 14370 are 48, 48, and 43.

btw we see that the HQ, Haplotype Quality score, should be represented as a pair of integers, but some of the fields are .,. or single integers?

The valid FORMAT key-value pairs are tightly specified and @MrCurtis has provided test cases for them: #20

The design decision we can make now is:

  1. How should we store the header's FORMAT data
  2. How should we use it to parse each of the body's rows

Here's one possibility:

enum NumberField {
    Number(u32),
    A,   // The field has one value per alternate allele
    R,   // The field has one value for each possible allele
    G,   // The field has one value for each possible genotype
    Dot, // The number of possible values varies, is unknown or unbounded
}

enum DataType {
    Integer(NumberField),
    Float(NumberField),
    Flag,
    Character(NumberField),
    String(NumberField),
}

struct InfoFormat {
    fieldtype: DataType,
    description: String,
    source: Option<String>,
    version: Option<String>,
}

For each format ID, (GT, GQ, DP, ..) we can store a InfoFormat struct that describes how to parse the row's fields. In this example,

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

would translate to:

InfoFormat { fieldtype: DataType::Integer(NumberField::Number(2)), description: "Haplotype Quality", None, None }

And we'd associate this instance with the HQ ID using a hashmap or something.

Then, when we parse each row of the sample, we match on DataType to determine how to parse that field. This would be an opportunity to report a detailed error message.

Contrast this with the dynamic dispatch solution:

trait InfoData {
    // Define methods that all InfoData must implement
}

struct InfoFormat {
    field_data: Box<dyn InfoData>,
    description: String,
    source: Option<String>,
    version: Option<String>,
}

struct IntegerData {
    number: NumberField,
    // Other fields specific to IntegerData
}

impl InfoData for IntegerData {
    // Implement the methods from InfoData trait
}

In terms of performance the choice is between matching on the field types at runtime and dereferencing the InfoData parsing methods. I think static dispatch would benefit from monomorphisation. I don't know which strategies guarantees better performance, but my guess is static dispatch.

The appeal of dynamic dispatch is that it's much more flexible: we could add a new field type by just implementing InfoData on it. On the other hand, the VCF spec clearly limits the scope of what values we're expecting.

Are there other fields, besides FORMAT, that might better be parsed with dynamic dispatch? If this is the case then it would probably be best to only use one strategy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions