Haplotype Region Handling¶
In this document, we will go into further detail about how regions are reported in the alternative allele field of an hVCF file and possible edge cases that may arise in hVCF creation.
Alternative allele fields - an overview¶
Note
For more information about this hVCF field and other general information about hVCF specifications, please review the haplotype VCF specification article.
The alternative allele (##ALT
) field contains key-value information
that represents the haplotype sequence and ID for a given
reference-range in the PHG. One key-value pair is the Regions
key.
This represents the physical location of the sequence in the genome
assembly FASTA file.
A common example of this would be the following example:
##ALT=<ID=57705b1e2541c7634ea59a48fc52026f,Description="haplotype data for line: LineA",Source="data/test/smallseq/LineA.fa",Regions=1:14-19,Checksum=Md5,RefRange=06ae4e937668d301e325d43725a38c3f>
...we have the following regions key:
...for the following assembly/sample:
This indicates that for LineA
, we have a haplotype sequence that
aligned against the given reference range (denoted with the MD5 hash,
06ae4e937668d301e325d43725a38c3f
) at chromosome 1 (1:
) between
the values of 14 and 19 (14-19
) base-pairs. We can represent this
graphically with a dot plot:
In the above example, LineA
(y-axis) is aligned against our
reference genome, in this case, identified as Reference
(x-axis).
The reference range boundary is located on chromosome 1 between the
base pairs of 3 and 8 (denoted by the dashed lines). Given our
hypothetical alignment scenario, our sample, LineA
, aligned against
the reference genome also on chromosome 1, but due to structural
variations between the two genomes, matches at different base-pair
positions (in this case between 14
and 19
base-pairs). Using
this region value, we can navigate back to LineA
's FASTA file and
identify the sequence region:
Possible edge cases - inversions¶
Given the possibility of structural diversity events between the reference and assembly genomes. There are edge cases where a reference range boundary can land on inversion regions. For example, let's illustrate this scenario with another dot plot:
Here, when we align LineA
against the reference, there is an
inversion at the end of this boundary. How do we represent this using
our Region
key-value pairing scheme in the alternate allele field?
The Region
key can house multiple region elements using comma
(,
) separated values. Using the above example, this can be
displayed as follows:
In the above example, we provide multiple regions (separated by commas) to indicate the normal and inverted portions of the alignment:
Region | Chromosome | Range (bp) | Type |
---|---|---|---|
1:13-16 |
1 | 13 to 16 | normal |
1:21-19 |
1 | 21 to 19 | inverted |
Since the last element of the region is inverted, the larger value
will be the first value left of the -
character and the smaller
value will be to the right of the -
character. This will indicate
to PHGv2 that reverse transcription of the sequence will need to
occur before the MD5 hashing step.