hVCF - haplotype Variant Call Format Specification¶

Specification version: v2.4
Date: 2024-11-19

Overview¶

hVCF stands for haplotype Variant Call Format. This format is used to store and encode haplotype information across samples from the PHG (Practical Haplotype Graph). An hVCF file is based on the standards of a VCF file, specifically, VCF v4.2. This format leverages VCF's symbolic allele information from the ALT field.

The hVCF specification¶

hVCF files can be broken into 3 main components: * Meta-information lines * Header line * Data lines containing information for each reference range * Fixed fields * Haplotype fields

An example¶

The following code block illustrates a formatted and merged example hVCF file:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=06ae4e937668d301e325d43725a38c3f,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:45001-49500,Checksum=06ae4e937668d301e325d43725a38c3f,RefChecksum=06ae4e937668d301e325d43725a38c3,RefRange=1:45001-49500>
##ALT=<ID=073286a82fe47d6a370e8a7a3803f1d3,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:39501-44000,Checksum=073286a82fe47d6a370e8a7a3803f1d3,RefChecksum=073286a82fe47d6a370e8a7a3803f1d,RefRange=1:39501-44000>
##ALT=<ID=105c85412229b45439db1f03c3f064f4,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:27501-28500,Checksum=105c85412229b45439db1f03c3f064f4,RefChecksum=105c85412229b45439db1f03c3f064f,RefRange=1:27501-28500>
##ALT=<ID=105e63346a01d88e8339eddf9131c435,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:50501-55000,Checksum=105e63346a01d88e8339eddf9131c435,RefChecksum=105e63346a01d88e8339eddf9131c43,RefRange=2:50501-55000>
##ALT=<ID=2c4b8564bbbdf70c6560fdefdbe3ef6a,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:34001-38500,Checksum=2c4b8564bbbdf70c6560fdefdbe3ef6a,RefChecksum=2c4b8564bbbdf70c6560fdefdbe3ef6,RefRange=2:34001-38500>
##ALT=<ID=347f0478b1a553ef107243cb60a9ba7d,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:11001-12000,Checksum=347f0478b1a553ef107243cb60a9ba7d,RefChecksum=347f0478b1a553ef107243cb60a9ba7,RefRange=2:11001-12000>
##ALT=<ID=39f96726321b329964435865b3694fd2,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:49501-50500,Checksum=39f96726321b329964435865b3694fd2,RefChecksum=39f96726321b329964435865b3694fd,RefRange=2:49501-50500>
##ALT=<ID=43687e13112bbe841f811b0a9de82a94,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:22001-23000,Checksum=43687e13112bbe841f811b0a9de82a94,RefChecksum=43687e13112bbe841f811b0a9de82a9,RefRange=2:22001-23000>
##ALT=<ID=546d1839623a5b0ea98bbff9a8a320e2,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:1-1000,Checksum=546d1839623a5b0ea98bbff9a8a320e2,RefChecksum=546d1839623a5b0ea98bbff9a8a320e2,RefRange=1:1-1000>
##ALT=<ID=57705b1e2541c7634ea59a48fc52026f,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:1001-5500,Checksum=57705b1e2541c7634ea59a48fc52026f,RefChecksum=57705b1e2541c7634ea59a48fc52026f,RefRange=1:1001-5500>
##ALT=<ID=1bda8c63ae8e2f3678b85bac0ee7b8b9,Description="haplotype data for line: B97",Source="data/test/smallseq/B97.fa",SampleName=B97,Regions=1:1250-6750,Checksum=1bda8c63ae8e2f3678b85bac0ee7b8b9,RefChecksum=57705b1e2541c7634ea59a48fc52026f,RefRange=1:1001-5500>
##ALT=<ID=5fedf293a1a5443cc896d59f12d1b92f,Description="haplotype data for line: CML231"Source="data/test/smallseq/CML231.fa",SampleName=CML231,Regions=2:22001-23000,Checksum=5fedf293a1a5443cc896d59f12d1b92f,RefChecksum=43687e13112bbe841f811b0a9de82a94,RefRange=2:22001-23000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##contig=<ID=1,length=55000>
##contig=<ID=2,length=55000>
##reference=https://s3.amazonaws.com/maizegenetics/phg/phgV2Test/Ref.fa
#CHROM POS   ID REF ALT                                                                   QUAL FILTER INFO      FORMAT Ref B97 CML231
1      1     .  G   <546d1839623a5b0ea98bbff9a8a320e2>                                    .    .      END=1000  GT     1   1   1
1      1001  .  A   <57705b1e2541c7634ea59a48fc52026f>,<1bda8c63ae8e2f3678b85bac0ee7b8b9> .    .      END=5500  GT     1   2   1
1      27501 .  G   <105c85412229b45439db1f03c3f064f4>                                    .    .      END=28500 GT     1   1   1
1      39501 .  G   <073286a82fe47d6a370e8a7a3803f1d3>                                    .    .      END=44000 GT     1   1   1
1      45001 .  G   <06ae4e937668d301e325d43725a38c3f>                                    .    .      END=49500 GT     1   1   1
2      11001 .  A   <347f0478b1a553ef107243cb60a9ba7d>                                    .    .      END=12000 GT     1   1   1
2      22001 .  A   <43687e13112bbe841f811b0a9de82a94>,<5fedf293a1a5443cc896d59f12d1b92f> .    .      END=23000 GT     1   1   2
2      34001 .  A   <2c4b8564bbbdf70c6560fdefdbe3ef6a>                                    .    .      END=38500 GT     1   1   1
2      49501 .  G   <39f96726321b329964435865b3694fd2>                                    .    .      END=50500 GT     1   1   1
2      50501 .  G   <105e63346a01d88e8339eddf9131c435>                                    .    .      END=55000 GT     1   1   1

Note

In the prior example, the hVCF output columns below the header line (e.g. below the line starting with #CHROM) are formatted for visual clarity. In a real example, delimiters are tab (\t) based.

Meta-information lines¶

The header portion of an hVCF file contain rows of "meta-information" , which are lines that start with ## and must appear first in the file. Like a VCF file, hVCF files can contain both unstructured and structured meta-information.

Unstructured meta-information is characterized by a straightforward pairing of key=value logic. For instance, in the previous illustration, the ##fileformat=VCF4.2 represents unstructured meta-information, with ##=fileformat serving as the key and VCF4.2 as the corresponding value.

Structured meta-information also consists of a key-value pair, but in this case, the value is a collection of additional key-value pairs separated by a comma (,) and enclosed with angle brackets (< and >). In our prior example, sequence information fields (e.g. ##contig=<ID=chr7,length=461>) represent structured meta-information where ##contig is the primary key and <ID=chr7,length=461> is the value containing nested key-value pairs:

ID=chr7
length=461

File format (`##fileformat`) field¶

Similar to VCF, a single line containing file format information (e.g. fileformat) must be the first line in the file. In the case of hVCF files, the version must be version 4.4 of the VCF specification (VCFv4.4).

Alternative allele (`##ALT`) field¶

The primary driver of the hVCF specification is information stored within the structured alternative allele field. At its core, the alternative allele field contains two primary key-values pairs, the ID and Description which describe symbolic alternate alleles in the ALT column of VCF records. While this field is usually used to describe possible structural variants and IUPAC ambiguity codes, here it is used to represent a haplotype sequence and ID for a given reference range. This is achieved by defining the ID value with an MD5 checksum of the given haplotype sequence and defining the Description value with information about the origin of the haplotype sequence.

Since this haplotype sequence is (I) derived from a particular sample, (II) related to reference range information, and (III) has its own positional information, we can populate the alternative allele field with additional key-value information. Take the following example:

##ALT=<ID=1bda8c63ae8e2f3678b85bac0ee7b8b9,Description="haplotype data for line: B97",Source="data/test/smallseq/B97.fa",SampleName=B97,Regions=1:1250-6750,Checksum=1bda8c63ae8e2f3678b85bac0ee7b8b9,RefChecksum=57705b1e2541c7634ea59a48fc52026f,RefRange=1:1001-5500>

Here, we have the following information:

Key	Value	Description
`ID`	`1bda8c63ae8e2f3678b85bac0ee7b8b9`	Identifier for the given haplotype sequence. Can be either the MD5 checksum or a 1-based genomic range identifier format (i.e., `contig:start-end`)
`Description`	`"haplotype data for line: B97"`	Information about the origin of the haplotype sequence.
`Source`	`"data/test/smallseq/B97.fa"`	Fasta file ID and path containing haplotype sequence
`SampleName`	`B97`	The sample ID from which the haplotype originated.
`Regions`	`1:1250-6750`	List of genomic regions which make up the haplotype. Regions are also represented in a 1-based genomic range identifier format (i.e., `contig:start-end`). If sub-regions are present, they will be separated by commas (e.g., `1:100-200,1:205-300`).
`Checksum`	`1bda8c63ae8e2f3678b85bac0ee7b8b9`	The MD5 checksum for the haplotype sequence.
`RefChecksum`	`06ae4e937668d301e325d43725a38c3f`	The MD5 checksum for the reference range sequence. If the sample in question is the reference assembly, this value will be the same as the value found in the `Checksum` key.
`RefRange`	`1:45001-49500`	The genomic region the reference range sequence originate. Also represented in a 1-based genomic range identifier format (i.e., `contig:start-end`).

Individual format (`##FORMAT`) field¶

The meta-information contained in the individual format field closely adheres to the VCF specification. This structured field provides a description of the IDs found within the FORMAT column of the data rows. The necessary keys are as follows:

Key	Description
`ID`	Identifier for `FORMAT` entry
`Number`	Number (integer) of values representing `ID`
`Type`	Data type for `ID`
`Description`	Descriptive information about `ID`

Information (`##INFO`) field¶

Much like the ##FORMAT field, the ##INFO field is a structured meta-information field that provides details pertaining to each reference range and the corresponding haplotype data contained within those reference ranges. Similar to the ##FORMAT field, the necessary keys are as follows:

Key	Description
`ID`	Identifier for `FORMAT` entry. Defaults to `END`
`Number`	Number (integer) of values representing `ID`
`Type`	Data type for `ID`
`Description`	Descriptive information about `ID`

In order to properly represent the information regarding reference ranges, the following values are required:

Value	Description
`End`	End position of reference range (bp)

By combining the values identified within the POS column and the END value, we can specify the total length of the reference range along with assembly information.

Sequence information (`##contig`) field¶

The contig field is used to detail additional attributes for each sequence represented within the haplotype data. For now, this is a structured field requiring the identifier (ID) for the sequence and the length of the mentioned sequence (length)

Like the VCF specifications, the 8 mandatory tab-delimited (\t) column headers are required:

#CHROM
POS
ID
REF
ALT
QUAL
FILTER

Since genotype (i.e. haplotype information) data is also required, the FORMAT column is also required with the GT identifier. More information about each of these fields is discussed in the next section.

Note

The end of the line must have no tab characters (\t).

Data lines¶

Fixed fields¶

There are 8 fixed fields for each reference range record:

Field	Description
`CHROM`	sequence identifier
`POS`	start position of reference range
`ID`	an optional identifier for a reference range record
`REF`	The allele at the start position for reference haplotype sequence
`ALT`	MD5 hash sums for each possible haplotype sequence detected at a given reference range record
`QUAL`	quality score (needed to satisfy VCF specifications)
`FILTER`	filter status (needed to satisfy VCF specifications)
`INFO`	information field used to represent the end position (`END`) value for the reference range record

Haplotype fields¶

Haplotype information must be specified by first creating a format field (FORMAT) field along with a genotype (GT) identifier.

The following fields proceeding the FORMAT field are specified with the sample (e.g. taxa) identifier for each given sample referenced in the hVCF file. For example, let's take a look at a given record in the prior example (with added header for additional clarity):

#CHROM POS   ID REF ALT                                                                   QUAL FILTER INFO      FORMAT Ref B97 CML231
1      1001  .  A   <57705b1e2541c7634ea59a48fc52026f>,<1bda8c63ae8e2f3678b85bac0ee7b8b9> .    .      END=5500  GT     1   2   1

One thing you will notice is that there are no calls to the "reference" allele field; only calls to the alternate field since these allele values represent the haplotype sequence in MD5 hash form. Allele values, if using haploid path finding, are represented using singular values (e.g. 1, 2) which represent the indexed order of haplotype sequences in the ALT field. In other terms, if a sample has an allele value of 1, this would refer to the first symbolic allele in the ALT field for the haploid value.

Using this information with the prior example, we can infer the following haplotype sequence information for the given reference range record (1:1001-5500):

Sample ID	Allele values	MD5 symbolic allele
`Ref`	`1`	`57705b1e2541c7634ea59a48fc52026f`
`B97`	`2`	`1bda8c63ae8e2f3678b85bac0ee7b8b9`
`CML231`	`1`	`57705b1e2541c7634ea59a48fc52026f`

Alternatively, allele values in hVCF files can be generated using diploid path finding during the PHGv2 imputation process. Here is an example entry of this:

#CHROM POS   ID REF ALT                                                                   QUAL FILTER INFO      FORMAT Ref B97 CML231
1      1001  .  A   <57705b1e2541c7634ea59a48fc52026f>,<1bda8c63ae8e2f3678b85bac0ee7b8b9> .    .      END=5500  GT     1|1 2|1 1|1

Allele values are separated with a "phased" indicator (|) and never with an "unphased" indicator (/). Similar to haploid path finding, allele values represent the indexed order of haplotype sequences in the ALT field. In other terms, if a sample has an allele value of 2|1, this would refer to the second symbolic allele in the ALT field for the first gamete and the first symbolic allele for the second gamete.

Using this information with the prior example, we can infer the following haploid sequence information for the given reference range record (1:1001-5500) using diploid values:

Sample ID	Allele values	MD5 symbolic allele (gamete 1)	MD5 symbolic allele (gamete 2)
`Ref`	`1\|1`	`57705b1e2541c7634ea59a48fc52026f`	`57705b1e2541c7634ea59a48fc52026f`
`B97`	`2\|1`	`1bda8c63ae8e2f3678b85bac0ee7b8b9`	`57705b1e2541c7634ea59a48fc52026f`
`CML231`	`1\|1`	`57705b1e2541c7634ea59a48fc52026f`	`57705b1e2541c7634ea59a48fc52026f`