hVCF - haplotype Variant Call Format Specification¶
- Specification version:
v2.4
- Date: 2024-11-19
Overview¶
hVCF stands for haplotype Variant Call Format. This
format is used to store and encode haplotype information across
samples from the PHG (Practical Haplotype Graph).
An hVCF file is based on the standards of a VCF file, specifically,
VCF v4.2. This
format leverages VCF's symbolic allele information from the ALT
field.
The hVCF specification¶
hVCF files can be broken into 3 main components: * Meta-information lines * Header line * Data lines containing information for each reference range * Fixed fields * Haplotype fields
An example¶
The following code block illustrates a formatted and merged example hVCF file:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=06ae4e937668d301e325d43725a38c3f,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:45001-49500,Checksum=06ae4e937668d301e325d43725a38c3f,RefChecksum=06ae4e937668d301e325d43725a38c3,RefRange=1:45001-49500>
##ALT=<ID=073286a82fe47d6a370e8a7a3803f1d3,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:39501-44000,Checksum=073286a82fe47d6a370e8a7a3803f1d3,RefChecksum=073286a82fe47d6a370e8a7a3803f1d,RefRange=1:39501-44000>
##ALT=<ID=105c85412229b45439db1f03c3f064f4,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:27501-28500,Checksum=105c85412229b45439db1f03c3f064f4,RefChecksum=105c85412229b45439db1f03c3f064f,RefRange=1:27501-28500>
##ALT=<ID=105e63346a01d88e8339eddf9131c435,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:50501-55000,Checksum=105e63346a01d88e8339eddf9131c435,RefChecksum=105e63346a01d88e8339eddf9131c43,RefRange=2:50501-55000>
##ALT=<ID=2c4b8564bbbdf70c6560fdefdbe3ef6a,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:34001-38500,Checksum=2c4b8564bbbdf70c6560fdefdbe3ef6a,RefChecksum=2c4b8564bbbdf70c6560fdefdbe3ef6,RefRange=2:34001-38500>
##ALT=<ID=347f0478b1a553ef107243cb60a9ba7d,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:11001-12000,Checksum=347f0478b1a553ef107243cb60a9ba7d,RefChecksum=347f0478b1a553ef107243cb60a9ba7,RefRange=2:11001-12000>
##ALT=<ID=39f96726321b329964435865b3694fd2,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:49501-50500,Checksum=39f96726321b329964435865b3694fd2,RefChecksum=39f96726321b329964435865b3694fd,RefRange=2:49501-50500>
##ALT=<ID=43687e13112bbe841f811b0a9de82a94,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=2:22001-23000,Checksum=43687e13112bbe841f811b0a9de82a94,RefChecksum=43687e13112bbe841f811b0a9de82a9,RefRange=2:22001-23000>
##ALT=<ID=546d1839623a5b0ea98bbff9a8a320e2,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:1-1000,Checksum=546d1839623a5b0ea98bbff9a8a320e2,RefChecksum=546d1839623a5b0ea98bbff9a8a320e2,RefRange=1:1-1000>
##ALT=<ID=57705b1e2541c7634ea59a48fc52026f,Description="haplotype data for line: Ref",Source="data/test/smallseq/Ref.fa",SampleName=Ref,Regions=1:1001-5500,Checksum=57705b1e2541c7634ea59a48fc52026f,RefChecksum=57705b1e2541c7634ea59a48fc52026f,RefRange=1:1001-5500>
##ALT=<ID=1bda8c63ae8e2f3678b85bac0ee7b8b9,Description="haplotype data for line: B97",Source="data/test/smallseq/B97.fa",SampleName=B97,Regions=1:1250-6750,Checksum=1bda8c63ae8e2f3678b85bac0ee7b8b9,RefChecksum=57705b1e2541c7634ea59a48fc52026f,RefRange=1:1001-5500>
##ALT=<ID=5fedf293a1a5443cc896d59f12d1b92f,Description="haplotype data for line: CML231"Source="data/test/smallseq/CML231.fa",SampleName=CML231,Regions=2:22001-23000,Checksum=5fedf293a1a5443cc896d59f12d1b92f,RefChecksum=43687e13112bbe841f811b0a9de82a94,RefRange=2:22001-23000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##contig=<ID=1,length=55000>
##contig=<ID=2,length=55000>
##reference=https://s3.amazonaws.com/maizegenetics/phg/phgV2Test/Ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ref B97 CML231
1 1 . G <546d1839623a5b0ea98bbff9a8a320e2> . . END=1000 GT 1 1 1
1 1001 . A <57705b1e2541c7634ea59a48fc52026f>,<1bda8c63ae8e2f3678b85bac0ee7b8b9> . . END=5500 GT 1 2 1
1 27501 . G <105c85412229b45439db1f03c3f064f4> . . END=28500 GT 1 1 1
1 39501 . G <073286a82fe47d6a370e8a7a3803f1d3> . . END=44000 GT 1 1 1
1 45001 . G <06ae4e937668d301e325d43725a38c3f> . . END=49500 GT 1 1 1
2 11001 . A <347f0478b1a553ef107243cb60a9ba7d> . . END=12000 GT 1 1 1
2 22001 . A <43687e13112bbe841f811b0a9de82a94>,<5fedf293a1a5443cc896d59f12d1b92f> . . END=23000 GT 1 1 2
2 34001 . A <2c4b8564bbbdf70c6560fdefdbe3ef6a> . . END=38500 GT 1 1 1
2 49501 . G <39f96726321b329964435865b3694fd2> . . END=50500 GT 1 1 1
2 50501 . G <105e63346a01d88e8339eddf9131c435> . . END=55000 GT 1 1 1
Note
In the prior example, the hVCF output columns below the header line
(e.g. below the line starting with #CHROM
) are formatted for
visual clarity. In a real example, delimiters are tab (\t
) based.
Meta-information lines¶
The header portion of an hVCF file contain rows of "meta-information"
, which are lines that start with ##
and must appear first in the
file. Like a VCF file, hVCF files can contain both unstructured
and structured meta-information.
Unstructured meta-information is characterized by a
straightforward pairing of key=value
logic. For instance, in the
previous illustration, the ##fileformat=VCF4.2
represents
unstructured meta-information, with ##=fileformat
serving as the
key and VCF4.2
as the corresponding value.
Structured meta-information also consists of a key-value pair,
but in this case, the value is a collection of additional key-value
pairs separated by a comma (,
) and enclosed with angle brackets
(<
and >
). In our prior example, sequence information fields
(e.g. ##contig=<ID=chr7,length=461>
) represent structured
meta-information where ##contig
is the primary key and
<ID=chr7,length=461>
is the value containing nested key-value pairs:
ID=chr7
length=461
File format (##fileformat
) field¶
Similar to VCF, a single line containing file format information
(e.g. fileformat
) must be the first line in the file. In the
case of hVCF files, the version must be version 4.4 of the VCF
specification (VCFv4.4
).
Alternative allele (##ALT
) field¶
The primary driver of the hVCF specification is information stored
within the structured alternative allele field. At its core, the
alternative allele field contains two primary key-values pairs, the
ID
and Description
which describe symbolic alternate alleles in
the ALT
column of VCF records. While this field is usually used
to describe possible structural variants and
IUPAC ambiguity codes,
here it is used to represent a haplotype sequence and ID for a
given reference range. This is achieved by defining the ID
value
with an MD5 checksum of the
given haplotype sequence and defining the Description
value with
information about the origin of the haplotype sequence.
Since this haplotype sequence is (I) derived from a particular sample, (II) related to reference range information, and (III) has its own positional information, we can populate the alternative allele field with additional key-value information. Take the following example:
##ALT=<ID=1bda8c63ae8e2f3678b85bac0ee7b8b9,Description="haplotype data for line: B97",Source="data/test/smallseq/B97.fa",SampleName=B97,Regions=1:1250-6750,Checksum=1bda8c63ae8e2f3678b85bac0ee7b8b9,RefChecksum=57705b1e2541c7634ea59a48fc52026f,RefRange=1:1001-5500>
Here, we have the following information:
Key | Value | Description |
---|---|---|
ID |
1bda8c63ae8e2f3678b85bac0ee7b8b9 |
Identifier for the given haplotype sequence. Can be either the MD5 checksum or a 1-based genomic range identifier format (i.e., contig:start-end ) |
Description |
"haplotype data for line: B97" |
Information about the origin of the haplotype sequence. |
Source |
"data/test/smallseq/B97.fa" |
Fasta file ID and path containing haplotype sequence |
SampleName |
B97 |
The sample ID from which the haplotype originated. |
Regions |
1:1250-6750 |
List of genomic regions which make up the haplotype. Regions are also represented in a 1-based genomic range identifier format (i.e., contig:start-end ). If sub-regions are present, they will be separated by commas (e.g., 1:100-200,1:205-300 ). |
Checksum |
1bda8c63ae8e2f3678b85bac0ee7b8b9 |
The MD5 checksum for the haplotype sequence. |
RefChecksum |
06ae4e937668d301e325d43725a38c3f |
The MD5 checksum for the reference range sequence. If the sample in question is the reference assembly, this value will be the same as the value found in the Checksum key. |
RefRange |
1:45001-49500 |
The genomic region the reference range sequence originate. Also represented in a 1-based genomic range identifier format (i.e., contig:start-end ). |
Individual format (##FORMAT
) field¶
The meta-information contained in the individual format field closely
adheres to the VCF specification. This structured field provides a
description of the IDs found within the FORMAT
column of the data
rows. The necessary keys are as follows:
Key | Description |
---|---|
ID |
Identifier for FORMAT entry |
Number |
Number (integer) of values representing ID |
Type |
Data type for ID |
Description |
Descriptive information about ID |
Information (##INFO
) field¶
Much like the ##FORMAT
field, the ##INFO
field is a structured
meta-information field that provides details pertaining to each
reference range and the corresponding haplotype data contained within
those reference ranges. Similar to the ##FORMAT
field, the
necessary keys are as follows:
Key | Description |
---|---|
ID |
Identifier for FORMAT entry. Defaults to END |
Number |
Number (integer) of values representing ID |
Type |
Data type for ID |
Description |
Descriptive information about ID |
In order to properly represent the information regarding reference ranges, the following values are required:
Value | Description |
---|---|
End |
End position of reference range (bp) |
By combining the values identified within the POS
column and the
END
value, we can specify the total length of the reference range
along with assembly information.
Sequence information (##contig
) field¶
The contig field is used to detail additional attributes for
each sequence represented within the haplotype data. For now,
this is a structured field requiring the identifier (ID
) for the
sequence and the length of the mentioned sequence (length
)
Header¶
Like the VCF specifications, the 8 mandatory tab-delimited (\t
)
column headers are required:
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
Since genotype (i.e. haplotype information) data is also required,
the FORMAT
column is also required with the GT
identifier.
More information about each of these fields is discussed in the
next section.
Note
The end of the line must have no tab characters (\t
).
Data lines¶
Fixed fields¶
There are 8 fixed fields for each reference range record:
Field | Description |
---|---|
CHROM |
sequence identifier |
POS |
start position of reference range |
ID |
an optional identifier for a reference range record |
REF |
The allele at the start position for reference haplotype sequence |
ALT |
MD5 hash sums for each possible haplotype sequence detected at a given reference range record |
QUAL |
quality score (needed to satisfy VCF specifications) |
FILTER |
filter status (needed to satisfy VCF specifications) |
INFO |
information field used to represent the end position (END ) value for the reference range record |
Haplotype fields¶
Haplotype information must be specified by first creating a format
field (FORMAT
) field along with a genotype (GT
) identifier.
The following fields proceeding the FORMAT
field are specified with
the sample (e.g. taxa) identifier for each given sample referenced
in the hVCF file. For example, let's take a look at a given record
in the prior example (with added header for additional clarity):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ref B97 CML231
1 1001 . A <57705b1e2541c7634ea59a48fc52026f>,<1bda8c63ae8e2f3678b85bac0ee7b8b9> . . END=5500 GT 1 2 1
One thing you will notice is that there are no calls to the
"reference" allele field; only calls to the alternate field
since these allele values represent the haplotype sequence in
MD5 hash form. Allele values, if using haploid path finding,
are represented using
singular values
(e.g. 1
, 2
) which represent the indexed order of haplotype
sequences in the ALT
field. In other terms, if a sample has an
allele value of 1
, this would refer to the first symbolic allele
in the ALT
field for the haploid value.
Using this information with the prior example, we can infer the
following haplotype sequence information for the given reference
range record (1:1001-5500
):
Sample ID | Allele values | MD5 symbolic allele |
---|---|---|
Ref |
1 |
57705b1e2541c7634ea59a48fc52026f |
B97 |
2 |
1bda8c63ae8e2f3678b85bac0ee7b8b9 |
CML231 |
1 |
57705b1e2541c7634ea59a48fc52026f |
Alternatively, allele values in hVCF files can be generated using diploid path finding during the PHGv2 imputation process. Here is an example entry of this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ref B97 CML231
1 1001 . A <57705b1e2541c7634ea59a48fc52026f>,<1bda8c63ae8e2f3678b85bac0ee7b8b9> . . END=5500 GT 1|1 2|1 1|1
Allele values are separated with a "phased" indicator (|
) and
never with an "unphased" indicator (/
). Similar to haploid path
finding, allele values represent the indexed order of haplotype
sequences in the ALT
field. In other terms, if a sample has an
allele value of 2|1
, this would refer to the second symbolic
allele in the ALT
field for the first gamete and the first
symbolic allele for the second gamete.
Using this information with the prior example, we can infer the
following haploid sequence information for the given reference
range record (1:1001-5500
) using diploid values:
Sample ID | Allele values | MD5 symbolic allele (gamete 1) | MD5 symbolic allele (gamete 2) |
---|---|---|---|
Ref |
1|1 |
57705b1e2541c7634ea59a48fc52026f |
57705b1e2541c7634ea59a48fc52026f |
B97 |
2|1 |
1bda8c63ae8e2f3678b85bac0ee7b8b9 |
57705b1e2541c7634ea59a48fc52026f |
CML231 |
1|1 |
57705b1e2541c7634ea59a48fc52026f |
57705b1e2541c7634ea59a48fc52026f |