PHG v2 Terminology

General terms

Term Definition
Reference genome genome used for initial alignment and base coordinates
Reference range segment of the reference genome
Haplotype sequence of part of an individual chromosome with its start and stop defined by the reference range
Reference Haplotype haplotype from the reference genome
Alternate genome high quality genomes used to identify alternate haplotypes
Alternate haplotype haplotype derived from a genome assembly
Composite genome inferred genome based on its composite set of alternate and reference haplotypes
Haplotype ID MD5 checksum for the haplotype sequence
Sample genotype (haploid or diploid or higher), taxon, individual
Path phased set of haplotype ids through the pangenome graph

File types

File Type Acronym definition Usage
.agc Assembled Genomes Compressor Efficient genome sequence compression.
.bam Binary Alignment Map Binary representation of a SAM file; useful for efficient processing.
.bed Browser Extensible Data Genomic feature coordinate (e.g. reference ranges) storage.
.bcf Binary Call Format Binary representation of a VCF file; useful for efficient processing.
.fasta FAST-All Sequence representation and storage.
.g.VCF genomic VCF file Variant and non-variant genomic storage.
.h.VCF haplotype VCF file Haplotype information representation and storage. More information can be found here.
.maf Multiple Alignment Format Multiple alignment storage; basis for gVCF and hVCF creation.
.sam Sequence Alignment Map Sequence alignment to a reference sequence.
.vcf Variant Call Format Genetic variant representation and storage.


Software Purpose
agc Performant FASTA genome compression
AnchorWave Sensitive aligner for genomes with high sequence diversity
bcftools Utilities for indexing VCF data
samtools bgzip compression for VCF data
TileDB Performant storage core for array data
TileDB-VCF API for storing and querying VCF data