Skip to content

Quality Control Metrics

In this document, we will discuss the built-in quality control metrics available from the PHG. Metrics accompany several steps of building and using the PHG, and many can be run as either standalone commands or as part of a related command.

AnchorWave dot plots

Dot plots generated from the align-assemblies step are a visual representation of the alignment between two assemblies. They are useful for identifying large-scale structural differences between assemblies, such as inversions, translocations, and large deletions.

There are three methods to produce AnchorWave dot plots:

1. As part of the align-assemblies command:

./phg align-assemblies \
    --gff data/anchors.gff \
    --reference-file output/updated_assemblies/Ref.fa \
    --assemblies data/assemblies_list.txt \
    --total-threads 20 \
    --in-parallel 2 \
    -o output/alignment_files

The align-assemblies command runs code to create the dot plots from AnchorWave's .anchorspro files. For each .anchorspro file that is generated, a genome-wide dot plot image is created and stored in the same location where .maf and .anchorspro files are placed (-o parameter directory). Each image that is generated will have the assembly name found in the --assemblies parameter followed by the suffix, _dotplot.svg. * For example, if your assemblies list file (in my case, this would be assemblies_list.txt) has the following assemblies to align:

output/updated_assemblies/LineA.fa
output/updated_assemblies/LineB.fa
The plot output that is generated by default would be:
output/alignment_files/LineA_dotplot.svg
output/alignment_files/LineB_dotplot.svg

Note

As of PHGv2 version 2.2.74.123, the plots generated with the align-assemblies command are part of the default output and is not optional.

2. As a standalone command:

phg create-anchorwave-dotplot \
    --input-file my.anchorspro \
    --output-file myOutputFile.svg

The create-anchorwave-dotplot command creates a dot plot from a user provided .anchorspro file generated from AnchorWave. Users may use the output from either the align-assemblies command, or from any user run AnchorWave alignment as the --input-file parameter to create-anchorwave-dotplot.

Output from create-anchorwave-dotplot is written to the file specified by the --output-file parameter. Users may specify output files with extension of .svg or .png and the appropriate file type will be generated.

3. Using rPHG2:

For more "granular" inspection of our alignment data, we can also use the complimentary R package, rPHG2. Check out the "Load and visualize PHG metrics" section for more information.

VCF Metrics

Once the gVCF and hVCF files for the TileDB instances have been created, we can produce a table summarizing metrics related to the assemblies used to produce them. These metrics are useful for identifying low-quality assemblies which you may wish to exclude from the PHG, or detecting problems with the assembly alignments.

There are three methods to produce or visualize VCF metrics:

1. As part of create-maf-vcf:

phg create-maf-vcf \
    --db-path vcf_dbs \
    --bed output/ref_ranges.bed \
    --reference-file output/updated_assemblies/Ref.fa \
    --maf-dir output/alignment_files \
    -o output/vcf_files \
    --metrics-file output/vcf_files/VCFMetrics.tsv \
    --skip-metrics

By default, create-maf-vcf will write the VCF metrics table to the same output directory as the VCF files (set with flag -o) and will name the file VCFMetrics.tsv. The optional flag --metrics-file can be used to change the destination of the table.

If VCF metrics are not desired, the --skip-metrics flag can be used to skip calculating and writing the metrics table altogether. Generally, we recommend reviewing the built-in QC metrics at each step of building the PHG, so the default behavior is to calculate metrics.

2. As a standalone command:

phg calc-vcf-metrics \
    --vcf-dir output/vcf_files \
    --output VCFMetrics.tsv

The calc-vcf-metrics command creates the VCF metrics table from a directory of existing gVCF and hVCF files. It has two required parameters: * --vcf-dir - The path to the directory containing assembly VCF files. gVCF (.g.vcf) files are required, but the corresponding hVCF files are optional. * --output - The name for the output metrics table

Note

calc-vcf-metrics is not a general-purpose tool for VCF files. It was designed specifically for the files produced by create-maf-vcf, which are single-sample, haploid gVCFs and hVCFs. Metrics may not be accurate for VCF files that do not fit this format.

3. Visualize gVCF metrics: For more "granular" inspection of our gVCF data, we can also use the complimentary R package, rPHG2. Check out the "Load and visualize PHG metrics" section for more information.

Output

For each gVCF file in the input directory, both chromosome-level and assembly-level statistics are calculated, and each has its own row in the table. The columns are as follows:

Column Description
taxa The sample name in the gVCF file
chrom The reference chromosome name. For assembly-level statistics this value is ALL
refLength The length of the reference sequence
numSNPs The number of SNP records in the gVCF file for the given chromosome
numIns The number of insertion records in the gVCF file
numDel The number of deletion records in the gVCF file
numNs The number of N's/ambiguous bases in the assembly alignment
numBasesInserted The number of bases inserted relative to the reference sequence
numBasesDeleted The number of bases deleted relative to the reference sequence
percentIdentityWithRef The proportion of bases relative to refLength that are the same base as the reference base
percentMappedToRef The proportion of bases relative to refLength that are present in a gVCF record
meanInsertionSize The mean size of insertion records
medianInsertionSize The median size of insertion records
largestInsertion The size of the largest insertion
meanDeletionSize The mean size of deletion records
medianDeletionSize The median size of deletion records
largestDeletion The size of the largest deletion
refRangesWithHaplotype The number of reference ranges with a non-missing haplotype in the hVCF file
haplotypesIdenticalToRef The number of haplotypes identical to a reference haplotype in the hVCF file

The last two columns, refRangesWithHaplotype and haplotypesIdenticalToRef require hVCF files to calculate, and are omitted if no hVCF files are present in the given directory. See the hVCF specification documentation for more details about the hVCF format.

Terminology disambiguation

  • percentIdentityWithRef is roughly equivalent to the percentage of bases contained in gVCF reference blocks with non-missing genotypes. It also includes the padding base of each indel record if that base is the same as the reference
  • percentMappedToRef includes reference blocks, SNP records, and indel records. It is equivalent to the percentage of the reference covered by alignment blocks in the AnchorWave MAF files used to create the VCF files
  • refRangesWithHaplotype In general, most assemblies will produce haplotypes at most reference ranges. However, large deletions may span an entire reference range or more, resulting in a missing range. A large number of missing ranges may indicate a problem with the assembly or the alignment.