Quality Control Metrics¶
In this document, we will discuss the built-in quality control metrics available from the PHG. Metrics accompany several steps of building and using the PHG, and many can be run as either standalone commands or as part of a related command.
AnchorWave dot plots¶
Dot plots
generated from the align-assemblies
step
are a visual representation of the alignment between two assemblies.
They are useful for identifying large-scale structural
differences between assemblies, such as
inversions,
translocations,
and large deletions.
There are three methods to produce AnchorWave dot plots:
1. As part of the align-assemblies
command:
./phg align-assemblies \
--gff data/anchors.gff \
--reference-file output/updated_assemblies/Ref.fa \
--assemblies data/assemblies_list.txt \
--total-threads 20 \
--in-parallel 2 \
-o output/alignment_files
The align-assemblies
command runs code to create the dot plots
from AnchorWave's .anchorspro
files. For each .anchorspro
file that is generated, a genome-wide dot plot image is created
and stored in the same location where .maf
and .anchorspro
files are placed (-o
parameter directory). Each image that is
generated will have the assembly name found in the --assemblies
parameter followed by the suffix, _dotplot.svg
.
* For example, if your assemblies list file (in my case, this
would be assemblies_list.txt
) has the following assemblies
to align:
Note
As of PHGv2 version 2.2.74.123
, the plots generated with the
align-assemblies
command are part of the default output and
is not optional.
2. As a standalone command:
The create-anchorwave-dotplot
command creates a dot plot from
a user provided .anchorspro
file generated from AnchorWave.
Users may use the output from either the align-assemblies
command, or from any user run AnchorWave alignment as the
--input-file
parameter to create-anchorwave-dotplot
.
Output from create-anchorwave-dotplot
is written to the file
specified by the --output-file
parameter. Users may specify
output files with extension of .svg
or .png
and the
appropriate file type will be generated.
3. Using rPHG2
:
For more "granular" inspection of our alignment data, we can also use
the complimentary R package, rPHG2
.
Check out the "Load and visualize PHG metrics"
section for more information.
VCF Metrics¶
Once the gVCF and hVCF files for the TileDB instances have been created, we can produce a table summarizing metrics related to the assemblies used to produce them. These metrics are useful for identifying low-quality assemblies which you may wish to exclude from the PHG, or detecting problems with the assembly alignments.
There are three methods to produce or visualize VCF metrics:
1. As part of create-maf-vcf
:
phg create-maf-vcf \
--db-path vcf_dbs \
--bed output/ref_ranges.bed \
--reference-file output/updated_assemblies/Ref.fa \
--maf-dir output/alignment_files \
-o output/vcf_files \
--metrics-file output/vcf_files/VCFMetrics.tsv \
--skip-metrics
By default, create-maf-vcf
will write the VCF metrics table to the
same output directory as the VCF files (set with flag -o
) and will
name the file VCFMetrics.tsv
. The optional flag --metrics-file
can be used to change the destination of the table.
If VCF metrics are not desired, the --skip-metrics
flag can be used
to skip calculating and writing the metrics table altogether.
Generally, we recommend reviewing the built-in QC metrics at each
step of building the PHG, so the default behavior is to calculate
metrics.
2. As a standalone command:
The calc-vcf-metrics
command creates the VCF metrics table from a
directory of existing gVCF and hVCF files. It has two required
parameters:
* --vcf-dir
- The path to the directory containing assembly VCF
files. gVCF (.g.vcf
) files are required, but the
corresponding hVCF files are optional.
* --output
- The name for the output metrics table
Note
calc-vcf-metrics
is not a general-purpose tool for VCF files.
It was designed specifically for the files produced by
create-maf-vcf
, which are single-sample, haploid gVCFs and hVCFs.
Metrics may not be accurate for VCF files that do not fit this
format.
3. Visualize gVCF metrics:
For more "granular" inspection of our gVCF data, we can also use
the complimentary R package, rPHG2
.
Check out the "Load and visualize PHG metrics"
section for more information.
Output¶
For each gVCF file in the input directory, both chromosome-level and assembly-level statistics are calculated, and each has its own row in the table. The columns are as follows:
Column | Description |
---|---|
taxa |
The sample name in the gVCF file |
chrom |
The reference chromosome name. For assembly-level statistics this value is ALL |
refLength |
The length of the reference sequence |
numSNPs |
The number of SNP records in the gVCF file for the given chromosome |
numIns |
The number of insertion records in the gVCF file |
numDel |
The number of deletion records in the gVCF file |
numNs |
The number of N's/ambiguous bases in the assembly alignment |
numBasesInserted |
The number of bases inserted relative to the reference sequence |
numBasesDeleted |
The number of bases deleted relative to the reference sequence |
percentIdentityWithRef |
The proportion of bases relative to refLength that are the same base as the reference base |
percentMappedToRef |
The proportion of bases relative to refLength that are present in a gVCF record |
meanInsertionSize |
The mean size of insertion records |
medianInsertionSize |
The median size of insertion records |
largestInsertion |
The size of the largest insertion |
meanDeletionSize |
The mean size of deletion records |
medianDeletionSize |
The median size of deletion records |
largestDeletion |
The size of the largest deletion |
refRangesWithHaplotype |
The number of reference ranges with a non-missing haplotype in the hVCF file |
haplotypesIdenticalToRef |
The number of haplotypes identical to a reference haplotype in the hVCF file |
The last two columns, refRangesWithHaplotype
and
haplotypesIdenticalToRef
require hVCF files to calculate, and are
omitted if no hVCF files are present in the given directory. See the
hVCF specification documentation for more details about the hVCF
format.
Terminology disambiguation¶
percentIdentityWithRef
is roughly equivalent to the percentage of bases contained in gVCF reference blocks with non-missing genotypes. It also includes the padding base of each indel record if that base is the same as the referencepercentMappedToRef
includes reference blocks, SNP records, and indel records. It is equivalent to the percentage of the reference covered by alignment blocks in the AnchorWave MAF files used to create the VCF filesrefRangesWithHaplotype
In general, most assemblies will produce haplotypes at most reference ranges. However, large deletions may span an entire reference range or more, resulting in a missing range. A large number of missing ranges may indicate a problem with the assembly or the alignment.
Imputation Metrics¶
Calculate imputation metrics for an hvcf file created by the Imputation pipeline
Command - imputation-metrics
Example
phg imputation-metrics \
--sample-hvcf my/parent/hvcf/from/create-maf-vcf \
--imputation-hvcf my/imputation/h.vcf \
--bed-file path/to/ranges.bed \
--read-mapping-files my/readMappingList.txt \
--output-file output/imputation_metrics.txt
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--sample-hvcf |
Path to hvcf for the sample expected to be selected most often. | "" |
|
--imputation-hvcf |
Path to hvcf created from imputation pipeline. | "" |
|
--bed-file |
Path to bed file created in create-ranges. | "" |
|
--read-mapping-files |
Path to a file that contains a list of read-mapping files, one per line, full path names. | "" |
|
--output-file |
Path to a file where metrics will be printed. | "" |
|
`--chrom' | Comma-separated list of contigs to include in output (optional) |
The imputation-metrics
command calculates metrics for a given h.vcf file created as the output of find-paths.
It requires input that is not available to the find-paths command, so must be run separately from
that command.
The intent of this command is to provide a way to evaluate the quality of the imputation process. The metrics allow the user to see the path the imputation has taken, where it has diverged from the parent (sample) haplotype, and shows the read counts from the read mappings files that support each transition. The output is a tab-delimited file with data arranged in reference range order that can be opened in a spreadsheet program for further analysis.