Quality Control Metrics¶

In this document, we will discuss the built-in quality control metrics available from the PHG. Metrics accompany several steps of building and using the PHG, and many can be run as either standalone commands or as part of a related command.

AnchorWave dot plots¶

Dot plots generated from the align-assemblies step are a visual representation of the alignment between two assemblies. They are useful for identifying large-scale structural differences between assemblies, such as inversions, translocations, and large deletions.

There are three methods to produce AnchorWave dot plots:

1. As part of the align-assemblies command:

./phg align-assemblies \
    --gff data/anchors.gff \
    --reference-file output/updated_assemblies/Ref.fa \
    --assemblies data/assemblies_list.txt \
    --total-threads 20 \
    --in-parallel 2 \
    -o output/alignment_files

The align-assemblies command runs code to create the dot plots from AnchorWave's .anchorspro files. For each .anchorspro file that is generated, a genome-wide dot plot image is created and stored in the same location where .maf and .anchorspro files are placed (-o parameter directory). Each image that is generated will have the assembly name found in the --assemblies parameter followed by the suffix, _dotplot.svg. * For example, if your assemblies list file (in my case, this would be assemblies_list.txt) has the following assemblies to align:

output/updated_assemblies/LineA.fa
output/updated_assemblies/LineB.fa

The plot output that is generated by default would be:

output/alignment_files/LineA_dotplot.svg
output/alignment_files/LineB_dotplot.svg

Note

As of PHGv2 version 2.2.74.123, the plots generated with the align-assemblies command are part of the default output and is not optional.

2. As a standalone command:

phg create-anchorwave-dotplot \
    --input-file my.anchorspro \
    --output-file myOutputFile.svg

The create-anchorwave-dotplot command creates a dot plot from a user provided .anchorspro file generated from AnchorWave. Users may use the output from either the align-assemblies command, or from any user run AnchorWave alignment as the --input-file parameter to create-anchorwave-dotplot.

Output from create-anchorwave-dotplot is written to the file specified by the --output-file parameter. Users may specify output files with extension of .svg or .png and the appropriate file type will be generated.

3. Using rPHG2:

For more "granular" inspection of our alignment data, we can also use the complimentary R package, rPHG2. Check out the "Load and visualize PHG metrics" section for more information.

VCF metrics¶

Once the gVCF and hVCF files for the TileDB instances have been created, we can produce a table summarizing metrics related to the assemblies used to produce them. These metrics are useful for identifying low-quality assemblies which you may wish to exclude from the PHG, or detecting problems with the assembly alignments.

There are three methods to produce or visualize VCF metrics:

1. As part of create-maf-vcf:

phg create-maf-vcf \
    --db-path vcf_dbs \
    --bed output/ref_ranges.bed \
    --reference-file output/updated_assemblies/Ref.fa \
    --maf-dir output/alignment_files \
    -o output/vcf_files \
    --metrics-file output/vcf_files/VCFMetrics.tsv \
    --skip-metrics

By default, create-maf-vcf will write the VCF metrics table to the same output directory as the VCF files (set with flag -o) and will name the file VCFMetrics.tsv. The optional flag --metrics-file can be used to change the destination of the table.

If VCF metrics are not desired, the --skip-metrics flag can be used to skip calculating and writing the metrics table altogether. Generally, we recommend reviewing the built-in QC metrics at each step of building the PHG, so the default behavior is to calculate metrics.

2. As a standalone command:

phg calc-vcf-metrics \
    --vcf-dir output/vcf_files \
    --output VCFMetrics.tsv

The calc-vcf-metrics command creates the VCF metrics table from a directory of existing gVCF and hVCF files. It has two required parameters: * --vcf-dir - The path to the directory containing assembly VCF files. gVCF (.g.vcf) files are required, but the corresponding hVCF files are optional. * --output - The name for the output metrics table

Note

calc-vcf-metrics is not a general-purpose tool for VCF files. It was designed specifically for the files produced by create-maf-vcf, which are single-sample, haploid gVCFs and hVCFs. Metrics may not be accurate for VCF files that do not fit this format.

3. Visualize gVCF metrics: For more "granular" inspection of our gVCF data, we can also use the complimentary R package, rPHG2. Check out the "Load and visualize PHG metrics" section for more information.

Output¶

For each gVCF file in the input directory, both chromosome-level and assembly-level statistics are calculated, and each has its own row in the table. The columns are as follows:

Column	Description
`taxa`	The sample name in the gVCF file
`chrom`	The reference chromosome name. For assembly-level statistics this value is `ALL`
`refLength`	The length of the reference sequence
`numSNPs`	The number of SNP records in the gVCF file for the given chromosome
`numIns`	The number of insertion records in the gVCF file
`numDel`	The number of deletion records in the gVCF file
`numNs`	The number of N's/ambiguous bases in the assembly alignment
`numBasesInserted`	The number of bases inserted relative to the reference sequence
`numBasesDeleted`	The number of bases deleted relative to the reference sequence
`percentIdentityWithRef`	The proportion of bases relative to refLength that are the same base as the reference base
`percentMappedToRef`	The proportion of bases relative to refLength that are present in a gVCF record
`meanInsertionSize`	The mean size of insertion records
`medianInsertionSize`	The median size of insertion records
`largestInsertion`	The size of the largest insertion
`meanDeletionSize`	The mean size of deletion records
`medianDeletionSize`	The median size of deletion records
`largestDeletion`	The size of the largest deletion
`refRangesWithHaplotype`	The number of reference ranges with a non-missing haplotype in the hVCF file
`haplotypesIdenticalToRef`	The number of haplotypes identical to a reference haplotype in the hVCF file

The last two columns, refRangesWithHaplotype and haplotypesIdenticalToRef require hVCF files to calculate, and are omitted if no hVCF files are present in the given directory. See the hVCF specification documentation for more details about the hVCF format.

Terminology disambiguation¶

percentIdentityWithRef is roughly equivalent to the percentage of bases contained in gVCF reference blocks with non-missing genotypes. It also includes the padding base of each indel record if that base is the same as the reference
percentMappedToRef includes reference blocks, SNP records, and indel records. It is equivalent to the percentage of the reference covered by alignment blocks in the AnchorWave MAF files used to create the VCF files
refRangesWithHaplotype In general, most assemblies will produce haplotypes at most reference ranges. However, large deletions may span an entire reference range or more, resulting in a missing range. A large number of missing ranges may indicate a problem with the assembly or the alignment.

Imputation metrics¶

Calculate imputation metrics for an hvcf file created by the Imputation pipeline

Command - imputation-metrics

Example

phg imputation-metrics \
    --sample-hvcf my/parent/hvcf/from/create-maf-vcf \
    --imputation-hvcf my/imputation/h.vcf \
    --bed-file path/to/ranges.bed \
    --read-mapping-files my/readMappingList.txt \
    --output-file output/imputation_metrics.txt

Parameters

Parameter name	Description	Default value
`--sample-hvcf`	Path to hvcf for the sample expected to be selected most often.	`""`
`--imputation-hvcf`	Path to hvcf created from imputation pipeline.	`""`
`--bed-file`	Path to bed file created in create-ranges.	`""`
`--read-mapping-files`	Path to a file that contains a list of read-mapping files, one per line, full path names.	`""`
`--output-file`	Path to a file where metrics will be printed.	`""`
`--chrom'	Comma-separated list of contigs to include in output (optional)

The imputation-metrics command calculates metrics for a given h.vcf file created as the output of find-paths. It requires input that is not available to the find-paths command, so must be run separately from that command.

The intent of this command is to provide a way to evaluate the quality of the imputation process. The metrics allow the user to see the path the imputation has taken, where it has diverged from the parent (sample) haplotype, and shows the read counts from the read mappings files that support each transition. The output is a tab-delimited file with data arranged in reference range order that can be opened in a spreadsheet program for further analysis.

Read mapping metrics¶

Return QC metrics for read mappings¶

Note

Need clarification!

Using FASTQ files as an input, this command creates a tab-delimited table reporting each k-mer generated from the FASTQ files and how well they map to haplotypes in the PHG database.

Command - qc-read-mapping

Example

phg qc-read-mapping \
    --hvcf-dir path/to/hvcf_directory \
    --output-dir path/to/output_directory \
    --read-files reads1.fastq,reads2.fastq \
    --kmer-index path/to/kmerIndex.txt \
    --num-reads 20

Parameters

Parameter name	Description	Default value
`--hvcf-dir`	Directory containing hVCF files used to build the HaplotypeGraph.	`""`
`--output-dir`	Output folder for the read mapping results.	`""`
`--read-files`	Comma separated list of FASTQ read files to map.	`""`
`--kmer-index`	K-mer index file created by `build-kmer-index`. If not provided, defaults to `<hvcfDir>/kmerIndex.txt`.	`<hvcfDir>/kmerIndex.txt`
`--num-reads`	Number of reads to process. This value controls how many reads are processed from each file.	`20`

Example output

Kmer                                Hash1                               Hash2                               RefRanges                      Hapids
ACGTACGTACGTACGTACGTACGTACGTACGT    d41d8cd98f00b204e9800998ecf8427e    0cc175b9c0f1b6a831c399e269772661    chr1:1000-2000,chr2:3000-4000    {0=[B73, Ki11]}
TGCACTGATGCACTGATGCACTGATGCACTGA    900150983cd24fb0d6963f7d28e17f72    f96b697d7cb7938d525a2f31aaf161d0    None                           {}

Kmer: A 32-nucleotide sequence extracted from the read.
Hash1 / Hash2: MD5 checksum–like hash values computed for the k-mer.
RefRanges: A comma-separated list of reference ranges (formatted as contig:start-end) where the k-mer was found.
Hapids: A mapping from reference range IDs to the corresponding haplotype IDs that were matched.

Return QC metrics for read mapping counts¶

Using a read mapping file as input, this command will generate a tab-delimited report of QC metrics of read mapping counts for a given sample ID found the PHG DB.

Command - read-mapping-count-qc

Example

phg read-mapping-count-qc \
    --hvcf-dir path/to/hvcf_directory \
    --read-mapping-file path/to/read_mapping.txt \
    --target-sample-name LineA \
    --output-dir path/to/output_directory

Parameters

Parameter name	Description	Default value
`--hvcf-dir`	Directory with Haplotype VCF files.	`""`
`--read-mapping-file`	Read mapping file containing the mapping data.	`""`
`--target-sample-name`	Target sample name for which counts will be computed.	`""`
`--output-dir`	Output directory for writing the count QC results.	`""`

Example output

refRange    LineA_HapID LineA_HapCount  HighestAltCount Difference  OtherHapCounts
1:1-1000    12f0cec9102e84a161866e37072443b7    853 0   853 0_4fc7b8af32ddd74e07cb49d147ef1938, 0_546d1839623a5b0ea98bbff9a8a320e2
1:1001-5500 3149b3144f93134eb29661bade697fc6    4378    70  4308    70_8967fabf10e55d881caa6fe192e7d4ca, 0_57705b1e2541c7634ea59a48fc52026f
1:5501-6500 1b568197f6f329ec5b71f66e49a732fb    855 47  808 47_05efe15d97db33185b64821791b01b0f, 0_d896e9cc56e74f39fd3f3c665191d727
1:6501-11000    369464a8743d2e40ad83d1375c196bdd    4355    101 4254    101_8f7de1a693aa15fb8fb7b85e7a8b5e95, 24_66465399052d8ebe48b06329c60fee03

refRange: The reference range.
LineA_HapID: The haplotype ID corresponding to the target sample (e.g., --target-sample-name LineA).
LineA_HapCount: The count of reads mapping to the target haplotype.
HighestAltCount: The highest count among non-target haplotypes.
Difference: The difference between the target haplotype count and the highest alternative count.
OtherHapCounts: A comma-separated list of counts and haplotype IDs for all other haplotypes in the reference range.