Convenience Commands¶
In addition to the primary commands for the build, imputation, and resequencing pipelines, PHGv2 also provides a suite of "convenience commands" for miscellaneous "quality of life (QoL)" improvements. In this document, we will discuss the currently available external commands for performing highly used tasks.
-
Conversion
Commands for converting one file type to another.
-
Merging
Commands for merging various PHG file types.
-
Statistics
Commands for generating summary information in tabular format.
-
Experimental
Prototype commands that are under construction and may have significant changes in future updates.
Conversion¶
Convert gVCF files to hVCF files¶
Create hVCF files from existing gVCF files created by the PHG
Command - gvcf2hvcf
Example
phg gvcf2hvcf \
--bed my/bed/file.bed \
--reference-file my/updated/ref/fasta.fa \
--gvcf-dir gvcf/directory \
--db-path my/phg/db
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--bed |
BED file with entries that define the haplotype boundaries. | "" |
|
--gvcf-dir |
Directory containing bgzipped and CSI indexed gVCF files. | "" |
|
--reference-file |
Path to local Reference FASTA file. | "" |
|
--conda-env-prefix |
Prefix for the Conda environment to use. If provided, this should be the full path to the Conda environment. | Current active Conda environment | |
--db-path |
Folder name where TileDB datasets and AGC record is stored. If not provided, the current working directory is used. | Current working dir |
Convert hVCF files to gVCF files¶
Create gVCF files from existing hVCF files created by the PHG
Command - hvcf2gvcf
Example
phg hvcf2gvcf \
--reference-file my/updated/ref/fasta.fa \
--hvcf-dir hvcf/directory \
--db-path my/phg/db
--output-dir output/directory/for/gvcfs
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--hvcf-dir |
Path to directory holding hVCF files. Data will be pulled directly from these files instead of querying TileDB. | "" |
|
--reference-file |
Path to local Reference FASTA file. | "" |
|
--conda-env-prefix |
Prefix for the Conda environment to use. If provided, this should be the full path to the Conda environment. | Current active Conda environment | |
--db-path |
Folder name where TileDB datasets and AGC record is stored. If not provided, the current working directory is used. | Current working dir | |
--output-dir |
Output directory for the gVCF files. If not provided, the current working directory is used. | Current working dir | |
--batch-size |
Number of sample vcf files to export in a single batch from tiledb | 5 |
Create a GFF file from an imputed hVCF file¶
Create a path-specific GFF file from an imputed hVCF file and existing sample GFF files. Useful for creating a GFF file that contains only annotations for paths present in the imputed hVCF file.
Command - paths-to-gff
Example
phg paths-to-gff \
--hvcf-file my/hvcf/file.h.vcf \
--key-file my/samples/keyfile.txt \
--output-file output/path_specific.gff
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--hvcf-file |
Path to hVCF file for which the GFF will be created | "" |
|
--key-file |
Path to key file containing 2 tab-delimited columns:
|
"" |
|
--output-file |
Full path to the file where the new GFF3 file will be written | "" |
Advanced API use
For advanced users who would like to leverage this GFF-based data structure in memory for downstream Kotlin pipelines in contrast to handling output files, you may use the following example code:
import net.maizegenetics.phgv2.utils.loadGFFsToGff3Feature
import net.maizegenetics.phgv2.utils.makeGffFromHvcf
// Same as CLI parameter inputs
val keyFile = "my/samples/keyfile.txt"
val hvcfFile = "my/hvcf/file.h.vcf"
// Create GFF 'TreeMap' object
val resTreeMap = loadGFFsToGff3Feature(keyFile)
// Create HTSJDK 'Gff3Feature' set object
val taxonPathGFF = makeGffFromHvcf(hvcfFile, resTreeMap)
In the above example, taxonPathGFF
is an in-memory HTSJDK
Gff3Feature
object that can be used for downstream purposes. See the PathsToGff
class source code for further details.
Create a PS4G file from read mapping data¶
Convert read mapping data into a PS4G (positional support for gamete) file.
Command - convert-rm2ps4g
Example
phg convert-rm2ps4g \
--read-mapping-file /path/to/readmapping.txt \
--output-dir /dir/for/ps4g/output/ \
--hvcf-dir /path/to/hvcf/files/
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--read-mapping-file |
Path to read mapping file. | "" |
|
--output-dir |
Output directory for the generated PS4G file. | "" |
|
--hvcf-dir |
Directory containing hVCF files. | "" |
Create a PS4G file from ropebwt3 BED data¶
Convert a ropebwt3 BED file into a PS4G (positional support for gamete) file.
Note
This command will only work with ropebwt3 files where the reads
are aligned to the whole assembly chromosome using the
mem
command. MEMs (Maximal Exact Matches) are used
to determine what the optimal mappping is. One downside to this
approach is that if a SNP is in the middle of the read, the
mappings will be ignored. We may integrate running this in
conjunction with ropebwt3's Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW)
approach (i.e., the sw
command) in a future update.
Command - convert-ropebwt2ps4g
Example
phg convert-ropebwt2ps4g \
--ropebwt-bed /path/to/readmapping.txt \
--output-dir /dir/for/ps4g/output/ \
--hvcf-dir /path/to/hvcf/files/
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--ropebwt-bed |
Path to ropebwt3 BED file. | "" |
|
--output-dir |
Output directory for the generated PS4G file. | "" |
|
--hvcf-dir |
Directory containing hVCF files. | "" |
|
--min-mem-length |
Minimum length of a possible match to be considered a match. Default value is the average length of a short read (150 bp) - 2 bp for possible variance. | 148 |
|
--max-num-hits |
Maximum number of hits to report. | 50 |
Note
ropebwt3 can hit more than the value provided in the
--max-num-hits
parameter but any alignment hitting more
haplotypes than this will be ignored.
Merging¶
Merge gVCF files¶
Merge multiple gVCF files into a single gVCF file
Command - merge-gvcfs
Example
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--input-dir |
Path to input gVCF file directory. | "" |
|
--output-dir |
Path and/or filename for merged gVCF file. | "" |
Merge hVCF files¶
Merge multiple hVCF files into a single hVCF file
Command - merge-hvcfs
Example
phg merge-hvcfs \
--input-dir my/hvcf/directory \
--output-file output/merged_hvcfs.h.vcf \
--id-format CHECKSUM \
--reference-file \
--range-bedfile
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--input-dir |
Path to input hVCF file directory. | "" |
|
--output-dir |
Path and/or filename for merged hVCF file. | "" |
|
--id-format |
ID format for hVCF files. Options are: CHECKSUM or RANGE_SAMPLE_GAMETE (see notes for further details). |
RANGE_SAMPLE_GAMETE |
|
--reference-file |
Path to reference FASTA file. | "" |
|
--range-bedfile |
Path to reference range BED file. | "" |
Note - id-fomat
If you select CHECKSUM
for the --id-format
parameter, the ID
values will be MD5 checksums in the ##ALT
header:
If you select RANGE_SAMPLE_GAMETE
, the ID
values will change
to a reference range/sample/gamete
ID format:
Statistics¶
List Sample names from datasets¶
List the sample names from the AGC compressed file, the TileDB gVCF dataset, and/or the TileDB hVCF dataset
Command - list-samples
Example
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--db-path |
Folder name where TileDB datasets and AGC record is stored. If not provided, the current working directory is used. | "" |
|
--output-file |
Path and/or filename for the samples list file. | "" |
|
--data-set |
Storage from which to pull sample names. Must be one of all , agc , gvcf , hvcf |
hvcf |
|
--conda-env-prefix |
Prefix for the Conda environment to use. If provided, this should be the full path to the Conda environment. | Current active Conda environment |
Example output
Create a table of haplotype IDs by reference range¶
Creates a tab-delimited table of haplotype IDs by reference range coordinates and sample IDs
Command - sample-hapid-by-range
Example
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--input-dir |
Path to directory holding hVCF files. | "" |
|
--output-dir |
Path and/or filename for haplotype ID table. | "" |
Example output
#CHROM START END B73 SEEDGWAS1 SEEDGWAS10
chr1 1 5000 <c9ecfe3967a71282f3ad7c41d48e0bbf> <b19364bc9a4c07a80986b1ee181446c2> <5c8e72b2e9f11ecc652d5b8e8d0e5bf3>
chr1 5001 6000 <f162e742c4d30f151ae6276fbebe762c> <fdfdaa361c39cf5b6f13fad195d0e519> <283a8261c193212fd5cf43d208673322>
chr1 6001 9000 <471d4abbf0545dede647e65915345648> <d6dd5ecea7fb4e6f77f9e630f601b7a8> <13e0ac1a8d12e1aedd6a5302d1e221fd>
Create a table of haplotype IDs to sample¶
Creates a tab-delimited table of haplotype IDs to sample gamete. Can be one or multiple samples mapping to each haplotype ID.
Command - hapid-sample-table
Example
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--hvcf-dir |
Path to directory holding hVCF files. | "" |
|
--output-dir |
Path and/or filename for haplotype ID table. | "" |
Note - --hvcf-dir
This is intended for use with hVCF files created from aligning assemblies. While this will work with hVCF files from imputation, all the sample names will be the imputed file name and not the sample names associated with the hapids when they were created.
Example output
"a81a7df7340ae0f14a6dccce0d9632db" Ki3
"c45452b07db68928da6f4e14d50ba1e3" Mo18W
"1b6d29dbb7b380e67b15c5a0f0142cf0" Ms71,R2D2
"a935ee46a1a1118df309fc34bdb9e5a5" B73,Ky21,Ki11
"b878dec3587e24c4714fec5131d4dbbb" C3PO
Experimental¶
Initialize custom TileDB instance for hVCFs¶
Creates a TileDB array instance to house hVCF header data
Command - init-hvcf-array
Example
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--db-path |
Directory name under which TileDB datasets will be created. If this folder does not exist, it will be created. | Current working directory |
Load hVCF data into custom TileDB instance¶
Loads hVCF data into TileDB array instance that was created with the command
init-hvcf-array
Command - load-hvcf
Example
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--db-path |
Directory name under which TileDB datasets will be created. If this folder does not exist, it will be created. | Current working directory | |
--hvcf-dir |
Full path to an hVCF file directory | "" |
Query hVCF arrays¶
Query TileDB arrays created from
load-hvcf
Command - query-hvcf-arrays
Example
phg query-hvcf-arrays \
--db-path /path/for/tiledb/instance/ \
--query-type distinctSamples \
--array-type variants \
--ref-range-file /path/to/refrange/bed/file.bed \
--output-file query_results.txt
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--db-path |
Directory name under which TileDB datasets will be created. If this folder does not exist, it will be created. | Current working directory | |
--query-type |
Type of query to perform. Options to choose from are:
|
"" | |
--array-type |
Type of array to query. Options to choose from are:
|
variants |
|
--ref-range-file |
Full path to a BED file-formatted list of reference ranges to query. If no file is provided, all reference ranges are queried. | All reference ranges | |
--output-file |
Name of file for query results. | "" |