Imputation using Machine Learning¶
Disclaimer
This Imputation using Machine Learning (ML) section is a work in progress. It is not yet complete and may contain inaccuracies or incomplete information.
In this document, we will discuss the steps needed to build PS4G files using the full assemblies and then to perform Machine Learning Based imputation using the PHG:
- Run ropebwt3 indexing by full length chromosomes
- Build Splines for the assemblies
- Align Reads against the Index
- Create a PS4G file from ropebwt3 BED data
- Run ML Model - Future Work
Note
The steps detailed in this document build on the materials from the "Building and Loading" documentation. Please review this if you have not worked with the PHG before!
Quick start¶
-
Run ropebwt3 indexing by full length chromosomes:
-
Build Splines for the assemblies:
shell phg build-spline-knots \ --vcf-dir /my/vcfs \ --vcf-type gvcf \ --output-dir /my/spline_output_dir \ --min-indel-length 10 \ --num-bps-per-knot 50000 \ --contig-list chr1,chr2,chr3
-
Align Reads against the Index:
-
Create a PS4G file from ropebwt3 BED data:
Detailed walkthrough¶
Run ropebwt3 indexing by full length chromosomes¶
Creates a ropeBWT3 index for a set of assemblies using the full length indexing method. Each FASTA is taken one at a time and is processed and indexed into the ropeBWT3 index. Once the initial index is finished, the
.fmr
index is converted to.fmd
and the suffix array is built.
Command - rope-bwt-chr-index
Example
phg rope-bwt-chr-index \
--keyfile keyfile.txt \
--output-dir /path/to/output/bed/files/ \
--index-file-prefix phgIndex \
--threads 20
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--keyfile |
Tab-delimited file containing 2 columns: Fasta and SampleName . Fasta is the full path to the FASTA file, and SampleName is the name for assembly. |
None | |
--output-dir |
Output directory where the index files will be written. | None | |
--index-file-prefix |
Prefix for the ropebwt3 index file. This prefix will be added to the output directory and used for generated index files. | None | |
--threads |
Number of threads to use for index creation. | 3 |
|
--delete-fmr-index |
Delete the .fmr index file after converting to .fmd . |
true |
|
--conda-env-prefix |
Prefix for the Conda environment to use. If provided, this should be the full path to the Conda environment. | "" |
Note
--keyfile
is a tab-delimited file with 2 columns with namesFasta
andSampleName
.Fasta
needs the full path for each assembly FASTA file andSampleName
needs to be the name you want included in the contig name. The first step of the indexer is to open up each FASTA file and rename the contigs to include the provided sample name separated by an '_' (e.g.,lineA_chr1
).--index-file-prefix
is the prefix for all the output index files. This tool will make a number of files (some temporary) while it is running each with this prefix. There should not be an extension here as this will be added as need be.
Note
Be sure to include your reference fasta file in the --keyfile
as well.
Note
The SampleName should not have any underscores in it. We rename the contigs in the assembly fasta file temporarily so that RopeBWT3 can differentiate between contigs of the same name but coming from different assemblies(like chr1, chr2, ... etc.).
Create a PS4G file from ropebwt3 BED data¶
Convert a ropebwt3 BED file into a PS4G (positional support for gamete) file.
Note
This command will only work with ropebwt3 files where the reads
are aligned to the whole assembly chromosome using the
mem
command. MEMs (Maximal Exact Matches) are used
to determine what the optimal mappping is. One downside to this
approach is that if a SNP is in the middle of the read, the
mappings will be ignored. We may integrate running this in
conjunction with ropebwt3's Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW)
approach (i.e., the sw
command) in a future update.
Command - convert-ropebwt2ps4g
Example
phg convert-ropebwt2ps4g \
--ropebwt-bed /path/to/readmapping.txt \
--output-dir /dir/for/ps4g/output/ \
--hvcf-dir /path/to/hvcf/files/
Parameters
Parameter name | Description | Default value | Required? |
---|---|---|---|
--ropebwt-bed |
Path to ropebwt3 BED file. | "" |
|
--output-dir |
Output directory for the generated PS4G file. | "" |
|
--hvcf-dir |
Directory containing hVCF files. | "" |
|
--min-mem-length |
Minimum length of a possible match to be considered a match. Default value is the average length of a short read (150 bp) - 2 bp for possible variance. | 148 |
|
--max-num-hits |
Maximum number of hits to report. | 50 |
Note
ropebwt3 can hit more than the value provided in the
--max-num-hits
parameter but any alignment hitting more
haplotypes than this will be ignored.