Imputation using Machine Learning¶
Disclaimer
This Imputation using Machine Learning (ML) section is a work in progress. It is not yet complete and may contain inaccuracies or incomplete information.
In this document, we will discuss the steps needed to build PS4G files using the full assemblies and then to perform Machine Learning Based imputation using the PHG:
- Run ropebwt3 indexing by full length chromosomes
- Build Splines for the assemblies
- Align Reads against the Index
- Create a PS4G file from ropebwt3 BED data
- Run ML Model - Future Work
Note
The steps detailed in this document build on the materials from the "Building and Loading" documentation. Please review this if you have not worked with the PHG before!
Quick start¶
-
Run ropebwt3 indexing by full length chromosomes:
-
Build Splines for the assemblies:
!!! note "Getting gVCF files without reference ranges"
When not using reference ranges (skipping phg create-ranges), you cannot
run the standard phg create-ref-vcf and phg create-maf-vcf commands.
Instead, you can convert MAF alignment files directly to gVCF format using
the maf-to-gvcf-converter command from biokotlin-tools.
-
Find maximal exact matches between ropebwt3 index and reads:
-
Create a PS4G file from ropebwt3 BED data:
Detailed walkthrough¶
Run ropebwt3 indexing by full length chromosomes¶
Creates a ropeBWT3 index for a set of assemblies using the full length indexing method. Each FASTA is taken one at a time and is processed and indexed into the ropeBWT3 index. Once the initial index is finished, the
.fmrindex is converted to.fmdand the suffix array is built.
Command - rope-bwt-chr-index
Example
phg rope-bwt-chr-index \
--keyfile keyfile.txt \
--output-dir /path/to/output/bed/files/ \
--index-file-prefix phgIndex \
--threads 20
Parameters
| Parameter name | Description | Default value | Required? |
|---|---|---|---|
--keyfile |
Tab-delimited file containing 2 columns: Fasta and SampleName. Fasta is the full path to the FASTA file, and SampleName is the name for assembly. |
None | |
--output-dir |
Output directory where the index files will be written. | None | |
--index-file-prefix |
Prefix for the ropebwt3 index file. This prefix will be added to the output directory and used for generated index files. | None | |
--threads |
Number of threads to use for index creation. | 3 |
|
--delete-fmr-index |
Delete the .fmr index file after converting to .fmd. |
true |
|
--conda-env-prefix |
Prefix for the Conda environment to use. If provided, this should be the full path to the Conda environment. | "" |
Note
--keyfileis a tab-delimited file with 2 columns with namesFastaandSampleName.Fastaneeds the full path for each assembly FASTA file andSampleNameneeds to be the name you want included in the contig name. The first step of the indexer is to open up each FASTA file and rename the contigs to include the provided sample name separated by an '_' (e.g.,lineA_chr1).--index-file-prefixis the prefix for all the output index files. This tool will make a number of files (some temporary) while it is running each with this prefix. There should not be an extension here as this will be added as need be.
Note
Be sure to include your reference fasta file in the --keyfile as well.
Note
The SampleName should not have any underscores in it. We rename the
contigs in the assembly fasta file temporarily so that RopeBWT3 can
differentiate between contigs of the same name but coming from
different assemblies (e.g., chr1, chr2, ... etc.).
Build spline knots¶
Build spline knot points from gVCF or hVCF data
Command - build-spline-knots
Example
build-spline-knots \
--vcf-dir /data/hvcf_files \
--output-dir /results/spline_knots \
--vcf-type hvcf \
--num-bps-per-knot 75000 \
--random-seed 42
Parameters
| Parameter name | Description | Default value | Required? |
|---|---|---|---|
--vcf-dir |
Directory containing the hVCF or gVCF files. | "" |
|
--vcf-type |
Type of VCFs to build the splines from. Accepts "hvcf" or "gvcf". |
"hvcf" |
|
--output-dir |
Output directory to write the spline knots to. | "" |
|
--min-indel-length |
Minimum length of an indel to break up the running block for spline creation of gVCFs. Ignored if --vcf-type is hvcf. |
10 |
|
--num-bps-per-knot |
Maximum number of base pairs per knot for each contig’s spline. The actual number may be lower if a contig has fewer bases. | 50000 |
|
--contig-list |
Comma-separated list of chromosomes to include in spline generation. If not provided, all chromosomes will be included. | "" |
|
--random-seed |
Random seed used for downsampling the number of points per chromosome. Ensures reproducibility. | 12345 |
Align reads to ropebwt3 index¶
This command serves as a high-level wrapper around the
ropebwt3 memalgorithm, providing a streamlined interface for aligning short sequencing reads to a pre-built FM-index reference.This command identifies Super-Maximal Exact Matches (SMEMs) between query reads and the indexed reference, efficiently mapping reads to genomic coordinates. Results are exported in BED format for easy visualization and downstream PS4G construction (see next section).
Command - align-reads
Example
phg align-reads \
--index ropebwt.fmd \
--query query.fastq \
--min-smem-len 31 \
--threads 4 \
--output example.bed
Parameters
| Parameter name | Description | Default value | Required? |
|---|---|---|---|
--index |
Path to the ropebwt3 FM-index (.fmd) file used as the reference for read alignment. |
"" |
|
--query |
Input FASTQ file containing reads to align. Supports both uncompressed and .gz-compressed files. |
"" |
|
--min-smem-len |
Minimum SMEM (Super-Maximal Exact Match) length used as a seed for alignment. Larger values produce fewer but more specific matches. | 31 |
|
--threads |
Number of threads to use for parallel processing during alignment. Improves performance on multicore systems. | 1 |
|
--output |
Output file in BED format containing aligned read intervals, mapping quality, and strand information. | "" |
Create a PS4G file from ropebwt3 BED data¶
Convert a ropebwt3 BED file into a PS4G (positional support for gamete) file.
Note
This command will only work with ropebwt3 files where the reads
are aligned to the whole assembly chromosome using the
mem
command. MEMs (Maximal Exact Matches) are used
to determine what the optimal mappping is. One downside to this
approach is that if a SNP is in the middle of the read, the
mappings will be ignored. We may integrate running this in
conjunction with ropebwt3's Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW)
approach (i.e., the sw
command) in a future update.
Command - convert-ropebwt2-ps4g-file
Example
phg convert-ropebwt2-ps4g-file \
--ropebwt-bed /data/alignments/sample.mem.bed \
--output-dir /results/ps4g \
--spline-knot-dir /refs/spline_knots \
--min-mem-length 148 \
--max-num-hits 50 \
--sort-positions
Parameters
| Parameter name | Description | Default value | Required? |
|---|---|---|---|
--ropebwt-bed |
Path to the RopeBWT3 BED file (MEM hits per read) to convert into PS4G. | "" |
|
--output-dir |
Output directory where the generated PS4G file will be written. | "" |
|
--spline-knot-dir |
Directory containing Spline Knot lookup files (per contig / sample) used to transform MEM positions into consensus PS4G positions. | "" |
|
--min-mem-length |
Minimum length of a match (MEM) to be considered for consensus. Shorter MEMs are ignored. | 148 |
|
--max-num-hits |
Maximum total number of hits allowed (sum across best MEMs). If a read hits more haplotypes than this, it is ignored. | 50 |
|
--sort-positions |
Sort positions in the resulting PS4G file. Use --no-sort-positions to disable. |
true |
Note
ropebwt3 can hit more than the value provided in the
--max-num-hits parameter but any alignment hitting more
haplotypes than this will be ignored.