Imputation using Machine Learning¶

Disclaimer

This Imputation using Machine Learning (ML) section is a work in progress. It is not yet complete and may contain inaccuracies or incomplete information.

In this document, we will discuss the steps needed to build PS4G files using the full assemblies and then to perform Machine Learning Based imputation using the PHG:

Run ropebwt3 indexing by full length chromosomes
Build Splines for the assemblies
Align Reads against the Index
Create a PS4G file from ropebwt3 BED data
Run ML Model - Future Work

Note

The steps detailed in this document build on the materials from the "Building and Loading" documentation. Please review this if you have not worked with the PHG before!

Quick start¶

Run ropebwt3 indexing by full length chromosomes:

phg rope-bwt-chr-index \
    --keyfile /my/keyfile.txt \
    --output-dir /my/phgIndex/ \
    --index-file-prefix phg_index \
    --threads 20

Build Splines for the assemblies:

phg build-spline-knots \
    --vcf-dir /my/vcfs \
    --vcf-type gvcf \
    --output-dir /my/spline_output_dir \
    --min-indel-length 10 \
    --num-bps-per-knot 50000 \
    --contig-list chr1,chr2,chr3

!!! note "Getting gVCF files without reference ranges" When not using reference ranges (skipping phg create-ranges), you cannot run the standard phg create-ref-vcf and phg create-maf-vcf commands. Instead, you can convert MAF alignment files directly to gVCF format using the maf-to-gvcf-converter command from biokotlin-tools.

Find maximal exact matches between ropebwt3 index and reads:

phg align-reads \
    --index ropebwt.fmd \
    --query query.fastq \
    --min-smem-len 31 \
    --threads 4 \
    --output example.bed

Create a PS4G file from ropebwt3 BED data:

phg convert-bed-to-ps4g \
    --db-path /my/db/uri \
    --data-set hvcf \
    --sample-names /my/sample_names.txt \
    --output-dir /my/ps4g_output_dir

Detailed walkthrough¶

Run ropebwt3 indexing by full length chromosomes¶

Creates a ropeBWT3 index for a set of assemblies using the full length indexing method. Each FASTA is taken one at a time and is processed and indexed into the ropeBWT3 index. Once the initial index is finished, the .fmr index is converted to .fmd and the suffix array is built.

Command - rope-bwt-chr-index

Example

phg rope-bwt-chr-index \
    --keyfile keyfile.txt \
    --output-dir /path/to/output/bed/files/ \
    --index-file-prefix phgIndex \
    --threads 20

Parameters

Parameter name	Description	Default value
`--keyfile`	Tab-delimited file containing 2 columns: `Fasta` and `SampleName`. `Fasta` is the full path to the FASTA file, and `SampleName` is the name for assembly.	None
`--output-dir`	Output directory where the index files will be written.	None
`--index-file-prefix`	Prefix for the ropebwt3 index file. This prefix will be added to the output directory and used for generated index files.	None
`--threads`	Number of threads to use for index creation.	`3`
`--delete-fmr-index`	Delete the `.fmr` index file after converting to `.fmd`.	`true`
`--conda-env-prefix`	Prefix for the Conda environment to use. If provided, this should be the full path to the Conda environment.	`""`

Note

--keyfile is a tab-delimited file with 2 columns with names Fasta and SampleName. Fasta needs the full path for each assembly FASTA file and SampleName needs to be the name you want included in the contig name. The first step of the indexer is to open up each FASTA file and rename the contigs to include the provided sample name separated by an '_' (e.g., lineA_chr1).
--index-file-prefix is the prefix for all the output index files. This tool will make a number of files (some temporary) while it is running each with this prefix. There should not be an extension here as this will be added as need be.

Note

Be sure to include your reference fasta file in the --keyfile as well.

Note

The SampleName should not have any underscores in it. We rename the contigs in the assembly fasta file temporarily so that RopeBWT3 can differentiate between contigs of the same name but coming from different assemblies (e.g., chr1, chr2, ... etc.).

Build spline knots¶

Build spline knot points from gVCF or hVCF data

Command - build-spline-knots

Example

build-spline-knots \
  --vcf-dir /data/hvcf_files \
  --output-dir /results/spline_knots \
  --vcf-type hvcf \
  --num-bps-per-knot 75000 \
  --random-seed 42

Parameters

Parameter name	Description	Default value
`--vcf-dir`	Directory containing the hVCF or gVCF files.	`""`
`--vcf-type`	Type of VCFs to build the splines from. Accepts `"hvcf"` or `"gvcf"`.	`"hvcf"`
`--output-dir`	Output directory to write the spline knots to.	`""`
`--min-indel-length`	Minimum length of an indel to break up the running block for spline creation of gVCFs. Ignored if `--vcf-type` is `hvcf`.	`10`
`--num-bps-per-knot`	Maximum number of base pairs per knot for each contig’s spline. The actual number may be lower if a contig has fewer bases.	`50000`
`--contig-list`	Comma-separated list of chromosomes to include in spline generation. If not provided, all chromosomes will be included.	`""`
`--random-seed`	Random seed used for downsampling the number of points per chromosome. Ensures reproducibility.	`12345`

Align reads to ropebwt3 index¶

This command serves as a high-level wrapper around the ropebwt3 mem algorithm, providing a streamlined interface for aligning short sequencing reads to a pre-built FM-index reference.

This command identifies Super-Maximal Exact Matches (SMEMs) between query reads and the indexed reference, efficiently mapping reads to genomic coordinates. Results are exported in BED format for easy visualization and downstream PS4G construction (see next section).

Command - align-reads

Example

phg align-reads \
    --index ropebwt.fmd \
    --query query.fastq \
    --min-smem-len 31 \
    --threads 4 \
    --output example.bed

Parameters

Parameter name	Description	Default value
`--index`	Path to the ropebwt3 FM-index (`.fmd`) file used as the reference for read alignment.	`""`
`--query`	Input FASTQ file containing reads to align. Supports both uncompressed and `.gz`-compressed files.	`""`
`--min-smem-len`	Minimum SMEM (Super-Maximal Exact Match) length used as a seed for alignment. Larger values produce fewer but more specific matches.	`31`
`--threads`	Number of threads to use for parallel processing during alignment. Improves performance on multicore systems.	`1`
`--output`	Output file in BED format containing aligned read intervals, mapping quality, and strand information.	`""`

Create a PS4G file from ropebwt3 BED data¶

Convert a ropebwt3 BED file into a PS4G (positional support for gamete) file.

Note

This command will only work with ropebwt3 files where the reads are aligned to the whole assembly chromosome using the mem command. MEMs (Maximal Exact Matches) are used to determine what the optimal mappping is. One downside to this approach is that if a SNP is in the middle of the read, the mappings will be ignored. We may integrate running this in conjunction with ropebwt3's Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW) approach (i.e., the sw command) in a future update.

Command - convert-ropebwt2-ps4g-file

Example

phg convert-ropebwt2-ps4g-file \
    --ropebwt-bed /data/alignments/sample.mem.bed \
    --output-dir /results/ps4g \
    --spline-knot-dir /refs/spline_knots \
    --min-mem-length 148 \
    --max-num-hits 50 \
    --sort-positions

Parameters

Parameter name	Description	Default value
`--ropebwt-bed`	Path to the RopeBWT3 BED file (MEM hits per read) to convert into PS4G.	`""`
`--output-dir`	Output directory where the generated PS4G file will be written.	`""`
`--spline-knot-dir`	Directory containing Spline Knot lookup files (per contig / sample) used to transform MEM positions into consensus PS4G positions.	`""`
`--min-mem-length`	Minimum length of a match (MEM) to be considered for consensus. Shorter MEMs are ignored.	`148`
`--max-num-hits`	Maximum total number of hits allowed (sum across best MEMs). If a read hits more haplotypes than this, it is ignored.	`50`
`--sort-positions`	Sort positions in the resulting PS4G file. Use `--no-sort-positions` to disable.	`true`

Note

ropebwt3 can hit more than the value provided in the --max-num-hits parameter but any alignment hitting more haplotypes than this will be ignored.