PS4G - Positional Support for Gametes Specification¶
- Specification version:
v2.0 - Date: 2025-10-24
Overview¶
A PS4G (Positional Support for Gametes) file is a standardized tab-delimited format that tracks genomic support for reference panel gametes across binned genomic positions. This format provides positional tracking for genomic data from multiple sources (e.g., variants, read alignments), thereby enhancing pathfinding support and enabling integration with machine learning-based imputation engines.
PS4G files aggregate evidence from sequencing reads or genotype calls to determine which reference panel haplotypes (gametes) are supported at different genomic locations. This compressed representation enables efficient downstream imputation and haplotype inference.
File Format¶
Structure¶
A PS4G file consists of:
- Header section - Metadata lines prefixed with
# - Column header line - Tab-delimited field names
- Data section - Tab-delimited data lines
Header Section¶
The header contains metadata about the file creation and reference gametes:
#PS4G
#version=2.0
#<metadata lines>
#Command: <CLI command used to generate this file>
#TotalUniqueCounts: <sum of all counts in the file>
#gamete gameteIndex count
#<SampleGamete> <index> <total_count>
...
Header fields:
| Field | Description |
|---|---|
#PS4G |
File format identifier (required first line) |
#version |
PS4G format version (2.0 for current specification) |
#Command |
Full CLI command used to generate the file |
#TotalUniqueCounts |
Sum of all unique position counts in the file |
#gamete |
Reference panel gamete identifier (format: SampleName or SampleName:gameteIndex) |
gameteIndex |
Zero-based integer index assigned to each gamete |
count |
Total number of observations supporting this gamete across all positions |
Data Section¶
After the header, the column names are defined, followed by data rows:
gameteSet refContig refPosBinned count
<comma-separated gamete indices> <contig> <binned position> <count>
Data fields:
| Field | Type | Description |
|---|---|---|
gameteSet |
String | Comma-separated list of gamete indices (from header) that are supported at this position |
refContig |
String | Reference contig/chromosome identifier |
refPosBinned |
Integer | Binned genomic position (actual position divided by 256) |
count |
Integer | Number of reads/variants supporting this gamete set at this position |
Example¶
#PS4G
#version=2.0
#Command: phg convert-rm2ps4g-file --read-mapping-file input.txt --hvcf-dir /path/to/hvcfs --output-dir output/
#TotalUniqueCounts: 1234
#gamete gameteIndex count
#LineA 0 853
#LineB 1 381
#Ref 2 100
gameteSet refContig refPosBinned count
0 chr1 1000 853
0,1 chr1 2000 24
1 chr2 500 15
0,1,2 chr2 1500 5
In this example:
- Row 1: Gamete 0 (
LineA) is supported at chr1 binned position 1000 (actual position ~256,000 bp) by 853 reads - Row 2: Gametes 0 and 1 (
LineAandLineB) are both supported at chr1 position 2000 (~512,000 bp) by 24 reads - Row 3: Gamete 1 (
LineB) is supported at chr2 position 500 (~128,000 bp) by 15 reads - Row 4: All three gametes are supported at chr2 position 1500 (~384,000 bp) by 5 reads
Position Binning¶
To reduce file size and provide efficient storage, genomic positions are binned into 256 bp windows:
Binning process:
The refPosBinned field stores the genomic position divided by 256 (integer division):
Converting back to approximate genomic position:
Resolution and Compression
The 256 bp binning provides a balance between positional resolution and file compression. The actual genomic position is rounded down to the nearest 256 bp boundary during binning. This resolution is suitable for chromosome-scale imputation and pathfinding algorithms.
Binning Example
- Genomic position 256,000 bp → Binned position 1,000
- Genomic position 256,255 bp → Binned position 1,000 (same bin)
- Genomic position 256,256 bp → Binned position 1,001 (next bin)
Generation Methods¶
PS4G files can be generated from three different sources:
1. From Read Mapping Files (convert-rm2ps4g-file)¶
Converts PHG read mapping output (from k-mer or RopeBWT mapping) to PS4G format.
Input: Read mapping file with format:
Process:
- Maps haplotype IDs to reference ranges
- Identifies gametes at each reference range
- Determines position from reference range start
- Aggregates counts by gamete set and position
Command:
phg convert-rm2ps4g-file \
--read-mapping-file mapping.txt \
--hvcf-dir /path/to/hvcfs \
--output-dir output/
2. From RopeBWT BED Files (convert-ropebwt2ps4g-file)¶
Converts RopeBWT3 maximal exact match (MEM) alignments to PS4G format using spline-based coordinate transformation.
Input: BED file from RopeBWT3 with MEM alignments
Process:
- Loads spline knots for coordinate transformation from assembly to reference coordinates
- Groups MEMs by read
- Filters by minimum MEM length and maximum hits
- Uses spline interpolation to map assembly positions to reference positions
- Creates consensus position from multiple MEMs
- Aggregates gamete support by position
Command:
phg convert-ropebwt2ps4g-file \
--ropebwt-bed alignments.bed \
--spline-knot-dir /path/to/splines \
--output-dir output/ \
--min-mem-length 148 \
--max-num-hits 50
Parameters:
--min-mem-length: Minimum MEM length to consider (default: 148)--max-num-hits: Maximum number of haplotype hits allowed (default: 50)--sort-positions: Sort output by genomic position (default: true)
3. From VCF Files (convert-vcf2ps4g-file)¶
Converts variant calls to PS4G format using a reference panel for gamete matching.
Input:
- Sample VCF file (to be imputed)
- Reference panel VCF file
Process:
- Builds allele-to-gamete lookup from reference panel
- For each variant in sample VCF:
- Matches alleles to reference panel gametes
- Records gamete support at that position
- Aggregates counts by gamete set and position
Command:
phg convert-vcf2ps4g-file \
--to-impute-vcf sample.vcf \
--ref-panel-vcf reference_panel.vcf \
--output-dir output/
Data Interpretation¶
Gamete Sets¶
The gameteSet field contains indices of gametes that share evidence at a position. Multiple gametes in a set indicate:
- From read data: Reads mapping ambiguously to multiple haplotypes
- From VCF data: Shared alleles across multiple reference samples
Position Accuracy¶
Due to the 256 bp binning:
- Positions represent approximate genomic locations
- Multiple nearby variants/reads may contribute to the same bin
- Suitable for chromosome-scale imputation, not for fine-scale variant calling
Count Interpretation¶
The count field represents:
- Read mapping: Number of reads supporting this gamete combination
- VCF conversion: Number of variants matching this pattern
- Higher counts indicate stronger evidence for those gametes
Use Cases¶
PS4G files are designed for:
- Machine learning-based imputation: Provide feature vectors for ML models to predict haplotypes
- Pathfinding algorithms: Inform hidden Markov models about gamete support across the genome
- Quality control: Assess read mapping quality and reference panel coverage
- Comparative analysis: Compare imputation results across different methods
File Naming Convention¶
PHG generates PS4G files with the naming pattern:
Examples:
LineA_1_readMapping_ps4g.txt- From read mappingsample_alignments_ps4g.txt- From RopeBWTinput_vcf_ps4g.txt- From VCF conversion
Related Commands¶
rope-bwt-chr-index- Create RopeBWT index for alignmentbuild-spline-knots- Generate spline knots for coordinate transformationmap-reads- Align reads to generate mapping files- See Imputation using Machine Learning for complete workflow
Specification Notes¶
Version History¶
- v2.0 (2025-10-24): Major format update - removed position encoding, split position into separate
refContigandrefPosBinnedcolumns for improved readability and flexibility - v1.0 (2025-02-19): Initial complete specification
- v0.1 (2025-02-19): Draft specification
Implementation¶
PS4G files are generated by PHGv2 commands in the net.maizegenetics.phgv2.pathing.ropebwt package:
ConvertRm2Ps4gFile.kt- Read mapping conversionConvertRopebwt2Ps4gFile.kt- RopeBWT conversionConvertVcf2Ps4gFile.kt- VCF conversionPS4GUtils.kt- Shared utilities for file writing and data formatting