Create reference intervals
Legacy Documentation - PHG Version 1
This section contains documentation for PHG version 1, which is no longer actively developed. It is preserved here for archival and historical reference only. If you are looking to use the Practical Haplotype Graph, please refer to the PHG v2 documentation, which reflects the current version of the software.
SCRIPT PURPOSE¶
Generate reference intervals for the Practical Haplotype Graph (PHG)
NOTES¶
- This script assumes it is run inside the PHG Docker container with predefined I/O paths
- The .gff file is assumed to be in JGI format: gene models have the "gene" name, and an "ID=..." field is present in annotation
RUNNING THE SCRIPT¶
#!bash
docker run --rm \
-v /your_data_folder/:/tempFileDir/data \
maizegenetics/phg \
/CreateReferenceIntervals.sh -f your_reference.fasta -a your_reference.gene.gff3 [ ... optional parameters]
REQUIRED PARAMETERS¶
#!bash
-f <file name>
name of fasta file containing the reference sequence
-a <file name>
name of genome annotation file in .gff format containing gene model annotation
OPTIONAL PARAMETERS¶
#!bash
-k <integer>
Length of kmer used for determining repetitive regions
Default: 11
-e <integer>
Number of bases by which to expand gene models for initial reference interval selection
Default: 1000
-m <integer>
Distance (in bp) between genes below which gene models are merged
Default: 100
-p <double>
Proportion of kmers to be considered repetitive.
This determines the high kmer count tail which is considered repetitive (e.g. the top 0.05 most frequent)
Default: 0.1
-n <integer>
Number of kmer copies (genome-wide) above which a kmer is considered repetitive. Overrides -p
Default: none, -p is used by default
-l <integer>
The number of bases to consider when evaluating if a location in the genome is repetitive
Default: 100
-s <integer>
The step size (in bp) by which to proceed outward from a gene model when evaluating flanking regions
Default: 10
SCRIPT RESULTS¶
Output location (subfolder in the input data folder)¶
- genomic_intervals_unique-timestamp
Relevant output contents¶
- reference_intervals_run.log -- a log file summarizing parameters for the run
- your_fasta.gene.expand.trimmed.summary_report.tsv -- a summary of seed gene model expansion
- your_fasta.kmer_count.tsv -- a complete list of kmer counts for kmers with count > 1
- your_fasta.gene.expand.trimmed.bed -- the final reference intervals, in BED format