Resequencing / Rare Alleles Pipeline¶
In this document, we will discuss the proposed pipeline to identify rare alleles in a PHG population. This pipeline is invoked once a user has successfully run the PHG imputation pipeline and has secured a haplotype VCF (hVCF) file.
From the imputation hVCF file, a composite FASTA can be created. The pipeline begins by re-aligning the WGS reads used in the imputation pipeline against the composite FASTA derived above. The BAM file created by this second alignment is filtered, then run through a variant caller. We suggest using DeepVariant or Octopus for this purpose. The output of the variant caller is a VCF file that contains the variants called against the composite fasta.
Additional filtering is then performed to select variants based on
VCF quality scores, depth or other measures. This final VCF provides
a list of SNPs that may be translated to haplotype coordinates. To
perform this transformation, the user can use the
composite-to-haplotype-coords
command.
Quick start¶
-
Create a composite FASTA file
-
Align short reads to composite FASTA
-
Filter BAM files
samtools fixmate -m LineA_B_composite_align.bam LineA_B_composite_fm.bam samtools sort --threads 8 -T ./tmp -o LineA_B_composite_sorted_fm.bam LineA_B_composite_fm.bam samtools markdup -r LineA_B_composite_sorted_fm.bam LineA_B_composite_dedup.bam samtools view -F 4 -f 2 -b LineA_B_composite_dedup.bam > LineA_B_composite_filtered.bam samtools index LineA_B_composite_filtered.bam
-
Perform variant calling
PHG_DIR=phg_v2_example/ docker run \ -v ${PHG_DIR}:/workdir \ google/deepvariant:1.6.1 \ /opt/deepvariant/bin/run_deepvariant \ --model_type=WGS \ --ref=/workdir/output/composite_assemblies/LineA_B_composite.fa \ --reads=/workdir/output/minimap2/LineA_B_composite_filtered.bam \ --output_vcf=/workdir/output/vcf_files/LineA_B_composite_reseq.vcf \ --output_gvcf=/workdir/output/vcf_files/LineA_B_composite_reseq.g.vcf \ --intermediate_results_dir=/workdir/intermediate_results \ --num_shards=4
-
Filter variants
-
Create final VCF in haplotype coordinates
Detailed walkthrough¶
Create a composite FASTA file¶
Using the prior example outputs from the imputation section,
we can first convert the hVCF output from the find-paths
command
to a FASTA format via the create-fasta-from-hvcf
command:
phg create-fasta-from-hvcf \
--hvcf-file output/vcf_files_imputed/LineA_B.h.vcf \
--fasta-type composite \
-o output/composite_assemblies/
This command takes 3 necessary parameters:
--hvcf-file
- an hVCF file from the imputation pipeline--fasta-type
- how should the output for the FASTA file look like?composite
- concatenates haplotype sequences together at the chromosome level (needed for this pipeline)haplotype
- each FASTA entry is a haplotype sequence
-o
or--output-dir
- an existing output directory for the FASTA file
After running this command, our example directory (based on the
prior pipelines) will now have a FASTA file (LineA_B_composite.fa
)
in the composite_assemblies/
directory:
phg_v2_example/
├── data
│ └── short_reads
│ │ ├── LineA_LineB_1.fq
│ │ └── LineA_LineB_2.fq
│ ├── anchors.gff
│ ├── Ref-v5.fa
│ ├── LineA-final-01.fa
│ └── LineB-final-04.fa
├── output
│ ├── alignment_files/
│ ├── composite_assemblies *
│ │ └── LineA_B_composite.fa *
│ ├── ref_ranges.bed
│ ├── updated_assemblies/
│ ├── vcf_files/
│ ├── vcf_files_imputed
│ │ └── LineA_B.h.vcf
│ └── read_mappings/
└── vcf_dbs
├── assemblies.agc
├── gvcf_dataset/
├── hvcf_dataset/
├── hvcf_files/
├── reference/
└── temp/
Align short reads to composite FASTA¶
After creating the composite FASTA, we can re-align the WGS reads used for the imputation steps to this new composite assembly. This can be accomplished using a short-read aligner such as minimap2 (recommended) or other comparable software. Since this step is dependent on the choices made by the end-user or group, we will provide an example of how to run this via minimap2 in the following code block:
# Run minimap2 to align the reads with the composite genome
minimap2 -a -x sr -t 20 --secondary=no --eqx \
output/composite_assemblies/LineA_B_composite.fa \
data/short_reads/LineA_LineB_1.fq \
data/short_reads/LineA_LineB_2.fq | \
samtools view -b > output/minimap2/LineA_B_composite_align.bam
In summary, this command is using the following minimap2
parameters:
-a
- output alignments in SAM format-x sr
- alignment mode is short-read alignment-t 20
- run this alignment using 20 threads on our example machine (this can be modified to leverage the full potential of your machine)--secondary=no
- suppress secondary alignments and report only the primary alignment--eqx
- output the CIGAR string in "extended" format (=
for matches andX
for mismatches)
This minimap2 procedure will create a SAM file that is piped (|
)
into the SAMtools program which is converted (view -b
) and
saved (> output/minimap2/LineA_B_composite_align.bam
) as a BAM
format for more efficient storage and downstream procedures.
Filter BAM files¶
BAM files can be filtered using SAMtools (recommended) or other SAM/BAM processing tools. We suggest running the following SAMtools commands:
-
Fix potential issues in paired-end reads of a BAM file by adjusting and correcting mate information by ensuring each pair of reads has accurate information about its mate score (
-m
): -
Sort alignments by reference sequence coordinates which is necessary before deduplication and indexing:
-
Identify and remove duplicate reads:
-
Exclude unmapped reads or those that are not properly paired:
-
Index alignments for fast random access:
Perform variant calling¶
After the BAM files have been filtered, we can now run variant calling. Since there are many different variant callers available, this step can become "highly opinionated" for various groups and users. In our current tests, we have found the results produced by DeepVariant to be of high quality.
Note
For more information on comparisons between different variant callers we have tested, please refer to the notes found on this page.
Here we show an example of running DeepVariant on the prior BAM file. The output of DeepVariant is a VCF file that contains the variants called against the composite fasta made in the first step. Assuming you have retrieved the Docker image for DeepVariant, the following command is an example of the recommended parameters to use for the program:
PHG_DIR=phg_v2_example/
docker run \
-v ${PHG_DIR}:/workdir \
google/deepvariant:1.6.1 \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref=/workdir/output/composite_assemblies/LineA_B_composite.fa \
--reads=/workdir/output/minimap2/LineA_B_composite_filtered.bam \
--output_vcf=/workdir/output/vcf_files/LineA_B_composite_reseq.vcf \
--output_gvcf=/workdir/output/vcf_files/LineA_B_composite_reseq.g.vcf \
--intermediate_results_dir=/workdir/intermediate_results \
--num_shards=4
If you have limited experience in what Docker and/or DeepVariant commands are, here we explain in further detail the rationale for each of these parameters:
Docker commands:
docker run
: base Docker command to run a container-v ${PHG_DIR}:/workdir
: This mounts the host system's (i.e., your machine) working PHG directory (specified as the variable,PHG_DIR
) to the/workdir
directory inside the container. This would be the path on your machine that contains the necessary files (discussed below). Here, I am mounting it to the "example" PHG directory that was used previously.
DeepVariant commands:
google/deepvariant:1.6.1
: This specifies the version of the DeepVariant Docker image to use (1.6.1
in this case since this is the latest version as of writing this documentation)./opt/deepvariant/bin/run_deepvariant
: This specifies the path inside the Docker container to the DeepVariant executable script that orchestrates the variant calling pipeline.
DeepVariant input parameters:
--model_type=WGS
: Specifies the model type to use for variant calling. In this case,WGS
stands for "Whole Genome Sequencing".--ref=/workdir/output/composite_assemblies/LineA_B_composite.fa
: The reference genome in FASTA format that DeepVariant will use for alignment and variant calling. The reference file in this case will be the composite FASTA file created earlier.--reads=/workdir/output/minimap2/LineA_B_composite_filtered.bam
: Specifies the prior sorted and indexed BAM file containing the aligned sequencing reads.--output_vcf=/workdir/output/vcf_files/LineA_B_composite_reseq.vcf
: The output VCF file path where DeepVariant will write the called variants.--output_gvcf=/workdir/output/vcf_files/LineA_B_composite_reseq.g.vcf
: The output gVCF file path.--intermediate_results_dir=/workdir/intermediate_results
: The directory where DeepVariant will write intermediate files generated during the analysis. These files are useful for debugging or optimizing performance, but can be deleted after run if not needed.--num_shards=4
: This specifies the number of CPU cores or threads to parallelize the job across. In this case, we are using4
CPU threads. The higher number of shards, the faster the job, but will require more computational resources.
Filter variants¶
To identify rare alleles, the VCF file output by the variant caller should be filtered to select variants based on quality scores, depth, or other measures. The following is an example of how to use bcftools to filter the VCF file created above:
bcftools view -m2 -M2 -v snps \
-i 'QUAL>20 && GT="1/1" && DP>4' \
output/vcf_files/LineA_B_composite_reseq.vcf \
-o output/vcf_files/LineA_B_composite_reseq_filt.vcf
-m2
: Specifies that only diploid variants should be included. The value2
represents the ploidy level (diploid).-M2
: Ensures that variants with more than two alleles are excluded. Using this in conjunction with-m2
ensures that only biallelic diploid variants are retained.-v snps
: This option restricts the output to only SNPs.-i 'QUAL>20 && GT="1/1" && DP>4'
: This specifies the filtering conditions in which variants have the qualities:QUAL>20
: only include variants with a quality score greater than 20. The quality score indicates the confidence of the variant call, and this filter ensures that only high-confidence variants are retained.GT="1/1"
: this ensures that only homozygous alternate variants are included (where the genotype is1/1
, indicating) both alleles are the alternate allele).DP>4
: retain only variants with a depth of coverage greater than 4. The depth of coverage (DP
) indicates the number of reads covering the variant position.
output/vcf_files/LineA_B_composite_reseq.vcf
: the input VCF file that contains the variants to be filtered.-o output/vcf_files/LineA_B_composite_reseq_filt.vcf
: the output file where the filtered variants will be saved.
Create final VCF in haplotype coordinates¶
Now that we have retained "high quality" variants, we can finally
convert the coordinates from the composite FASTA file to coordinates
found in the haplotypes within PHG population. To perform this, we
can use a PHGv2 command called composite-to-haplotype-coords
:
Note
The VCF coordinates created in this step will not be based on reference coordinates or coordinates relative to any single assembly. They will be relative to the haplotypes used to make the composite assembly!
phg composite-to-haplotype-coords \
--path-hvcf output/vcf_files_imputed/LineA_B.h.vcf \
--variant-vcf output/vcf_files/LineA_B_composite_reseq_filt.vcf \
--sample-name LineA_B \
--output-file output/vcf_files/LineA_B_reseq_hap_coords.vcf
In summary, this command takes 4 parameters:
--path-hvcf
: path to the imputed hVCF file created during the imputation steps.--variant-vcf
: path to the filtered VCF file created by the prior variant caller (e.g., DeepVariant).--sample-name
: sample ID to include in the VCF file.--output-file
: path and name of coordinate converted VCF file.
Miscellaneous¶
In the following sections, we present possible alternate workflows that may arise due to the attributes of your starting data:
WGS reads coming from multiple libraries:¶
In certain cases, you may have reads that come from multiple flowcell
or sequencing libraries. To align short reads to the composite, you
can add read grouping information (-R
) to minimap2. In the
following example, we have three WGS libraries for the sample P39
:
# Run minimap2 to align the reads with the composite genome
FASTA_DIR=/workdir/lcj34/phg_v2/testing/fastas
FASTQ_DIR=/workdir/lcj34/phg_v2/testing/fastqFiles
WORK_DIR=/workdir/lcj34/phg_v2/testing
minimap2 -a -x sr -t 20 --secondary=no --eqx -R '@RG\tID:READ_GRP1\tSM:P39' \
${FASTA_DIR}/P39wgs_composite.fa \
${FASTQ_DIR}/P39/P39_HBEN2ADXX_GTGAAA_R1.fastq.gz \
${FASTQ_DIR}/P39/P39_HBEN2ADXX_GTGAAA_R2.fastq.gz | \
samtools view -b > ${WORK_DIR}/p39top39composite_bams/P39_HBEN2ADXX_GTGAAA.ori.bam
minimap2 -a -x sr -t 20 --secondary=no --eqx -R '@RG\tID:READ_GRP2\tSM:P39' \
${FASTA_DIR}/P39wgs_composite.fa \
${FASTQ_DIR}/P39/P39__HFFGKADXX_GTGAAA_R1.fastq.gz \
${FASTQ_DIR}/P39/P39__HFFGKADXX_GTGAAA_R2.fastq.gz | \
samtools view -b > ${WORK_DIR}/p39top39composite_bams/P39_HFFGKADXX_GTGAAA.ori.bam
minimap2 -a -x sr -t 20 --secondary=no --eqx -R '@RG\tID:READ_GRP3\tSM:P39'
${FASTA_DIR}/P39wgs_composite.fa \
${FASTQ_DIR}/P39/P39_HL5WNCCXX_GTGAAA_R1.fastq.gz \
${FASTQ_DIR}/P39/P39_HL5WNCCXX_GTGAAA_R2.fastq.gz | \
samtools view -b > ${WORK_DIR}/p39top39composite_bams/P39_HL5WNCCXX_GTGAAA.ori.bam
In the above example, read groupings are added using the -R
parameter found in minimap2. Each WGS library will get its own
read group:
ID:READ_GRP1
-ID:READ_GRP3
: specifies that the read group identifier isREAD_GRP1
throughREAD_GRP3
SM:P39
: The sample name. In this case, it isP39
After alignment, the BAM files can be merged into one using the
merge
command from samtools:
-
-b
: A list of BAM files to merge. This would be a plain-text list of file paths. For example:
After merging, the BAM file can be sorted and indexed using the prior guidelines: