Exporting Data¶
In this document, we will discuss general strategies for exporting data from a PHG database.
Note
This section will assume you have a pre-existing database loaded with haplotype data. See the "Building and Loading" documentation for further information.
Quickstart¶
- Export hVCF files from database
Note
--sample-names can be replaced with --sample-file <file_name.txt>
where file_name.txt is a file containing the sample names, one
per line.
-
Create FASTA files from hVCF data or database
-
Data retrieval using BrAPI endpoints and rPHG2
Detailed walkthrough¶
Export VCF data¶
In PHGv2, we leverage TileDB and
TileDB-VCF
for efficient storage and querying of VCF data. For this example,
let's assume I have a pre-existing PHG database (located in my
vcf_dbs directory) that contains several
samples:
- LineA
- LineB
- LineC
If I want to export hVCF files for a given set of samples
(LineA and LineC) to an output directory (in my case
output/hvcf_files), I can use the export-vcf command:
phg export-vcf \
--db-path vcf_dbs \
--dataset-type hvcf \
--sample-names LineA,LineC \
-o output/hvcf_files
This command uses several parameters:
--db-path- path to directory storing the TileDB instances.--dataset-type- what type of data do you want to extract? This can either be:- hVCF (
hvcf) data (default parameter) - gVCF (
gvcf) data
- hVCF (
--sample-names- a comma (,) separated list of sample IDs.-o- output directory of VCF data.--regions-file- a file of positions to be exported. Can be a BED file or a VCF file.
Note
Make sure there is no whitespace between sample IDs. For example:
LineA,LineB✅LineA , LineB❌
Users may instead use the --sample-file parameter to specify a file
that contains the sample names, one per line. For example, if I have
a text file called sample_names.txt, the contents of the file would
look like the following:
...and would be passed to the export-vcf command:
phg export-vcf \
--db-path vcf_dbs \
--dataset-type hvcf \
--sample-file sample_names.txt \
-o output/hvcf_files
--regions-file parameter, only
variants overlapping those positions will be exported.
The regions-file must be either a
BED file or a
VCF file, and
must have either a .bed or a .vcf extension.
For example, if I want the regions from 1 to 5000 base pairs (bp) on
chromosome 3 (in my case the ID would be chr03), I could make a BED
file:
Note
BED files are 0-based, so plan accordingly!
...or this could be a VCF file that contains a data line for chr03
region information for the CHROM, POS, and
INFO columns with the INFO column containing a END field. For
example:
Create FASTA data¶
While haplotype sequences are abstracted to MD5 hashes in hVCF
files, sequence information can be recapitulated from these hash
values using the create-fasta-from-hvcf command:
phg create-fasta-from-hvcf \
--hvcf-file my_sample.h.vcf \
--fasta-type composite \
-o /path/to/outputFolder
As the name of this command implies, we are creating FASTA files
of nucleotide sequence data from a single hVCF file or a collection
of hVCF files by specifying a directory. The output FASTA files will
be written (one FASTA per hVCF file) to the specified output directory
(-o). The format of the file names will be sample_name_type.fa
where sample_name is the name of the sample from the hVCF file name
and type is the type of fasta file created (composite or
haplotype). The following parameters may be used:
- input type (you can only select one):
--hvcf-file- path to an hVCF file. Can be substituted with--hvcf-dir.--hvcf-dir- path to a directory containing hVCF files. Can be substituted with--hvcf-file.
--fasta-type- what type of FASTA format do you want to use?composite- generate a FASTA file that contains all haplotypes concatenated together by consecutive reference ranges. This composite or "pseudo" genome can be used for the resequencing pipeline.haplotype- generate a FASTA file where each haplotype is a separate FASTA entry. Useful for read mapping, imputation or simple haplotype sequence retrieval.- 'pangenomeHaplotype' - generate a FASTA file where we output all the haplotypes from all the hvcf files in the directory
- 'rangeFasta' - outputs one file per reference range specified by the bedfile. Each file contains the haplotype sequences for each sample for the specified range.
-o- output path to directory for the created fasta files.
Data retrieval using BrAPI endpoints and rPHG¶
While the above commands allow for individual-oriented access to PHG data, another option is to start a "RESTful" web service. This service can provide access to a centralized PHG database, allowing multiple individuals in a team to simultaneously retrieve PHG-relevant information. The following web service leverages the Breeding API (BrAPI which provides a standard, community-driven collection of web retrieval calls relevant to plant breeding.
To create a web service for serving PHG data, we can use the
start-server command:
This command takes only two arguments:
--db-path- path to directory storing the TileDB instances.--port- web server port for the network connection. Defaults to8080.
Once this command is run, a web service to localhost will start
and data can be retrieved: