Exporting Data¶
In this document, we will discuss general strategies for exporting data from a PHG database.
Note
This section will assume you have a pre-existing database loaded with haplotype data. See the "Building and Loading" documentation for further information.
Quickstart¶
- Export hVCF files from database
Note
--sample-names
can be replaced with --sample-file <file_name.txt>
where file_name.txt
is a file containing the sample names, one
per line.
-
Create FASTA files from hVCF data or database
-
Data retrieval using BrAPI endpoints and rPHG2
Detailed walkthrough¶
Export VCF data¶
In PHGv2, we leverage TileDB and
TileDB-VCF
for efficient storage and querying of VCF data. For this example,
let's assume I have a pre-existing PHG database (located in my
vcf_dbs
directory) that contains several
samples:
- LineA
- LineB
- LineC
If I want to export hVCF files for a given set of samples
(LineA
and LineC
) to an output directory (in my case
output/hvcf_files
), I can use the export-vcf
command:
phg export-vcf \
--db-path vcf_dbs \
--dataset-type hvcf \
--sample-names LineA,LineC \
-o output/hvcf_files
This command uses several parameters:
--db-path
- path to directory storing the TileDB instances.--dataset-type
- what type of data do you want to extract? This can either be:- hVCF (
hvcf
) data (default parameter) - gVCF (
gvcf
) data
- hVCF (
--sample-names
- a comma (,
) separated list of sample IDs.-o
- output directory of VCF data.--regions-file
- a file of positions to be exported. Can be a BED file or a VCF file.
Note
Make sure there is no whitespace between sample IDs. For example:
LineA,LineB
✅LineA , LineB
❌
Users may instead use the --sample-file
parameter to specify a file
that contains the sample names, one per line. For example, if I have
a text file called sample_names.txt
, the contents of the file would
look like the following:
...and would be passed to the export-vcf
command:
phg export-vcf \
--db-path vcf_dbs \
--dataset-type hvcf \
--sample-file sample_names.txt \
-o output/hvcf_files
--regions-file
parameter, only
variants overlapping those positions will be exported.
The regions-file must be either a
BED file or a
VCF file, and
must have either a .bed
or a .vcf
extension.
For example, if I want the regions from 1
to 5000
base pairs (bp) on
chromosome 3 (in my case the ID would be chr03
), I could make a BED
file:
Note
BED files are 0-based, so plan accordingly!
...or this could be a VCF file that contains a data line for chr03
region information for the CHROM
, POS
, and
INFO
columns with the INFO
column containing a END
field. For
example:
Create FASTA data¶
While haplotype sequences are abstracted to MD5 hashes in hVCF
files, sequence information can be recapitulated from these hash
values using the create-fasta-from-hvcf
command:
phg create-fasta-from-hvcf \
--hvcf-file my_sample.h.vcf \
--fasta-type composite \
-o /path/to/outputFolder
As the name of this command implies, we are creating FASTA files
of nucleotide sequence data from a single hVCF file or a collection
of hVCF files by specifying a directory. The output FASTA files will
be written (one FASTA per hVCF file) to the specified output directory
(-o
). The format of the file names will be sample_name_type.fa
where sample_name
is the name of the sample from the hVCF file name
and type
is the type of fasta file created (composite
or
haplotype
). The following parameters may be used:
- input type (you can only select one):
--hvcf-file
- path to an hVCF file. Can be substituted with--hvcf-dir
.--hvcf-dir
- path to a directory containing hVCF files. Can be substituted with--hvcf-file
.
--fasta-type
- what type of FASTA format do you want to use?composite
- generate a FASTA file that contains all haplotypes concatenated together by consecutive reference ranges. This composite or "pseudo" genome can be used for the resequencing pipeline.haplotype
- generate a FASTA file where each haplotype is a separate FASTA entry. Useful for read mapping, imputation or simple haplotype sequence retrieval.- 'pangenomeHaplotype' - generate a FASTA file where we output all the haplotypes from all the hvcf files in the directory
- 'rangeFasta' - outputs one file per reference range specified by the bedfile. Each file contains the haplotype sequences for each sample for the specified range.
-o
- output path to directory for the created fasta files.
Data retrieval using BrAPI endpoints and rPHG¶
While the above commands allow for individual-oriented access to PHG data, another option is to start a "RESTful" web service. This service can provide access to a centralized PHG database, allowing multiple individuals in a team to simultaneously retrieve PHG-relevant information. The following web service leverages the Breeding API (BrAPI which provides a standard, community-driven collection of web retrieval calls relevant to plant breeding.
To create a web service for serving PHG data, we can use the
start-server
command:
This command takes only two arguments:
--db-path
- path to directory storing the TileDB instances.--port
- web server port for the network connection. Defaults to8080
.
Once this command is run, a web service to localhost
will start
and data can be retrieved: