Skip to content

Exporting Data

In this document, we will discuss general strategies for exporting data from a PHG database.

Note

This section will assume you have a pre-existing database loaded with haplotype data. See the "Building and Loading" documentation for further information.

Quickstart

  • Export hVCF files from database
    phg export-vcf \
        --db-path /path/to/dbs \
        --dataset-type hvcf \ # can also be 'gvcf'
        --sample-names LineA,LineB \ # comma separated list of sample names
        -o /path/to/output/directory
    

Note

--sample-names can be replaced with --sample-file <file_name.txt> where file_name.txt is a file containing the sample names, one per line.

  • Create FASTA files from hVCF data or database

    phg create-fasta-from-hvcf \
      --hvcf-dir my/hvcf_dir \ # can also be an individual file ('--hvcf-file')
      --fasta-type composite \ # can also be 'haplotype'
      -o /path/to/output_folder
    

  • Data retrieval using BrAPI endpoints and rPHG2

    phg start-server \
        --db-that /path/to/dbs \
        --port 8080
    
    library(rPHG2)
    
    PHGServerCon("localhost", 8080) |>
        readHaplotypeData()
    

Detailed walkthrough

Export VCF data

In PHGv2, we leverage TileDB and TileDB-VCF for efficient storage and querying of VCF data. For this example, let's assume I have a pre-existing PHG database (located in my vcf_dbs directory) that contains several samples:

  • LineA
  • LineB
  • LineC

If I want to export hVCF files for a given set of samples (LineA and LineC) to an output directory (in my case output/hvcf_files), I can use the export-vcf command:

phg export-vcf \
    --db-path vcf_dbs \
    --dataset-type hvcf \
    --sample-names LineA,LineC \
    -o output/hvcf_files

This command uses several parameters:

  • --db-path - path to directory storing the TileDB instances.
  • --dataset-type - what type of data do you want to extract? This can either be:
    • hVCF (hvcf) data (default parameter)
    • gVCF (gvcf) data
  • --sample-names - a comma (,) separated list of sample IDs.
  • -o - output directory of VCF data.
  • --regions-file - a file of positions to be exported. Can be a BED file or a VCF file.

Note

Make sure there is no whitespace between sample IDs. For example:

  • LineA,LineB
  • LineA , LineB

Users may instead use the --sample-file parameter to specify a file that contains the sample names, one per line. For example, if I have a text file called sample_names.txt, the contents of the file would look like the following:

LineA
LineB

...and would be passed to the export-vcf command:

phg export-vcf \
    --db-path vcf_dbs \
    --dataset-type hvcf \
    --sample-file sample_names.txt \
    -o output/hvcf_files
If input is specified for the --regions-file parameter, only variants overlapping those positions will be exported. The regions-file must be either a BED file or a VCF file, and must have either a .bed or a .vcf extension.

For example, if I want the regions from 1 to 5000 base pairs (bp) on chromosome 3 (in my case the ID would be chr03), I could make a BED file:

chr03 0 5000

Note

BED files are 0-based, so plan accordingly!

...or this could be a VCF file that contains a data line for chr03 region information for the CHROM, POS, and INFO columns with the INFO column containing a END field. For example:

#CHROM  POS ID  REF ALT QUAL  FILTER  INFO  ...
chr03   1   r1  A   T   50    PASS    END=5000

Create FASTA data

While haplotype sequences are abstracted to MD5 hashes in hVCF files, sequence information can be recapitulated from these hash values using the create-fasta-from-hvcf command:

phg create-fasta-from-hvcf \
  --hvcf-file my_sample.h.vcf \
  --fasta-type composite \
  -o /path/to/outputFolder

As the name of this command implies, we are creating FASTA files of nucleotide sequence data from a single hVCF file or a collection of hVCF files by specifying a directory. The output FASTA files will be written (one FASTA per hVCF file) to the specified output directory (-o). The format of the file names will be sample_name_type.fa where sample_name is the name of the sample from the hVCF file name and type is the type of fasta file created (composite or haplotype). The following parameters may be used:

  • input type (you can only select one):
    • --hvcf-file - path to an hVCF file. Can be substituted with --hvcf-dir.
    • --hvcf-dir - path to a directory containing hVCF files. Can be substituted with --hvcf-file.
  • --fasta-type - what type of FASTA format do you want to use?
    • composite - generate a FASTA file that contains all haplotypes concatenated together by consecutive reference ranges. This composite or "pseudo" genome can be used for the resequencing pipeline.
    • haplotype - generate a FASTA file where each haplotype is a separate FASTA entry. Useful for read mapping, imputation or simple haplotype sequence retrieval.
    • 'pangenomeHaplotype' - generate a FASTA file where we output all the haplotypes from all the hvcf files in the directory
    • 'rangeFasta' - outputs one file per reference range specified by the bedfile. Each file contains the haplotype sequences for each sample for the specified range.
  • -o - output path to directory for the created fasta files.

Data retrieval using BrAPI endpoints and rPHG

While the above commands allow for individual-oriented access to PHG data, another option is to start a "RESTful" web service. This service can provide access to a centralized PHG database, allowing multiple individuals in a team to simultaneously retrieve PHG-relevant information. The following web service leverages the Breeding API (BrAPI which provides a standard, community-driven collection of web retrieval calls relevant to plant breeding.

To create a web service for serving PHG data, we can use the start-server command:

phg start-server \
    --db-path vcf_dbs \
    --port 8080

This command takes only two arguments:

  • --db-path - path to directory storing the TileDB instances.
  • --port - web server port for the network connection. Defaults to 8080.

Once this command is run, a web service to localhost will start and data can be retrieved:

  • manually using BrAPI endpoints and cURL:

    # An example pointing to the 'samples' BrAPI endpoint
    $ curl http://localhost:8080/brapi/v2/samples
    
    # An example pointing to a composite hVCF file
    $ curl http://localhost:8080/brapi/v2/variantsets
    

  • using the R package, rPHG2. Since this is a separate library, more information about the library and retrieval methods can be found here.