SLURM Usage Guidelines for align-assemblies
¶
When running align-assemblies
on a single machine,
the --total-threads
and --in-parallel
parameters are used to
determine how many threads to allocate for the alignment step and
how many genomes to run in parallel. Users only need to call the
align-assemblies
command once and there is no pre-processing steps
necessary.
However, for HPC
systems leveraging the SLURM
job scheduler, parallel processes operate differently which will
require slight modifications to the alignment step. In this document,
we will discuss how to use the align-assemblies
command for usage
with the SLURM work scheduler system.
Workflow Overview¶
1 - Reference preparation¶
When running jobs in a SLURM array, the first step is to run the
align-assemblies
command with the --just-ref-prep
parameter set
to true
. This will run the preliminary steps of aligning the
reference genome to the GFF CDS, creating the reference-sam
and
reference-cds-fasta
files needed for aligning the individual
assemblies. The output of this step will be written to the user
supplied output directory.
2 - SLURM data array file creation¶
The second step would be to create the SLURM data array file. This is
a text file which has an align-assemblies
command, one per line,
for each assembly that will be run. It will use the reference-sam
and reference-cds-fasta
files created in the first step as input to
the align-assemblies
command. Users can create the file manually if
they prefer, but for long lists of assemblies, manual curation can be
tedious and prone to errors. PHGv2 provides a convenience command
(prepare-slurm-align-file
) to automatically generate these files. See
the "Prepare SLURM File for Alignments"
section for further details.
3 - Job submission¶
The third step is to submit the SLURM job. When running a SLURM job,
the --in-parallel
parameter is not used as each align-assemblies
command is assigned to separate computer nodes in the HPC system.
Conversely, the number of threads to use is still specified by the
--total-threads
parameter.
Prepare SLURM File for Alignments¶
The prepare-slurm-align-file
command is a convenience method for
creating a file that can be submitted to SLURM as a
data array job:
phg prepare-slurm-align-file \
--phg-location /path/to/phg \
--gff data/anchors.gff \
--reference-file output/updated_assemblies/Ref.fa \
--reference-sam output/alignment_files/Ref.sam \
--reference-cds-fasta output/alignment_files/ref.cds.fasta \
--asemblies data/assemblies_list.txt \
--ref-max-align-cov 1 \
--query-max-align-cov 1 \
--total-threads 20 \
--conda-env-prefix /path/to/conda/env \
--output-dir /path/for/align-assemblies/output \
--slurm-file output/slurm_align_file.txt \
-o output/alignment_files
Parameters¶
This command uses several parameters:
* --phg-location
- The location of the phg executable. The full path should be provided. This is needed to run the align-assemblies command.
If it is not specified, the current directory, ie ./phg
, will be assumed.
-
--gff
- GFF file for the reference genome. This is used to identify full-length coding sequences to use as anchors -
--reference-file
- The reference genome in FASTA format.-
ℹ️ Note
The path to the reference genome should be the updated version that was created during theprepare-assemblies
command.
-
-
--reference-sam
- Optional parameter. If this is specified, the optional parameter--reference-cds-fasta
must also be supplied. When both are supplied, the software skips the creation of these files and uses those supplied by the user. This is desirable when the user is running multiple assembly alignments from a SLURM data-array option and does not wish to realign the reference multiple times. If specified, but--reference-cds-fasta
is not, the software will throw an exception. -
--reference-cds-fasta
- Optional parameter. If this is specified, the optional parameter--reference-sam
must also be supplied. When both are supplied, the software skips the creation of these files and uses those supplied by the user. This is desirable when the user is running multiple assembly alignments from a SLURM data-array option and does not wish to realign the reference multiple times. If specified, butreference-sam
is not, the software will throw an exception. -
--assemblies
- A text file containing a list of annotated assembly genomes (see the "Prepare Assembly FASTA files" section for further details). The contents of the assembly list file should be either full or relative paths to each uncompressed assembly you would like to align. For example, since I am following the steps laid out in the "Build and Load Documentation", I can create a text file calledassemblies_list.txt
(placed in thedata/
subdirectory) and populate it with the following lines:Here, I am planning on aligning two genomes called
LineA
andLineB
. Since these are created with theprepare-assemblies
command and the output is located in a subdirectory calledoutput/updated_assemblies/
relative to my working directory, I will also add that to the path.-
⚠️ Warning
This text list should not contain the path to the reference genome since this is recognized in the--reference-file
flag.
-
-
--ref-max-align-cov
- The maximum reference genome alignment coverage. This is used in theproali
command. The default value is1
. -
--query-max-align-cov
- The maximum query genome alignment coverage. This is used in theproali
command. The default value is1
. -
--total-threads
- How many threads would you like to allocate for the alignment step? -
--conda-env-prefix
- Optional parameter that specifies the path to the Conda directory that contains the conda environment needed to run phg. If not set, conda envphgv2-conda
in the default location will be used. -
--slurm-file
- The name of the file that will be created that contains the SLURM commands to run thealign-assemblies
command for each assembly in the list of assemblies. -
-o
,--output-dir
- The name of the directory for the alignment outputs.
Example output¶
Following along with the example data shown in the "Build and Load"
documentation, my example SLURM data-array file
(output/slurm_align_file.txt
) will look like the following since
I have two samples:
./phg align-assemblies --gff data/anchors.gff --output-dir output/alignment_files --reference-file output/update_assemblies/Ref.fa --reference-sam output/alignment_files/Ref.sam --reference-cds-fasta output/alignment_files/ref.cds.fasta --assembly-file data/test/smallseq/LineA.fa --total-threads 20 in-parallel 1 --ref-max-align-cov 1 --query-max-align-cov 1
./phg align-assemblies --gff data/anchors.gff --output-dir output/alignment_files --reference-file output/update_assemblies/Ref.fa --reference-sam output/alignment_files/Ref.sam --reference-cds-fasta output/alignment_files/ref.cds.fasta --assembly-file data/test/smallseq/LineB.fa --total-threads 20 in-parallel 1 --ref-max-align-cov 1 --query-max-align-cov 1
Note
If the file specified by the --assemblies
parameter contains 10
assemblies, the output file will contain 10 lines, each with a
call to the align-assemblies
command for a single assembly. If the
file specified by the --assemblies
parameter contains 100
assemblies, the output file will contain 100 lines.
Integrating into SLURM jobs¶
Since the output from prepare-slurm-align-file
is simply a list
of individual align-assemblies
commands (each representing an
individual assembly), we must pass this along to an actual SLURM
array job. Below we have added an example SLURM script detailing
some example parameters and code setup you may want to use for your
applications:
#!/bin/bash
#SBATCH --time=10:30:00 # walltime limit (HH:MM:SS)
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=40 # 40 processor core(s) per node X 2 threads per core
#SBATCH --mem=200G # maximum memory per node
#SBATCH --partition=short # standard node(s)
#SBATCH --job-name="10T_anchorwaveV2"
#SBATCH --mail-user=lcj34@cornell.edu # email address
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --array=0-4
# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
module load miniconda
module load java/17
eval "$(conda shell.bash hook)"
conda activate /project/buckler_lab_panand/lynn.johnson/phgv2-conda
echo "All jobs in this array have:"
echo "- SLURM array job id: ${SLURM_ARRAY_JOB_ID}"
echo "- SLURM array task count: ${SLURM_ARRAY_TASK_COUNT}"
echo "- SLURM array starting task: ${SLURM_ARRAY_TASK_MIN}"
echo "- SLURM array ending task: ${SLURM_ARRAY_TASK_MAX}"
echo "This job in the array has:"
echo "- SLURM job id: ${SLURM_JOB_ID}"
echo "- SLURM array task id: ${SLURM_ARRAY_TASK_ID}"
INPUTFILE=<your_align_command_input>
IFS=$'\n' read -d '' -r -a LINES < ${INPUTFILE}
LINE=${LINES[$SLURM_ARRAY_TASK_ID]}
eval ${LINE}
if [ $? -eq 0 ]
then
echo -e "$(date +"%D %r")\tSuccess: ${LINE}"
exit 0
else
echo -e "$(date +"%D %r")\tFailed\t${LINE}"
echo -e "$(date +"%D %r")\tJobID\t${SLURM_JOB_ID}"
echo -e "$(date +"%D %r")\tTaskID\t${SLURM_ARRAY_TASK_ID}"
exit 1
fi
...where the line that contains the INPUTFILE
variable declaration
is the path to the prepare-slurm-align-file
command output (in our
case, this would be output/slurm_align_file.txt
). For more
information about how to get started with SLURM, please check out
the official guides.