Population Scale NextGen Sequencing
Next Generation Sequencing (NGS) technology is now routinely used for sequencing human genomes. The current Illumina HiSeq X can sequence human genomes in a few days. And NextGen Seq instruments running in parallel can effectively sequence several human genomes per day. Efforts are underway to run this up to tens of thousands of human genomes per year.
This unprecedented capability is leading NGS technology to the realm of population scale research. There are numerous projects underway for studying the genetic diversity of populations including the 1,000 Genomes Project, Human Longevity Inc., the Genome 10K Project, the i5K Initiative, and others. Some of these are targeted at human genetic diversity while others are focused on non-human species. With the rapid decline in sequencing costs and reduced sequencing run times, we can expect to see more population level NGS projects in the future.
While the notion of large-scale human genome sequencing offers great promise for population-scale research, it also creates a mild crisis for bioinformatics analysis of the resulting datasets. One key bottleneck is a time-consuming sequence alignment step. Typically, the DNA reads generated from sequencers are aligned to reference genomes. This step attempts to map sample reads to a known reference genome. The current human reference genome contains about 3.1 billion nucleotide bases, which is a fairly large genome for data analysis (although there are considerably larger genomes in other species). Sequence alignment of NGS sample reads against the reference human genome can take a few hours to several hours to run on a reasonably high-end server or cluster. Although this is not excessive for a single sample, with population scale samples the aggregate time spent in the sequence alignment step can be very significant.
For the purposes of this discussion, we recognize two basic types of parallel computing:
- Algorithmic parallelism
- Job level parallelism
Algorithmic parallelism refers to thread-level or task-level parallelism in which multiple threads or tasks are split across several processor cores and executed simultaneously across those cores. The threads or tasks run independently of one another for some period of time, thus resulting in a performance speedup that depends on the exact methods of code parallelization.
Job level parallelism refers to the simultaneous execution of multiple jobs across several compute nodes. The jobs themselves run entirely independently from all other jobs. We often see speedups that are proportional to the number of concurrent jobs.
Here, we’ll combine algorithmic parallelism with job level parallelism. Let’s call it Parallel2. The results described below are derived from the XSEDE project “Multi-GPU Human Genome Sequence Alignments” (TG-ASC150023), which used Parallel2 methods for exceptional speedups in population scale human genome sequence alignments.
We obtained sample data from the 1000 Genomes project. This project contains 1,000+ sequenced human genomes derived from a variety of NGS instruments, NGS vendors, and sequencing centers. To test the Parallel2 methodology, we aligned each of the 1,000+ sample human genomes against the current build of the reference human genome hg38.
Internet file transfers can be painfully slow, especially if you’re transferring large files or large numbers of files. Fortunately, Globus provides a simple, fast file transfer protocol, which we employed here to move human genome files from the European Bioinformatics Institute (EBI) to the Texas Advanced Computing Center (TACC).
If you want to try Globus, follow these steps:
- Obtain a free Globus account
- Login to your account and use the File Transfer option
- Enter Source and Target endpoint names. Source = “ebi#public”. Target = “xsede#stampede”.
- Navigate to appropriate Source and Target directories. Source = “1000g/ftp/phase3/data/…..”. Target = “/work/acctid/acctname/…..”
- Select Source files and “drag-and-drop” to Target directory
Globus endpoint names are usually provided by the Source and Target organizations, or check the Globus website for instructions on managing endpoints. The Source directory is on the EBI server hosting the 1000 Genomes project. The Target directory is a Stampede directory of your choice within your account. A typical file transfer session is shown below.
Globus file transfer screenshot.
Typical file transfer speeds between the EBI and TACC sites were around 150 Mbytes/sec. or 500 Gbytes/hr. In the Globus Transfer Settings options list, choose “verify file integrity after transfer”, which verifies file checksums between source and destination, thus ensuring correct file transmission. We recommend using Globus for fast Internet file transfers.
The Stampede supercomputer is a flagship computing resource hosted by TACC. It currently ranks #10 in the Top500 list and is part of the XSEDE (Extreme Science and Engineering Discovery Environment) consortium. For our purposes we’re interested in Stampede for hosting 128 Nvidia K20 GPU accelerators (see below). Each Stampede compute node hosts a single K20 GPU and dual Xeon E5-2680 processors (16-cores) with 32 GB RAM. Batch jobs are submitted to the SLURM scheduler for batch execution. In order to fairly balance resource utilization among all users, TACC policies limit users to 32 GPU-enabled compute nodes per batch job. Therefore, even though Stampede hosts 128 GPU’s in total, we can only run batch jobs with up to 32 compute nodes at once.
Simplified Stampede architecture.
Nvidia Tesla K20 GPU
Stampede currently hosts Nvidia Tesla K20 GPU’s. All of the sequence alignment metrics described here were derived from K20 GPU runs. Note that the K20 architecture is two generations behind the current model Tesla K80 GPU. K80’s run about 2.0X faster than K20’s. (We encourage TACC to upgrade to K80’s!). The K20 specifications are:
- 2,496 GPU cores
- 5 GB GDDR5 RAM
- 208 GB/sec. memory bandwidth
- 225 W power consumption
Stampede hosts a single K20 GPU per compute node.
As mentioned in an earlier blogpost, we use the NVBIO suite for sequence alignment runs. nvBowtie is a GPU-enabled sequence alignment tool in the NVBIO software suite. nvBowtie is designed for highly-parallel GPU-only sequence alignments. In the context of Stampede, we run a single sequence alignment job per compute node. nvBowtie is designed to maximize GPU utilization per compute node. Thus, on each compute node, a single CPU thread acts as a master thread controlling up to 2,496 GPU cores running up to 2,496 slave threads.
nvBio includes a programmatic occupancy calculator that optimizes GPU core utilization. (A manual Excel version is also available). The occupancy calculator correlates the projected thread count per streaming multiprocessor, register file size, shared memory utilization, the GPU compute capability, and other parameters to arrive at an optimal thread count per CUDA kernel. The occupancy calculator then proposes an optimal kernel launch configuration including the number of threads per block and number of blocks per grid. Thus, with the occupancy calculator we expect optimal GPU utilization per compute node based on the dataset requirements for each job and each human genome.
NVBIO addresses the algorithmic parallelism of Parallel2.
Like many other data centers, TACC and Stampede use Linux environment modules to simplify user shell management. Stampede includes two modules that are useful for Parallel2 computing.
The “cuda” module loads the Nvidia CUDA toolkit and updates several key Linux environment variables, providing access to CUDA libraries, binaries, and files. This is required for running parallelized CUDA-enabled code on the K20 GPU’s.
The “launcher” module is a module developed by TACC staff for submitting multiple serial applications across multiple compute nodes. It sets TACC_LAUNCHER_DIR, which includes a set of shell scripts for managing job submission.
The “launcher” scripts require a “paramlist” input file as shown below:
time nvBowtie -x ../hg38-index -1 ../1000genome/ERR013138_1.filt.fastq.gz -2 ../1000genome/ERR013138_2.filt.fastq.gz --device 0 -S bam/ERR013138.bam
time nvBowtie -x ../hg38-index -1 ../1000genome/ERR015758_1.filt.fastq.gz -2 ../1000genome/ERR015758_2.filt.fastq.gz --device 0 -S bam/ERR015758.bam
etc. - 32 total records
Sample “paramlist” input file.
Each record in the paramlist file specifies a single nvBowtie job. Each job runs on a single compute node. On Stampede, you can include up to 32 records, one record per job. Each record includes the nvBowtie executable, the human genome index file, left and right paired-end read files, the GPU device ID (which always = 0 on Stampede compute nodes), and a BAM (or SAM) output file. We also time each job to keep track of how long they run. Thus, in this example, we’ll run 32 human genome sequence alignments simultaneously across 32 GPU’s and 32 compute nodes.
We run all of our Parallel2 jobs through the batch queues. A basic SLURM batch job script for running simultaneous jobs is shown here:
2 #SBATCH -J parallel2 # Job name
3 #SBATCH -N 32 # Total number of nodes
4 #SBATCH -n 32 # Total number of tasks
5 #SBATCH -p gpu # Queue name
6 #SBATCH -o humangenome.o%j # Name of stdout output file (%j expands to jobid)
7 #SBATCH -t 02:00:00 # Run time limit (hh:mm:ss)
9 module load launcher
10 module load cuda/6.0
12 setenv EXECUTABLE $TACC_LAUNCHER_DIR/init_launcher
13 setenv CONTROL_FILE paramlist
14 setenv WORKDIR .
16 $TACC_LAUNCHER_DIR/paramrun $EXECUTABLE $CONTROL_FILE
SLURM batch script.
The script is surprisingly easy. On Stampede the maximum number of simultaneous GPU compute nodes per job is 32. We run a single task per node, thus, the number of nodes = the number of tasks. The GPU production queue name is “gpu”. We include a job name, a standard output file name and a run time limit. We load the “launcher” and “cuda” Linux modules and set a few required Linux environment variables. Finally, one of the “launcher” scripts – “paramrun” – builds a parallel computing environment and launches 32 parallel jobs.
Stampede’s “launcher” module addresses the job level parallelism of Parallel2.
We ran nvBowtie with the following system and run configurations:
- Stampede supercomputer
- Dual Intel Xeon E5-2680 CPU’s per compute node
- 32 GB CPU RAM per compute node
- (1) Nvidia Tesla K20 GPU per compute node
- 2,496 GPU-threads
TACC data center batch queue policies limit the number of GPU runs to 32 simultaneous compute nodes. Thus, we’re limited to 32 simultaneous genome sequence alignments per batch job.
TACC: Starting up job 5857559
TACC: starting parallel tasks...
TACC Launcher -> 32 processors allocated.
TACC Launcher -> 32 total tasks to complete.
LOTS of output here
TACC: Shutting down parallel environment.
TACC: Shutdown complete. Exiting.
Typical standard out for 32 human genome sequence alignments.
At job completion we see that 32 processors were each allocated a single task, running 32 instances of nvBowtie. Therefore we see:
- A typical runtime for 32 human genome sequence alignments = 25 minutes
For the 1,000 Genomes Project, on Stampede, we need about 32 separate batch jobs (1,000 genomes / 32 genomes per batch job = ca. 32 jobs) to run sequence alignments on all sample datasets. Assuming 25 minutes per job, this equates to 800 minutes (25 mins. * 32 jobs). Therefore we see:
- 14 hours to finish sequence alignments on the entire 1,000 human genomes dataset
There are several impediments to optimal throughput in this project.
GPU’s. Current generation K80 GPU’s are about 2X faster than K20 GPU’s. In the near future Nvidia will announce its Tesla Pascal architecture, which will supplant the K80 Maxwell architecture. Pascal will feature NVlink interconnects, 3D stacked DRAM high-bandwidth memory, and up to 17B transistors, all leading to substantial increases in GPU performance. Pascal GPU’s should provide substantial performance improvements over K80’s and K20’s.
Batch queue limits. TACC batch queue policies limit jobs to 32 GPU’s, even though there are 128 total GPU’s in the current system. Clearly, running jobs on 128 GPU’s would improve throughput by about 4X, as Parallel2 computing scales linearly with the number of GPU’s.
GPU limits. Stampede contains 128 GPU’s. There are systems available with hundreds to thousands of GPU’s. With Parallel2 computing, simply scaling up the number of GPU’s beyond 128 would increase performance substantially.
Busy system. Stampede is a busy machine with the GPU batch queues mostly fully occupied. There are typical wait times of several hours between job runs. If one had a dedicated GPU cluster, Parallel2 jobs could run sequentially, back-to-back with no delays between jobs, thus improving overall throughput.
The Parallel2 computing method enables population scale sequence alignments on NextGen Sequencing datasets. We could eliminate performance bottlenecks with a dedicated purpose-built Pascal GPU cluster running the NVBIO suite. Application and job level parallelism are a powerful combination for tackling population scale NextGen Sequencing projects.