Assembling genomes with little memory.
De novo assembly is the construction of genomes without reference to existing known genomes. There is a wide variety of de novo assembly tools available for constructing genomes. It’s well-known that many of these tools require substantial amounts of memory for assembling “large” genomes. For these assemblers the definition of “large genome” usually refers to something greater than about 1 Gb (gigabase, 1B nucleotides), although there are genomes that exceed this length by a considerable amount.
From experience we know that some assemblers require several hundred GB (gigabytes) up to 1 TB (terabyte) of RAM, per compute node, in a cluster computing environment, in order to assemble large genomes. This can be very expensive. Therefore, we’re evaluating a relatively new de novo assembler, Minia, as an alternative to some of the traditional assemblers. Apparently, Minia can assemble human scale genomes with about 5 GB RAM. Let’s see what our evaluation discovered.
We’ll assemble the genome of James D. Watson. Watson’s genomic DNA sequences can be obtained from the NCBI Sequence Read Archive (SRA) via DNAnexus Sequence Read Archive+ and the NCBI ftp server, as follows:
- SRA Project ID: SRP000095
- NCBI download: ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson (login as Guest)
- Read info: 74,198,831 DNA reads
- Project title: “The complete genome of an individual by massively parallel DNA sequencing.” See Wheeler et al..
(The Wheeler et al. paper claims ca. 106.5M reads were generated from their NextGen Sequencing experiments. However, we only count ca. 74M reads in the NCBI repository.)
We ran Minia on the Colorado State University Cray supercomputer with the following system configuration and Minia run parameters:
- Cray XT6m
- (2) 12-core AMD Opteron 6100 CPU’s per compute node
- 32 GB RAM per compute node
- 1 CPU-thread
- k-mer size = 31
- min abundance = 3
We ran Minia single-threaded on one CPU core on one compute node. The goal of our benchmark runs was to test RAM utilization, not wall-clock performance or scalability per se, hence the choice of single-core runs. The chosen k-mer size was based on recommendations in Minia documentation for a 3 Gb target genome. “min abundance” removes erroneous, low abundance k-mers; the documentation suggests a value of 3.
Side note. Perusing the source code, we see some OpenMP pragma’s
1 #if OMP
2 #pragma omp parallel
3 if(use_compressed_reads && ! separate_count
but it’s not clear if parallelization is still in development or production ready. Therefore, we thought it prudent to run on a single CPU core, instead of multi-core, for the RAM tests.
The results of the Minia de novo assembly run were:
- Total wall clock time: 40.9 hrs.
- Max. memory: 2.1 GB
Watson’s genome assembly required 40.9 wall clock hours of run time. Minia reported several key phases during the run including “counting k-mers” (4.1 hrs.), “writing k-mers” (2.3 hrs.), “Debloom step” (8.3 hrs.), “assembly” (24.0 hrs.), and some miscellaneous steps. This is a fairly long run but keep in mind that we’re only using a single CPU core for assembly.
The reported RAM usage was 2.1 GB. Minia documentation estimates ca. 1 – 2 GB RAM consumption per 1 Gb of target genome size. Human genome is ca. 3.1 Gb, so we would expect about 3 – 6 GB of RAM utilization for Watson’s assembly. The reported metric of 2.1 GB was somewhat below the predicted value.
We’ll use Quast to examine the Watson genome assembly metrics.
|Minia Assembly Metrics|
|Longest contig||9,223 bp|
|Total # contigs||8,213,731|
|Total # contigs >= 1,000 bp||218,513|
|Total length||2,088,316,147 bp|
|Total length >= 1,000 bp||289,355,173 bp|
Comparing these results with Chikhi and Rizk Table 2, we see some interesting trends: a) N50 is similar to Minia, ABySS, and SOAPdenovo assembly of another human genome, b) the longest contig is decent at 9K bp but somewhat shorter than Minia, C&B (now Gossamer), and ABySS, c) total number of contigs was somewhat greater than the C&R paper, and d) total length was about the same as the C&R paper. Overall, the Minia Watson assembly metrics were pretty much in-line with the assemblers mentioned in the C&R paper.
If your computational systems have limited amounts of RAM, you may want to consider using Minia for de novo assembly. We assembled human genome scale NGS datasets (3 Gb) with ca. 2 GB RAM, which is a RAM utilization of one to two orders of magnitude less than most assemblers.
We encourage the Minia developers to include a scaffolder tool and assembly metrics tool in the Minia package. Hunt et al.. offer a comparative analysis of candidate scaffolding tools. The Assemblathon proceedings (Earl et al, Bradnam et al.) offer some recommendations on assembly metrics. These two additions would be highly beneficial for end users.
Based on the Minia source code, there appears to be an effort to parallelize the application. The documentation does not mention parallelization so we have to assume it’s not ready for production release. A faster parallelized version of Minia would be useful for assembling large genomes.
Given the present state of development for Minia, we would recommend that it be used with some caution, and that any results be independently validated. However, the very low RAM requirements make Minia an intriguing alternative to other assemblers.