(Figure from Marchetti et al. 2012)
Metatranscriptomics is the study of RNA expression and regulation in communities of organisms. It extends RNA-Seq from individual species to communities of species. As you might imagine this significantly compounds the difficulty of data analysis; handling one species is challenging enough, let alone tens to potentially hundreds (thousands?) of species. Metatranscriptomics for NextGen sequencing has been developing steadily over the past few years but there are relatively few studies and papers available in this area. It’s not a stretch to say that with metatranscriptomics we’re working at the leading edge of NextGen sequencing technology. It’s exciting and challenging at the same time.
Recently we were asked to run NextGen sequencing on samples of bacterial/viral populations and develop metatranscriptomics pipelines for data analysis of the results. We ran the sample sequencing on Life Technologies Ion Proton sequencer, which yielded about 30M cDNA reads per sample for transcript analysis. This is a brief discussion of our experience with the data analysis part.
We developed a metatranscriptomics pipeline derived in part from a paper by Haas et al. The required software applications are listed in the paper. To get the pipeline to work, you’ll want to use the versions listed here. We discovered irreconcilable incompatibilites when we tried mixing versions other than those shown below.
- Trinity RNA-Seq, v. r2013-02-25
- Bowtie, v. 0.12.9 (not Bowtie2)
- Samtools, v. 0.1.18
- Blast+, v. 2.2.27
- R, v. 2.15
After installing the R statistical package, you’ll need to install
within an R session using something like the following commands:
> biocLite (′edgeR′)
With everything installed, you’ll need some fairly hefty computational resources to handle the lengthy sequence alignment and de novo assembly steps. We used Cray XT6m and Cray XE6 supercomputers for this but any Linux cluster with an adequate number of CPU cores and RAM should suffice. (The required compute systems for the pipeline could increase substantially in the near future with new NextGen sequencers pumping out 109 reads.)
Here is a pipeline flowchart. Computationally intensive steps are shown in pink.
The main steps in the pipeline are:
- Install all software components
- Gather FASTQ files of cDNA reads from NextGen sequencers
- If the sequencer run used paired-end or mate-pair protocol, combine left and right reads in separate files; otherwise continue for single-end reads
- Assemble reads into transcripts using Trinity RNA-Seq application
- Check statistical metrics from assembly step
- Estimate transcript abundance with RSEM application
- Identify significantly differentially expressed genes with edgeR
- Perform TMM (trimmed mean of M-values) normalization
- Perform hierarchical clustering of transcripts
Paired-end or mate-pair reads are preferable for the Trinity de novo assembly step, but single-end reads will also yield reasonably good transcripts. The statistical analysis step will give some guidance on how good the assembly step was. RSEM yields estimates of transcript abundance and gene expression levels, which are then used in edgeR to determine significantly differentially expressed genes. The normalization step is important to adjust expression levels for differences in sequencing depth (number of reads) across samples. Hierarchical clustering of transcripts is an optional step if you want to group sets of transcripts with similar expression levels.
The two computationally intensive steps involve Trinity and Bowtie. There are parallelized versions of both applications. We encourage continued efforts to parallelize and optimize these two applications. These steps will become real bottlenecks in the pipeline with the massive datasets now coming from NextGen sequencers.
MA-plots show RNA transcript fold change as a function of transcript abundance. Significantly differentially expressed genes (p < 0.05) are shown in red. Up-regulated genes are shown as y > 0; down-regulated genes are y < 0. High abundance transcripts appear on the right x-axis; low abundance transcripts appear on the left x-axis. Housekeeping genes are mostly shown in black. In whole transcriptome research on single species, MA-plots show transcripts just for that individual species. However, for metatranscriptome studies, MA-plots show transcripts from all species constituting the original samples. Clearly, in this experiment, there were many significantly up-regulated / down-regulated genes.
Volcano plots show statistical significance levels (false discovery rates (FDR); p-values) as a function of fold change. Significantly differentially expressed genes with FDR < 0.1% are shown in red. For metatranscriptomic studies, volcano plots show highly expressed genes from multiple species in the original samples.
Several issues became apparent while developing the pipeline.
- What species are in these samples? Whenever you’re handling communities of species, which is the case for any of the “meta-” studies – metagenomics, metatranscriptomics, metaproteomics – the species composition of the samples is an open question. NextGen researchers are often surprised at the extreme diversity of species found in their samples. Typically, researchers are interested in a handful of target species and consider everything else to be “contaminating” species or of secondary interest. However, the metatranscriptomics pipeline is non-discriminatory and will process all the RNA (cDNA) from all species in the samples. This is an important distinction from more traditional RNA-Seq research.
- Alternative splicing. In the study mentioned here we believe the majority of species were of prokaryotic bacterial and viral origin. Alternative splicing and isoform detection was thus not much of an issue. However, in other studies, there may be a substantial number of eukaryotic species present in the samples. Alternative splicing then becomes a real issue, especially for the de novo assembly step. Trinity apparently detects alternatively spliced isoforms for single species samples, but how well does it perform with multi-species samples, especially when there are closely related species in the samples? This is probably an open question that needs more research and validation.
- Ribosomal RNA. rRNA often constitutes >90% of sample RNA. We use various pre-sequencing ribodepletion methods to eliminate as much rRNA as possible. Nevertheless, there is almost always some remaining post-sequencing rRNA. In data analysis these are usually easy to identify as they show huge RPKM values, and BLAST’ing those reads against rRNA databases reveal their identity. In the pipeline described here we had minimal post-sequencing rRNA removal, so rRNA assembly was included in the analysis. In a (near) future blog I’ll describe some computational methods under development for post-sequencing rRNA extraction. We’ll update the pipeline with these new methods.