DNA-Seq is a Next Generation Sequencing (NGS) method for identifying and analyzing genomic mutations. It is used for the study of genetic disease states, population and evolutionary relationships, heredity and breeding characteristics, and the effects of genetic mutations and genomic structural variations on molecular functions. DNA-Seq is one of the most common types of sequencing runs performed in our NGS Core. We’ll take a brief look at it here.
DNA-Seq sequencing runs are typically used for three types of research protocols:
As the name implies, whole genome sequencing involves the sequencing of entire genomes regardless of size or complexity. In contrast, targeted resequencing involves sequencing only specific regions of interest from whole genomes. These regions include genes, sets of genes, exons, exomes, enhancers, promoters, telomeres, amplicons and other uniquely identifiable structural elements within genomes. Exome sequencing is a subset of targeted resequencing. It involves sequencing all protein coding exons, while ignoring everything else within a genome.
The various steps, processes and methods involved in DNA-Seq runs are often collated into standard workflows and pipelines. A typical DNA-Seq pipeline is shown here:
The main steps in the pipeline are:
- Perform sequencing run on DNA sequencer
- Generate FASTQ files containing DNA fragments (“reads”)
- Filter out adapter sequences
- If the sequencing run included multiple samples per chip/flowcell, filter out barcode/index sequences
- Perform quality control check
- Trim and filter reads based on QC results
- Identify and obtain appropriate reference genome sequence
- Align sample reads to reference genome
- Generate various reports for SNP and structural rearrangement analysis
- Optionally compare SNP/rearrangement reports to published databases
Sequencing runs generate FASTQ files with several million to several billion DNA reads, depending on the instrument used for the run. Adapter sequences are filtered out of the FASTQ files with tools like Trimmomatic, FAST-X Toolkit and others. If the sequencing run was multiplexed, i.e. included multiple samples per chip or flowcell, then barcode or indexing sequences are filtered out. Ion instruments use barcode sequences and Illumina instruments use index sequences to identify and separate mixed samples. Barcoding and indexing are the same concept but use different terminologies. A quality control check is used to identify short reads and low quality reads. FastQC is a popular tool used for the QC step. Based on the results of the QC check, reads are trimmed and filtered so as to remove very short sequences (typically < 25 bp) and low quality reads (typically Phred Q-scores < 20). QC, trimming and filtering will be covered in more detail in a separate blogpost.
An appropriate reference genome is obtained from one of many sources. NCBI/Genbank is the most common source but many others are also available. Perhaps the most important step in the DNA-Seq pipeline is the sequence alignment of sample reads against the chosen reference genome. This step identifies the location of genomic mutations, SNP’s/MNP’s, and any structural rearrangements that may be present in each sample. Depending on the size and complexity of the reference genome, and the number of samples in the sequencing run, this step can be computationally demanding, thus it’s highlighted in pink. Depending on the bioinformatics tools in use, numerous reports are generated. SNP’s, MNP’s and genomic rearrangements can be compared to published databases, such as dbSNP, to identify known SNP’s.
Note that there is no formal standard workflow for DNA-Seq pipelines. Rather, there is a community consensus about key steps that should be included in most workflows. There are many variations around these depending on the specific goals of a research project.
A typical DNA-Seq sequence alignment is shown here:
The chosen reference genome sequence is shown in the first line with colored nucleotide bases “ACGT”. The length of the reference genome can be very short, say a few kilobases for viral genomes, or fairly long, around 3.1 gigabases for human genomes, or well beyond this length for some plants and other eukaryotes. Individual sample reads are shown as grey bars. The 5′ end of a read is depicted with a blunt face and the 3′ end of a read is depicted with a chevron face. Some reads map to the reference genome in a “forward” direction – 5′ to 3′ – and some map in a “reverse” direction – 3′ to 5′, hence the grey bars are shown mapped in either direction. Due to the random nature of DNA shearing, sample reads map across the reference genome in a staggered fashion. Coverage is the number of sample reads mapped to a particular locus in the reference genome. At this position in the reference genome, average coverage appears to be around 30X. In the depiction above, only variant basecalls are demarcated; the remaining baseballs match the reference nucleotide. One locus appears to be heterozygous with about half the basecalls guanine “G” against a reference nucleotide adenine “A”. The sequence alignment thus clearly identifies mutant alleles in sample reads relative to the chosen reference genome.
DNA-Seq sequence alignments can yield a wealth of information about gene variants and changes in genome structure. In a typical DNA-Seq experiment, researchers are usually interested in the following genome variants:
- SNP’s (single nucleotide polymorphisms)
- MNP’s (multiple nucleotide polymorphisms)
- Small indels (insertions, deletions)
- Structural rearrangements (inversions, duplications, translocations, copy number variation)
SNP’s identify single nucleotide variations between sample genomes and the chosen reference genome. SNP’s derived from DNA-Seq runs can be cross-referenced with various databases to identify known SNP-related diseases, such as dbSNP, ClinVar and others. A natural extension of SNP’s is multiple nucleotide polymorphisms where two to several nucleotide variants as a group differ from the reference genome. While not as common as SNP’s, MNP’s are still an important element in disease research. Small indels refer to nucleotide insertions and deletions of one to several basepairs, which may result in deleterious frameshift mutations. A variety of genomic structural rearrangements can seriously impact normal gene functions. Inversions, duplications, translocations, and copy number variation are among several such genomic restructurings.
SNP/MNP, indels, and structural rearrangements can be linked to Gene Ontology (GO) analysis and Biochemical Signaling Pathway analysis. GO analysis identifies which cellular, molecular and biological processes are potentially affected by point mutations and genome rearrangements. Pathway analysis identifies which cell signaling, metabolic and biological pathways are altered by gene variants and rearrangements. Both methods yield a more complete picture of the impact of gene mutations and genome structural alterations on biological functions.
There is now a vast array of bioinformatics tools available for data analysis of DNA-Seq experiments. Some of these are cloud-based and some more traditional workstation-based. To get a feel for what’s available see:
I’ll give a shout-out to a few key tools that we use locally in our NGS Core:
We’ll review other types of NGS experiments in future blogs.