Whole Genome Sequencing: Principle, Types, Process, Uses microbiologystudy

Whole genome sequencing (WGS) is a method that identifies the complete DNA sequence of an organism. This method sequences the entire genome of an organism including both coding and non-coding regions using different automated DNA sequencing techniques and bioinformatic tools. Unlike targeted sequencing methods such as exome sequencing that focus on specific regions of the genome, WGS provides a comprehensive view of the entire genome.

Whole Genome Sequencing
Whole Genome Sequencing

The first DNA sequencing method was Sanger sequencing developed in the late 1970s. It was initially performed manually and became automated later in the 1990s, allowing for the sequencing of whole genomes. Whole genome shotgun sequencing method was introduced in 1979 for sequencing small genomes by fragmenting DNA and then reassembling it. The first organism to be completely sequenced was Haemophilus influenzae in 1995. Later in 2003, the entire human genome sequencing was completed. Early WGS techniques were slow and expensive but the development of next-generation sequencing (NGS) techniques has made whole genome analysis faster and more affordable.

Interesting Science Videos

Principle of Whole Genome Sequencing

The principle of whole genome sequencing involves sequencing the complete DNA sequence of an organism’s genome including non-coding regions. It sequences the entire genome and provides detailed information about the genes, regulatory elements, and variations in the genome. The process involves extracting DNA from the organism, constructing a sequencing library, sequencing the DNA fragments, and analyzing the sequence data to identify genetic variations. 

WGS includes two main strategies: shotgun sequencing and pairwise-end sequencing. In shotgun sequencing, DNA fragments are randomly cut into smaller pieces. Each fragment is sequenced individually and analyzed to identify overlaps between them. Pairwise end sequencing involves sequencing both ends of each DNA fragment. This method provides more information and allows accurate reconstruction of the DNA sequence.

Types of Whole Genome Sequencing

Whole Genome Sequencing can be classified into two categories:

De novo genome sequencing is used for assembling new genomes with no prior reference sequence. It is particularly useful for sequencing newly studied species or genomes that are highly variable. It provides the foundational sequence data necessary for further genomic studies and is used to create reference genomes for new species. This process can be demanding due to the complexity of the genome and the requirement for extensive bioinformatics resources and expertise.

Whole-genome resequencing (WGR) involves sequencing the genome of an individual or population and comparing it to an existing reference genome to identify variants. This method requires a reference genome for read mapping and variant identification. WGR is widely used to identify genetic variants and study genetic diversity. 

Process of Whole Genome Sequencing (WGS)

1. Sample Preparation

  • The initial step in the WGS process is sample preparation which involves obtaining high-quality nucleic acid samples.
  • At first, biological samples are obtained from the organism of interest. The cells are lysed using physical or chemical methods to release DNA.
  • Then the DNA is separated or purified from proteins, lipids, and other cellular debris using different extraction methods. 

2. Library Construction

  • Once the nucleic acids are purified, the next step is constructing a sample library containing short fragments of DNA.
  • The genomic material is fragmented into required lengths using mechanical shearing or enzymatic digestion.
  • The fragmented DNA undergoes end repair followed by the ligation of adapters to the ends of the DNA fragments. These adapters contain sequences necessary for sequencing.
  • The adapter-ligated library is enriched to ensure a high concentration of DNA fragments for sequencing. The library constructed is validated for quality to meet the requirements for sequencing instruments.

3. Sequencing 

  • After library construction, the samples are sequenced. The prepared library is loaded onto the chosen sequencing platform.
  • Different sequencing technologies are available. At present, next-generation sequencing platforms are popular for WGS such as Illumina, PacBio, and Oxford Nanopore. NGS platforms can generate massive quantities of short reads with different lengths for one genome.
  • The sequencing output is formatted into standardized files which are then used for alignment and further analysis. 

4. Alignment and Assembly

  • The process of alignment maps the short nucleotide reads to a reference genome. This step is computationally intensive and time-consuming due to the vast number of possible positions in a reference genome. Some of the tools used for alignment are mrFAST, SHRiMP, BOWTIE/BOWTIE2, SOAP2, and BWA.
  • Sequence assembly reconstructs the genome sequences into large contiguous segments. There are two main methods of sequence assembly: reference-based assembly and de novo assembly.
  • Reference-based assembly aligns the reads to an existing reference genome sequence and produces a sequence that closely matches the reference. It requires fewer computational resources but it cannot generate novel sequences that are not present in the reference genome. Examples of tools for this type of assembly include MAQ, SeqMap, and RMAP.
  • De novo assembly uses computational methods to align overlapping reads and assemble them into contigs. It does not rely on the reference genome. This method is necessary for discovering new sequences and requires significant computational resources to process and assemble large amounts of data. Tools used include Arachne, Velvet, SOAPdenovo, and ABySS.

5. Quality Control

    • Sequencing platforms may generate errors, including poor-quality reads, base calling errors, and PCR duplicates. Quality control ensures that sequencing errors are minimized leading to more accurate biological analyses.
    • The quality control process involves checking the raw sequencing data for various quality metrics, including read length, primer contamination, adapter contamination, and read quality. Low-quality reads and those containing adapters are identified and removed.
    • Different metrics are used to assess the quality of assembled genomes including the N50 or N90 statistic, assembly size, contig numbers, and the number of mismatches.
    • Some commonly used tools for quality control include FastQC, FASTX-Toolkit, PRINSEQ, and QUAST.

    6. Variant Calling

    • Variant calling process identifies differences between the sequenced genome and the reference genome to detect genetic variants which is useful to study their associations with several diseases and detect mutations.
    • Variant calling tools can be categorized based on the type of variants they detect such as single-nucleotide polymorphisms (SNPs), insertions or deletions (indels), structural variants (SVs), and copy number variations (CNVs).
    • Some of the variant calling tools are GATK, SOAPsnp, SAMtools, iSVP, SvABA, and SVMerge. 

    7. Annotation

    • After the variant calling step, the next step is genome annotation. Annotation involves adding biological information to the sequenced data and the identified variants.
    • Structural annotation predicts the locations and structural components of genes and other genomic elements by mapping these segments to known gene sequences from existing databases. This involves identifying open reading frames (ORFs) which are genomic regions encoding proteins. Tools commonly used for structural annotation include AUGUSTUS and GeneMark.
    • Functional annotation assigns functions to the predicted genes by comparing them to existing databases. Tools like BLAST and InterProScan are commonly used. This provides information about the gene functions and regulatory elements.
    • Several tools are available for variant annotation such as ANNOVAR, VAT (Variant Annotation Tool), GATK (Genome Analysis Toolkit), and VEP (Variant Effect Predictor).

    8. Analysis

    • The final step in WGS is interpreting and analyzing the annotated data to translate the sequencing data into meaningful biological insights.
    • This step involves several analyses to validate the findings of the annotation process and understand the significance of the identified variants.
    • Pathway analysis identifies the biological pathways and is used to understand the functional impact of variants using databases like KEGG and Reactome.
    • Population genetics analyzes genetic diversity and provides information about the evolutionary history and genetic risk factors for diseases.
    • Comparative genomics compares genomes and constructs phylogenetic trees to study evolutionary relationships using tools like OrthoMCL, MEGA, and PhyML.
    • Gene expression analysis uses RNA-seq data to study gene expression patterns, while epigenetic analysis studies modifications such as DNA methylation using techniques like ChIP-seq.

    Video on Whole Genome Sequencing Steps

    YouTube videoYouTube video

    Advantages of Whole Genome Sequencing

    • Whole Genome Sequencing detects a wide range of genetic variations and mutations including those missed by targeted techniques. SNVs and small indels are detected with high accuracy, providing reliable data for genetic analysis.
    • Whole Genome Sequencing detects variants in both protein-coding and non-coding regions providing information about gene expression and regulatory mechanisms.
    • Whole Genome Sequencing provides large volumes of data quickly which is useful in the assembly of novel genomes and genetic analyses.
    • Whole Genome Sequencing allows quick identification and tracking of pathogens during outbreaks.

    Limitations of Whole Genome Sequencing

    • Whole Genome Sequencing often generates many variants of uncertain significance. The vast amount of data makes clinical interpretation difficult.
    • Certain regions of the genome such as those with repetitive elements may not be accurately analyzed leading to potential gaps in the genomic data.
    • Despite significant cost reductions, Whole Genome Sequencing is still expensive, especially for large-scale studies or clinical use.
    • The large volume of data generated by Whole Genome Sequencing requires powerful computational resources for data processing and analysis.
    • The extensive genetic data also raises ethical issues regarding privacy, informed consent, and the potential for misuse of genetic information. 

    Applications of Whole Genome Sequencing

    • Whole Genome Sequencing plays an important role in understanding genetic variations including SNPs, indels, and CNVs. These genetic variants can influence the risk of common and rare diseases.
    • Whole Genome Sequencing is used in research to identify new genes or mutations associated with rare diseases or different types of cancer. 
    • Whole Genome Sequencing is an important tool for clinical diagnosis. It can be used to detect infectious organisms or pathogens in clinical settings. WGS allows for personalized or tailored treatments of rare diseases.
    • Whole Genome Sequencing helps identify genes associated with desirable traits, such as disease resistance and drought tolerance which can be used in the development of improved crop varieties.
    • Sequencing the whole genomes of various species also helps in conservation efforts to protect endangered species by revealing genetic diversity.

    References

    1. Burian, A. N., Zhao, W., Lo, T. W., & Thurtle-Schmidt, D. M. (2021). Genome sequencing guide: An introductory toolbox to whole-genome analysis methods. Biochemistry and molecular biology education: a bimonthly publication of the International Union of Biochemistry and Molecular Biology, 49(5), 815–825. https://doi.org/10.1002/bmb.21561
    2. Ekblom, R., & Wolf, J. B. (2014). A field guide to whole-genome sequencing, assembly and annotation. Evolutionary Applications, 7(9), 1026–1042. https://doi.org/10.1111/eva.12178
    3. Fuentes-Pardo, A. P., & Ruzzante, D. E. (2017). Whole-genome sequencing approaches for conservation biology: Advantages, limitations and practical recommendations. Molecular Ecology, 26(20), 5369–5406. https://onlinelibrary.wiley.com/doi/10.1111/mec.14264
    4. Pfeifer S. P. (2017). From next-generation resequencing reads to a high-quality variant data set. Heredity, 118(2), 111–124. https://doi.org/10.1038/hdy.2016.102
    5. Whole genome sequencing — Knowledge Hub (hee.nhs.uk)
    6. Whole Genome Sequencing (WGS)- Introduction, workflow, Pipelines, Applications – Sciencevivid
    7. Whole-Genome Sequencing (WGS) (illumina.com)
    8. Wu, J., Wu, M., Chen, T. et al. Whole genome sequencing and its applications in medical genetics. Quant Biol 4, 115–128 (2016). https://doi.org/10.1007/s40484-016-0067-0
    9. Yin, R., Kwoh, C. K., & Zheng, J. (2018). Whole Genome Sequencing Analysis: Computational Pipelines and Workflows in Bioinformatics. Reference Module in Life Sciences. doi:10.1016/b978-0-12-809633-8.20095-2

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top