Editor: This article is the second part of an article series. Check out Part One here.
Biological foundations of bioinformatics
Nucleic acids and proteins are important macromolecules that form the basis of life. Nucleic acids encompass both deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). DNA is the carrier of genetic information whilst RNA is involved in the biosynthesis of proteins. Finally, proteins control the cellular processes of life.
Evolution has devised an ingenious way of storing information about these macromolecules in sequences of nucleotides and amino acids. In bioinformatics, this modular biological information is stored, retried, manipulated and distributed computationally. This contributes to understanding the structure and function of proteins as well as their evolutionary relationship.
DNA and RNA
DNA occurs as a double-stranded helix whilst RNA is single-stranded. However, they are both made up of the same monomeric unit, nucleotides. The constituents of nucleotides are a pentose sugar, a phosphate group, and a heterocyclic base. Notably, the pentose sugar in DNA is deoxyribose whilst that in RNA is ribose, giving rise to their respective names.
The pentose sugar and the phosphate are critical in connecting one nucleotide to the next. More specifically, the phosphate is bonded to the 5’ hydroxyl (OH) group of the pentose sugar in each nucleotide; it also forms an ester bond to the 3’ OH group of the next sugar residue, thus forming the phosphoribose backbone joining DNA or RNA polymers. An upshot of this is that the polynucleotide (i.e. DNA or RNA polymer consisting of repeating units of nucleotides) will have an unreacted 5’ phosphate at one end and a 3’ OH at the other, termed the 5’ and 3’ ends respectively. This asymmetry gives nucleic acids an intrinsic directionality. In fact, transcription and translation both follow a 5’ to 3’ direction. Therefore, nucleic acid sequences in bioinformatics also read from 5’ to 3’.
So far, we’ve introduced the phosphodiester backbone, which does not seem to carry diverse biological information. Indeed, the main source of variation is the bases in the nucleotides. There are five different bases: cytosine (C), uracil (U), thymine (T), adenine (A), and guanine (G). Uracil occurs only in RNA and thymine only in DNA. Genetic information is stored in unique arrangements of bases.
Figure 1: Structure of DNA and RNA. DNA and RNA each consist of a 5’ end and a 3’ end. Whilst DNA forms a double helix, RNA usually occurs as a single strand. C bases pair with G bases, and A with T (or U in RNA). This stabilizes the double-stranded structure of DNA and allows DNA to serve as a template when transcribed into RNA.
Proteins are macromolecules that are made up of the 20 naturally occurring amino acids. The amino acid sequence of a protein is known as its primary structure. The primary structure of a protein allows it to fold into characteristic three-dimensional structures, which dictate its biological properties and functions.
The structure of natural amino acids is characterized by an amino and a carboxyl group bonded to a central α-carbon atom. The α-carbon is also bonded to one of 20 side chains (or R group), which determines the different properties of amino acids, such as hydrophobic, polar, acidic, or basic. Individual amino acids are connected together via peptide bonds in a polypeptide chain, which can contain from three up to several hundred amino acids. Finally, each amino acid in the polypeptide chain is abbreviated by either a three-letter or one-letter code. For example, glycine can be represented with Gly (three-letter code) or G (one-letter code). These abbreviations are especially important in storing and presenting large amounts of protein sequence data.
As mentioned previously, the primary structure is essential for determining the secondary and tertiary structures of protein. The secondary structure of a protein describes the ordered folding patterns of a polypeptide chain into regular helices (α-helix) and sheet structures (β-strand) and irregular loops and turns. These three secondary structural elements contain around 10 amino acid residues and are the building blocks of the protein’s tertiary structure.
The tertiary structure describes the three-dimensional shape of a protein. Polypeptide chains with greater than 200 amino acids fold themselves into several units termed domains. The tertiary structure of a protein, especially the domains, specifies its function. For example, immunoglobulin domains in antibodies can bind antigens, whilst different catalytic domains can catalyse enzymatic reactions. Thus identifying domains in proteins is of great interest in bioinformatics.
Lastly, the quaternary structure is the assembly of several polypeptide subunits through noncovalent interactions. Mapping these interactions is also a branch of bioinformatics that is under active research since proteins rarely function in isolation but constantly bind to and dissociate from other proteins.
The storage of genetic information
You may already be familiar with the concept that genes in the DNA encode for specific proteins. In fact, the sequence of nucleotides determines the sequence of amino acids in proteins, with every three consecutive bases, known as a codon, coding for one corresponding amino acid.
This information flow from DNA to RNA to protein is described as the central dogma of molecular biology. The genetic information encoded in the nucleotide sequence of the DNA is first transcribed into messenger RNA (mRNA). Proteins are subsequently built based on the mRNA sequence via the process of translation.
Furthermore, the organization of gene regions differs in eukaryotes and prokaryotes. Eukaryotes are organisms (e.g. humans, plants and other animals) where DNA is enclosed by a nuclear membrane, whilst prokaryotes (e.g. bacteria) lacks a nucleus. In eukaryotes, gene regions that code for proteins (exons) are interrupted by non-coding introns. In contrast, genetic information is encoded on a continuous DNA stretch in prokaryotes. This simply means that the transcription process in eukaryotes will involve the removal of introns and joining together of exons through the process called splicing. You may have realised that different exons could be joined together depending on which introns are removed. This is known as alternative splicing, where different mRNA transcripts and, consequently, different proteins can be produced by the same gene. As a result, the number of genes found in the human genome as compared to the number of proteins produced is low.
Figure 2: Overview of the Central Dogma of Molecular Biology. The coding region of the template strand of DNA is transcribed into mRNA. mRNA, in turn, is translated into a polypeptide (protein). Additional elements such as the promoter and terminator, start and stop codons specify the start and end of transcription and translation. This image was taken from https://www.flickr.com/creativecommons/
The collection of sequence data and its associated biological information, as well as accessibility to this information, created the need for an organized data storage system – Databases. Based on the different types of biological data stored, biological databases are categorized mainly into primary, secondary and specialized databases.
But first, data needs to be stored in an accessible format for any of these databases. Biological databases use three types of database management structures: flat files, relational, and object-oriented. In particular, flat-file format (usually in the form of structured ASCII text files) is most widely used for biological databases for storing sequence data and its associated information. This is because ASCII text files allow the data to be manipulated without an expensive and complicated database system. ASCII text files also make data exchange between scientists relatively simple.
One drawback of using ASCII text files is that keyword-searching within a dataset is laborious and time-consuming. To overcome this, index flat file-based databases have been developed. Because of their index register system, keyword-based searches are much faster.
Primary databases, as the name implies, are databases that contain original biological data. These databases contain the primary nucleotide or protein sequences from experiments such as DNA sequencing or protein structure determination. Depending on the information deposited, these databases could be divided into nucleotide sequence databases and protein databases.
Nucleotide sequence databases
Nucleic acid sequence data produced by researchers are submitted to three major public sequence databases. These are GenBank, the European Molecular Biology Laboratory (EMBL) database and the DNA Data Bank of Japan (DDBJ), all of which are freely available on the internet. The GenBank database is also known as the nucleotide sequence database available at the U.S. National Center for Biotechnology Information (NCBI).
Together these three databases make up the International Nucleotide Sequence Database Collaboration. They collaborate and exchange new data daily (they synchronize their data every 24 hours). This ensures that the nucleotide sequence of a particular organism (whole or part) remains the same across the three databases. However, slight differences remain in the format used to represent the data.
Recall that the three-dimensional structure of proteins is essential to their function. Hence, databases containing both the amino acid sequence and the structure of proteins are required.
A centralized database, the Protein Data Bank (PDB), stores the three-dimensional structures of proteins (as well as some nucleic acids). Protein structures can be determined by a range of methods, such as X-ray crystallography, Nuclear Magnetic Resonance (NMR) and cryo-EM. Information on the atomic coordinates, which are the relative position of atoms in a macromolecule, is then deposited in this database.
NCBI protein database is another common protein sequence database. It has data entries from Swissprot, the PIR database, PDB, protein translations from the GenBank database, and several other sequence databases.
These databases contain processed sequence information derived from the primary databases. The raw sequence information is annotated to reveal higher levels of biological information, such as manually curated domain structures and related variants. A common example of such databases is SWISS-PROT. This database provides detailed sequence annotation on the structure and function of protein sequence
Specialized databases are databases that provide information, such as sequence data and its associated biological knowledge, on a particular organism. These databases are usually created from primary and sometimes secondary databases. Examples include Flybase and WormBase, which contain genome information on the model organisms Drosophila (fruit flies) and C. elegan (roundworm), respectively.
In addition, there are also specialized databases that contain relationships between genes and the biological characteristics of organisms. These types of databases are sometimes referred to as genotype-phenotype Databases. A good example is the Online Mendelian Inheritance in Man (OMIM) database of the NCBI.
In nature, biological information is stored in intricate arrangements of nucleic acid and protein sequences. Bioinformatics seeks to understand this using computational methods. For ease of analysis, the vast amount of sequencing data and protein structure need to be organised in databases.
Primary databases are the fundamental databases where raw sequences, structures and information are stored. However, as a single database is often insufficient, hence more curated secondary and specialized databases are needed. During bioinformatics analysis, more than one database is usually consulted. Hence, knowledge of how each database works and what types of data it stores is crucial to effectively retrieve the information required.
Applied Bioinformatics (Second edition): An introduction by Paul M. Selzer, Richard J. Marhöfer and Oliver Koch
Essential Bioinformatics by Jin Xiong
Bayero University, Kano