/  Part V.1 – A Primer of Human Molecular Biology



A Primer of Human Molecular Biology

Robert Roberts, M.D.

A. The DNA Double Helix and its Transcription

This chapter will discuss a few of the general features of molecular biology that are fundamental to understand regenerative medicine, gene therapy, and development of drugs. For many of you this will be simply refreshing your memory but not in the detail that some of you may like. There will be other specific lectures that will deal in detail with the biology of drugs and regenerative medicine.

So let’s start out with the DNA double helix (Figure 1). We live in a world in which molecular biology is used on a daily basis. Biotechnology now makes up about one-third of Wall Street. Just 25 years ago it was more like 5%. All of that has happened because we now live in an era in which a major focus of science is on the biology of human beings and its application to modern medicine. Biotechnology, whether it is developing drugs or other therapies, will depend on utilizing some aspect of the DNA helix. The DNA helix contains the fundamentals of life not just for humans, animals, plants, or fish, but all forms of life on this planet. All forms of life use either RNA or DNA for replication, development, and maintenance of life.

Figure 1. This shows the double helix bound together by hydrogen bonds. It is self-evident from the structure that it can separate into two strands as it does during replication. It is said to be semi-conservative since only one strand is used to provide the template passed on to the offspring.

Many of you were taught that the central dogma of DNA and gene expression is as shown in Figure 2. The DNA in the nucleus forms messenger RNA (mRNA), referred to as transcription, which leaves the nucleus and goes out into the cytoplasm as the template for the synthesis of protein (translation). This dogma is still correct with only about 2% or less of the DNA being utilized to make mRNA and protein. We now know that essentially all of the DNA, probably over 90%, is transcribed. Most of it is not translated into protein but transcribed into non-coding RNAs (ncRNA). These ncRNAs, while not translated into protein, significantly influence mRNA which indirectly influences protein synthesis (Figure 3).

Figure 2. The expression of DNA into protein is referred to as the central dogma. This is the dogma that is followed for replication, development and maintenance of all forms of life on this planet.

Figure 3. Gene expression refers to the expression of a gene into protein. Data indicates less than 2% of DNA is used to code for proteins. Recent data indicates over 90% of DNA is transcribed. The remaining DNA sequences are primarily transcribed into non-coding RNAs referred to as micro, long and intermediate non-coding RNAs. The process from RNA to protein is referred to as translation, since it translates from the language of nucleic acids in mRNA to the language of amino acids which are used to form proteins.

The ncRNAs are classified into long, intermediate, and micro ncRNAs (Figure 4). The intermediate non-coding RNAs are RNAs of greater than 30 nucleotides but less than 200. The long non-coding RNAs are RNAs of more than 200 nucleotides. These ncRNAs act through modification of mRNA to alter expression of protein. For example, micro ncRNAs bind to the 3’ end of the mRNA and determine whether it will be transcribed into protein. How much? How long? It may also determine the half-life of mRNA which in turn determines the number of copies of the protein expressed. It is important to keep in mind that DNA stays in the nucleus. It plays no functional role with respect to the cell outside of the nucleus. The mRNA makes the protein and ultimately it is the protein that determines the phenotype (observable characteristics). The phenotype is determined by the genotype (one’s genes). It is the protein that determines what you do and how you do it. These non-coding RNAs, whether they be the micro, long or intermediate forms, affect proteins through their actions on the mRNA. The mRNA is the template which decides which protein is synthesized by the sequence of the codons in mRNA, which in turn determines the sequence of the amino acids in the protein.

Figure 4. Classification of protein non-coding RNAs.

B. The Composition and Compaction of DNA into Nucleosomes

The DNA of a cell consists of 46 chromosomes, of which 22 are paired along with the X and Y chromosomes. The sex of an organism is determined by the X and Y chromosome with the female having two X chromosomes and the male having an X and a Y chromosome. Each chromosome is essentially a long linear DNA molecule, of which there are 46 compactly tucked away in the nucleus. If all 46 chromosomes are joined end to end it measures three meters long in each cell. Thus, it is a very important compact storage system. The way that the cell compacts and stores DNA is very precisely regulated such that when the body needs to synthesize proteins it does so by unravelling and making the appropriate DNA sequence available to the factors that regulate transcription. I emphasize the storage system because, ultimately, one has to have some understanding of how this DNA is stored and how it can unfold and be transcribed specifically as you need it. The DNA is wrapped around histone proteins (H2A, H2B, H3, and H4) with two linker histones (H1 and H5). This combination of DNA and protein compacted together is referred to as chromatin. The histone proteins compact the DNA into units of approximately 150 base pairs referred to as nucleosomes. A chromosomal region which is unfolded and available for transcription into mRNA is referred to as euchromatin and that which is more compact and not available for transcription is referred to as heterochromatin. This process of chromosomal unfolding and transcription is regulated by a module of proteins referred to as transcriptional factors. These proteins bind to specific DNA sequences referred to as the DNA elements. This will be subsequently discussed in greater detail.

So a chromosome is simply a linear DNA molecule of nucleic acids. Here is the key issue: it has a simple alphabet of only four letters. Our Phoenician alphabet has 26 letters but in the case of DNA it is made out of only four bases: Guanine (G), Adenine (A), Thymine (T), and Cytosine (C) as shown in Figure 5. You may not have to remember those bases but you do need to remember that DNA is simply a repeat of specific sequences of these four bases. The sequence of the bases determines the sequence of the amino acids in the protein. This is a linear relationship in which the sequence of the amino acids in the protein corresponds identically to the sequence of the bases in the mRNA. DNA is always read from left to right. DNA consists of two strands joined together by hydrogen bonds. DNA is said to be semi-conservative since only the sequence of one strand is passed on to the offspring of the next generation.

Figure 5. Illustration of the four bases which are the repeating units of DNA. These bases are bound to a phosphate group and a sugar referred as a nucleotide. The sugar is ribose in RNA and deoxyribose in DNA.

The binding of the two DNA strands into a double helix is highly specific since A can only bind with T and C can only bind with G (Figure 6). This observation is crucial to the development of DNA drugs and DNA diagnostic tests, both of which will utilize this specific property. If one wants to develop a drug to block the translation of a certain protein the molecule would exhibit complementary binding to the mRNA or DNA to inhibit its formation. If the mRNA has the A, C, T sequence, one would develop a sequence of T, G, A so that it would bind to the A, C, T sequence. This is referred to as complementary base pairing, a crucial and unique feature of DNA that is used routinely in the development of drugs and diagnostic tests.

Figure 6. Shown here is complementary base pairing. A can only bind with T and C with G. This specificity is an essential feature of replication, development and maintenance of all forms of life. It is also this specificity that is utilized in highly specific DNA diagnostic essays and therapies. It is an essential feature of drugs developed for diseases such as cancer.

C. The Human Genome

The human genome contains 3.2 billion bases (Figure 7). Each base has attached to it a sugar (in DNA it is deoxyribose, in RNA it is ribose) and a phosphate group, giving it the name of a nucleic acid or nucleotide. Thus, the word base and nucleotide is often used interchangeably even though the latter has the additional feature of being bound to a sugar and phosphate. The number of genes is estimated to be about 20,000. I suspect as time goes by we will, in fact, recognize that there are only about 18,000 or 19,000 genes. However, it is important to keep in mind that each gene has several forms referred to as alleles. The predominant allele is the form that occurs more commonly, referred to as the major allele, and the least common is referred to as a minor allele. The acronym MAF refers to the major allele frequency or occurrence rate. Thus, the 20,000 genes have several forms (alleles) each of which have slightly different function from the major allele. It is estimated there are over 100,000 genes based on the distinct function of these alleles.

Figure 7. The human genome has 3.2 billion bases of which less than 2% are genes translated into protein.

D. Phenotype and the Role of Protein

We come back to the fact that the DNA is transcribed as mRNA. It is the mRNA that is translated into a protein. Proteins are the molecules responsible for performing the cell’s functions. Whether it is as an enzyme catalyzing a reaction or the molecule that stores your fat or glycogen, it is always performed by a protein. Proteins determine your phenotype. The two other major substances crucial for the body’s structure and function are carbohydrates and fats. Carbohydrates serve as fuel for energy and are synthesized and used as needed from moment to moment. There are no long time stores of glucose. Glucose is stored as glycogen and the amount of storage is minimal and varies from organ to organ. In the heart, the glycogen stores only last for 10-15 minutes and in skeletal it may last for 20-30 minutes. The brain has no storage of glucose and thus is totally dependent on the blood glucose, although at times of starvation the brain will switch over to utilizing fatty acids as its energy source. In contrast to carbohydrates, the body can store fat by converting all forms of fats into fatty acids. The heart normally utilizes fatty acids as its primary source of fuel. These fatty acids are released from the fat stores and converted into glucose to provide a source of energy. There is only one form of energy for all life forms on this planet and that is high energy phosphate. High energy phosphate is stored as creatine phosphate (CP) and converted to one of three forms (ATP, ADP, or AMP) for utilization. Thus, in contrast to the many forms of currencies that exist in the world, all living things on this planet have only one currency, either ATP, ADP or AMP.

It should be evident by analogy to computer language that DNA represents the software and proteins the hardware. The DNA and the mRNA are made in the nucleus. Only the mRNA leaves the nucleus and goes to the cytoplasm of the cell to be translated into protein. Now, once proteins are formed there is another process called post-translational processing whereby moieties (peptides or metals) are added to these proteins to signal where the protein should go ( eg membrane, golgi apparatus) and its specific function. About 60 different forms of moieties are known to be added to proteins. Most common would be sulfhydryl and methyl groups, others would be acetone groups or various ions such as iron or magnesium. All of these modifications are necessary for the proteins to perform their functions at the specific locations for which they are dedicated. For example, the iron in hemoglobin is essential to its oxygen-carrying capacity.

E. The Universal Genetic Code

We have discussed the transfer from nucleic acids that make up DNA and RNA to that of amino acids which make up protein. The code that signals or makes possible this translation is referred to as the genetic code. It is necessary to our understanding of gene and protein function as well as genetic defects. It took scientists more than ten years to break the code. There are only four bases or nucleotides so the number of combinations will be limited by the number of nucleotides used to form the code. The code uses three nucleotides, referred to as a codon, which codes for each amino acid. Since there are four bases the number of combinations would be 43 giving you 64 codons. There are only 20 amino acids which means several amino acids have more than one codon. Indicated in Figure 8 are examples of the codons. AUG codes for methionine. AUG is also the codon for the start site for transcription of all mRNAs. All proteins start out with the amino acid methionine as their first amino acid and then in a linear fashion the mRNA is decoded from left to right for all amino acids until it comes to a stop codon, which is either UAA, UAG or UGA. The genetic code is universal for all forms of life on this planet. If a codon is changed, such as due to a mutation, you end up with a different amino acid. It is also possible that a mutation could change a codon for an amino acid to a start or stop codon which may truncate the protein and deprive it of its normal function.

Figure 8. Shown at the top of the diagram is the central dogma of genetics, namely DNA coding for protein. To the right is shown examples of codons that code for their respective amino acids.

F. Formation of mRNA

What you see here (Figure 9) is a gene being transcribed into mRNA. The yellow boxes refer to that part of the DNA named introns since they are spliced out and remain in the nucleus. The red boxes, referred to as exons, are named this because they exit the nucleus. The red boxes are joined together to give you the mRNA which codes for the protein. The left end of the gene is referred to as the 5’ end and the right end as the 3’ end. Transcription is always initiated from the 5’ end. Once the mRNA is completed, it is capped with methyl groups on the 5’ end and adenosine on the 3’ end. These methyl and adenyl groups are added to protect the mRNA in the cytoplasm. All cells are loaded with enzymes called nucleases, which digest mRNA into small pieces. Protein synthesis is very fast at a rate of 3 to 5 amino acids per second. Messenger RNA will last for several hours and then is degraded.

Figure 9. The DNA sequences that are destined to form the gene are comprised of introns (remain in the nucleus) and exons (exit the nucleus). The yellow boxes (introns) are those sequences that will be spliced together to remain in the nucleus. The red boxes are those sequences which will be spliced together to form mRNA which leaves the nucleus to provide the template for protein synthesis. This process is referred to as transcription. The mRNA is capped with a methyl group on the 5’ end and with adenosine group on the 3’ end. This is to protect the mRNA from digestion by nucleases that are in great abundance in the cytoplasm.

G. The Structure of a Gene

The gene is to biology what the atom is to matter. It is that unit of DNA sequences which encodes for the structure and function of all forms of life including humans. A gene on average consists of about 20,000 nucleotides but could be as many as 1million. A gene has three components: the 5’ regulatory sequences, a middle component called the protein-coding sequences, and a 3’ stability sequence (Figure 10).

Figure 10. Genes have three components: the 5’ regulatory region, the protein-coding sequences and the stability sequences.

The protein coding sequences are formed from the exons that are joined together, leaving the introns to remain in the nucleus (Figure 9). The mRNA leaves the nucleus to code for the sequence of the amino acids utilized to provide the desired protein. On each end of the coding region are regulatory sequences. These sequences are highly specific and referred to as DNA elements to which specific proteins bind. These proteins are referred to as transcription factors. These transcription factors determine the rate of transcription. Similar sequences are present in the 3’ end to which proteins are added to give stability rather than to regulate transcription. Remember everything in DNA starts at the 5’ end and moves to the 3’ end. There are three types of transcription factors: promoters, enhancers and silencers; all of which are proteins that bind to the specific DNA sequences of the elements on the 5’ end of the gene to regulate transcription.

Figure 11. The rate of transcription is determined by the transcription factors shown here on the 5’ end of the gene.

A promotor protein promotes the transcription of the mRNA and is further enhanced by the enhancer proteins. The silencers repress transcription, resulting in decreased mRNA and protein formation. All three of these promoters are integrated to determine whether the gene will make mRNA and its subsequent protein. The expression of the gene refers to the processes of transcription and translation to form protein. Figure 11 shows the silencer, enhancer, and promotor to the left of the transcription initiation site, referred to as the TATA box. On the 3’ end are the attached adenines referred to as the poly(A) tail. The mRNA is capped and transported to the cytoplasm as the template for a protein. It might be worth noting that just like the gene has a 5’ and a 3’ end so do all proteins have a distinguishing initial and terminal end. All proteins begin with an amino group (NH2) and end with a carboxyl group (COOH).

H. Cardiac Growth and the Application of DNA Therapy

The most potent cardiac growth factor is angiotensin II. Angiotensin II stimulates growth of not only cardiac muscle, but also increases expression of transforming growth factor (TGFβ). TGFβ is the most potent stimulus of collagen and fibrosis. Cardiac hypertrophy is usually a compensatory feature due to increased expression of angiotensin II and TGFβ. Unfortunately, it not only stimulates cardiac muscle but also increased fibroblasts which secrete collagen and other extracellular components. Replacement of cardiac muscle with fibroblasts (scar tissue), which is in essence fibrosis, will ultimately lead to permanent cardiac failure. The most commonly used therapy in heart failure is angiotensin II inhibitors and blockers of the angiotensin II receptor. Several drugs have been developed to inhibit the effect of angiotensin II and TGFβ such as losartan, captopril, olmesartan, telmisartan, enalapril, ramipril and lisinopril. Another class of drugs inhibit aldosterone. Aldosterone is also a potent direct stimulant of fibroblasts and its secondary effect is increased expression of TGFβ. Two major drugs that are used in heart failure to inhibit aldosterone are spironolactone and eplerenone. All of these drugs decrease cardiac hypertrophy and perhaps more importantly the development of fibrosis.

I. Cancer and the Effect on DNA Therapy

Cancer, by definition, is an uncontrolled form of growth due to a defect or multiple defects in one’s DNA. In fact, the drugs utilized for cancer mainly target either transcription or translation. The aim is to prevent the abnormal growth and to do so one must manipulate the DNA or the RNA, the regulators of growth. Many drugs act to repress transcription or translation. Specificity of the drug will be developed such that it only binds to the specific sequences of the elements in the 5’ regulatory region of that gene. This is further summarized in Figures 12 and 13.

Figure 12. This shows the gene with its introns and exons together with the transcription factors and capping molecules in preparation for the splicing out of the introns and preparation for its exit out of the nucleus.

Figure 13. The whole process of gene expression into protein is illustrated by both transcription and translation.

J. Unique Features of DNA Utilized in Diagnostic DNA Assays and Therapy

It is appropriate to emphasize three unique features of DNA which are utilized in all DNA diagnostic tests and in the development of drugs acting on DNA. These three features are: the property of the DNA strands to separate at high temperatures and reanneal upon restoration to a lower temperature; the specificity of complementary base pairing whereby A binds only to T and C to G and lastly, the addition of phosphorus which bestows negativity to the DNA molecule, enabling different size molecules of DNA to be separated by electrophoresis.

The double stranded DNA survives in your body at a temperature of 39°C. DNA heated to 100°C induces denaturation, which breaks the hydrogen bonds and the DNA separates into two strands. However, when you decrease the temperature back to 39°C the two strands will reanneal identical to their previous form due to the specificity of complementary base paring. When you cook an egg and denature the proteins, reducing the temperature does not restore it to its liquid form. Thus, reannealing is a unique feature of DNA and one that is exploited in DNA diagnostic assays and the development of DNA drugs. For example, let’s suppose you want to test for the gene that makes β-hemoglobin. One would obtain either a blood or saliva sample and extract the DNA. One can synthesize a single strand of nucleotides whose sequence is complementary to the DNA sequence that forms a portion of the α-hemoglobin gene. Attached to your probe with the specific sequences is a protein which changes color once the sequences bind to the α-hemoglobin sequence. You now add your probe to the DNA sample which is heated to 100°C. The DNA of the α-hemoglobin, if present in the sample, will denature and separate into its two strands. The temperature is reduced to 39°C and the single strands of DNA will reanneal and some by random will bind to the complementary sequence on your probe. The intensity of color change will determine the concentration of the α-hemoglobin in the sample. In an analogous manner we exploit the property of negativity provided by the phosphorus. Based on the phosphorus negative charge, it is possible to detect different sizes of DNA by electrophoresis. Insertion of a DNA sample into a gel attached to positive and a negative electrodes results in the negatively charged DNA migrating to the positive electrode. Gels can be developed with different pore sizes, enabling detection of the DNA molecule of specific size. The gel can be stained showing smaller molecules migrating ahead of the larger molecules as shown in Figure 14.

Figure 14. This is an electrophoretogram showing separation of DNA into different sizes. The smaller particles move faster and farther with the intermediate and larger ones delayed in their migration.

Such analysis using DNA is referred to as Southern blot. Utilizing RNA is referred to as a Northern blot and utilization of protein is a Western blot. Often, you will see on multiple choice exams “What is an Eastern?” but there is no Eastern. We just have DNA, RNA, and protein.

K. Genotype vs Phenotype

To recap, it is important to recognize that DNA is the software that stores the information required to make a cell. DNA has little influence per se on cellular processes. This is the property of the proteins. Proteins determine the shape and structure of the cell and are responsible for molecular recognition and catalysis and all other features. So when we refer to the phenotype of someone we are referring to the observable characteristics, almost all of which are due to your proteins. Yes, it is true when you have that fat belly at middle age it is determined by fat; however, the fat is transported and stored by proteins. It is worth keeping in mind that the average cell is about 1 mm in diameter. You cannot see it with the naked eye. You have to magnify about 400-fold. The average cell contains at least 750 small molecules and about 2,000 large molecules. All of this is coordinated and brought together by an overarching mechanism, namely your DNA.

It is perhaps appropriate and informative to ask the question “How did we discover the gene sequences for many of the proteins that we know today?” We indicated in the beginning that the genetic dogma is going from DNA to mRNA to protein. Deducing the DNA sequence of mRNA is not self-evident since it was formed by joining together the exons and splicing out the introns. This was made feasible in 1979 with the discovery of the enzyme reverse transcriptase. This meant we could go back from mRNA to DNA. The derivation of DNA sequences copied from mRNA by reverse transcriptase is referred to as cDNA. Surprisingly, we discovered that the cDNA, when injected into cell or animals, induced the synthesis of the same protein observed in humans. This had great implications since it would now be possible to do gene therapy utilizing only the cDNA. It also indicated that the exons, which had been excised from the protein coding sequences of the gene, were not needed for translation of the mRNA into protein (Figure 15).

Figure 15. Reverse transcriptase was discovered by two independent investigators, both of whom got Nobel prizes. The ability to go backwards in the central dogma really opened up the field for gene therapy.

There are a few terms worthy of repeat definitions. A genome refers to all the genes responsible for an organism. Proteome refers to all the proteins derived from the genome. The word omics has now been accepted into the English language and is broadly utilized both in medicine and in industry. Each and every one of the organs, depending on your interests, will be referred to as an omic. An example would be the field of metabolism is designated as metabolomics.


Hide picture