Open Reading Frames (ORF)

2. Open Reading Frames (ORF)

2.1 Defintion

A reading frame is a non-overlapping set of three-nucleotide-codons (triplets) in DNA or RNA, which defines a gene. A codon is a set of three adjoined nucleotides that encodes either a particular amino acid or tells the ribosome to stop translation (figure 3). Table 1 shows the 61 triplets, which code for the 20 essential amino acids, and the 3 stop codons marked in red.

A nucleotide sequence that contains a start codon (initiation codon, typically AUG) and a stop codon (termination codons, UAA, UAG or UGA) is called an open reading frame (ORF).

An ORF is hence a portion of an organism's genome which contains a sequence of bases that could potentially encode an individual protein. Once one knows the ORF for a gene or its mRNA, one can translate a nucleotide sequence into its corresponding amino acid sequence, based on the processes described in figure 3.


	Figure 3: Cells whose genetic information is stored on DNA use the two-step process to read each gene (transcription) and produce the string of amino acids that makes up a protein (translation). In RNA retro viruses the RNA is transcribed into a double stranded DNA by reversed transcription before transcription and translation can take place. The ss+ mRNA of SARS-CoV can be translated into a +strand, which serves as template for translation and replication (see also figure 2).

Exercise 1: Determine the amino acid sequence of the gene product (protein), which follows from the given segment of double stranded DNA. The two stands are antiparallel and the nucleotide sequences are complementary.

5’GGGATCGATGCCCCTTAAAGAGTTTACATATTGCTGGAGGCGTTAACCCCGGA3’ strand 1
3’CCCTAGCTACGGGGAATTTCTCAAATGTATAACGACCTCCGCAATTGGGGCCT5’ strand 2

Indication: The DNA template strand (strand 2 above) is transcribed by RNA polymerase into a complementary mRNA strand. In RNA-synthesis the nucleotide chain grows from the 5’-end to the 3’end of the template and A is translated into U, instead of T as in DNA replication. The newly synthesized stand is antiparallel to the DNA template strand. The triplets on the mRNA strand are then translated into an aminoacid sequence, based on the universal genetic code (table 1).

Solution to Exercise 1:
1. Transcribe the DNA template strand into the corresponding mRNA.

5’GGGAUCGAUGCCCCUUAAAGAGUUUACAUAUUGCUGGAGGCGUUAACCCCGGA3’

2. Identify the start codon, i.e the beginning of the open reading frame.

5’GGGAUCGAUGCCCCUUAAAGAGUUUACAUAUUGCUGGAGGCGUUAACCCCGGA3’

3. Break the sequence into three-basepair-codons until you reach a stop codon.

5’GGGAUCG

AUG

CCC

CUU

AAA

GAG

UUU

ACA

UAU

UGC

UGG

AGG

CGU

UAA

CCCCGGA3’

4. Translate each codon into its corresponding amino acid.

Met

Pro

Leu

Lys

Glu

Phe

Thr

Tyr

Cys

Trp

Arg

stop

Note: ORFs are usually encountered when sifting through pieces of DNA while trying to locate a gene. ORFs are identified similarly in organisms which use different start-codons based on slightly altered genetic codes.

Table 1.1 Genetic mRNA code: 61 triplets code for the 20 essential amino acids
and 3 triplets are stop signals . The direction is 5' to 3'.
		U	C	A	G
		2nd base
1st base	U	UUU Phenylalanine (Phe/F) UUC Phenylalanine UUA Leucine (Leu/L) UUG Leucine	UCU Serine (Ser/S) UCC Serine UCA Serine UCG Serine	UAU Tyrosine (Tyr/Y) UAC Tyrosine UAA Ochre Stop UAG Amber Stop	UGU (Cys/C)Cysteine UGC (Cys/C)Cysteine UGA Opal (Stop) UGG (Trp/W)Tryptophan
	C	CUU Leucine CUC Leucine CUA Leucine CUG Leucine	CCU Proline (Pro/P) CCC Proline CCA Proline CCG Proline	CAU Histidine (His/H) CAC Histidine CAA Glutamine (Gln/Q) CAG Glutamine	CGU (Arg/R)Arginine CGC (Arg/R)Arginine CGA (Arg/R)Arginine CGG (Arg/R)Arginine
	A	AUU Isoleucine (Ile/I) AUC Isoleucine AUA Isoleucine AUG Methionine (Met/M) Start¹	ACU Threonine (Thr/T) ACC Threonine ACA Threonine ACG Threonine	AAU (Asn/N)Asparagine AAC (Asn/N)Asparagine AAA (Lys/K)Lysine AAG (Lys/K)Lysine	AGU (Ser/S)Serine AGC (Ser/S)Serine AGA (Arg/R)Arginine AGG (Arg/R)Arginine
	G	GUU Valine (Val/V) GUC Valine GUA Valine GUG Valine	GCU Alanine (Ala/A) GCC Alanine GCA Alanine GCG Alanine	GAU (Asp/D)Aspartic acid GAC (Asp/D)Aspartic acid GAA (Glu/E)Glutamic acid GAG (Glu/E)Glutamic acid	GGU (Gly/G)Glycine GGC (Gly/G)Glycine GGA (Gly/G)Glycine GGG (Gly/G)Glycine

¹AUG encodes N-formylmethionine at the beginning of mRNA in Bacteria and methionine in Archaea and Eukarya

Amino acid sequences can also be presented with the one-letter code, which is also included in table 1.1 and again in alphabetical order in table 1.2, e.g. the nucleotide sequence of exercise 1 (omitting stop triplet) has the following aminoacid correspondence:

AUG

CCC

CUU

AAA

GAG

UUU

ACA

UAU

UGC

UGG

AGG

CGU

**Table 1.2**. 1- and 3-letter abbreviations of amino acids
Amino Acid	Abbreviations
Alanine	A	Ala
Cysteine	C	Cys
Aspartic acid	D	Asp
Glutamate	E	Glu
Phenylalanine	F	Phe
Glycine	G	Gly
Histidine	H	His
Isoleucine	I	Ile
Lysine	K	Lys
Leucine	L	Leu
Methionine	M	Met
Asparagine	N	Asn
Proline	P	Pro
Glutamine	Q	Gln
Arginine	R	Arg
Serine	S	Ser
Threonine	T	Thr
Selenocysteine	U	Sec
Valine	V	Val
Tryptophan	W	Trp
Tyrosine	Y	Tyr

2.2 Determining ORFs on the SARS genome

The complete SARS-CoV genome was published in April 2003 by a Canadian group of
researchers. It is a single stranded RNA of 29571 bases.
Details can be found at
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&
db=nucleotide&list_uids=29826276&dopt=GenBank
The full nucleotide sequence is given for this exercise in date file sars_can.

Exercise 2: With the following steps we will define and characaterize ORFs on the SARS genome.
Load the complete SARS-CoV sequence into the Matlab workspace

load sars_can

The detection of genes in viruses is complicated due to overlapping ORFs and to other problems. The simplest algorithm to find ORFs in the SARS-CoV sequence is selecting
the statistically significant ORFs among all those found by the function SEQORFs. This function detemines the ORFs in a sequence and in its reverse complement. If there are 3 possible reading frames in a forward mRNA strand and considering the reverse complements will result in 6 different reading frames. With this procedure, most of the genes can be identified. Make sure that the m-file “seqorfs” is in your working directory.

[orf n]=seqorfs(sars_can,'MINIMUMLENGTH',1);
[orfr nr]=seqorfs(sars_can(randperm(length(sars_can))),'MINIMUMLENGTH',1);
ORFLengthr=[];
for i=1:6
ORFLengthr=[ORFLengthr; orfr(i).Length'];
end
empirical_threshold=prctile(ORFLengthr,95)
[orf n]=seqorfs(sars_can,'MINIMUMLENGTH',empirical_threshold/3);

The number “n” is the number of the ORFs found in the forward sequence and in its reverse complement. The variable “orf” returns a structure with start/stop positions and the length of the ORF, and it identifies the reading frame it is in. Since we also consider the reverse complement of the mRNA sequence of SARS-CoV, we have 6 different reading frames in our example. To obtain for example the third one, type in

SARS_ORF=[orf(3).Start' orf(3).Stop' orf(3).Length' orf(3).Frame']

This is a matrix whose first column contains the start positions of the found ORFs, the second column the stop positions, the third column the length of the ORFs and the last column the corresponding frame (possible indices: {-3,-2,-1,1,2,3}).
The universal genetic code (table 1) with start codon AUG and stop codons UAA, UAG and UGA is used by default.

Additional exercises: How many ORFs can be identified from the entire SARS-CoV genome?

Compare your result with those given in the NCBI database. To access, type

web('http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?cmd=Retrieve&
db=Nucleotide&list_uids=30248028&dopt=GenBank&
WebEnv=0jFo_p0HwjP18h4EdA2Wei4-5l%4026414B3D6A627CB0_0063SID&WebEnvRq=1')

into the Matlab. How many ORFs code for known proteins? Can you find the ORF that codes for the spike glycoprotein product of SARS-CoV (see figure 1)?