Protein is Good for You (Matt Gertz)
As I was preparing to graduate from the University of Michigan way back in the late eighties, I had a big decision to make regarding grad school – robotics at Carnegie Mellon, or biology at Washington State? On the one hand, biology was something I’d always really loved, having even intended to go to med school at one point. On the other hand, robotics was
more likely to help me pay off my debts a very exciting field with a lot of challenges still ahead of it. Ultimately, I chose robotics (gaining a degree that, to this day, I have never used), but I often wonder what it might have been like to go into biology.
Anyway, back when I was dating my future wife (who was in the molecular biology graduate program), I wrote a quick’n’dirty program to translate DNA coding sequences to chains of amino acids for her advisor. That was fun, and I got feel like I was participating in the research (in a very teeny-tiny way). Beyond that, I haven’t had much interaction with hands-on biology work in many years, although I try to keep up with what’s going on. Recently, though, I’ve been scrambling trying to come up with new ideas for blog articles, and that program I wrote nearly 18 years ago came to mind. I’d never been happy with the visualization of the data, so I decided to give it a second try, this time using the WPF designer to help me out.
In this blog, I’ll cover the creation of a program to translate DNA from proteins, and tomorrow I’ll talk about visualizing the results using StackPanel controls. The overall example requires VS2008 or later to code up, although today’s blog code is mostly machinery that would pretty much work on either WinForms or WPF.
“Captain… the alien virus is rewriting his DNA! He’s changing!”
One of the problems in being both a computer specialist and also somewhat knowledgeable about biology is that it’s very difficult to make it through your average Star Trek episode, between the bad computer science and the bad biology. (For the record, changing your DNA won’t change your appearance, since the protein structures they code for already exist and represent roughly seven years of dead-end energy investment on your part. If you’re lucky, any changes will have no impact at all or even be beneficial; if you’re less lucky, the cell will die or cause some deleterious behavior.) At any rate, this will be a simple DNA/RNA/protein visualization program, and no DNA altering will be allowed. J
I’m not going to give a big overview on DNA transcription; if you don’t remember enough from school to follow along, the Wikipedia article on DNA is pretty good for refreshing one’s memory (as I found out). For the purposes of this exercise, we’ll just note that DNA is used to determine which proteins are created for a cell:
(1) DNA is comprised of a combination of 4 base pairs (A, T, C, G) connected longitudinally by a sugar/phosphate group and latitudinally by hydrogen bonds. A (adenine) always connects latitudinally to T (thymine), and C (cytosine) always connects to G (guanine).
(2) During transcription, the DNA (a double-helix) in a given chromosome is cut down the middle
(3) An mRNA (“messenger RNA”) string is built up from the DNA side which contains the appropriate information for that part of the string. (RNA is very similar to DNA except that the connective sugar is ribose instead of deoxyribose, and thymine is replaced with uracil (U)). The resulting mRNA string is an inverse copy of the strand it copied. I’ll be using the terms mRNA and RNA interchangeably for the purpose of this blog.
(4) Amino acids are assembled together based on the sequence of the mRNA. It takes three bases to code for one amino, so given four bases, there are 4^3 = 64 possible combinations for a given triplet (codon) of bases. Some codons code for the same amino; there are 20 standard aminos mapping to 61 codons. (Three of the possible codons simply indicate the end of a sequence; one codon indicates the start of a sequence and also codes for a specific amino, methionine.)
Basically, the plan for the program is this: allow the user to read in a sequence of DNA, automatically convert it to an mRNA sequence, and then convert it to zero or more sequences of amino acids. We’ll then throw a visual representation of all strings onto the form.
The basic application
First, I’ll create a new WPF application called “VBProtein.” On its grid, I’m going to throw three main controls:
(1) A ScrollViewer control for display the sequences graphically. I’ll set mine to be 112 pixels high (enough for three rows of sequences of height 32 pixels plus the scrollbar). In the properties of the ScrollViewer, I’ll set “HorizontalScrollbarVisibility” to “Visible” and “VerticalScrollbarVisibility” to “Disabled,” since the sequences will be listed left-to-right. I’ve also set its “TabIndex” to “0”.
(2) A Button control labeled “Load” (“TabIndex” = “1”).
(3) A Button control labeled “Save” (“TabIndex” = “2”).
I’ve also added a few label controls and changed a few colors, but that’s the all of the important stuff. Everything else gets added in code, so let’s ta ke a look at that. Double-click on the window frame to generate the Window1_Loaded event. We’ll populate it later, but for the moment we’ll concentrate on the members we’ll need for the application. These are:
Public Translations As New Microsoft.VisualBasic.Collection
Public DNA As String
Public RNA As String
Public Proteins As New List(Of String)
We’ll load in the value of DNA, translate it to RNA, and then translate that to the Proteins – in other words, we’ll deal with those later. Let’s worry about the Translations instead.
For the translations, I decided to go with a Collection since they are easy to work with, they support keys for lookup, and I’m not dealing with too many objects – just the 64 possible codons. There’s a lot of information I’ll want to keep with each Translation:
Normal = 0
SequenceStart = 1
SequenceStop = 2
Public Sub New(ByVal Triplet As String, ByVal Acid As String, _
ByVal Mnem As Char, ByVal Clue As Sequence)
Codon = Triplet
Amino = Acid
shortAmino = Mnem
Usage = Clue
Public Overrides Function ToString() As String
Public Codon As String
Public Amino As String
Public shortAmino As Char
Public Usage As Sequence
Note that I’m overriding the “ToString” method to return the Codon value, which I’ll be using as a key in the collection. With this structure, I can initialize the translation collection (abridged from the actual code for the purposes of legibility):
Private Sub InitializeTranslations()
Translations.Add(New Translation(“UUU”, “Phe”, “F”, Sequence.Normal), “UUU”) ‘ Phenylalanine
Translations.Add(New Translation(“UUC”, “Phe”, “F”, Sequence.Normal), “UUC”) ‘ Phenylalanine
Translations.Add(New Translation(“UAA”, “OCH”, “.”, Sequence.SequenceStop), “UAA”) ‘ Ochre stop sequence
Translations.Add(New Translation(“UAG”, “AMB”, “.”, Sequence.SequenceStop), “UAG”) ‘ Amber stop sequence
Translations.Add(New Translation(“UGA”, “OPA”, “.”, Sequence.SequenceStop), “UGA”) ‘ Opal stop sequence
Translations.Add(New Translation(“AUG”, “Met”, “M”, Sequence.SequenceStart), “AUG”) ‘ Methionine
Each translation has a codon, the abbreviated name of the matching amino, the one-character name of the matching amino (which I never use, but what the heck), and a setting which determines if this is a normal codon, a starting codon, or a stopping codon.
Now, in the Window1_Load code, I’ll add the following:
SaveResultsBtn.IsEnabled = False
(The second line is unrelated to the previous code and just disables the Save button until we have something to s ave.)
I can now start writing the functional code. Back on the grid, I’ll double-click the “Load” button, and in the resulting LoadSequenceBtn_Click event, I’ll add the following code:
‘ Load in the file
Dim dlg As New OpenFileDialog
dlg.Filter = My.Resources.FILT_FileFilter
If dlg.ShowDialog() = True Then
DNA = My.Computer.FileSystem.ReadAllText(dlg.SafeFileName)
As you can see, first I’m throwing up a file dialog to get the name of the file to load (which is just a TXT file fille