Due: Thursday 3/28 10:00am

Submission name: w15-proteins

Background Genetics Information

DNA is made up of strands of nucleotides, of which there are 4 types: adenine, thymine, cytosine and guanine. Because of this, DNA sequences can be represented as strings like this:

tcgcagctcgaaccactatg

When two strands of DNA match, then adenine is paired with thymine, and cytosine is paired with guanine. So the following two DNA strands match:

 tcgcagctcgaaccactatg
 agcgtcgagcttggtgatac

Generally, DNA is translated into RNA, and then RNA is used to create proteins using amino acids. In DNA, 3 nucleotides together represent a single amino acid. We refer to sequences of 3 nucleotides as codons. Additionally, the sequence atg represents the start of a protein, and taa, tga, and tag represent the end of a protein. We will be working on a processing program to help visualize different information about DNA strands.

Useful Java Stuff

  • If you want to look at a working version of the previous assignment, check the solutions github repository.
  • indexOf(String sub) is a Java Sting method that returns either:
    • The index of the start of the first occurrence of sub in the calling string.
    • -1 if sub does not appear in the calling string.
  • substring() is a Java String method that returns a portion of a String, as a new String object. There are 2 versions of substring():
    1. s.substring(start)
      • Returns a String made from the characters in s starting at index start and going to the end of s.
    2. s.substring(start, end)
      • Returns a String made from the characters in s starting at index start and ending at index end-1.

Task at Hand

Create a copy of your work from yesterday, then add the following methods:

  • intFindProteinEnd(String strand)
    • Returns the index of the first end codon in strand.
    • Returns -1 if there is no end codon in strand.
  • boolean containsProtein(String dna)
    • Returns true if dna contains at least one full exon.
    • For our purposes, a DNA sequence contains an exon if:
      • It has a start codon
      • It has an end codon
      • The number of nucleotides between the start and end is a multiple of 3 (i.e. there are no nucleotides unattached to a codon)
      • It has at least 5 other codons between those 2. (this is not biologically accurate, in reality this is closer to 430 codons).
  • String getProtein(String dna)
    • Returns the first protein-encoding (exon) portion of dna. It should not include the start or end codons.
    • If there are no exons in dna, return the empyt string.

Here are a series of useful test cases for this assignment:

println("protein end in [" + protein1 + "] (21): " + findProteinEnd(protein1));
println("protein end in [" + noProtein0 + "] (-1): " + findProteinEnd(noProtein0));
println("protein end in [" + noProtein2 + "] (3): " + findProteinEnd(noProtein2));

println("protein in [" + protein0 + "] (true): " + containsProtein(protein0));
println("protein in [" + protein1 + "] (true): " + containsProtein(protein1));
println("protein in [" + protein2 + "] (true): " + containsProtein(protein2));
println("protein in [" + noProtein0 + "] (false): " + containsProtein(noProtein0));
println("protein in [" + noProtein1 + "] (false): " + containsProtein(noProtein1));
println("protein in [" + noProtein2 + "] (false): " + containsProtein(noProtein2));
println("protein in [" + noProtein3 + "] (false): " + containsProtein(noProtein3));
println("protein in [" + noProtein4 + "] (false): " + containsProtein(noProtein4));

println();
println("protein in [" + protein0 + "] " + getProtein(protein0));