Tuesday, September 15, 2009

Homework 3: DNA

The instructions for building any living organism is encoded in its DNA. We now have the ability to sequence (that is, get a full listing) of a person's DNA and look to see which particular genes that person carries, genes which might correspond to specific characteristics or diseases. A DNA sequence is a very long string composed only of the letters, A, C, G, T. For this homework you will use use following sample sequence:

String dna = 
   "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC" +
   "CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC" +
   "CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG" +
   "AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC" +
   "CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG" +
   "TTTAATTACAGACCTGAA";

Your program will ask the user for a query, which will be a combination of the letters ACGT, and you will program will then print out the index position (in the dna string) of all the places where that query matches. But, you will also need to allow for one character errors. That is, if there is only one letter difference, any letter, then the match should still count. Allowing for these errors is important in Biology because the DNA sequences do contain many errors because of limitations on the sequencing methods.

Below are a few examples searches on the dna string above. The user types in the query and the program prints out the match positions.
Enter query:ACAAGAT
Match at position 0

Enter query:ACAAGAA
Match at position 0

Enter query:CCTGGAGGG
Match at position 70

Enter query:CCTGGGGGG
Match at position 70

Enter query:ACAA
Match at position 0
Match at position 96
Match at position 125
Match at position 131
Match at position 253
Match at position 270
Match at position 272
Match at position 315
Match at position 319
Match at position 321
Match at position 345
Match at position 357

Enter query:CCCCCCC
Match at position 82
Match at position 243
Match at position 244
Match at position 245
Match at position 246

Hint: You will need to use a nested loop.

This homework is due Tuesday, September 22 @noon. As always, email your program to your TAs.

For Hackers only: If you feel this problem was too easy and want a little bit of a challenge, you can try to solve the welcome to code jam problem from the Google Code Jam 2009. It is similar to our problem but yet soooo much harder to solve. And, NO, you will not get any bonus points for solving it. If you can solve it, you should not be taking 145. You should be in 350.

2 comments:

chevytrk24 said...

any more hints?

Jose M Vidal said...

hint 2: compare the query string character-by-character, that way you can count mismatches.