CSci 150: Foundations of computer science I
Home Syllabus Assignments Tests

Assignment 3: DNA forensics

Due: 8:00am, Thursday, February 11. Value: 30 pts. Submit to Sauron.

A detective arrives on a crime scene and locates a DNA sample thought to be from the culprit. The detective collects DNA from several suspects, sends it off to a lab, which determines which suspect's DNA (if any) matches the DNA found at the crime scene. In this assignment we'll write a program implementing a crude approximation for measuring the similarity of the DNA sequences. (The same program might also be useful in determining how species are related.)

DNA is composed of a sequence of four nucleotides: adenine, cytosine, guanine, and thymine, commonly abbreviated as A, C, G, and T. In our crude similarity approximation, we'll simply count the number of positions where the sequences match. As an example, consider the below two sequences.

GTGAAGTCCG
GGGTGCAACC

Our measure of similarity would be 3, since they have three nucleotides in common: the first (G for both), the third (G), and the next-to-last (C).

Your assignment is to create a program for which the user first enters the DNA sequence from the crime scene then the DNA sequence for each suspect. As each suspect's DNA is read, the program should display how many nucleotides it has in the same position as in the crime scene DNA. The following illustrates how your program should interact with the user, with user input shown in green boldface.

Crime scene: GTGAAGTCCG        Your output should match this example exactly.
How many suspects? 5
Suspect 1: GGGTGCAACC
Shares 3 nucleotides           This is the example shown above.
Suspect 2: CCACGACCGC
Shares 1 nucleotide            Note that nucleotide is in the singular.
Suspect 3: GTCACGACAG
Shares 6 nucleotides
Suspect 4: GAGCCGACCA
Shares 5 nucleotides
Suspect 5: GCGACACCCA
Shares 5 nucleotides

Your program may assume that all strings have the same length (though the shared length may not be 10!), and that the only characters in each string are A, C, G, and T. That means that you don't need to worry about your program verifying that the inputs are correct and contain only these charcaters.

Note: If two sequences share just one nucleotide, the output should use the word nucleotide rather than the plural nucleotides.

Suggestion: Build your program in two steps.

  1. First, write a program that reads in the crime scene DNA and the DNA for just one suspect, then displays how many nucleotides they have in common. Before you go on, make sure this part works. Make sure you test the program when the two DNA sequences share just one nucleotide, where the singular nucleotide should be used.

  2. Once you have that working, modify the program so that it deals with multiple suspects. This will largely be a matter of wrapping most of your step-1 code into a loop.