CS 190C, Spring 2009: Problem Set 3

Posted: Friday, January 30, 2009

Due: Friday, February 6, 2009, 10pm (electronic submission via Blackboard)

Your code for each problem should be in a separate file. Name the files problem1.py and problem2.py and make sure to follow any additional instructions given within each problem.

Problem 1: Manipulating DNA

Deoxyribonucleic acid (DNA) consists of two long chains of nucleotides twisted into a double helix and joined by hydrogen bonds between the complementary bases adenine and thymine or cytosine and guanine. The main role of DNA is the storage of information and DNA is often compared to code containing instructions. Each nucleotide has a complement: for 'a' it is 't' (and vice versa) and for 'c' it is 'g' (and vice versa). In this problem we view DNA as a string of characters over a four-letter alphabet, 'a', 'c', 'g', and 't'. Using operations on strings, write functions to do the following:

  • DNA_complement(str): Function DNA_complement returns, for a given string str, the complement of the string. For example, DNA_complement(“aaggttcc”) returns the string “ttccaagg”.
  • DNA_heavy_max(str): Function DNA_heavy_max determines, for a given string str, which nucleotide occurs most frequently and returns that nucleotide. If there is a tie for most frequent occurrence the function returns the nucleotide with the greatest molecular mass. Adenine has a molecular mass of 135.128, Guanine is 151.128, Cytosine is 111.103, and Thymine is 125.107. For example, DNA_heavy_max(“agatc”) returns “a” and DNA_heavy_max(“ccattg”) returns “t”.

Use string operations to implement functions DNA_complement and DNA_heavy_max. Appendix A in Zelle contains a good summary of the string operations you may want to consider. Write and test each function individually. Then put the two functions into file problem1.py which should read strings from an input file (one string per line) and invoke each function on the string. Here is sample files: dnastrings1.txt, dnastrings2.txt.

Your code in file problem1.py should be organized as followed:

# CS190C: Spring 2009
# FirstName LastName
# Problem Set 3, Problem 1
import string
def DNA_complement(str):
    #returns complemented string
    #write code
def DNA_heavy_max(str):
    #returns most frequently occurring nucleotide
    #write code
if __name__ == '__main__':
    f = open('filename', 'r')
    # open a file containing DNA strings, one per line 
    for line in f:
        print 'DNA sequence =', line
        # call function DNA complement
        # call function DNA_heavy_max
        # print complemented string and most frequently occurring nucleotide

The format

 if  __name__ == '__main__': 

is generally used when writing code with functions (instead of using a main function). It allows one to have code act as either reusable modules or as standalone programs. You can find out more at http://effbot.org/pyfaq/tutor-what-is-if-name-main-for.htm.

Problem 2: Spatial Queries

Many applications dealing with data sets containing points in a given dimension need queries determining spatial relationships. This problem considers 2-dimensional points given by their (x,y) coordinates and asks you to write a number of functions implementing spatial queries. You are given code for a number of “helper” function related to I/O and data generation:

  • read_points(filename): opens and reads a specified file and generates a list consisting of all points in the file (i.e., a list containing pairs of points).
  • save_points(points, fname): given a list of points, the function saves the points to the file named fname. These points can be read back in with read_points(fname).
  • show_points(points, color=(0,1,0)): points is a list of points; show_points draws them as spheres in the active VPython window. If no color is given (i.e., no second argument is given), the spheres will be green. To use this function, VPython must be installed.
  • random_points(n,d): returns a list of n random points.

These functions are in file ps3_09_p2_lib.py. You can use them without making any changes or you can change them. In either case, you need to submit this file - without changing its name - along with file problem2.py.

The functions to be written

  • distance(p, q): has two points as arguments and returns the Euclidean distance between the two points (see External Link for a definition, if needed).
  • closest(points, q): has two arguments, a list of points and an arbitrary point q. Point q may or may not be in list points. The function returns a point p such that distance(p,q) is a minimum over all points in the list. The function should consider every point in list points to determine q.
  • nearest_neighbors(points): determines, from among all points in list points, a pair of points p and q, p different from q, such that distance(p,q) is a minimum over all other distances between two points in list points. Points p and q are returned. If there is more than one pair having the same minimum distance, any pair achieving the minimum distance can be returned. The function should compute a pair giving the minimum distance by considering all possible pairs in the list.

A skeleton of problem2.py is available here: problem2.py and it is also shown below. The functions in the skeleton program have place-holders with return statements so you can run the program before completing the functions.

Write and test each function individually and start with small and simple data sets. Examples of data files are available here: testdata0.txt, testdata1.txt. Note that one can generate an input file easily using functions random_points and save_points.

Skeleton of problem2.py you can use:

# CS190C: Spring 2009
# FirstName LastName
# Problem Set 3, Problem 2
from ps3_09_p2_lib import *  #import helper functions
def distance(p, q):
    return 0  #return distance; write code
def closest(points, p):
    return [5,5]  #return closest point; write code
def nearest_neighbors(points):
    return  [[0,0], [5,5]]  #return pair of points; write code 
if __name__ == '__main__':
    # Create a list of 100 random points
    myPoints = random_points(100, 2)
    # Save the points to a file
    save_points(myPoints, "points.txt")
    # Plot the points
    show_points(myPoints, (0,1,0))
    # Create three extra points
    pointA = [1.0, 0.0]
    pointB = [1.0, 1.0]
    pointC = [0.0, 0.0]
    # Print the distance between two points
    print "The distance between", pointA, "and", pointB,"is:", distance(pointA, pointB)
    # Find the point in myPoints closest to point B
    closeToB = closest(myPoints, pointB);
    print "The point closes to", pointB, "is:", closeToB
    # Find the nearest neighbors in myPoints
    pointX, pointY = nearest_neighbors(myPoints)
    print "The nearest neighbors are:", pointX, "and", pointY
cs190c/problemset3_09.txt · Last modified: 2009/01/31 08:14 by seh
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki