Posted: Monday, April 20, 2009
Due date: Friday, May 1, 2009, 10pm
In Part I of the Protein-Protein Interaction Project you used the MCL algorithm to find clusters which were then evaluated by computing coverage and performance. These two metrics are based on properties of the graph. In Part II, you will evaluate the quality of the clustering using biological information obtained from the Gene Ontology (GO). GO is a publicly available set of structured vocabularies that describe various aspects of what is known about molecules in a cell. See GO and the class notes.
For the two biological data sets, we have mapped proteins to GO terms and computed GO-based matrices of pairwise similarities between these proteins (using the cellular compartment vocabulary of GO). This information is not available for all nodes in the graph G and only edges having similarity values from GO will be used to evaluate the quality of the clusters.
For four of the five data sets from part 1, weighted graphs are available from the same page. For Gavin and Ito, these matrices measure the similarities between proteins. The following describes the tasks to be completed for these two data files. For gavin.txt and ito.txt data sets, generate a weighted graph GW. The weight between two nodes represents the similarity measure from GO. Function import_matrix which has been added to bio_provided_sp09.py. The function reads a matrix and generates the corresponding weighted, complete graph. Note that graph GW does not have the same nodes as G or GC (what holds is that every node in GW is in G, but a node in GC may not be in GW). Each edge weight in GW is a score between 0 and 1 (it is the Lin score described in the class notes).
The following tasks operate on the weighted graph GW and a clustering C for graph GC. We again refer to good and bad edges in GC with respect to the clustering generated. For any two nodes u and v in GC, if both u and v are in GW, then the edge (u,v) participates in the cluster evaluation. If one or both nodes is missing, the edge does not participate. Using only part of the data in an evaluation is not uncommon for biological data sets. Assume that both u and v are nodes in graph GW. The weight of a good or bad edge is the weight of this edge in GW. Note that in the lecture notes, good edges are called within-cluster edges and bad edges are called between-cluster edges.
multi_boxplot(point_sets, colors = None) given in bio_provided_sp09.py to generate a figure showing a boxplot for every cluster in C. For the discussion of a boxplot see the lecture notes orhttp://en.wikipedia.org/wiki/Box_plot. The figure generated by using multi_boxplot will be similar to that shown on slide 20 from the April 8 lecture. The function is given a list of lists (point_sets) where one element in the list contains the edge weights of the good edges in a cluster. Note that the code we give does not use the built-in boxplot feature of pylab (it does not support drawing multiple boxplots). Make sure to test the use of function multi_boxplot separately with your own test data.A good graph clustering will result in clusters of proteins with high values of similarity metrics within a cluster (which means values based on the good edges should be large) and low values between the clusters (which means the values based on bad edges should be small).
The page we provide for sample graphs also contains four matrices, two for synthetic data and two for the biological data sets. For a graph having associated matrices, use function import_matrix to create each graph GW, use function mcl_clustering to determine a clustering of the nodes in graph GC, and use graph GW to evaluate the clusters.
Note: If mcl_clustering is taking too long to run on your computer, the clustering is provided for you on the page with the graphs.
Summarizing what is to be done for each data set:
Submit all the code in a file called my_bio_analysis.py. Note that some of the code will be from part I or it may be from the solution posted. Make sure to include everything needed. may