CS 190C, Spring 2009: Project 4, Part II

Protein-Protein Interactions: Evaluating Clusters using Gene Ontology Data

Posted: Monday, April 20, 2009

Due date: Friday, May 1, 2009, 10pm

In Part I of the Protein-Protein Interaction Project you used the MCL algorithm to find clusters which were then evaluated by computing coverage and performance. These two metrics are based on properties of the graph. In Part II, you will evaluate the quality of the clustering using biological information obtained from the Gene Ontology (GO). GO is a publicly available set of structured vocabularies that describe various aspects of what is known about molecules in a cell. See GO and the class notes.

For the two biological data sets, we have mapped proteins to GO terms and computed GO-based matrices of pairwise similarities between these proteins (using the cellular compartment vocabulary of GO). This information is not available for all nodes in the graph G and only edges having similarity values from GO will be used to evaluate the quality of the clusters.

For four of the five data sets from part 1, weighted graphs are available from the same page. For Gavin and Ito, these matrices measure the similarities between proteins. The following describes the tasks to be completed for these two data files. For gavin.txt and ito.txt data sets, generate a weighted graph GW. The weight between two nodes represents the similarity measure from GO. Function import_matrix which has been added to bio_provided_sp09.py. The function reads a matrix and generates the corresponding weighted, complete graph. Note that graph GW does not have the same nodes as G or GC (what holds is that every node in GW is in G, but a node in GC may not be in GW). Each edge weight in GW is a score between 0 and 1 (it is the Lin score described in the class notes).

The following tasks operate on the weighted graph GW and a clustering C for graph GC. We again refer to good and bad edges in GC with respect to the clustering generated. For any two nodes u and v in GC, if both u and v are in GW, then the edge (u,v) participates in the cluster evaluation. If one or both nodes is missing, the edge does not participate. Using only part of the data in an evaluation is not uncommon for biological data sets. Assume that both u and v are nodes in graph GW. The weight of a good or bad edge is the weight of this edge in GW. Note that in the lecture notes, good edges are called within-cluster edges and bad edges are called between-cluster edges.

  • Generate the information shown in the two boxplots shown on slide 19 from the April 8 lecture. The left boxplot uses the weights of all good edges in the graph and the right boxplot uses the weights of all the bad edges in the graph. Also show the distribution of the values as a graph or histogram (good and bad edges can be shown in the same or in different plots).
  • Use function multi_boxplot(point_sets, colors = None) given in bio_provided_sp09.py to generate a figure showing a boxplot for every cluster in C. For the discussion of a boxplot see the lecture notes orhttp://en.wikipedia.org/wiki/Box_plot. The figure generated by using multi_boxplot will be similar to that shown on slide 20 from the April 8 lecture. The function is given a list of lists (point_sets) where one element in the list contains the edge weights of the good edges in a cluster. Note that the code we give does not use the built-in boxplot feature of pylab (it does not support drawing multiple boxplots). Make sure to test the use of function multi_boxplot separately with your own test data.

A good graph clustering will result in clusters of proteins with high values of similarity metrics within a cluster (which means values based on the good edges should be large) and low values between the clusters (which means the values based on bad edges should be small).

Submission instructions

The page we provide for sample graphs also contains four matrices, two for synthetic data and two for the biological data sets. For a graph having associated matrices, use function import_matrix to create each graph GW, use function mcl_clustering to determine a clustering of the nodes in graph GC, and use graph GW to evaluate the clusters.

Note: If mcl_clustering is taking too long to run on your computer, the clustering is provided for you on the page with the graphs.

Summarizing what is to be done for each data set:

  1. Read the graph G and generate graph GC. Perform mcl_clustering on GC, compute coverage and performance of the clustering. Generate a description of the clusters found: number of clusters and their sizes (you can reuse what you did in aprt 1 or use code from the posted solution).
  2. Generate the weighted, complete graph GW and produce the multi boxplot and the visualization of good and bad edges (as described further up).
  3. For the two biological data sets, prepare a description on the qualities of the clustering with respect to know protein interactions (in a .txt file). Include a discussion on whether the clusters found suggest meaningful protein complexes based on the metric used to evaluate them. You may also want to consider the sizes of the clusters in your answer.

Submit all the code in a file called my_bio_analysis.py. Note that some of the code will be from part I or it may be from the solution posted. Make sure to include everything needed. may

cs190c/project4c_09.txt · Last modified: 2009/04/21 15:19 (external edit)
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki