**Posted:** Monday, April 20, 2009

**Due date:** Friday, May 1, 2009, 10pm

In Part I of the Protein-Protein Interaction Project you used the MCL algorithm to find clusters which were then evaluated by computing coverage and performance. These two metrics are based on properties of the graph. In Part II, you will evaluate the quality of the clustering using biological information obtained from the Gene Ontology (GO). GO is a publicly available set of structured vocabularies that describe various aspects of what is known about molecules in a cell. See GO and the class notes.

For the two biological data sets, we have mapped proteins to GO terms and computed GO-based matrices of pairwise similarities between these proteins (using the cellular compartment vocabulary of GO). This information is not available for all nodes in the graph G and only edges having similarity values from GO will be used to evaluate the quality of the clusters.

For four of the five data sets from part 1, weighted graphs are available from the same page. For Gavin and Ito, these matrices measure the similarities between proteins. The following describes the tasks to be completed for these two data files. For gavin.txt and ito.txt data sets, generate a weighted graph GW. The weight between two nodes represents the similarity measure from GO. Function `import_matrix`

which has been added to bio_provided_sp09.py. The function reads a matrix and generates the corresponding weighted, complete graph.
Note that graph GW does **not** have the same nodes as G or GC (what holds is that every node in GW is in G, but a node in GC may not be in GW). Each edge weight in GW is a score between 0 and 1 (it is the Lin score described in the class notes).

The following tasks operate on the weighted graph `GW`

and a clustering C for graph GC. We again refer to good and bad edges in GC with respect to the clustering generated. For any two nodes u and v in GC, if both u and v are in GW, then the edge (u,v) participates in the cluster evaluation. If one or both nodes is missing, the edge does not participate. Using only part of the data in an evaluation is not uncommon for biological data sets. Assume that both u and v are nodes in graph GW.
The weight of a good or bad edge is the weight of this edge in GW. Note that in the lecture notes, good edges are called within-cluster edges and bad edges are called between-cluster edges.

- Generate the information shown in the two boxplots shown on slide 19 from the April 8 lecture. The left boxplot uses the weights of all good edges in the graph and the right boxplot uses the weights of all the bad edges in the graph. Also show the distribution of the values as a graph or histogram (good and bad edges can be shown in the same or in different plots).

- Use function
`multi_boxplot(point_sets, colors = None)`

given in bio_provided_sp09.py to generate a figure showing a boxplot for every cluster in C. For the discussion of a boxplot see the lecture notes orhttp://en.wikipedia.org/wiki/Box_plot. The figure generated by using`multi_boxplot`

will be similar to that shown on slide 20 from the April 8 lecture. The function is given a list of lists (point_sets) where one element in the list contains the edge weights of the good edges in a cluster. Note that the code we give does not use the built-in boxplot feature of pylab (it does not support drawing multiple boxplots). Make sure to test the use of function`multi_boxplot`

separately with your own test data.

A good graph clustering will result in clusters of proteins with high values of similarity metrics within a cluster (which means values based on the good edges should be large) and low values between the clusters (which means the values based on bad edges should be small).

The page we provide for sample graphs also contains four matrices, two for synthetic data and two for the biological data sets. For a graph having associated matrices, use function `import_matrix`

to create each graph GW, use function `mcl_clustering`

to determine a clustering of the nodes in graph GC, and use graph GW to evaluate the clusters.

**Note: If mcl_clustering is taking too long to run on your computer, the clustering is provided for you on the page with the graphs.**

Summarizing what is to be done for each data set:

- Read the graph G and generate graph GC. Perform mcl_clustering on GC, compute coverage and performance of the clustering. Generate a description of the clusters found: number of clusters and their sizes (you can reuse what you did in aprt 1 or use code from the posted solution).
- Generate the weighted, complete graph GW and produce the multi boxplot and the visualization of good and bad edges (as described further up).
- For the two biological data sets, prepare a description on the qualities of the clustering with respect to know protein interactions (in a .txt file). Include a discussion on whether the clusters found suggest meaningful protein complexes based on the metric used to evaluate them. You may also want to consider the sizes of the clusters in your answer.

Submit all the code in a file called `my_bio_analysis.py`

. Note that some of the code will be from part I or it may be from the solution posted. Make sure to include everything needed. may