Table of Contents

CS 190C, Spring 2009: Project 4, Part II

Protein-Protein Interactions: Evaluating Clusters using Gene Ontology Data

Posted: Monday, April 20, 2009

Due date: Friday, May 1, 2009, 10pm

In Part I of the Protein-Protein Interaction Project you used the MCL algorithm to find clusters which were then evaluated by computing coverage and performance. These two metrics are based on properties of the graph. In Part II, you will evaluate the quality of the clustering using biological information obtained from the Gene Ontology (GO). GO is a publicly available set of structured vocabularies that describe various aspects of what is known about molecules in a cell. See GO and the class notes.

For the two biological data sets, we have mapped proteins to GO terms and computed GO-based matrices of pairwise similarities between these proteins (using the cellular compartment vocabulary of GO). This information is not available for all nodes in the graph G and only edges having similarity values from GO will be used to evaluate the quality of the clusters.

For four of the five data sets from part 1, weighted graphs are available from the same page. For Gavin and Ito, these matrices measure the similarities between proteins. The following describes the tasks to be completed for these two data files. For gavin.txt and ito.txt data sets, generate a weighted graph GW. The weight between two nodes represents the similarity measure from GO. Function import_matrix which has been added to bio_provided_sp09.py. The function reads a matrix and generates the corresponding weighted, complete graph. Note that graph GW does not have the same nodes as G or GC (what holds is that every node in GW is in G, but a node in GC may not be in GW). Each edge weight in GW is a score between 0 and 1 (it is the Lin score described in the class notes).

The following tasks operate on the weighted graph GW and a clustering C for graph GC. We again refer to good and bad edges in GC with respect to the clustering generated. For any two nodes u and v in GC, if both u and v are in GW, then the edge (u,v) participates in the cluster evaluation. If one or both nodes is missing, the edge does not participate. Using only part of the data in an evaluation is not uncommon for biological data sets. Assume that both u and v are nodes in graph GW. The weight of a good or bad edge is the weight of this edge in GW. Note that in the lecture notes, good edges are called within-cluster edges and bad edges are called between-cluster edges.

A good graph clustering will result in clusters of proteins with high values of similarity metrics within a cluster (which means values based on the good edges should be large) and low values between the clusters (which means the values based on bad edges should be small).

Submission instructions

The page we provide for sample graphs also contains four matrices, two for synthetic data and two for the biological data sets. For a graph having associated matrices, use function import_matrix to create each graph GW, use function mcl_clustering to determine a clustering of the nodes in graph GC, and use graph GW to evaluate the clusters.

Note: If mcl_clustering is taking too long to run on your computer, the clustering is provided for you on the page with the graphs.

Summarizing what is to be done for each data set:

  1. Read the graph G and generate graph GC. Perform mcl_clustering on GC, compute coverage and performance of the clustering. Generate a description of the clusters found: number of clusters and their sizes (you can reuse what you did in aprt 1 or use code from the posted solution).
  2. Generate the weighted, complete graph GW and produce the multi boxplot and the visualization of good and bad edges (as described further up).
  3. For the two biological data sets, prepare a description on the qualities of the clustering with respect to know protein interactions (in a .txt file). Include a discussion on whether the clusters found suggest meaningful protein complexes based on the metric used to evaluate them. You may also want to consider the sizes of the clusters in your answer.

Submit all the code in a file called my_bio_analysis.py. Note that some of the code will be from part I or it may be from the solution posted. Make sure to include everything needed. may