Table of Contents

**Posted:** March 28, 2008

**Due dates:**

- Part 1: Friday, April 4, 10pm
- Part 2: Friday, April 11, 10pm

Most cellular processes are carried out by complexes of multiple physically interacting proteins. Identifying and analyzing the components of these complexes provides insights into the functional mechanisms in cells. Modern experimental technologies, such as yeast-to-hybrid and co-precipitation combined with mass spectrometry, allow us to investigate the components of the complexes on a large scale. However, data from these technologies are inherently imperfect and tend to include false positive and false negative observations. The large-scale and noisy nature of these experiments makes computational analysis crucial for interpreting the results.

In this project, you will analyze data sets from large-scale experiments characterizing protein-protein interactions in S. cerevisiae (baker’s yeast), a model system relevant to human biology. This research is described in Anne-Claude Gavin et al., “Functional organization of the yeast proteome by systematic analysis of protein complexes”, Nature, January 2002 (available also at http://www.nature.com/nature/journal/v415/n6868/full/415141a.html). You are not expected to read the paper, the reference is given for information only. Using data sets produced by Gavin's research group, you will generate graphs and use graph algorithms and graph clustering to predict the quality of the experimentally observed protein complexes.

In this project you will perform graph manipulations and use the Python NetworkX library to solve problems on the graphs. You will use the Cytoscape application to visualize large graphs. You will be working with synthetic data as well as real-life data generated by bioinformatics researchers. You will interpret results of graph clustering using standard metrics as well as biological knowledge.

In Part 1 of the project, you will perform a sequence of operations on graphs, starting with an initial graph that you read from a file. As you generate each new graph, do not destroy the previous graph on which it is based. Part 1 makes heavy use of the NetworkX library and graph visualization tools.

Here is a list of graph operations you may find useful (note that you can use any graph operation you find useful) and the bio_provided.py file.

The graphs you manipulate are either read from a file or are generated by one of the NetworkX graph generators. The input graphs are directed, their edges have no weights, but nodes have names. You will be given a number of input files to process. Function `import_edge_list(file_name)`

opens file `file_name`

, which is assumed to contain a directed, unweighted graph. The function returns a NetworkX object that represents this graph.

Graphs are represented in files as a series of lines, with each line representing one edge in the graph. An edge is given by two entries representing the node names (there is a space between two node names). The order of the node names indicates the direction of the edge. For the files that contain biological data, entries are of the form `EBI-4766 EBI-6256`

, indicating a directed edge from node `EBI-4766`

to node `EBI-6256`

. The Gavin data sets have thousands of entries and they should only be used once your code has been tested.

NetworkX allows you to generate graphs (use methods in the package `generators`

) that may be useful when testing various parts of your code.

After you read (or generate) the initial graph G, generate a reduced graph GR. The reduction process consists of removing nodes from the graph according to specified rules. In the biological data, nodes removed by this process correspond to “noise” in the experiment data or to a part of the graph that is not very “connected” to the rest. The reduction operates on a directed, simple graph G (what NetworkX calls a DiGraph).

A directed graph G is *strongly connected* if for any two of its nodes a and b, there exists a directed path from node a to node b and one from node b to a. Intuitively, in a strongly connected graph one can travel along the edges from any node to any other node. If a graph is not strongly connected, we are interested in identifying the largest subgraphs that are strongly connected. We say two nodes a and b are in the same strongly connected component if and only if there exists a path from a to b and one from b to a. NetworkX contains a number of algorithms to determine the strongly connected components (SCC) of a directed graph (see submodule `networkx.component`

). Thus, you do not need to write code for these algorithms, just use them from the NetworkX library. The library functions return a list of connected components. Each connected component is itself a list that contains the names of the nodes (`EBI-4766`

, `EBI-6256`

, etc.) that are strongly connected.

The reduction process uses an integer r, r >= 0, called the *reduction factor*. For every input graph, you will be given a reduction value (the default value is r = 0). To generate the reduced graph GR from the directed input graph G, do the following:

- Determine the strongly connected components of G (using NetworkX).
- Create a copy of G, called GR.
- Remove from GR every node that is part of a strongly connected component consisting of r or fewer nodes. Removing a node from GR removes not only the node, but all edges connected to that node.

Note that for r = 0, no reduction takes place. Create a function called `reduce_graph(G, r)`

to do this reduction, returning the reduced graph GR.

Below is an example of a directed graph G that consists of four strongly connected components. When r = 1, strongly connected components consisting of a single node get removed. Since the smallest SCC contains two nodes, no nodes are removed in this case. When r = 2, nodes 6 and 7, which form one SCC of size two, get removed. Removing these two nodes also removes a total of four edges.

The contraction creates, from graph GR, a new graph, GRC, in which each node corresponds to a strongly connected set of nodes in the original graph GR.

Observe the following graph properties:

- Every node in the graph GR is in exactly one strongly connected component.
- Consider two strongly connected components C1 and C2. There can be edges from nodes in C1 to nodes in C2. However, there cannot be an edge from a node in C1 to one in C2
**and**an edge from a node in C2 to a node in C1. If there were, these two components would be one SCC

The contracted graph GRC is generated from GR by representing every strongly connected component in GR as a single node in GRC. Hence, this new graph GRC has as many nodes as GR has SCCs. GRC is a directed graph. We add an edge from node s1 to node s2 in GRC, if and only if in GR there exists an edge from a node in the strongly connected component s1 to a node in strongly connected component s2.

Consider the example graph G, above. For r = 1, no nodes and edges are deleted by the reduction step, so G = GR. Graph G has four strongly connected components, so the contraction step returns this list of lists, called here SCC:

SCC = [[16, 12, 15, 11, 14, 13], [10, 8, 9], [6, 7], [2, 1, 4, 3, 5]]

You create graph GRC (using NetworkX) based on the list SCC. In this example, SCC is a list of four elements and the resulting graph GRC consists of four nodes as shown below. Graph GRC shows the interaction between strongly connected components.

Printing the list of edges that make up a graph or printing its adjacency matrix gives very little insight into the structure of the graph, even when the graph has only a few nodes. It is, thus, important to visualize graphs by drawing them according to some criteria. Criteria include the physical area needed to draw the graph, how edges are drawn between nodes (straight lines or bent), the number of edge crossings, separating nodes and edges so they can be distinguished visually, and showing properties like symmetry and distance. Drawing graphs is a challenging data visualization problem. There is no single “best” way to draw graphs: what constitutes a good drawing depends on the features one chooses to emphasize.

For this project, you have two tools with which to visual your graphs: NetworkX drawing routines and the Cytoscape application. There exist other visualization tools to display directed graphs in more attractive ways, however, we will only have you use these two. If you are interested in using other graph visualization software, please contact Sagar Mittal or John Valko.

NetworkX contains a number of drawing and visualization modules that you can call from within a Python program. They provide good visualization for smaller graphs, but not for the graphs that result when using the large Gavin data sets. Graphs can be shown in a Matplotlib window and can be saved in various formats. Here is an example:

# Be careful about doing an import * when using these libraries. # They both have draw functions! import networkx import pylab G_lad = networkx.circular_ladder_graph(10) networkx.draw(G_lad) pylab.show()

Unfortunately, not only does NetworkX not handle large graphs well, it also does not draw directed edges nicely.

Cytoscape is a state-of-the art software tool for visualizing graphs. The tool is designed for bioinformatics researchers to visualize molecular interaction networks and to integrate these networks with gene expression profiles. Cytoscape allows effective visualization of large graphs. It is interactive and gives a user control over layout specifics. Cytoscape is easy to download and to install.

A brief tutorial is available on the basics of how to import and visualize graphs. It is not meant to replace the Cytoscape on-line tutorial. Cytoscape reads a graph from a file in the same format described above: each line in the file corresponds to an edge in the graph, in the form of ”`src dest`

” where `src`

is the source node and `dest`

is the destination node. Function `export_edge_list()`

given in `provided_code.py`

creates a file that can be input to Cytoscape.

After you generate graphs GR and GRC, you should visualize them. For graph GR, assign colors to nodes such that all nodes belonging to the same strongly connected component have the same color. This coloring will allow you to easily see the SCCs and help you insure that your graph is correct. Graphing GRC will allow you to verify that the strongly connected components in GR are all correctly represented in GRC.

As you learned in Part 1, clustering in graphs is the process of grouping nodes into sets, called clusters, so that nodes grouped into the same set are “similar” in some sense and nodes belonging to different sets are “dissimilar”. You also saw that each node belongs to exactly one cluster. The definition of “similar” is often application-dependent, but a common goal is to separate sparsely connected subgraphs from dense ones. This separation can be done by identifying edges that disconnect the graph or by performing random walks in the graph to identify more densely connected subgraphs.

You will be given code for a very effective and relatively simple clustering algorithm. The clustering algorithm operates on an undirected graph. For this project, converting the graphs from directed to undirected is not a problem: The directed edges in the original graph G represent artifacts of the experiment that do not have a relevant biological meaning that the clustering algorithm can make use of anyway. We'll call the graph that results from converting the edges in GR from directed to undirected edges, GRU. Note that GRU is a simple graph and two edges of opposite direction between two nodes represent one undirected edge.

The clustering algorithm used is known as the Markov Cluster (MCL) algorithm. It is based on making random walks in the graph to identify dense subgraphs. This algorithm is function `mcl_clustering(G)`

in `provided_code.py`

(do not change the code of `mcl_clustering(G)`

). Those interested in reading how it works can find information at http://micans.org/mcl. As before, function `mcl_clustering`

returns a list of lists, with each sublist representing a cluster made up of the given nodes.

Consider again the example graph used earlier. When the input graph is reduced with r = 1, no nodes are removed. Graph GRU is shown below and the colors indicate the four clusters found. Not surprisingly, for this example, the clusters correspond to the strongly connected components in the directed version of the graph.

After a clustering of the nodes in graph GRU has been found, you will evaluate the quality of the clustering. For synthetic data sets, you will only evaluate clusters using graph-based evaluation metrics. For the biological data sets, however, you will perform additional processing. Evaluating clusters in protein interaction graphs typically has two goals. One goal is to determine functional modules, that is, find groups of interacting proteins that participate in a same or similar biological function. A second goal is to find novel protein complexes that have not been observed before.

Given a graph and a clustering on it (represented as a list of lists), the edges of the graph can be classified into “good” edges and “bad” edges. A *good edge* is an edge between two nodes in the same cluster. A *bad edge* is an edge between two nodes in different clusters. Two standard measures of the quality of a clustering are the *coverage* and *performance*.

Consider the cluster in our example graph that consists of nodes 1, 2, 3, 4, and 5. It contains seven good edges. Node 4 is connected to a bad edge (edge (4,6)).

For a clustering C determined for graph GRU, let `total_good`

be the number of good edges in all the clusters in C and `total_bad`

be the number of bad edges. The *coverage* of clustering C is defined as

coverage = total_good / (total_good + total_bad)

Observe that the quantity `total_good + total_bad`

corresponds to the total number of edges in the graph. Intuitively, the larger the value of `coverage`

, the better the quality of a clustering.

The second metric we consider is called the *performance* of a clustering C. In some sense, performance is a measure of the “correctly interpreted nodes.” Let `missing_bad`

edges be the number of edges between nodes in different clusters **not** in GRU. Let `n`

be the number of nodes in GRU. Then,

performance = (total_good + missing_bad) / (n*(n-1)/2)

Note that an undirected, simple graph consisting of n nodes can have at most n*(n-1)/2 edges. The example clustering has a coverage of 0.8696 and a performance of 0.9 (shown in the plot below).

Write a function `coverage(G, C)`

to determine the coverage and a function `performance(G, C)`

to compute the performance.

In this last step, you are to evaluate the quality of the clustering generated by using biological information obtained from the Gene Ontology (GO). GO is a publicly available set of structured vocabularies that describe various aspects of what is known about molecules in a cell. See GO and the class notes.

For the two biological data sets, we have mapped the proteins to GO terms and computed GO-based matrices of pairwise similarities between these proteins (using the cellular compartment vocabulary of GO). This work was done only for nodes (proteins) present in the reduced graph GR. Hence, if you run this step with biological data sets with a smaller reduction value r, you will not find all needed entries in the GO matrices. The similarity values generated from GO will be used to evaluate the quality of the clusters.

For each of the gavin.txt and ito.txt data sets, you first need to generate a weighted graph. The weight between two nodes represents the similarity measure from GO. The weighted graph has the same nodes as graph GRU, but it is a complete graph. That is, every possible edge exists in the graph. Each edge weight is a score between 0 and 1 (this is the Lin score in the class notes). Function `import_matrix`

in `provided_code.py`

reads a matrix and generates the corresponding weighted, complete graph. Invoke it to generate the graph `w_complete_CC`

. The two matrices you will need (one for each biological data set) are available from graphs and matrices.

You are to complete the following tasks, which operate on the weighted graph `w_complete_CC`

and a clustering C for graph GRU. We again refer to good and bad edges in GRU with respect to the clustering generated. The weight of a good or bad edge is the weight of this edge in the complete, weighted graph generated from the matrix. (Note that good edges are called within-cluster edges and bad edges are called between-cluster edges in the class notes.)

- Use function
`multi_boxplot(point_sets, colors = None)`

given in`provided_code.py`

to generate a figure showing a boxplot for every cluster in C. For the discussion \ of a boxplot see the March 31 class notes (also http://en.wikipedia.org/wiki/Box_plot). The figure generated by using`multi_boxplot`

will be similar to that shown on slide 27. The function is given a list of lists (point_sets) where one element in the list contains the edge weights of the good edges in a cluster. Note that the code we give does not use the built-in boxplot feature of pylab (it does not support drawing multiple boxplots). Make sure to test the use of function`multi_boxplot`

separately with your own test data.

- Generate the information shown in the two boxplots shown on slide 25. The left boxplot uses the weights of all good edges in the graph and the right boxplot uses the weights of all the bad edges in the graph. You do not need to show the distribution of values using a boxplot; a histogram can be used as well.

A good graph clustering will result in clusters of proteins with high values of similarity metrics within a cluster (which means values based on the good edges should be large) and low values between the clusters (which means the values based on bad edges should be small).

Read a graph from a file, prompt the user to enter an r value, and generate G by using function `import_edge_list`

from `provided_code.py`

. To test your code, you can also use graph generators to produce graphs. Then, write the code necessary to perform the following tasks:

- Generate the reduced graph GR (write function
`reduce(G,r)`

, which returns GR) - Generate the contracted graph GRC from GR (write function
`make_contracted_graph(G)`

, which returns GRC) - Visualize all graphs in Cytoscape or NetworkX
^{1)}. For the Gavin and Ito datasets, visualize G, GR, and GRC in Cytoscape. Visualize the strongly connected components in graph GR and the connected components in graph GRU with different colored nodes. Save the visualizations for Gavin and Ito in files with appropriate names and refer to them in your writeup.

Run your code on graphs we provide after testing it using smaller graphs you create. For the graphs in files synthetic2.txt, gavin02.txt, and ito.txt generate a report on the size changes in terms of the number of nodes and edges in G, GR, and GRC. Include a discussion on strongly connected components of GR.

Submit a file called `my_graph_operations.py`

along with a folder containing your discussion and graph visualization files. Make sure this material is organized well (e.g., one subfolder per graph) and files of visuals have names that clearly identify what they represent. Here is a template on how the operations may be used to help you test your code my_graph_operations.py.

For all graphs considered in Part 1, use function `mcl_clustering`

to determine clustering of the nodes in graph GRU.
Write a function `coverage(G, C)`

to compute the coverage and a function `performance(G, C)`

to compute the performance of a clustering. Visualize coverage and performance for all graphs in a single plot with appropriate labels. For each biological data set, in addition, use the corresponding matrix generated from the Gene Ontology to evaluate the clusters. Use function `import_matrix`

to create each graph. Generate the two figures based on boxplots as described above.

Submit all the code written performing the above tasks. In addition, hand in two images showing the clustering (use Cytoscape, show clusters by coloring the nodes), one for each biological data set. For these two data sets also include discussion on whether the clusters found suggest meaningful protein complexes based on the metric used to evaluate them. You may also want to consider the sizes of the clusters in your answer.

Summarizing what you are expected to do for a biological data set:

- Read the graph and generate G, generate the reduced graph GR, generate the undirected version GRU. This was included in part 1.
- Perform mcl_clustering on GRU, compute coverage and performance of the clustering and show both quantities in a plot. Visualize the clustering in Cytoscape and save an image.
- Generate the weighted, complete graph w_complete_CC and produce the multi boxplot and the visualization of good and bad edges (described in detail further up). Prepare a description on the qualities of the clustering with respect to know protein interactions.

Write a revised version of `multi_boxplot`

that shows the
information about the weights of the good edges in clusters in an easier to understand way. Possible approaches you may want to consider are arranging the clusters in the figure by their median value or by the number of nodes in a cluster. It may be desirable to show the information about the quality of the clusters in more than one figure.