**Posted:** Monday, April 6, 2009

**Due date:** Friday, April 17, 10pm

Most cellular processes are carried out by complexes of multiple physically interacting proteins. Identifying and analyzing the components of these complexes provides insights into the functional mechanisms in cells. Modern experimental technologies, such as yeast-to-hybrid and co-precipitation combined with mass spectrometry, allow us to investigate the components of the complexes on a large scale. However, data from these technologies are inherently imperfect and tend to include false positive and false negative observations. The large-scale and noisy nature of these experiments makes computational analysis crucial for interpreting the results.

In this project, you will analyze data sets from large-scale experiments characterizing protein-protein interactions in S. cerevisiae (baker’s yeast), a model system relevant to human biology. This research is described in Anne-Claude Gavin et al., “Functional organization of the yeast proteome by systematic analysis of protein complexes”, Nature, January 2002 (available also at http://www.nature.com/nature/journal/v415/n6868/full/415141a.html). You are not expected to read the paper, the reference is given for information only. Using data sets produced by Gavin's research group, you will generate graphs and use graph algorithms and graph clustering to predict the quality of the experimentally observed protein complexes.

In this project you will perform graph manipulations and use the Python NetworkX library to solve problems on the graphs. You will use the Cytoscape to visualize large graphs. You will be working with synthetic data as well as real-life data generated by bioinformatics researchers. You will interpret results of graph clustering using standard metrics as well as biological knowledge.

The first part of this project focuses on performing various operations on graphs, starting with an initial graph that you read from a file. As you generate each new graph, do not destroy the previous graph on which it is based. All operations make use of the NetworkX library and graph visualization tools. See Lab 11 for graph operations that may be useful (note that you can use any graph operation you find useful). bio_provided_sp09.py is the provided file.

The graphs you manipulate are either read from a file or are generated by one of the NetworkX graph generators. The input graphs are undirected graphs, their edges have no weights, but nodes have names. You will be given a number of input files to process. Function `import_edge_list(file_name)`

opens file `file_name`

. The function returns a NetworkX object that represents this graph.

Graphs are represented in files as a series of lines, with each line representing one edge in the graph. An edge is given by two entries representing the node names (there is a space between two node names). For the files that contain biological data, entries are of the form `EBI-4766 EBI-6256`

, indicating an edge from node `EBI-4766`

to node `EBI-6256`

. The Gavin data sets have thousands of entries and they should only be used once your code has been tested.

NetworkX allows you to generate graphs (use methods in the package `generators`

). These graphs may be useful when testing various parts of your code.

After you read (or generate) the initial graph G, find the connected components of the graph.
An undirected graph G is *connected* if for any two of its nodes a and b, there exists a path from node a to node b (which also implies a path from node b to a). Intuitively, in a connected graph one can travel along the edges from any node to any other node.

If a graph is not connected, we are interested in identifying the largest subgraphs that are connected. We say two nodes a and b are in the same connected component if and only if there exists a path from a to b. NetworkX contains a number of algorithms to determine the connected components of a graph (see submodule `networkx.component`

). Thus, you do not need to write code and can use methods from the NetworkX library. The library functions return a list of connected components. Each connected component is itself a list that contains the names of the nodes (`EBI-4766`

, `EBI-6256`

, etc.) that are connected. Below is an example of an undirected graph G consisting of 20 nodes and four connected components.

The connected component algorithm will be applied to a given graph to determine the connected component having largest size in the graph. In the example, the connected subgraph containing node 11 represent the largest component (it has size 10). If there is more than one connected component of maximum size, the algorithm should not continue. In the data sets you will be given, having two connected components of the same size indicates a serious problem in the data sets.

Printing the list of edges that make up a graph or printing its adjacency matrix gives very little insight into the structure of the graph, even when the graph has only a few nodes. It is, thus, important to visualize graphs by drawing them according to some criteria. Criteria include the physical area needed to draw the graph, how edges are drawn between nodes (straight lines or bent), the number of edge crossings, separating nodes and edges so they can be distinguished visually, and showing properties like symmetry and distance. Drawing graphs is a challenging data visualization problem. There is no single “best” way to draw graphs: what constitutes a good drawing depends on the features one chooses to emphasize.

For this project, you have two tools with which to visual your graphs: NetworkX drawing routines and the Cytoscape application. NetworkX provides only very basic drawing support and drawing larger graphs (more than 40 nodes) can be very slow. We suggest using NetworkX only testing purposes and not for the biological data sets.

NetworkX contains a number of drawing and visualization modules that you can call from within a Python program. They provide good visualization for smaller graphs, but not for the graphs that result when using the large Gavin data sets. Graphs can be shown in a Matplotlib window and can be saved in various formats. Here is an example:

# Be careful about doing an import * when using these libraries. # They both have draw functions! import networkx import pylab G_lad = networkx.circular_ladder_graph(10) networkx.draw(G_lad) pylab.show()

Unfortunately, not only does NetworkX not handle large graphs well, it also does not draw directed edges nicely.

Cytoscape is a state-of-the art software tool for visualizing graphs. The tool is designed for bioinformatics researchers to visualize molecular interaction networks and to integrate these networks with gene expression profiles. Cytoscape allows effective visualization of large graphs. It is interactive and gives a user control over layout specifics. Cytoscape is easy to download and to install. Use Cytoscape to visualize larger graphs (over 30 nodes). Assign colors to nodes such that all nodes belonging to the largest connected component have the same color. Cytoscape also allows you to visualize clusters found in large graphs.

A brief tutorial is available on the basics of how to import and visualize graphs. It is not meant to replace the Cytoscape on-line tutorial. Cytoscape reads a graph from a file in the same format described above: each line in the file corresponds to an edge in the graph, in the form of “`src dest`

” where `src`

is the source node and `dest`

is the destination node. Function `export_edge_list()`

given in the provided file creates a file that can be input to Cytoscape.

Clustering in graphs is the process of grouping nodes into sets, called clusters, so that nodes grouped into the same set are “similar” in some sense and nodes belonging to different sets are “dissimilar”. A node belongs to exactly one cluster. The definition of “similar” is often application-dependent, but a common goal is to separate sparsely connected subgraphs from dense ones. This separation can be done by identifying edges that disconnect the graph or by performing random walks in the graph to identify more densely connected subgraphs.

You will be given code for a very effective and relatively simple clustering algorithm. The clustering algorithm operates on an undirected graph. The clustering algorithm used is known as the Markov Cluster (MCL) algorithm. It is based on making random walks in the graph to identify dense subgraphs. This algorithm is function `mcl_clustering(G)`

in `provided_code.py`

(do not change the code of `mcl_clustering(G)`

). Those interested in reading how it works can find information at http://micans.org/mcl. Function `mcl_clustering`

returns a list of lists, with each sublist representing a cluster made up of the given nodes.

For any given input graph G, first find the connected components of graph G. Let GC be the subgroup of G corresponding to the largest connected component. The MCL clustering algorithm will be applied to graph GC. All other connected components of G will be ignored.

Applying MCL clustering to graph GC generates a clustering CL represented as a list; each element of the list corresponds to one cluster. A cluster can contain a single node. Applying the MCL algorithm to the largest connected component of the above example generates two clusters, one consisting of six nodes and one consisting of four nodes. These two clusters are shown below.

Given a clustering CL, the edges of the graph GC can be classified into “good” edges and “bad” edges. A *good edge* is an edge between two nodes in the same cluster. A *bad edge* is an edge between two nodes in different clusters.

Two standard measures of the quality of a clustering are the *coverage* and *performance*.
For a clustering CL determined for graph GC, let `total_good`

be the number of good edges in all the clusters in CL and `total_bad`

be the number of bad edges. The *coverage* of clustering CL is defined as

coverage = total_good / (total_good + total_bad)

Observe that the quantity `total_good + total_bad`

corresponds to the total number of edges in the graph. Intuitively, the larger the value of `coverage`

, the better the quality of a clustering. Consider the cluster in our example graph: it contains 16 good edges and 2 bad edges. Hence, the coverage of the clustering is 16/18 = 0.89.

The second metric we consider is called the *performance* of a clustering CL. In some sense, performance is a measure of the “correctly interpreted nodes.” Let `missing_bad`

edges be the number of possible edges between nodes in different clusters **that do not exist in GC**. The larger the number of missing_bad edges, the better the clustering. Let `n`

be the number of nodes in GC.

performance = (total_good + missing_bad) / (n*(n-1)/2)

Note that an undirected, simple graph consisting of n nodes can have at most n*(n-1)/2 edges, since in a simple graph, any pair of nodes can have at most one edge between them and no node may have an edge to itself. For our example graph we have `missing_bad`

= 22 and we get performance = (16+22)/45 = 0.84. There are a total of 24 potential edges between clusters in GC (each node in the cluster of six can connect to four different nodes in the other cluster), but only 2 of these edges are actually in GC; thus, we derive `missing_bad = 24 - 2 = 22`

. The plot below shows one possible way of illustrating coverage and performance.

The first task is to read a graph G from a file. To test your code, use graph generators to produce graphs or generate your own graphs. Write code performing the following:

- Generate an analysis of graph G with respect to its connected components. This should include
- the number of nodes and the number of edges in G
- the number of connected components of G and the size of the largest connected component
- if G contains more than one connected component, generate a plot showing the sizes of the connected components ignored in later processing steps. This can be a histogram on the number of nodes in these components, or another presentation of this information you find illustrative.

- Generate graph GC representing the largest connected component of G
- Visualize graphs G and GC in Cytoscape or NetworkX. Be sure to use Cytoscape for larger graphs. Visualize the largest connected component in graph G with differently colored nodes. Save the visualizations for Gavin and Ito (given below) in files with appropriate names and refer to them in your writeup.
- Use function
`mcl_clustering`

to determine a clustering of the nodes in graph GC. Generate an analysis of the clusters with respect to the number of clusters and the number of nodes in each cluster. - Write a function
`coverage(G, C)`

to compute the coverage and a function`performance(G, C)`

to compute the performance of a clustering C. Visualize coverage and performance in a single plot with appropriate labels.

Run your code on graphs we provide after testing it using smaller graphs you create.

Submit a file called `my_graph_operations.py`

along with a folder containing your graph visualization files. Make sure this material is organized well (e.g., one subfolder per graph) and files of visuals have names that clearly identify what they represent. As a reminder, see Lab 11 for examples of graph operations and make use of the file bio_provided_sp09.py.