Table of Contents

Posted: Monday, April 14, 2008

Due: Wednesday, April 23, 2008, 10pm

In Problem 2 of Problem Set 3, you wrote three functions that implemented queries on points. The function scanned a list of points stored in arbitrary order. This problem set revisits spatial queries for 2-dimensional data sets, but uses a more efficient technique for storing and accessing the data. We consider three queries–`find_point`

, `in_rect`

, and `closest`

–along with two data structures for their implementation–lists and kd-trees. kd-trees are implemented using a binary tree structure and support efficient spatial query processing. They are commonly used in real applications. While kd-trees can be used for data in any dimension, we will consider only two dimensions. You are given class definitions for each of the two data structures. The goals are to:

- complete the methods implementing the queries
- compare the performance of kd-trees and list implementation
- work through and use given classes and methods

The queries you will run on a set of points include `in_rect`

and `closest`

from Problem Set 3 and a new query, `find_point`

. In Problem Set 3 you implemented queries `in_rect`

and `closest`

using functions on lists of points. Now, they are implemented as methods in a class named `PointList`

. File pointlist.py contains the class definitions. You have to write method `find_point`

.

See the lecture notes from April 14 for details on kd-trees. Recall that a kd-tree is a binary tree-based data structure for efficiently managing a set of points. A node object, `KDTreeNode`

, stores each point of the data set. When the node is at an even level in the tree, a vertical cut is made. This case means that the x-coordinates of all nodes in the left subtree are smaller than all x-coordinates of the nodes in the right subtree. If the node is at an odd level in the tree, it means that a horizontal cut is made, which splits nodes on their y-coordinates.

The class definition for the kd-tree implementation is in file kdtree.py. The file contains two classes: `KDTree`

and `KDTreeNode`

. It is important that you read and understand the code in this file before you start adding your methods. You have to write methods `find_point`

, `in_rect`

, and `visualize`

.

You implement are two implement two versions of each method, one for class `PointList`

and one for class `KDTreeNode`

:

`find_point(self, p)`

: Returns`true`

if point`p`

is in the set and`false`

otherwise.

`in_rect(self, lower, upper)`

: Returns a list of points that lie inside the rectangle created by points`lower`

and`upper`

`closest(self, p)`

: Returns a point`q`

from the set such that the Euclidean distance between`p`

and`q`

is a minimum over all distances between`q`

and other points.

Here are some examples of using these classes and methods. To create an instance of a `KDTree`

called `tree`

and an instance of a `PointList`

called plist, both on a list `points`

, write

tree = KDTree(points) plist = PointList(points)

Methods on `tree`

and `plist`

are written as

tree.find_point(p) tree.in_rect(l,u) tree.closest(p) plist.find_point(p) plist.in_rect(l,u) plist.closest(p)

The init method for class `KDTree`

converts the points into VPython vectors, which allows a more convenient use of operations on points. The root of the tree is an instance of class `KDTreeNode`

. The init method of `KDTreeNode`

builds the tree in a recursive way as described in class.

You will notice that methods `find_point`

, `in_rect`

, and `closest`

in class `KDTree`

generate a vector and invoke a method on the root, which is an instance of `KDTreeNode`

. The actual work of the three queries is done in the corresponding methods in class `KDTreeNode`

.

For the initial testing of your methods, create small point sets and your own simple test queries. You can use the function `random_points`

provided for Problem Set 3 to generate point sets from a uniform distribution. Files testpoints.txt and testsession.txt contain points and queries you should also use to test your code (note that function `read_points`

was given in Problem Set 3).

To determine the performance of the queries on lists versus kd-trees, use function `generate_points(num_clusts, points_per_clust)`

given in file bench.py. Function `generate_points`

generates `num_clusts * points_per_clust`

points, with the centers of each cluster uniformly distributed and points around the center of each cluster distributed according to a Gaussian distribution.

The code you are given contains two visualization routines that were used in class to explain the queries:

`showpoints(points, style='bo')`

in file bench.py plots the points with the given style. If there are fewer than 20 points, it also adds text labels with rounded coordinates. Use`showpoints`

when testing your code.`closest_vis(self, p, upper_dist, lower, upper)`

in file kdtree.py performs a visualization of method`closest`

. Its structure is similar to that of`closest`

. You do not need to use it for the problem set, but may find it helpful in understanding query`closest`

as described in class.

You are to complete method `visualize(self, lower, upper)`

in class `KDTreeNode`

. It is given a bounding rectangle defined by two points, `lower`

and `upper`

, and draws the partition induced by this node it is invoked on, then makes a recursive call to draw the partitions of its decedents. Note that the initial rectangle is set by method `visualize`

in class `KDTree`

.

You will compare the performance of the data structures by using two point sets, both generated by function `generate_points(num_clusts, points_per_clust)`

in file bench.py. Data set 1 should be generated with `num_clust = 200`

and `points_per_clust = 300`

. Data set 2 should be generated with `num_clust = 400`

and `points_per_clust = 75`

. For each data set, run `n = 200`

iterations. One iteration consists of generating three points, `p`

, `lo`

, and `up`

, and using them to invoke the three queries on the list and then on the kd-tree. A framework for the performance comparison is given in functions `benchmark`

and `test`

in file bench.py.

Note that after building the list and the kd-tree, function `test`

makes `n`

iterations, with each iteration generating three points used in six function calls. The times returned from the benchmark are accumulated in two arrays, each of size 3. You should augment function `test`

so that it not only outputs the sums, but generates a more interesting analysis of the running times of the three queries on the two data structures. For example, compute the median and standard deviation for each query type, show the performance of each of the `n`

times obtained for one query in a way that allows an effective comparison. Your visualization can use any of the approaches used and seen throughout the semester.

Submit the (revised) files `kdtree.py`

, `pointlist.py`

, and `bench.py`

with your methods `find_point`

, `in_rect`

, and `visualize`

completed and your changes to `bench.py`

. Include plots showing the results of your performance comparison (one for each data set) and include a brief discussion. Include an output generated by `visualize`

on a smaller data sets (e.g., `generate_points(50,10)`

).

The provided files are available here: kdtree.py, pointlist.py, bench.py.

The performance analysis described above ignores the cost of building the kd-tree. While this cost can be ignored in many applications, it is interesting to analyze the performance when this cost is to be included. For very few queries, the advantages of a kd-tree are likely to be lost when the cost of building the tree is added. Using the two data sets from the performance comparison study, determine the number of the iterations needed (i.e., `n`

in `test(n)`

) so that the kd-tree implementation is superior to the list when the cost of building the tree is included.

Note: The cost of building the kd-tree in the code given is dominated by the sorting done to determine the median coordinate before partitioning. There exist ways to build a kd-tree more efficiently by avoiding repeated sorts. We chose a simple way to build the tree and it is not the most efficient one.