Strict Standards: Declaration of Doku_Renderer_metadata::table_open() should be compatible with Doku_Renderer::table_open($maxcols = NULL, $numrows = NULL, $pos = NULL) in /home2/cp-wiki/htdocs/inc/parser/metadata.php on line 480

Strict Standards: Declaration of Doku_Renderer_metadata::table_close() should be compatible with Doku_Renderer::table_close($pos = NULL) in /home2/cp-wiki/htdocs/inc/parser/metadata.php on line 480

CS 190C, Spring 2008: Problem Set 4

Posted: Monday, April 14, 2008

Due: Wednesday, April 23, 2008, 10pm

Introduction

In Problem 2 of Problem Set 3, you wrote three functions that implemented queries on points. The function scanned a list of points stored in arbitrary order. This problem set revisits spatial queries for 2-dimensional data sets, but uses a more efficient technique for storing and accessing the data. We consider three queries–find_point, in_rect, and closest–along with two data structures for their implementation–lists and kd-trees. kd-trees are implemented using a binary tree structure and support efficient spatial query processing. They are commonly used in real applications. While kd-trees can be used for data in any dimension, we will consider only two dimensions. You are given class definitions for each of the two data structures. The goals are to:

  • complete the methods implementing the queries
  • compare the performance of kd-trees and list implementation
  • work through and use given classes and methods

Class Definitions

The queries you will run on a set of points include in_rect and closest from Problem Set 3 and a new query, find_point. In Problem Set 3 you implemented queries in_rect and closest using functions on lists of points. Now, they are implemented as methods in a class named PointList. File pointlist.py contains the class definitions. You have to write method find_point.

See the lecture notes from April 14 for details on kd-trees. Recall that a kd-tree is a binary tree-based data structure for efficiently managing a set of points. A node object, KDTreeNode, stores each point of the data set. When the node is at an even level in the tree, a vertical cut is made. This case means that the x-coordinates of all nodes in the left subtree are smaller than all x-coordinates of the nodes in the right subtree. If the node is at an odd level in the tree, it means that a horizontal cut is made, which splits nodes on their y-coordinates.

The class definition for the kd-tree implementation is in file kdtree.py. The file contains two classes: KDTree and KDTreeNode. It is important that you read and understand the code in this file before you start adding your methods. You have to write methods find_point, in_rect, and visualize.

Queries and Classes

You implement are two implement two versions of each method, one for class PointList and one for class KDTreeNode:

  • find_point(self, p): Returns true if point p is in the set and false otherwise.
  • in_rect(self, lower, upper): Returns a list of points that lie inside the rectangle created by points lower and upper
  • closest(self, p): Returns a point q from the set such that the Euclidean distance between p and q is a minimum over all distances between q and other points.

Here are some examples of using these classes and methods. To create an instance of a KDTree called tree and an instance of a PointList called plist, both on a list points, write

tree = KDTree(points)
plist = PointList(points) 

Methods on tree and plist are written as

tree.find_point(p)
tree.in_rect(l,u)
tree.closest(p)
plist.find_point(p)
plist.in_rect(l,u)
plist.closest(p)

The init method for class KDTree converts the points into VPython vectors, which allows a more convenient use of operations on points. The root of the tree is an instance of class KDTreeNode. The init method of KDTreeNode builds the tree in a recursive way as described in class.

You will notice that methods find_point, in_rect, and closest in class KDTree generate a vector and invoke a method on the root, which is an instance of KDTreeNode. The actual work of the three queries is done in the corresponding methods in class KDTreeNode.

Generating Point Sets

For the initial testing of your methods, create small point sets and your own simple test queries. You can use the function random_points provided for Problem Set 3 to generate point sets from a uniform distribution. Files testpoints.txt and testsession.txt contain points and queries you should also use to test your code (note that function read_points was given in Problem Set 3).

To determine the performance of the queries on lists versus kd-trees, use function generate_points(num_clusts, points_per_clust) given in file bench.py. Function generate_points generates num_clusts * points_per_clust points, with the centers of each cluster uniformly distributed and points around the center of each cluster distributed according to a Gaussian distribution.

Visualization of the kd-tree

The code you are given contains two visualization routines that were used in class to explain the queries:

  • showpoints(points, style='bo') in file bench.py plots the points with the given style. If there are fewer than 20 points, it also adds text labels with rounded coordinates. Use showpoints when testing your code.
  • closest_vis(self, p, upper_dist, lower, upper) in file kdtree.py performs a visualization of method closest. Its structure is similar to that of closest. You do not need to use it for the problem set, but may find it helpful in understanding query closest as described in class.

You are to complete method visualize(self, lower, upper) in class KDTreeNode. It is given a bounding rectangle defined by two points, lower and upper, and draws the partition induced by this node it is invoked on, then makes a recursive call to draw the partitions of its decedents. Note that the initial rectangle is set by method visualize in class KDTree.

Performance Comparison

You will compare the performance of the data structures by using two point sets, both generated by function generate_points(num_clusts, points_per_clust) in file bench.py. Data set 1 should be generated with num_clust = 200 and points_per_clust = 300. Data set 2 should be generated with num_clust = 400 and points_per_clust = 75. For each data set, run n = 200 iterations. One iteration consists of generating three points, p, lo, and up, and using them to invoke the three queries on the list and then on the kd-tree. A framework for the performance comparison is given in functions benchmark and test in file bench.py.

Note that after building the list and the kd-tree, function test makes n iterations, with each iteration generating three points used in six function calls. The times returned from the benchmark are accumulated in two arrays, each of size 3. You should augment function test so that it not only outputs the sums, but generates a more interesting analysis of the running times of the three queries on the two data structures. For example, compute the median and standard deviation for each query type, show the performance of each of the n times obtained for one query in a way that allows an effective comparison. Your visualization can use any of the approaches used and seen throughout the semester.

Submission Instructions

Submit the (revised) files kdtree.py, pointlist.py, and bench.py with your methods find_point, in_rect, and visualize completed and your changes to bench.py. Include plots showing the results of your performance comparison (one for each data set) and include a brief discussion. Include an output generated by visualize on a smaller data sets (e.g., generate_points(50,10)).

The provided files are available here: kdtree.py, pointlist.py, bench.py.

Extra Credit

The performance analysis described above ignores the cost of building the kd-tree. While this cost can be ignored in many applications, it is interesting to analyze the performance when this cost is to be included. For very few queries, the advantages of a kd-tree are likely to be lost when the cost of building the tree is added. Using the two data sets from the performance comparison study, determine the number of the iterations needed (i.e., n in test(n)) so that the kd-tree implementation is superior to the list when the cost of building the tree is included.

Note: The cost of building the kd-tree in the code given is dominated by the sorting done to determine the median coordinate before partitioning. There exist ways to build a kd-tree more efficiently by avoiding repeated sorts. We chose a simple way to build the tree and it is not the most efficient one.

 
cs190c/problemset4.txt · Last modified: 2008/04/14 08:55 by seh
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki