CS 190C LAB 13: Friday, April 17, 2009

Python Dictionaries Review

Recall that dictionaries are Python are used to associate data with a key that can be of any time (as opposed to a list, where a key is always an integer). In this lab, you will use a dictionary that has a string as a key and an integer as the associated data. In the lab, whenever you come across a word, you want to add one to the associated number. The problem is, the word is not always in the dictionary already (which means it has occurred 0 times). We can use the get method to give a default value if the key does not exist in the dictionary. For example, if we wanted to multiply value with the key “foo” by 2, it would be done like this:

words['foo'] = words['foo'] * 2

However, in the event that foo is not in the dictionary, we end up with this:

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    words['foo'] = words['foo'] + 1
KeyError: 'foo'

This is because on the right hand side of the assignment, we try to retrieve the value associated with the key foo, which does not exist (note that the left hand side doesn't cause problems). To solve this, we can use the get method, which accepts an argument for the key, as well as a default value to return if the key doesn't exist. So if we now do this:

>>> words['foo'] = words.get('foo', 1) * 2
>>> print words['foo']

The operation then works as expected. Since foo does not exist in the dictionary, the get method returns 1, which is multiplied by 2 and stored in words with a key of foo. You can use this method for both creating and querying the dictionary, but it's not the only way. See class notes if you wish to do it another way, or need more reference for using get.

In-Lab Problem: Searching and Computing Word Frequencies

In this lab you will investigate the performance of three different implementations searching for words in files and determining their frequency in a given data set. You will be reading data (on which to execute the queries) from a file and the following describes what is done with one data set.

Before you read any of the data to query, a query list is generated for you. This is done by reading the entries in file query_file.txt which contains words, one per line. Each word appears in lower case. The words are read and stored in the list query_list.

Having creating list query_list, you are ready to read a document file. A line in a document file contains words, separated by blanks. All words are lower case and punctuation has been removed. The file is read line by line, and used to generate the following three representations:

  1. doc_list stores the words in the same order as they appear in the file. This part is done for you.
  2. doc_list_sorted should store the words in doc_list, only in sorted order (increasing). This part is done for you.
  3. doc_dict is a dictionary that uses the words from doc_list as the keys and the number of occurrences of each word as the value related to the key.

Record the time it takes to create each structure (again, the first two are done for you). For that reason you are not creating the three structures simultaneously (which would be more efficient), but one after the other. This means opening and closing the document file three times. The time taken for creating each structure should be stored in the time_list, time_list_sorted, and time_dict lists. These lists are already created for you, so you just need to call the appropriate function to append the times to the lists.

Next, consider one of the structures. For each, you should:

  • set start = clock()
  • iterate over query_list and for each entry determine the number of times the entry occurs in the document list
    • For doc_list, the count method is used.
    • For doc_list_sorted, the provided sorted_count method is used.
    • For doc_dict, the number of occurrences is already provided from when you generated the structure, so you just need to reference this data.
  • since you are only interested in the running time, no action is taken with the generated results
  • set end = clock() and use end - start to record the time
    • Store the resulting time in the appropriate list (it will end in _query). Again, these lists are created for you.

After having completed all searches for one file, you have two running times for each of the three structures. As stated above, each value is appended to a different list, which is used for plotting later on. This structure generation and timing should be done for all files provided below. The final step is plotting the collected running times. You should use one line style (e.g., dashed) for plotting lines that represent times to create structures and another (e.g., solid) for plotting lines that represent query times. The graph will look something like this:

You should notice that the query time for an unsorted list by far dominates the time for every other recorded time. To get an idea of how the other times compared to each other, remove the query time for the unsorted list from the plot and generate a new plot or use the zoom tool to focus on the other lines. An example of the second graph is shown below.. You should submit both plots with your code. You should also write a comment in your code that explains why the graph is as you observe. That is, explain why some structure is faster than another structure for querying but slower for creation, etc.

Provide data for testing: query file with 10 words, document file with 20 words.

Use lab13.py as a starting point for your code. Here is the query list and three files you should test on:

If you wish to test with a smaller set:


import pylab
import sys
from time import time, clock
from bisect import bisect_right, bisect_left
if not sys.platform == 'win32':
    clock = time
def sorted_count(doc_list_sorted, x):
    Given a sorted list and an element x, returns the number of occurences of
    x in the sorted list.
    return bisect_right(doc_list_sorted, x) - bisect_left(doc_list_sorted, x)
if __name__ == '__main__':
    files = ['the_prince.txt', 'frankenstein.txt', 'war_and_peace.txt']
    querylist = 'querylist.txt'
    # If you want a shorter set of data to test, comment out the above two lines
    # and uncomment the below two. Depending on your graph style, you may need
    # to change them, since matplotlib will not draw a line with a single point.
    #files = ['shortfile.txt']
    #querylist = 'shortquery.txt'
    query_list = []
    time_list = []
    time_list_query = []
    time_list_sorted = []
    time_list_sorted_query = []
    time_dict = []
    time_dict_query = []
    x_axis = []
    f = open(querylist, 'r')
    for line in f:
    for file in files:
        # The below code should require no modification. It creates the unsorted
        # list and sorted list and performs the query for both.
        start = clock()
        f = open(file, 'r')
        doc_list = []
        for line in f:
            words = line.split()
            for word in words:
        end = clock()
        time_list.append(end - start)
        start = clock()
        for query in query_list:
        end = clock()
        time_list_query.append(end - start)
        start = clock()
        f = open(file, 'r')
        doc_list_sorted = []
        for line in f:
            words = line.split()
            for word in words:
        end = clock()
        time_list_sorted.append(end - start)
        start = clock()
        for query in query_list:
            sorted_count(doc_list_sorted, query)
        end = clock()
        time_list_sorted_query.append(end - start)
        # Put your code for creating and querying the dictionary here. See the
        # lab spec for how to deal with creating and querying the dictionary
        # structure, if you are having problems with it. You should re-open
        # the file and use the pattern above for creating the dictionary. You
        # should also record the number of words that you come across and store
        # it in the x_axis list.
        start = clock()
        doc_dict = {}
        f = open(file, 'r')
        for line in f:
            words = line.split()
            for word in words:
                doc_dict[word] = doc_dict.get(word, 0) + 1
        end = clock()
        time_dict.append(end - start)
        start = clock()
        for query in query_list:
            doc_dict.get(query, 0)
        end = clock()
        time_dict_query.append(end - start)
    # Put your explanation here as to why the times are as observed. Also put
    # Your graph drawing here. Ideally you should include a legend to show
    # what lines are what timings.
    pylab.plot(x_axis, time_list, 'r--', label='doc creation')
    pylab.plot(x_axis, time_list_sorted, 'g--', label='sorted creation')
    pylab.plot(x_axis, time_dict, 'b--', label='dict creation')
    pylab.plot(x_axis, time_list_query, 'r-', label='doc query')
    pylab.plot(x_axis, time_list_sorted_query, 'g-', label='sorted query')
    pylab.plot(x_axis, time_dict_query, 'b-', label='dict query')
cs190c/lab13_09.txt · Last modified: 2009/04/29 23:02 by tang
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki