The dictionary is a very useful data structure in Python. The easiest way to conceptualize a dictionary is that it's like a list, except you don't look up values in a dictionary by their index in a sequence---you look them up using a "key," or a unique identifier for that value. The closest analogue to the dictionary in the non-programming word is a two-column spreadsheet: one column for the things you want to keep track of, and the other column for whatever value you're keeping track of for that thing.
We're going to focus here on learning how to use dictionaries to keep track of the parts of a text that are interest to us. We're also going to omit some of the nitty-gritty details about how dictionaries work internally. Some of what I'm going to tell you will seem weird and magical. Be prepared!
There is a whole category of problems that are difficult to solve without dictionaries. (Or problems that you CAN solve without dictionaries, but that become overly complicated or brittle.) In particular, dictionaries are dynamic, meaning we can add values to them while the program is running, instead of having to pre-define variables.
As an illustration of this, think about a program that reads in text and counts how many times each word in the text occurs. I.e., running the program on 'Two roads diverged in a yellow wood' might return:
Two: 2
one: 3
down: 1
another: 1
roads: 2
fair,: 1
grassy: 1
Though: 1
there: 1
long: 1
way: 1
travelled: 1
... along with a line for all of the other words in the text and their corresponding count. How would you implement this using only strings? You might come up with something like this:
two_count = 0
roads_count = 0
diverged_count = 0
in_count = 0
a_count = 0
yellow_count = 0
wood_count = 0
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
if word == 'two':
two_count = two_count + 1
elif word == 'roads':
roads_count = roads_count + 1
# ... etc etc etc ...
The drawback of this method should be obvious: we would have to pre-program our Python script with a list of all the words in the text! Clearly we need a different kind of data structure, different from a string, that will allow us to set a value for a particular word, without having to define ahead of time which words we want to set values for.
(We'll talk more about this particular example below.)
Dictionaries are written with curly brackets, surrounding a series of comma-separated pairs of keys and values. Here's a very simple dictionary, with one key, Obama, associated with a value, Hawaii:
>>> {'Barack Obama': 'Hawaii'} {'Barack Obama': 'Hawaii'}
Here's another dictionary, with two more entries:
>>> {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} {'Bill Clinton': 'Arkansas', 'George W. Bush': 'Texas', 'Barack Obama': 'Hawaii'}
As you can see, we're building a simple dictionary that associates the names of presidents to the home states of those presidents. The association of a key with a value is sometimes called a mapping. (In fact, in other programming languages like Java, the dictionary data structure is called a "Map.") So, in the above dictionary for example, we might say that the key Bill Clinton
maps to the value Arkansas
.
A dictionary is just like any other Python value. It has a type:
>>> type({'Barack Obama': 'Hawaii'}) <type 'dict'>
And you can assign a dictionary to a variable:
>>> president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} print type(president_states) <type 'dict'>
Keys and values in dictionaries can be of any data type, not just strings. Here's a dictionary, for example, that maps integers to lists of floating point numbers:
>>> print {17: [1.6, 2.45], 42: [11.6, 19.4], 101: [0.123, 4.89]} {17: [1.6, 2.45], 42: [11.6, 19.4], 101: [0.123, 4.89]}
A dictionary can also be empty, containing no key/value pairs at all:
>>> {} {}
The primary operation that we'll perform on dictionaries is writing an expression that evaluates to the value for a particular key. We do that with the same syntax we used to get a value at a particular index from a list. Except with dictionaries, instead of using a number, we use one of the keys that we had specified for the value when making the dictionary. For example, if we wanted to know what Bill Clinton's home state was, or, more precisely, what the value for the key Bill Clinton
is, we would write this expression:
>>> president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} >>> president_states['Bill Clinton'] 'Arkansas'
If we put a key in those brackets that does not exist in the dictionary, we get an error similar to the one we get when trying to access an element of an array beyond the end of a list:
>>> president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} >>> print president_states['Benjamin Franklin'] Traceback (most recent call last): File "<console>", line 1, in <module> KeyError: 'Benjamin Franklin'
As you might suspect, the thing you put inside the brackets doesn't have to be a string; it can be any Python expression, as long as it evaluates to something that is a key in the dictionary:
>>> president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} >>> president = 'Barack Obama' >>> president_states[president] 'Hawaii'
You can get a list of all the keys in a dictionary using the dictionary's .keys() method:
>>> president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} >>> president_states.keys() ['Bill Clinton', 'George W. Bush', 'Barack Obama']
Once you've assigned a dictionary to a variable, you can add another key/value pair to the dictionary by assigning a value to a new index, like so:
>>> president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} >>> president_states['Ronald Reagan'] = 'California' >>> print president_states {'Bill Clinton': 'Arkansas', 'George W. Bush': 'Texas', 'Barack Obama': 'Hawaii', 'Ronald Reagan': 'California'}
As you can see in the last line above, you can put print
in front of an expression evaluating to a dictionary to get Python to display a representation of the dictionary's state (i.e., what's in the dictionary).
Here's a list of the methods and other syntax supported by dictionaries in Python. I want to talk about a few of these in particular. First, the in
operator (which we've used previously to check to see if there's a substring in a string, or a particular item in a list), also works with dictionaries! It checks to see if a particular key exists in the dictionary:
>>> president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} >>> 'Barack Obama' in president_states >>> 'Ross Perot' in president_states True False
A dictionary can also go in a for
loop, in the spot between in
and the colon (where you might normally put a list or sys.stdin
). If you write a for
loop like this, the loop will iterate over each key in the dictionary:
>>> president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'} >>> for president in president_states: ... print president Bill Clinton George W. Bush Barack Obama
You may have noticed something in the previous examples, which is that sometimes the order in which we wrote our key/value pairs in our dictionaries is NOT the same order that those key/value pairs come out as when evaluating the dictionary as an expression or when using the .keys()
methods. That's because dictionaries in Python are unordered. A dictionary consists of a number of key/value pairs, but that's it---Python has no concept of which pairs come "before" or "after" other the pairs in the dictionary.
Here's a more explicit demonstration:
>>> print {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10} {'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4, 'g': 7, 'f': 6, 'i': 9, 'h': 8, 'j': 10}
Chances are that when you run that code, you'll get back a different ordering than the ordering you'd originally written. If you quit the Python interactive intepreter and start it again, you might get an ordering back that is different from that.
A better way of phrasing the idea that dictionaries are unordered is to say instead that "two dictionaries are considered the same if they have the same keys mapped to the same values."
Another important fact about dictionaries is that you can't put the same key into one dictionary twice. If you try to write out a dictionary that has the same key used more than once, Python will silently ignore all but one of the key/value pairs. For example:
>>> {'a': 1, 'a': 2, 'a': 3} {'a': 3}
Similarly, if we attempt to set the value for a key that already exists in the dictionary (using =), we won't add a second key/value pair for that key---we'll just overwrite the existing value:
>>> test_dict = {'a': 1, 'b': 2} >>> print test_dict['a'] >>> test_dict['a'] = 100 >>> print test_dict['a'] 1 100
At this point, we can start making the program that counts words in a text. But first, let's imagine how we might accomplish that task by hand, without writing any Python code. You might follow these steps:
Once you've completed that process, you'll have a list of every word in the text, along with the number of times that word occurs in the text! Here's an animation showing how you might do that with a Google Docs spreadsheet and a snippet of a Poe poem:
When we write this computer program in Python, we won't use a spreadsheet to store the words and counts; we're going to use a dictionary instead. Think of the dictionary as a kind of two-column spreadsheet, with the word in the first column and the count in the second column. The word column in this case are the "keys" and the count column are the "values."
import sys count = dict() # create an empty dictionary for line in sys.stdin: line = line.strip() words = line.split() for word in words: if word not in count: # if we haven't seen this word before count[word] = 1 # set its count to 1 else: count[word] = count[word] + 1 # otherwise, increment its count # now, iterate over the keys and print out the key and value for item in count.keys(): print item + ": " + str(count[item])
Run the program and you'll get something like this:
$ python word_count.py <frost.txt all: 1 looked: 1 just: 1 less: 1 both: 2 Had: 1 yellow: 1 leads: 1 not: 1 trodden: 1 Oh,: 1 had: 1 should: 1 better: 1 to: 1 sorry: 1 has: 1 them: 1 hence:: 1 far: 1 wood,: 2 telling: 1 Because: 1 morning: 1 by,: 1 where: 1 difference.: 1 other,: 1 sigh: 1 wear;: 1 really: 1 stood: 1 for: 2 ever: 1 day!: 1 knowing: 1 be: 2 step: 1 wanted: 1 come: 1 on: 1 about: 1 could: 2 way,: 1 passing: 1 black.: 1 Yet: 1 first: 1 equally: 1 Somewhere: 1 Two: 2 one: 3 down: 1 another: 1 roads: 2 fair,: 1 grassy: 1 Though: 1 there: 1 long: 1 way: 1 travelled: 1 was: 1 I---: 1 that: 3 took: 2 traveler,: 1 same,: 1 with: 1 And: 6 made: 1 this: 1 worn: 1 leaves: 1 and: 3 bent: 1 ages: 2 it: 2 To: 1 as: 5 diverged: 2 in: 3 undergrowth;: 1 if: 1 Then: 1 claim,: 1 no: 1 perhaps: 1 travel: 1 how: 1 shall: 1 I: 8 lay: 1 a: 3 kept: 1 back.: 1 doubted: 1 In: 1 the: 8 having: 1
The value for a key in a dictionary can be any type---even a list! This is helpful when we want to keep track of more than one value per key. In the interactive interpreter, you might make a dictionary with a list for a value like so:
>>> foo = {'a': [1, 2, 3]}
The above code creates a dictionary foo
with a single key, 'a'
, whose value is a list. Let's say you wanted to get the first element of that list. The expression to do so looks like this:
>>> foo = {'a': [1, 2, 3]} >>> foo['a'][0] 1
I.e., use the expression foo['a']
, which evaluates to the value for key 'a'
(in this case, a list), then use [0]
directly afterward to grab the first item in the list.
Likewise, we might add another element to the list like so:
>>> foo = {'a': [1, 2, 3]} >>> foo['a'].append(17) >>> foo {'a': [1, 2, 3, 17]}
That is: the expression foo['a']
evaluates to a list. Because that expression evaluates to a list, we can follow it immediately with a method call for a list, in this case .append()
.
To get a random item from the list:
>>> import random >>> foo = {'a': [1, 2, 3]} >>> random.choice(foo['a']) 3
The following program goes through a text word by word and checks to see which letter a particular word begins with. It keeps track of all initial letters in the text, storing them as a dictionary, with a list of all of the letters that begin with that letter as its value. It then uses this information to display a randomly-generated list of words beginning with each letter. I call it an "alpha poem."
import sys import random letters = {} for line in sys.stdin: line = line.strip() words = line.split() for word in words: first_letter = word[0] if first_letter not in letters: letters[first_letter] = [] letters[first_letter].append(word) for letter in 'abcdefghijklmnopqrstuvwxyz': if letter in letters: print random.choice(letters[letter])
Let's run it on some Shakespearean sonnets:
$ python alpha_poems.py <sonnets.txt aggravate bestow'st, cure despite end, friend, green, hath in judgment keep'st legacy? may no one precious queen rare, soil thought untutor'd verse, where your zealous
Nice!
EXERCISE #1: Rewrite the program above so that it keeps track of the length of each word, instead of its first letter. The first line of the output should consist of a single letter word; the second, a two-letter word; the third, a three-letter word; and so forth.
EXERCISE #2: Create a version of the program above that gets its dictionary of words from one text, and then performs a transformation on a second text using that dictionary. Specifically: for each word in the second text, randomly replace it with a corresponding word from the original text that begins with the same letter.
Now you know a bit about dictionaries!