So far, we've been working with programs that examined just one line of a file at a time. During this session, we'll be expanding our scope a little bit: we want to make programs that can build a picture of how an entire text looks, seems and behaves. In order to facilitate that, we're going to learn about lists---a simple Python data structure.
A list is a kind of value that contains other values---potentially many other values. Once you've created a list and put values in it, you can get values back out of the list using a syntax similar to the syntax we used to get particular characters (or slices) from strings.
Before we talk about how to use a list in a program, we're going to work with lists in the interactive interpreter.
Here's how you write a list in Python:
>>> ["hydrogen", "helium", "lithium", "beryllium"] ['hydrogen', 'helium', 'lithium', 'beryllium']
That is: That is: a left square bracket, followed by a series of comma-separated expressions, followed by a right square bracket. When you write out the list, the items in it can be expressions, not just plain values. Python will evaluate those expressions and put them in the list.
>>> element = "hydrogen" >>> [element[:2], element[:3], element[:4]] ['hy', 'hyd', 'hydr']
Lists can hold any type of value:
>>> [5, 10, 15, 20, 25, 30] [5, 10, 15, 20, 25, 30]
... and you can mix-and-match different types of value in the same list:
>>> [5, "harold", 7.6] [5, 'harold', 7.6]
Lists can have an arbitrary number of values. Here's a list with only one value in it:
>>> ["hello"] ['hello']
And here's a list with no values in it:
>>> [] []
Another way of making an empty list is to call the built-in list()
function with no parameters:
>>> list() []
Here's what happens when we ask Python what type of value a list is:
>>> type([1, 2, 3]) <type 'list'>
It's a value of type list
. Values of this type have their own methods, which you can see with the built-in dir
function (just as we did a while ago with strings)---scroll all the way to the right to see the interesting ones:
>>> dir([1, 2, 3]) ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
These methods---append
, count
, extend
, index
, insert
, pop
, remove
, reverse
, sort
---are all things that we can do with lists. We'll talk about some of them below.
Once we have a list, we might want to get values out of the list. You can write a Python expression that evaluates to a particular value in a list using square brackets to the right of your list, with a number representing which value you want, numbered from the beginning (the left-hand side) of the list. Here's an example:
>>> ["hydrogen", "helium", "lithium"][1] 'helium'
If we were to say this expression out loud, it might read, "I have a list of three things: hydrogen, helium and lithium. Give me back the second item in the list." Python evaluates that expression to helium
, the second item in the list.
This is very similar to how we got individual characters out of a string! And yes, Python uses zero-based indexes for lists, just as it does for strings.
Also---just as with strings---if you attempt to use a value for the index of a list that is beyond the end of the list (i.e., the value you use is higher than the last index in the list), Python gives you an error:
>>> ["hydrogen", "helium", "lithium"][127] Traceback (most recent call last): File "<console>", line 1, in <module> IndexError: list index out of range
Note that while the type of a list is list
, the type of an expression using index brackets to get an item out of the list is the type of whatever was in the list to begin with. To illustrate:
>>> type([1, 2, 3][0]) <type 'int'>
Oh, and just as with any other kind of value, you can assign a list to a variable. Then you just use the index bracket syntax to get back an element from the list later:
>>> elements = ["hydrogen", "helium", "lithium"] >>> elements[0] 'hydrogen'
Just as with strings, you can use negative numbers with list indexes to start counting from the end of the list, instead of the beginning. For example:
>>> ["hydrogen", "helium", "lithium", "beryllium"][-2] 'lithium'
Lists also support slice syntax, just like strings---except instead of using slice syntax to get substrings, you use slice syntax to get a smaller "slice" of the list:
>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][1:3] ['helium', 'lithium']
>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][2:] ['lithium', 'beryllium', 'boron']
>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][:3] ['hydrogen', 'helium', 'lithium']
Important to note: a slice of a list itself has a type of list
:
>>> elements = ["hydrogen", "helium", "lithium", "beryllium", "boron"] >>> type(elements[:3]) <type 'list'>
EXERCISE: Use list slices to write an expression that takes the
elements
variable as defined in the example above and evaluates to a new list that has all of the items inelements
except for the last.
Because lists are so central to Python programming, Python includes a number of built-in functions that allow us to write expressions that evaluate to interesting facts about lists. For example, try putting a list between the parentheses of the len() function. It will evaluate to the number of items in the list:
>>> len([10, 20, 30, 40]) 4
>>> len(["whatever"]) 1
>>> len([]) 0
The in
operator, which we've previously used to ask whether or not a given substring occurs in a larger string, can also be used to check to see if an element matching a particular value is present in a list:
>>> "lithium" in ["hydrogen", "helium", "lithium", "beryllium"] True
Likewise, the .index()
method evaluates to the index position of a given value, if it occurs in the list:
>>> ["hydrogen", "helium", "lithium", "beryllium"].index("helium") 1
(Note that .index()
causes your program to stop if it can't find the value you've asked for! Always use in
to check for a value's presence before you call .index()
.)
The max() function will evaluate to the highest value in the list:
>>> max([9, 8, 42, 3, -17, 2]) 42
... and the min() function will evaluate to the lowest value in the list:
>>> min([9, 8, 42, 3, -17, 2]) -17
The sum() function evaluates to the sum of all values in the list.
>>> sum([2, 4, 6, 8, 80]) 100
Finally, the sorted() function evaluates to a copy of the list, sorted from smallest value to largest value:
>>> sorted([9, 8, 42, 3, -17, 2]) [-17, 2, 3, 8, 9, 42]
This works with strings as well!
>>> sorted(["hydrogen", "helium", "lithium", "beryllium", "boron", "carbon"]) ['beryllium', 'boron', 'carbon', 'helium', 'hydrogen', 'lithium']
Often we'll want to make changes to a list after we've created it---for example, we might want to append elements to the list, remove elements from the list, or change the order of elements in the list. Python has a number of methods for facilitating these operations.
The first method we'll talk about is .append()
, which adds an item on to the end of an existing list.
>>> ingredients = ["flour", "milk", "eggs"] >>> ingredients.append("sugar") >>> ingredients ['flour', 'milk', 'eggs', 'sugar']
Notice that invoking the .append()
method doesn't itself evaluate to anything! (Technically, it evaluates to a special value of type None
.) Unlike many of the methods and syntactic constructions we've looked at so far, the .append()
method changes the underlying value---it doesn't return a new value that is a copy with changes applied.
There are two methods to facilitate removing values from a list: .pop()
and .remove()
. The .remove()
method removes from the list the first value that matches the value in the parentheses:
>>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> ingredients.remove("flour") >>> ingredients ['milk', 'eggs', 'sugar']
(Note that .remove()
, like .append()
doesn't evaluate to anything---it changes the list itself.)
The .pop()
method works slightly differently: give it an expression that evaluates to an integer, and it evaluates to the expression at the index named by the integer. But it also has a side effect: it removes that item from the list:
>>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> ingredients.pop(1) >>> ingredients 'milk' ['flour', 'eggs', 'sugar']
EXERCISE: What happens when you try to
.pop()
a value from a list at an index that doesn't exist in the list? What happens you try to.remove()
an item from a list if that item isn't in that list to begin with?
ANOTHER EXERCISE: Write an expression that
.pop()
s the second-to-last item from a list. SPOILER: (Did you guess that you could use negative indexing with.pop()
?
The .sort()
and .reverse()
methods do exactly the same thing as their function counterparts sorted()
and reversed()
, with the only difference being that the methods don't evaluate to anything, instead opting to change the list in-place.
>>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> ingredients.sort() >>> ingredients ['eggs', 'flour', 'milk', 'sugar']
>>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> ingredients.reverse() >>> ingredients ['sugar', 'eggs', 'milk', 'flour']
Python's random
library provides several helpful functions for performing chance operations on lists. The first is shuffle
, which takes a list and randomly shuffles its contents:
>>> import random >>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> random.shuffle(ingredients) >>> ingredients ['flour', 'milk', 'eggs', 'sugar']
The second is choice
, which returns a single random element from list.
>>> import random >>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> random.choice(ingredients) 'milk'
Finally, the sample
function returns a list of values, selected at random, from a list. The sample
function takes two parameters: the first is a list, and the second is how many items should be in the resulting list of randomly selected values:
>>> import random >>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> random.sample(ingredients, 2) ['eggs', 'milk']
At last, we now know enough about lists to start doing interesting creative stuff with them. But let's talk about what those programs will look like.
The first kind of program we'll write using lists will look like this, from a schematic perspective:
So: let's write a simple program that reads in some lines, then prints out a random line from all of the lines in the input.
import sys import random all_lines = list() # create an empty list # use our stdin loop to collect lines into a list---but don't print them! for line in sys.stdin: line = line.strip() all_lines.append(line) # after all the lines have been collected, print one out at random. print random.choice(all_lines)
Now we'll run the program, using "Sea Rose" as an input:
$ python random_line.py <sea_rose.txt hardened in a leaf?
EXERCISE: Modify the program above so that it only chooses randomly from lines that aren't empty. (Hint: You may need to be selective about which lines you add to the list.)
Let's modify this program to print out not one randomly selected line from our input, but three:
import sys import random all_lines = list() # create an empty list for line in sys.stdin: line = line.strip() all_lines.append(line) # use random.sample() to get three lines selected = random.sample(all_lines, 3) # print out each line print selected[0] print selected[1] print selected[2]
Now we'll run the program, using "Sea Rose" as an input:
$ python three_random_lines.py <sea_rose.txt you are caught in the drift. marred and with stint of petals,
Note that we needed to write print selected[0]
, print selected[1]
, etc. If we'd just written print selected
, we would have gotten something like this:
['you are lifted', 'Rose, harsh rose,', 'that drives in the wind.']
... as our output, which isn't what we want! Remember that the print
statement, if you give it something other than a string, will print out a "representation" of that value---basically, Python's best guess about how you'd like to see that value displayed. You probably don't want your poem to have brackets and commas in it (or maybe you do?), so make sure that you always tell Python to print strings, not different kinds of values. (Unless you're debugging, of course, in which case Python's attempts to make values into strings for you can be very useful.)
EXERCISE: Try running
three_random_lines.py
above with an input of fewer than three lines. What happens? Make a program that checks to make sure that at least three lines have been gathered from input, and fails gracefully otherwise. (You can decide what "fails gracefully" means---maybe it prints an error message? Or maybe it uses smaller values forrandom.sample()
depending on how many lines are in the input?)
You may have noticed in the above example that we wrote an expression for every index of the selected
list that we wanted to print out (i.e., selected[0]
, selected[1]
, selected[2]
). This works fine for a small number of items! But there are several problems here.
print
statement for each of those expressions?random.sample()
but also the number of print
statements. It's easy to mess this up.Things are looking grim! But there's hope. Performing the same operation on all items of a list is an extremely common task in computer programming. So common, that Python has some built-in syntax to make the task easy: the for
loop.
Here's how a for
loop looks:
for temp variable name in expression that evaluates to list: one or more statements
The words for
and in
just have to be there---that's how Python knows it's a for
loop. Here's what each of those parts mean.
for
will be executed once for each item in the list.Here's a for
loop, next to the same code if you tried to write it without using loops:
With loop
numbers = [1, 2, 3, 4, 5] for item in numbers: print item * item
Without loop
numbers = [1, 2, 3, 4, 5] item = numbers[0] print item * item item = numbers[1] print item * item item = numbers[2] print item * item item = numbers[3] print item * item item = numbers[4] print item * item
As you can see, the solution with the list is much more succinct! It's also more powerful: the same code will work even if you add more elements to the list later, or if the list itself is made when the program runs (e.g., you make the list from lines of text being passed into the program.)
With our knowledge of for
loops firmly established, we can now easily write a program that reads in an entire file, and then prints out the lines of that file in random order:
import sys import random # create an empty list all_lines = list() # add each line of input to that list for line in sys.stdin: line = line.strip() all_lines.append(line) # shuffle the lines randomly random.shuffle(all_lines) # now, print each item in the shuffled list for random_line in all_lines: print random_line
EXERCISE: Write a program that prints out 10% of the lines of the original text, sampled randomly. (I.e., if the original input contains 100 lines, it would output 10 lines selected at random.)
A powerful thing we can do with strings is "split" them up into lists. One easy way to do this is by passing a string to the list()
built-in function:
>>> chars = list("hello there") >>> chars ['h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']
This merely breaks the string up into individual characters, each of which ends up as an individual item in the list.
What if we want bigger, more meaningful chunks, not just individual characters? For that, Python gives us the .split()
method. The .split()
method takes a string to its left (before the .
) and a string inside of its parentheses. It returns a list of strings, carved out of the string on the left, "split" into pieces by breaking it wherever it finds the string on the right. For example, to break a US phone number into its constituent parts:
>>> "212-555-1212".split("-") ['212', '555', '1212']
In the above example, the .split()
method "splits" the string 212-555-1212
into a list containing three strings: 212
, 555
, and 1212
.
The .split()
method is the easiest way for our programs to start working with words, not just individual characters or entire lines. The easiest way to break a line into words is by calling .split()
with a string containing a single space as its parameter. So:
>>> sentence = "Now is the winter of our discontent." >>> words = sentence.split(" ") >>> words ['Now', 'is', 'the', 'winter', 'of', 'our', 'discontent.']
Nice! Let's use this functionality to write a program that counts how many words there are in an entire text file:
import sys # create an integer variable to accumulate # word count word_count = 0 for line in sys.stdin: line = line.strip() words = line.split(" ") line_word_count = len(words) word_count = word_count + line_word_count # print the total number of words print word_count
In the above program, we use the .split()
function to split the current line into words, then get the count of the number of words by passing the list to the len()
function. Run this program with "Sea Rose" as input:
$ python word_count.py <sea_rose.txt 68
EXERCISE: Compare the output of the above program to the output of the UNIX command line utility
wc -w
. Is the number different? Why? Can you put debug statements in the program above to diagnose why the two programs might have different outputs?
EXERCISE 2: Write a Python program that prints out the last word of every line of input. (Make sure to check to see if there are any words on the line before trying to get the last word!)
The list that you get from calling the .split()
method us just like any other list! And like any list, you can iterate over it using a for
loop. Here's another program---it reads in some input, then outputs five words, randomly chosen from the entire text:
import sys import random all_words = list() for line in sys.stdin: line = line.strip() words = line.split(" ") for individual_word in words: all_words.append(individual_word) random_words = random.sample(all_words, 5) for word in random_words: print word
Here's the output:
$ python five_random_words.py <sea_rose.txt on drift. leaf? a of
EXERCISE: Modify the example above to use the list method
.extend()
in place of thefor
loop on line 9.
ANOTHER EXERCISE: Modify the example above to include only words with five or more characters in the pool for potentially selected words. (HINT: Use an
if
statement in the innerfor
loop.)
Once we've created a list of words, it's a common task to want to take that list and "glue" it back together, so it's a single string again, instead of a list. So, for example:
>>> element_list = ["hydrogen", "helium", "lithium", "beryllium", "boron"] >>> glue = ", and " >>> glue.join(element_list) 'hydrogen, and helium, and lithium, and beryllium, and boron'
The .join()
method needs a "glue" string to the left of it---this is the string that will be placed in between the list elements. In the parentheses to the right, you need to put an expression that evaluates to a list. Very frequently with .join()
, programmers don't bother to assign the "glue" string to a variable first, so you end up with code that looks like this:
>>> words = ["this", "is", "a", "test"] >>> " ".join(words) 'this is a test'
When we're working with .split()
and .join()
, our workflow usually looks something like this:
With this in mind, here's a program that splits each line of input into a list, randomizes the order of that list, then prints out the results:
import sys import random for line in sys.stdin: line = line.strip() words = line.split(" ") random.shuffle(words) output = " ".join(words) print output
Run this program with "Sea Rose" and you get...
$ python randomize_words.py <sea_rose.txt rose, harsh Rose, petals, with stint of marred and meagre flower, thin, of spare leaf, precious more a rose wet than a -- on single stem you the in drift. caught are with small leaf, Stunted, you on sand, are flung the lifted you are sand crisp in the wind. in that the drives spice-rose the Can acrid drip fragrance such leaf? hardened a in
EXERCISE: Use UNIX pipes to combine
randomize_words.py
andrandomize_lines.py
(i.e., you should end up with all of the lines of an input in random order, with all of the words on each line in random order.)
EXERCSE 2: Write a Python program that prints out the last three words of each line of input. (Use
.split()
to split each line into words, and then.join()
to join a slice of that list back into a string.)
Whew! That was a lot of stuff to absorb. Here's some more information and reading: