So far, we've been working with programs that examined just one line of a file at a time. During this session, we'll be expanding our scope a little bit: we want to make programs that can build a picture of how an entire text looks, seems and behaves. In order to facilitate that, we're going to learn about lists---a simple Python data structure.
A list is a kind of value that contains other values---potentially many other values. Once you've created a list and put values in it, you can get values back out of the list using a syntax similar to the syntax we used to get particular characters (or slices) from strings.
Before we talk about how to use a list in a program, we're going to work with lists in the interactive interpreter.
Here's how you write a list in Python:
>>> ["hydrogen", "helium", "lithium", "beryllium"] ['hydrogen', 'helium', 'lithium', 'beryllium']
That is: That is: a left square bracket, followed by a series of comma-separated expressions, followed by a right square bracket. When you write out the list, the items in it can be expressions, not just plain values. Python will evaluate those expressions and put them in the list.
>>> element = "hydrogen" >>> [element[:2], element[:3], element[:4]] ['hy', 'hyd', 'hydr']
Lists can hold any type of value:
>>> [5, 10, 15, 20, 25, 30] [5, 10, 15, 20, 25, 30]
... and you can mix-and-match different types of value in the same list:
>>> [5, "harold", 7.6] [5, 'harold', 7.6]
Lists can have an arbitrary number of values. Here's a list with only one value in it:
>>> ["hello"] ['hello']
And here's a list with no values in it:
>>>  
Another way of making an empty list is to call the built-in
list() function with no parameters:
>>> list() 
Here's what happens when we ask Python what type of value a list is:
>>> type([1, 2, 3]) <type 'list'>
It's a value of type
list. Values of this type have their own methods, which you can see with the built-in
dir function (just as we did a while ago with strings)---scroll all the way to the right to see the interesting ones:
>>> dir([1, 2, 3]) ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
sort---are all things that we can do with lists. We'll talk about some of them below.
Once we have a list, we might want to get values out of the list. You can write a Python expression that evaluates to a particular value in a list using square brackets to the right of your list, with a number representing which value you want, numbered from the beginning (the left-hand side) of the list. Here's an example:
>>> ["hydrogen", "helium", "lithium"] 'helium'
If we were to say this expression out loud, it might read, "I have a list of three things: hydrogen, helium and lithium. Give me back the second item in the list." Python evaluates that expression to
helium, the second item in the list.
This is very similar to how we got individual characters out of a string! And yes, Python uses zero-based indexes for lists, just as it does for strings.
Also---just as with strings---if you attempt to use a value for the index of a list that is beyond the end of the list (i.e., the value you use is higher than the last index in the list), Python gives you an error:
>>> ["hydrogen", "helium", "lithium"] Traceback (most recent call last): File "<console>", line 1, in <module> IndexError: list index out of range
Note that while the type of a list is
list, the type of an expression using index brackets to get an item out of the list is the type of whatever was in the list to begin with. To illustrate:
>>> type([1, 2, 3]) <type 'int'>
Oh, and just as with any other kind of value, you can assign a list to a variable. Then you just use the index bracket syntax to get back an element from the list later:
>>> elements = ["hydrogen", "helium", "lithium"] >>> elements 'hydrogen'
Just as with strings, you can use negative numbers with list indexes to start counting from the end of the list, instead of the beginning. For example:
>>> ["hydrogen", "helium", "lithium", "beryllium"][-2] 'lithium'
Lists also support slice syntax, just like strings---except instead of using slice syntax to get substrings, you use slice syntax to get a smaller "slice" of the list:
>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][1:3] ['helium', 'lithium']
>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][2:] ['lithium', 'beryllium', 'boron']
>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][:3] ['hydrogen', 'helium', 'lithium']
Important to note: a slice of a list itself has a type of
>>> elements = ["hydrogen", "helium", "lithium", "beryllium", "boron"] >>> type(elements[:3]) <type 'list'>
EXERCISE: Use list slices to write an expression that takes the
elementsvariable as defined in the example above and evaluates to a new list that has all of the items in
elementsexcept for the last.
Because lists are so central to Python programming, Python includes a number of built-in functions that allow us to write expressions that evaluate to interesting facts about lists. For example, try putting a list between the parentheses of the len() function. It will evaluate to the number of items in the list:
>>> len([10, 20, 30, 40]) 4
>>> len(["whatever"]) 1
>>> len() 0
in operator, which we've previously used to ask whether or not a given substring occurs in a larger string, can also be used to check to see if an element matching a particular value is present in a list:
>>> "lithium" in ["hydrogen", "helium", "lithium", "beryllium"] True
.index() method evaluates to the index position of a given value, if it occurs in the list:
>>> ["hydrogen", "helium", "lithium", "beryllium"].index("helium") 1
.index() causes your program to stop if it can't find the value you've asked for! Always use
in to check for a value's presence before you call
The max() function will evaluate to the highest value in the list:
>>> max([9, 8, 42, 3, -17, 2]) 42
... and the min() function will evaluate to the lowest value in the list:
>>> min([9, 8, 42, 3, -17, 2]) -17
The sum() function evaluates to the sum of all values in the list.
>>> sum([2, 4, 6, 8, 80]) 100
Finally, the sorted() function evaluates to a copy of the list, sorted from smallest value to largest value:
>>> sorted([9, 8, 42, 3, -17, 2]) [-17, 2, 3, 8, 9, 42]
This works with strings as well!
>>> sorted(["hydrogen", "helium", "lithium", "beryllium", "boron", "carbon"]) ['beryllium', 'boron', 'carbon', 'helium', 'hydrogen', 'lithium']
Often we'll want to make changes to a list after we've created it---for example, we might want to append elements to the list, remove elements from the list, or change the order of elements in the list. Python has a number of methods for facilitating these operations.
The first method we'll talk about is
.append(), which adds an item on to the end of an existing list.
>>> ingredients = ["flour", "milk", "eggs"] >>> ingredients.append("sugar") >>> ingredients ['flour', 'milk', 'eggs', 'sugar']
Notice that invoking the
.append() method doesn't itself evaluate to anything! (Technically, it evaluates to a special value of type
None.) Unlike many of the methods and syntactic constructions we've looked at so far, the
.append() method changes the underlying value---it doesn't return a new value that is a copy with changes applied.
There are two methods to facilitate removing values from a list:
.remove() method removes from the list the first value that matches the value in the parentheses:
>>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> ingredients.remove("flour") >>> ingredients ['milk', 'eggs', 'sugar']
.append() doesn't evaluate to anything---it changes the list itself.)
.pop() method works slightly differently: give it an expression that evaluates to an integer, and it evaluates to the expression at the index named by the integer. But it also has a side effect: it removes that item from the list:
>>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> ingredients.pop(1) >>> ingredients 'milk' ['flour', 'eggs', 'sugar']
EXERCISE: What happens when you try to
.pop()a value from a list at an index that doesn't exist in the list? What happens you try to
.remove()an item from a list if that item isn't in that list to begin with?
ANOTHER EXERCISE: Write an expression that
.pop()s the second-to-last item from a list. SPOILER: (Did you guess that you could use negative indexing with
.reverse() methods do exactly the same thing as their function counterparts
reversed(), with the only difference being that the methods don't evaluate to anything, instead opting to change the list in-place.
>>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> ingredients.sort() >>> ingredients ['eggs', 'flour', 'milk', 'sugar']
>>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> ingredients.reverse() >>> ingredients ['sugar', 'eggs', 'milk', 'flour']
random library provides several helpful functions for performing chance operations on lists. The first is
shuffle, which takes a list and randomly shuffles its contents:
>>> import random >>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> random.shuffle(ingredients) >>> ingredients ['flour', 'milk', 'eggs', 'sugar']
The second is
choice, which returns a single random element from list.
>>> import random >>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> random.choice(ingredients) 'milk'
sample function returns a list of values, selected at random, from a list. The
sample function takes two parameters: the first is a list, and the second is how many items should be in the resulting list of randomly selected values:
>>> import random >>> ingredients = ["flour", "milk", "eggs", "sugar"] >>> random.sample(ingredients, 2) ['eggs', 'milk']
At last, we now know enough about lists to start doing interesting creative stuff with them. But let's talk about what those programs will look like.
The first kind of program we'll write using lists will look like this, from a schematic perspective:
So: let's write a simple program that reads in some lines, then prints out a random line from all of the lines in the input.
import sys import random all_lines = list() # create an empty list # use our stdin loop to collect lines into a list---but don't print them! for line in sys.stdin: line = line.strip() all_lines.append(line) # after all the lines have been collected, print one out at random. print random.choice(all_lines)
Now we'll run the program, using "Sea Rose" as an input:
$ python random_line.py <sea_rose.txt hardened in a leaf?
EXERCISE: Modify the program above so that it only chooses randomly from lines that aren't empty. (Hint: You may need to be selective about which lines you add to the list.)
Let's modify this program to print out not one randomly selected line from our input, but three:
import sys import random all_lines = list() # create an empty list for line in sys.stdin: line = line.strip() all_lines.append(line) # use random.sample() to get three lines selected = random.sample(all_lines, 3) # print out each line print selected print selected print selected
Now we'll run the program, using "Sea Rose" as an input:
$ python three_random_lines.py <sea_rose.txt you are caught in the drift. marred and with stint of petals,
Note that we needed to write
print selected, etc. If we'd just written
print selected, we would have gotten something like this:
['you are lifted', 'Rose, harsh rose,', 'that drives in the wind.']
... as our output, which isn't what we want! Remember that the
EXERCISE: Try running
three_random_lines.pyabove with an input of fewer than three lines. What happens? Make a program that checks to make sure that at least three lines have been gathered from input, and fails gracefully otherwise. (You can decide what "fails gracefully" means---maybe it prints an error message? Or maybe it uses smaller values for
random.sample()depending on how many lines are in the input?)
You may have noticed in the above example that we wrote an expression for every index of the
selected list that we wanted to print out (i.e.,
selected). This works fine for a small number of items! But there are several problems here.
random.sample()but also the number of
Things are looking grim! But there's hope. Performing the same operation on all items of a list is an extremely common task in computer programming. So common, that Python has some built-in syntax to make the task easy: the
Here's how a
for loop looks:
for temp variable name in expression that evaluates to list: one or more statements
in just have to be there---that's how Python knows it's a
for loop. Here's what each of those parts mean.
forwill be executed once for each item in the list.
for loop, next to the same code if you tried to write it without using loops:
As you can see, the solution with the list is much more succinct! It's also more powerful: the same code will work even if you add more elements to the list later, or if the list itself is made when the program runs (e.g., you make the list from lines of text being passed into the program.)
With our knowledge of
for loops firmly established, we can now easily write a program that reads in an entire file, and then prints out the lines of that file in random order:
import sys import random # create an empty list all_lines = list() # add each line of input to that list for line in sys.stdin: line = line.strip() all_lines.append(line) # shuffle the lines randomly random.shuffle(all_lines) # now, print each item in the shuffled list for random_line in all_lines: print random_line
EXERCISE: Write a program that prints out 10% of the lines of the original text, sampled randomly. (I.e., if the original input contains 100 lines, it would output 10 lines selected at random.)
A powerful thing we can do with strings is "split" them up into lists. One easy way to do this is by passing a string to the
list() built-in function:
>>> chars = list("hello there") >>> chars ['h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']
This merely breaks the string up into individual characters, each of which ends up as an individual item in the list.
What if we want bigger, more meaningful chunks, not just individual characters? For that, Python gives us the
.split() method. The
.split() method takes a string to its left (before the
.) and a string inside of its parentheses. It returns a list of strings, carved out of the string on the left, "split" into pieces by breaking it wherever it finds the string on the right. For example, to break a US phone number into its constituent parts:
>>> "212-555-1212".split("-") ['212', '555', '1212']
In the above example, the
.split() method "splits" the string
212-555-1212 into a list containing three strings:
.split() method is the easiest way for our programs to start working with words, not just individual characters or entire lines. The easiest way to break a line into words is by calling
.split() with a string containing a single space as its parameter. So:
>>> sentence = "Now is the winter of our discontent." >>> words = sentence.split(" ") >>> words ['Now', 'is', 'the', 'winter', 'of', 'our', 'discontent.']
Nice! Let's use this functionality to write a program that counts how many words there are in an entire text file:
import sys # create an integer variable to accumulate # word count word_count = 0 for line in sys.stdin: line = line.strip() words = line.split(" ") line_word_count = len(words) word_count = word_count + line_word_count # print the total number of words print word_count
In the above program, we use the
.split() function to split the current line into words, then get the count of the number of words by passing the list to the
len() function. Run this program with "Sea Rose" as input:
$ python word_count.py <sea_rose.txt 68
EXERCISE: Compare the output of the above program to the output of the UNIX command line utility
wc -w. Is the number different? Why? Can you put debug statements in the program above to diagnose why the two programs might have different outputs?
EXERCISE 2: Write a Python program that prints out the last word of every line of input. (Make sure to check to see if there are any words on the line before trying to get the last word!)
The list that you get from calling the
.split() method us just like any other list! And like any list, you can iterate over it using a
for loop. Here's another program---it reads in some input, then outputs five words, randomly chosen from the entire text:
import sys import random all_words = list() for line in sys.stdin: line = line.strip() words = line.split(" ") for individual_word in words: all_words.append(individual_word) random_words = random.sample(all_words, 5) for word in random_words: print word
Here's the output:
$ python five_random_words.py <sea_rose.txt on drift. leaf? a of
EXERCISE: Modify the example above to use the list method
.extend()in place of the
forloop on line 9.
ANOTHER EXERCISE: Modify the example above to include only words with five or more characters in the pool for potentially selected words. (HINT: Use an
ifstatement in the inner
Once we've created a list of words, it's a common task to want to take that list and "glue" it back together, so it's a single string again, instead of a list. So, for example:
>>> element_list = ["hydrogen", "helium", "lithium", "beryllium", "boron"] >>> glue = ", and " >>> glue.join(element_list) 'hydrogen, and helium, and lithium, and beryllium, and boron'
.join() method needs a "glue" string to the left of it---this is the string that will be placed in between the list elements. In the parentheses to the right, you need to put an expression that evaluates to a list. Very frequently with
.join(), programmers don't bother to assign the "glue" string to a variable first, so you end up with code that looks like this:
>>> words = ["this", "is", "a", "test"] >>> " ".join(words) 'this is a test'
When we're working with
.join(), our workflow usually looks something like this:
With this in mind, here's a program that splits each line of input into a list, randomizes the order of that list, then prints out the results:
import sys import random for line in sys.stdin: line = line.strip() words = line.split(" ") random.shuffle(words) output = " ".join(words) print output
Run this program with "Sea Rose" and you get...
$ python randomize_words.py <sea_rose.txt rose, harsh Rose, petals, with stint of marred and meagre flower, thin, of spare leaf, precious more a rose wet than a -- on single stem you the in drift. caught are with small leaf, Stunted, you on sand, are flung the lifted you are sand crisp in the wind. in that the drives spice-rose the Can acrid drip fragrance such leaf? hardened a in
EXERCISE: Use UNIX pipes to combine
randomize_lines.py(i.e., you should end up with all of the lines of an input in random order, with all of the words on each line in random order.)
EXERCSE 2: Write a Python program that prints out the last three words of each line of input. (Use
.split()to split each line into words, and then
.join()to join a slice of that list back into a string.)
Whew! That was a lot of stuff to absorb. Here's some more information and reading: