Lists and loops

So far, we've been working with programs that examined just one line of a file at a time. During this session, we'll be expanding our scope a little bit: we want to make programs that can build a picture of how an entire text looks, seems and behaves. In order to facilitate that, we're going to learn about lists---a simple Python data structure.

Lists

A list is a kind of value that contains other values---potentially many other values. Once you've created a list and put values in it, you can get values back out of the list using a syntax similar to the syntax we used to get particular characters (or slices) from strings.

Before we talk about how to use a list in a program, we're going to work with lists in the interactive interpreter.

Here's how you write a list in Python:

>>> ["hydrogen", "helium", "lithium", "beryllium"]
['hydrogen', 'helium', 'lithium', 'beryllium']

That is: That is: a left square bracket, followed by a series of comma-separated expressions, followed by a right square bracket. When you write out the list, the items in it can be expressions, not just plain values. Python will evaluate those expressions and put them in the list.

>>> element = "hydrogen"
>>> [element[:2], element[:3], element[:4]]
['hy', 'hyd', 'hydr']

Lists can hold any type of value:

>>> [5, 10, 15, 20, 25, 30]
[5, 10, 15, 20, 25, 30]

... and you can mix-and-match different types of value in the same list:

>>> [5, "harold", 7.6]
[5, 'harold', 7.6]

Lists can have an arbitrary number of values. Here's a list with only one value in it:

>>> ["hello"]
['hello']

And here's a list with no values in it:

>>> []
[]

Another way of making an empty list is to call the built-in list() function with no parameters:

>>> list()
[]

Here's what happens when we ask Python what type of value a list is:

>>> type([1, 2, 3])
<type 'list'>

It's a value of type list. Values of this type have their own methods, which you can see with the built-in dir function (just as we did a while ago with strings)---scroll all the way to the right to see the interesting ones:

>>> dir([1, 2, 3])
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

These methods---append, count, extend, index, insert, pop, remove, reverse, sort---are all things that we can do with lists. We'll talk about some of them below.

Getting values out of lists

Once we have a list, we might want to get values out of the list. You can write a Python expression that evaluates to a particular value in a list using square brackets to the right of your list, with a number representing which value you want, numbered from the beginning (the left-hand side) of the list. Here's an example:

>>> ["hydrogen", "helium", "lithium"][1]
'helium'

If we were to say this expression out loud, it might read, "I have a list of three things: hydrogen, helium and lithium. Give me back the second item in the list." Python evaluates that expression to helium, the second item in the list.

This is very similar to how we got individual characters out of a string! And yes, Python uses zero-based indexes for lists, just as it does for strings.

Also---just as with strings---if you attempt to use a value for the index of a list that is beyond the end of the list (i.e., the value you use is higher than the last index in the list), Python gives you an error:

>>> ["hydrogen", "helium", "lithium"][127]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
IndexError: list index out of range

Note that while the type of a list is list, the type of an expression using index brackets to get an item out of the list is the type of whatever was in the list to begin with. To illustrate:

>>> type([1, 2, 3][0])
<type 'int'>

Oh, and just as with any other kind of value, you can assign a list to a variable. Then you just use the index bracket syntax to get back an element from the list later:

>>> elements = ["hydrogen", "helium", "lithium"]
>>> elements[0]
'hydrogen'

Negative indexes and slices

Just as with strings, you can use negative numbers with list indexes to start counting from the end of the list, instead of the beginning. For example:

>>> ["hydrogen", "helium", "lithium", "beryllium"][-2]
'lithium'

Lists also support slice syntax, just like strings---except instead of using slice syntax to get substrings, you use slice syntax to get a smaller "slice" of the list:

>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][1:3]
['helium', 'lithium']
>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][2:]
['lithium', 'beryllium', 'boron']
>>> ["hydrogen", "helium", "lithium", "beryllium", "boron"][:3]
['hydrogen', 'helium', 'lithium']

Important to note: a slice of a list itself has a type of list:

>>> elements = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
>>> type(elements[:3])
<type 'list'>

EXERCISE: Use list slices to write an expression that takes the elements variable as defined in the example above and evaluates to a new list that has all of the items in elements except for the last.

Operations on lists

Because lists are so central to Python programming, Python includes a number of built-in functions that allow us to write expressions that evaluate to interesting facts about lists. For example, try putting a list between the parentheses of the len() function. It will evaluate to the number of items in the list:

>>> len([10, 20, 30, 40])
4
>>> len(["whatever"])
1
>>> len([])
0

The in operator, which we've previously used to ask whether or not a given substring occurs in a larger string, can also be used to check to see if an element matching a particular value is present in a list:

>>> "lithium" in ["hydrogen", "helium", "lithium", "beryllium"]
True

Likewise, the .index() method evaluates to the index position of a given value, if it occurs in the list:

>>> ["hydrogen", "helium", "lithium", "beryllium"].index("helium")
1

(Note that .index() causes your program to stop if it can't find the value you've asked for! Always use in to check for a value's presence before you call .index().)

The max() function will evaluate to the highest value in the list:

>>> max([9, 8, 42, 3, -17, 2])
42

... and the min() function will evaluate to the lowest value in the list:

>>> min([9, 8, 42, 3, -17, 2])
-17

The sum() function evaluates to the sum of all values in the list.

>>> sum([2, 4, 6, 8, 80])
100

Finally, the sorted() function evaluates to a copy of the list, sorted from smallest value to largest value:

>>> sorted([9, 8, 42, 3, -17, 2])
[-17, 2, 3, 8, 9, 42]

This works with strings as well!

>>> sorted(["hydrogen", "helium", "lithium", "beryllium", "boron", "carbon"])
['beryllium', 'boron', 'carbon', 'helium', 'hydrogen', 'lithium']

Making changes to lists

Often we'll want to make changes to a list after we've created it---for example, we might want to append elements to the list, remove elements from the list, or change the order of elements in the list. Python has a number of methods for facilitating these operations.

The first method we'll talk about is .append(), which adds an item on to the end of an existing list.

>>> ingredients = ["flour", "milk", "eggs"]
>>> ingredients.append("sugar")
>>> ingredients
['flour', 'milk', 'eggs', 'sugar']

Notice that invoking the .append() method doesn't itself evaluate to anything! (Technically, it evaluates to a special value of type None.) Unlike many of the methods and syntactic constructions we've looked at so far, the .append() method changes the underlying value---it doesn't return a new value that is a copy with changes applied.

There are two methods to facilitate removing values from a list: .pop() and .remove(). The .remove() method removes from the list the first value that matches the value in the parentheses:

>>> ingredients = ["flour", "milk", "eggs", "sugar"]
>>> ingredients.remove("flour")
>>> ingredients
['milk', 'eggs', 'sugar']

(Note that .remove(), like .append() doesn't evaluate to anything---it changes the list itself.)

The .pop() method works slightly differently: give it an expression that evaluates to an integer, and it evaluates to the expression at the index named by the integer. But it also has a side effect: it removes that item from the list:

>>> ingredients = ["flour", "milk", "eggs", "sugar"]
>>> ingredients.pop(1)
>>> ingredients
'milk'
['flour', 'eggs', 'sugar']

EXERCISE: What happens when you try to .pop() a value from a list at an index that doesn't exist in the list? What happens you try to .remove() an item from a list if that item isn't in that list to begin with?

ANOTHER EXERCISE: Write an expression that .pop()s the second-to-last item from a list. SPOILER: (Did you guess that you could use negative indexing with .pop()?

The .sort() and .reverse() methods do exactly the same thing as their function counterparts sorted() and reversed(), with the only difference being that the methods don't evaluate to anything, instead opting to change the list in-place.

>>> ingredients = ["flour", "milk", "eggs", "sugar"]
>>> ingredients.sort()
>>> ingredients
['eggs', 'flour', 'milk', 'sugar']
>>> ingredients = ["flour", "milk", "eggs", "sugar"]
>>> ingredients.reverse()
>>> ingredients
['sugar', 'eggs', 'milk', 'flour']

Lists and randomness

Python's random library provides several helpful functions for performing chance operations on lists. The first is shuffle, which takes a list and randomly shuffles its contents:

>>> import random
>>> ingredients = ["flour", "milk", "eggs", "sugar"]
>>> random.shuffle(ingredients)
>>> ingredients
['flour', 'milk', 'eggs', 'sugar']

The second is choice, which returns a single random element from list.

>>> import random
>>> ingredients = ["flour", "milk", "eggs", "sugar"]
>>> random.choice(ingredients)
'milk'

Finally, the sample function returns a list of values, selected at random, from a list. The sample function takes two parameters: the first is a list, and the second is how many items should be in the resulting list of randomly selected values:

>>> import random
>>> ingredients = ["flour", "milk", "eggs", "sugar"]
>>> random.sample(ingredients, 2)
['eggs', 'milk']

Making programs with lists, part one

At last, we now know enough about lists to start doing interesting creative stuff with them. But let's talk about what those programs will look like.

The first kind of program we'll write using lists will look like this, from a schematic perspective:

  1. Make an empty list
  2. Loop over every line in input
  3. After we've processed all of the input, perform some operation on the list and display the result.

So: let's write a simple program that reads in some lines, then prints out a random line from all of the lines in the input.

import sys
import random

all_lines = list() # create an empty list

# use our stdin loop to collect lines into a list---but don't print them!
for line in sys.stdin:
    line = line.strip()
    all_lines.append(line)

# after all the lines have been collected, print one out at random.
print random.choice(all_lines)
Program: random_line.py

Now we'll run the program, using "Sea Rose" as an input:

$ python random_line.py <sea_rose.txt
hardened in a leaf?

EXERCISE: Modify the program above so that it only chooses randomly from lines that aren't empty. (Hint: You may need to be selective about which lines you add to the list.)

Let's modify this program to print out not one randomly selected line from our input, but three:

import sys
import random

all_lines = list() # create an empty list

for line in sys.stdin:
    line = line.strip()
    all_lines.append(line)

# use random.sample() to get three lines
selected = random.sample(all_lines, 3)

# print out each line
print selected[0]
print selected[1]
print selected[2]
Program: three_random_lines.py

Now we'll run the program, using "Sea Rose" as an input:

$ python three_random_lines.py <sea_rose.txt

you are caught in the drift.
marred and with stint of petals,

Note that we needed to write print selected[0], print selected[1], etc. If we'd just written print selected, we would have gotten something like this:

['you are lifted', 'Rose, harsh rose,', 'that drives in the wind.']

... as our output, which isn't what we want! Remember that the print statement, if you give it something other than a string, will print out a "representation" of that value---basically, Python's best guess about how you'd like to see that value displayed. You probably don't want your poem to have brackets and commas in it (or maybe you do?), so make sure that you always tell Python to print strings, not different kinds of values. (Unless you're debugging, of course, in which case Python's attempts to make values into strings for you can be very useful.)

EXERCISE: Try running three_random_lines.py above with an input of fewer than three lines. What happens? Make a program that checks to make sure that at least three lines have been gathered from input, and fails gracefully otherwise. (You can decide what "fails gracefully" means---maybe it prints an error message? Or maybe it uses smaller values for random.sample() depending on how many lines are in the input?)

Looping over lists

You may have noticed in the above example that we wrote an expression for every index of the selected list that we wanted to print out (i.e., selected[0], selected[1], selected[2]). This works fine for a small number of items! But there are several problems here.

  1. It's kind of verbose---we're doing exactly the same thing multiple times, only with slightly different expressions. Surely there's an easier way to tell the computer to do this?
  2. It doesn't scale. What if we wrote a program that we want to produce hundreds or thousands of lines. Would we really need to write a print statement for each of those expressions?
  3. It's brittle. If we wanted to change our program so that we printed out four lines sampled at random, or two. We'd have to change not just the number in random.sample() but also the number of print statements. It's easy to mess this up.
  4. It requires us to know how many items are going to end up in the list to begin with.

Things are looking grim! But there's hope. Performing the same operation on all items of a list is an extremely common task in computer programming. So common, that Python has some built-in syntax to make the task easy: the for loop.

Here's how a for loop looks:

for temp variable name in expression that evaluates to list:
    one or more statements

The words for and in just have to be there---that's how Python knows it's a for loop. Here's what each of those parts mean.

Here's a for loop, next to the same code if you tried to write it without using loops:

With loop

numbers = [1, 2, 3, 4, 5]
for item in numbers:
  print item * item

Without loop

numbers = [1, 2, 3, 4, 5]
item = numbers[0]
print item * item
item = numbers[1]
print item * item
item = numbers[2]
print item * item
item = numbers[3]
print item * item
item = numbers[4]
print item * item

As you can see, the solution with the list is much more succinct! It's also more powerful: the same code will work even if you add more elements to the list later, or if the list itself is made when the program runs (e.g., you make the list from lines of text being passed into the program.)

With our knowledge of for loops firmly established, we can now easily write a program that reads in an entire file, and then prints out the lines of that file in random order:

import sys
import random

# create an empty list
all_lines = list()

# add each line of input to that list
for line in sys.stdin:
  line = line.strip()
  all_lines.append(line)

# shuffle the lines randomly
random.shuffle(all_lines)

# now, print each item in the shuffled list
for random_line in all_lines:
  print random_line 
Program: randomize_lines.py

EXERCISE: Write a program that prints out 10% of the lines of the original text, sampled randomly. (I.e., if the original input contains 100 lines, it would output 10 lines selected at random.)

Split: Making lists from strings

A powerful thing we can do with strings is "split" them up into lists. One easy way to do this is by passing a string to the list() built-in function:

>>> chars = list("hello there")
>>> chars
['h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']

This merely breaks the string up into individual characters, each of which ends up as an individual item in the list.

What if we want bigger, more meaningful chunks, not just individual characters? For that, Python gives us the .split() method. The .split() method takes a string to its left (before the .) and a string inside of its parentheses. It returns a list of strings, carved out of the string on the left, "split" into pieces by breaking it wherever it finds the string on the right. For example, to break a US phone number into its constituent parts:

>>> "212-555-1212".split("-")
['212', '555', '1212']

In the above example, the .split() method "splits" the string 212-555-1212 into a list containing three strings: 212, 555, and 1212.

The .split() method is the easiest way for our programs to start working with words, not just individual characters or entire lines. The easiest way to break a line into words is by calling .split() with a string containing a single space as its parameter. So:

>>> sentence = "Now is the winter of our discontent."
>>> words = sentence.split(" ")
>>> words
['Now', 'is', 'the', 'winter', 'of', 'our', 'discontent.']

Nice! Let's use this functionality to write a program that counts how many words there are in an entire text file:

import sys

# create an integer variable to accumulate
# word count
word_count = 0

for line in sys.stdin:
    line = line.strip()
    words = line.split(" ")
    line_word_count = len(words)
    word_count = word_count + line_word_count

# print the total number of words
print word_count
Program: word_count.py

In the above program, we use the .split() function to split the current line into words, then get the count of the number of words by passing the list to the len() function. Run this program with "Sea Rose" as input:

$ python word_count.py <sea_rose.txt
68

EXERCISE: Compare the output of the above program to the output of the UNIX command line utility wc -w. Is the number different? Why? Can you put debug statements in the program above to diagnose why the two programs might have different outputs?

EXERCISE 2: Write a Python program that prints out the last word of every line of input. (Make sure to check to see if there are any words on the line before trying to get the last word!)

The list that you get from calling the .split() method us just like any other list! And like any list, you can iterate over it using a for loop. Here's another program---it reads in some input, then outputs five words, randomly chosen from the entire text:

import sys
import random

all_words = list()

for line in sys.stdin:
    line = line.strip()
    words = line.split(" ")
    for individual_word in words:
        all_words.append(individual_word)

random_words = random.sample(all_words, 5)
for word in random_words:
    print word
Program: five_random_words.py

Here's the output:

$ python five_random_words.py <sea_rose.txt
on
drift.
leaf?
a
of

EXERCISE: Modify the example above to use the list method .extend() in place of the for loop on line 9.

ANOTHER EXERCISE: Modify the example above to include only words with five or more characters in the pool for potentially selected words. (HINT: Use an if statement in the inner for loop.)

Join: Making strings from lists

Once we've created a list of words, it's a common task to want to take that list and "glue" it back together, so it's a single string again, instead of a list. So, for example:

>>> element_list = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
>>> glue = ", and "
>>> glue.join(element_list)
'hydrogen, and helium, and lithium, and beryllium, and boron'

The .join() method needs a "glue" string to the left of it---this is the string that will be placed in between the list elements. In the parentheses to the right, you need to put an expression that evaluates to a list. Very frequently with .join(), programmers don't bother to assign the "glue" string to a variable first, so you end up with code that looks like this:

>>> words = ["this", "is", "a", "test"]
>>> " ".join(words)
'this is a test'

When we're working with .split() and .join(), our workflow usually looks something like this:

  1. Split a string to get a list of units (usually words).
  2. Use some of the list operations discussed above to modify or slice the list.
  3. Join that list back together into a string.
  4. Do something with that string (e.g., print it out).

With this in mind, here's a program that splits each line of input into a list, randomizes the order of that list, then prints out the results:

import sys
import random

for line in sys.stdin:
  line = line.strip()
  words = line.split(" ")
  random.shuffle(words)
  output = " ".join(words)
  print output
Program: randomize_words.py

Run this program with "Sea Rose" and you get...

$ python randomize_words.py <sea_rose.txt
rose, harsh Rose,
petals, with stint of marred and
meagre flower, thin,
of spare leaf,

precious more
a rose wet than
a -- on single stem
you the in drift. caught are

with small leaf, Stunted,
you on sand, are flung the
lifted you are
sand crisp in the
wind. in that the drives

spice-rose the Can
acrid drip fragrance such
leaf? hardened a in

EXERCISE: Use UNIX pipes to combine randomize_words.py and randomize_lines.py (i.e., you should end up with all of the lines of an input in random order, with all of the words on each line in random order.)

EXERCSE 2: Write a Python program that prints out the last three words of each line of input. (Use .split() to split each line into words, and then .join() to join a slice of that list back into a string.)

Conclusion

Whew! That was a lot of stuff to absorb. Here's some more information and reading: