Text mashups: working with multiple files

So far, all of our programs have operated on one source of input: standard input from the UNIX command-line. In this lesson, we'll learn how to use Python to read text from multiple files in the same program, allowing us to mash up text from more than one source.

Open sesame

The key to working with multiple files is Python's open() function. The open() function, in its simplest form, looks like this:


... where file_name_str is some expression that evaluates to a string that names the file that you want to open. (I.e., if you wanted to work with text in a file called foo.txt, you would write open("foo.txt").)

The open() function evaluates to a value of type file. You can read more about what you can do with file values if you'd like, but in this lesson I'm going to show you a few patterns to use with open() that you can just drop into your own code.

Iterating over open()

The first thing you can do with open() is iterate over it in a for loop, the same way you would iterate over a list. Here's an example, which does the same work of a normal standard-input program, but using an explicitly named file instead:

for line in open("sea_rose.txt"):
    line = line.strip()
    print line
Program: open_file.py

Run this program (making sure you have a file named sea_rose.txt in the same directory) and you'll get the following output:

$ python open_file.py
Rose, harsh rose,
marred and with stint of petals,
meagre flower, thin,
spare of leaf,

more precious
than a wet rose
single on a stem --
you are caught in the drift.

Stunted, with small leaf,
you are flung on the sand,
you are lifted
in the crisp sand
that drives in the wind.

Can the spice-rose
drip such acrid fragrance
hardened in a leaf?

Notice one thing about the command line: we didn't include input from redirection! (i.e., there's no <file.txt) That's because we didn't include our standard sys.stdin loop in the script---all of the text is read in by the open() command instead.

We can include as many calls to open() in our program as we'd like. And--- of course---we can use the for loop to do things other than just print! Here's a program that reads in two files, frost.txt and sea_rose.txt, and performs an unusual juxtaposition:

import random

rose_lines = list()
for line in open('sea_rose.txt'):
    line = line.strip()
    if len(line) > 0:

frost_lines = list()
for line in open('frost.txt'):
    line = line.strip()
    if len(line) > 0:

for i in range(10):
    random_rose = random.choice(rose_lines)
    random_frost = random.choice(frost_lines)
    print random_rose[:len(random_rose)/2] + random_frost[len(random_frost)/2:]
Program: halfsies.py

Here's the output:

$ python halfsies.py
in the cly about the same,
you are flungsy and wanted wear;
in the cr, as just as fair,
hardened  the better claim,
hardened e as far as I could
you areall the difference.
hardened n the undergrowth;
Can the sall the difference.
hardened  the passing there
marred and with n the undergrowth;

This program reads in all of the lines from two files (sea_rose.txt and frost.txt), and puts the lines into separate lists (rose_lines and frost_lines, respectively). It then executes some code at the end of the program ten times: choosing a random line from Sea Rose, a random line from Frost, and then printing out half of the Sea Rose line next to half of the Frost line.

EXERCISE: Write a version of halfsies.py that prints out half of the words from the randomly selected line from Sea Rose, followed by half of the words from the randomly selected line from Frost.

Read all contents

The open() function also allows us to slurp up all of a file at once into a big string. Here's how to do it:


... where file_name_str is some expression that evaluates to a string that names a file. The entire expression above evaluates to a string. Let's test it out in the interactive interpreter:

>>> open('sea_rose.txt').read()
'Rose, harsh rose, \nmarred and with stint of petals, \nmeagre flower, thin, \nspare of leaf,\n\nmore precious \nthan a wet rose \nsingle on a stem -- \nyou are caught in the drift.\n\nStunted, with small leaf, \nyou are flung on the sand, \nyou are lifted \nin the crisp sand \nthat drives in the wind.\n\nCan the spice-rose \ndrip such acrid fragrance \nhardened in a leaf?\n'

What can we do with the entire file in one big string? Well, we can grab big chunks of it for one thing, and make a kind of glitchy mashup of two different files:

import random

# read file contents into strings
sea_rose = open('sea_rose.txt').read()
frost = open('frost.txt').read()

for i in range(10):
    rose_start = random.randrange(len(sea_rose))
    rose_length = random.randrange(8, 20)
    rose_fragment = sea_rose[rose_start:rose_start+rose_length]

    frost_start = random.randrange(len(frost))
    frost_length = random.randrange(8, 20)
    frost_fragment = frost[frost_start:frost_start+frost_length]

    print rose_fragment + frost_fragment

Program: glitch.py

Here's the output:

$ python glitch.py
 stem -- 
you are cby,
And that has m

you areelled by,
you  I---
I took th
meagre flot travel both
drip se other, as

marred agh
 such acrid fragrawing how way le
drip such acd down one 

you are er come bac
are of leafdifference.

Reading in the contents of a file as a string also allows us to easily extract all of the words from the text. We can use that property to write a program that produces output that contains words from two different files:

import random

words = list()

sea_rose = open('sea_rose.txt').read()
frost = open('frost.txt').read()

for item in sea_rose.split():

for item in frost.split():

for i in range(10):
    num_words_this_line = random.randrange(1, 8)
    words_this_line = random.sample(words, num_words_this_line)
    print ' '.join(words_this_line)
Program: word_mashup.py

And the output:

$ python word_mashup.py
it had drip Then
the and
the there bent come
I--- first
a Rose, and And really
travel bent
acrid in And Yet
about sand, I --
as in

Split with no parameters

The program above used the .split() method in a new way---we didn't pass a string inside the parentheses. It turns out that .split(), when used without any parameters, does something interesting: it splits the string up on any whitespace (space characters, tabs, new lines). This is a little bit more versatile, especially when we're working with big strings that have newline characters in them.

Here's the difference between .split(" ") and .split(), illustrated in the interactive interpreter. First, we'll make a string with a bunch of weird whitespace in it:

>>> original = "This is\na test\n\ta very lovely test"
>>> print original
>>> print original.split(" ")
>>> print original.split()
This is
a test
    a very lovely test
['This', 'is\na', 'test\n\ta', 'very', 'lovely', 'test']
['This', 'is', 'a', 'test', 'a', 'very', 'lovely', 'test']

What does it look like when we split on " "?

>>> original = "This is\na test\n\ta very lovely test"
>>> print original.split(" ")
['This', 'is\na', 'test\n\ta', 'very', 'lovely', 'test']

It treats units like is\na like one unit---not ideal! If we use .split() with no parameters instead:

>>> original = "This is\na test\n\ta very lovely test"
>>> print original.split()
['This', 'is', 'a', 'test', 'a', 'very', 'lovely', 'test']

Much better!

Another example: replacing words

The program below reads in one file (poe.txt) and creates a list from its words. It then reads in a second file (frost.txt) and iterates over it line by line, replacing a randomly chosen word in the line with another word from poe.txt.

import random

poe_string = open("poe.txt").read()
poe_words = poe_string.split()

for line in open("frost.txt"):
    line = line.strip()
    if len(line) == 0:
        print line
        line_words = line.split()
        random_poe_word = random.choice(poe_words)
        random_frost_word = random.choice(line_words)
        line = line.replace(random_frost_word, random_poe_word)
        print line
Program: replacer.py

Here's what it looks like when you run it:

$ python replacer.py
appliances roads diverged in a yellow wood,
And dissolution. I could not travel both
ingress be one traveler, long I stood
And looked the one as far as I could
To where it bent of the undergrowth;

Then took the and as just as fair,
And having perhaps the better out
Because it was grassy and such wear;
Though as the that the passing there
Had worn shutm really about shut same,

And both that courtiers, equally lay
In leaves no step had girdled black.
his I kept the first for another day!
Yet knowing how way leads on to But
I doubted if I should blood. come back.

I shall sympathy telling this with a sigh
Somewhere ages A ages hence:
Two roads diverged abroad, a wood, and I---
I took the close less travelled by,
And The has made all the difference.

EXERCISE: Rewrite the program so that it uses something other than the .replace() method, and replaces random words rather than matching strings. (Hint: you'll need to .split() the line from frost.txt into words.)

open() vs. sys.stdin

So open() is pretty rad! Why not use it ALL the time, instead of bothering with sys.stdin? Well, there are a couple of reasons:

It's really a trade-off: open() allows you the flexibility of being able to work with multiple sources of input, but doesn't interoperate well with other programs. On the other hand, sys.stdin limits you to one source of input, but that source of input can be anything---a file (using redirection), or another UNIX program (using pipes).

Using both

Occasionally, it can make sense to use both open() and sys.stdin in the same file. Take, for example, this program, which prints out any lines in standard input that have words from frost.txt with a length of six or greater in them:

import sys

# read in a string with everything from frost.txt
frost_str = open('frost.txt').read()

# create an empty list
frost_words = []

# iterate over each word in frost_str; check to see if the word is of length
# equal to or greater than 6; add to the list if so
for word in frost_str.split():
    if len(word) >= 6:

# loop over every line in stdin
for line in sys.stdin:
    line = line.strip()
    # set found to false on each iteration
    found = False
    # check for each word in frost_words: is it found in the line? if so, set
    # found to True
    for word in frost_words:
        if word in line:
            found = True
    # after all that, if found is True, print the line.
    if found:
        print line
Program: frostify.py

Run this program, using sonnets.txt (for example) as input:

$ python frostify.py <sonnets.txt
But as the riper should by time decease,
Now is the time that face should form another;
For having traffic with thy self alone,
Sap checked with frost, and lusty leaves quite gone,
That's for thy self to breed another thee,
Then what could death do if thou shouldst depart,
And having climb'd the steep-up heavenly hill,
From his low tract, and look another way:
In singleness the parts that thou shouldst bear.
Mark how one string, sweet husband to another,
Resembling sire and child and happy mother,
Which to repair should be thy chief desire.
Make thee another self for love of me,
If all were minded so, the times should cease
Which bounteous gift thou shouldst in bounty cherish:
Thou shouldst print more, not let that copy die.
When lofty trees I see barren of leaves,
Against this coming end you should prepare,
So should that beauty which you hold in lease
When your sweet issue your sweet form should bear.
So should the lines of life that life repair,
Though yet heaven knows it is but as a tomb
So should my papers, yellow'd with their age,
You should live twice,--in it, and in my rhyme.
Then look I death my days should expiate.
Great princes' favourites their fair leaves spread
The dear respose for limbs with travel tir'd;
To march in ranks of better equipage:
But since he died and poets better prove,
Full many a glorious morning have I seen
And make me travel forth without my cloak,
Though thou repent, yet I have still the loss:
Though in our lives a separable spite,
Lest my bewailed guilt should do thee shame,
When thou art all the better part of me?
Both find each other, and I lose both twain,
Injurious distance should not stop my way;
Or heart in love with sighs himself doth smother,
When what I seek, my weary travel's end,
From where thou art why should I haste me thence?
Then should I spur, though mounted on the wind,
Thy edge should blunter be than appetite,
Being your slave what should I do but tend,
Though you do anything, he thinks no ill.
I should in thought control your times of pleasure,
Wh'r we are mended, or wh'r better they,
Is it thy will, thy image should keep open
Dost thou desire my slumbers should be broken,
Hath travell'd on to age's steepy night;
Ah! wherefore with infection should he live,
That sin by him advantage should achieve,
Why should false painting imitate his cheek,
Why should poor beauty indirectly seek
Why should he live, now Nature bankrupt is,
Ere beauty's dead fleece made another gay:
Making no summer of another's green,
Then thou alone kingdoms of hearts shouldst owe.
If thinking on me then should make you woe.
When I perhaps compounded am with clay,
Lest the wise world should look into your moan,
O! lest the world should task you to recite
What merit lived in me, that you should love
And so should you, to love things nothing worth.
When yellow leaves, or none, or few, do hang
My spirit is thine, the better part of me:
Then better'd that the world may see my pleasure:
So is my love still telling what is told.
These vacant leaves thy mind's imprint will bear,
Knowing a better spirit doth use your name,
Though I, once gone, to all the world must die:
Some fresher stamp of the time-bettering days.
In true plain words, by thy true-telling friend;
And their gross painting might be better us'd
Which should example where your equal grew.
Though words come hindmost, holds his rank before.
I was not sick of any fear from thence:
Thy self thou gav'st, thy own worth then not knowing,
Comes home again, on better judgement making.
As I'll myself disgrace; knowing thy will,
Lest I, too much profane, should do it wrong,
All these I better in one general best.
Thy love is better than high birth to me,
And having thee, of all men's pride I boast:
I see a better state to me belongs
That in thy face sweet love should ever dwell;
Thy looks should nothing thence, but sweetness tell.
Though to itself, it only live and die,
That leaves look pale, dreading the winter's near.
One blushing shame, another white despair;
Because he needs no praise, wilt thou be dumb?
Because I would not dull you with my song.
That having such a scope to show her pride,
Three beauteous springs to yellow autumn turn'd,
One thing expressing, leaves out difference.
And for they looked but with divining eyes,
Though absence seem'd my flame to qualify,
Like him that travels, I return again;
These blenches gave my heart another youth,
That did not better for my life provide
My most full flame should afterwards burn clearer.
Wherein I should your great deserts repay,
Which should transport me farthest from your sight.
That better is, by evil still made better;
'Tis better to be vile than vile esteem'd,
For why should others' false adulterate eyes
That every tongue says beauty should look so.
Whilst my poor lips which should that harvest reap,
Had, having, and in quest, to have extreme;
One on another's neck, do witness bear
And truly not the morning sun of heaven
Though in thy store's account I one must be;
Why should my heart think that a several plot,
If I might teach thee wit, better it were,
Though not to love, yet, love to tell me so;--
For, if I should despair, I should grow mad,
Who leaves unsway'd the likeness of a man,
The better angel is a man right fair,
Tempteth my better angel from my side,
I guess one angel in another's hell:
Why so large cost, having so short a lease,
Lest eyes well-seeing thy foul faults should find.
With others thou shouldst not abhor my state:

The program above is tricky! Read it carefully. Here's how to get a handle on the tricky parts.

EXERCISE: (advanced!) Modify the program above so that, instead of printing every matching line, it prints every instance of a matching string, along with ten characters of surrounding context (i.e., the ten characters before the match, and the ten characters afterward).

Command-line parameters with sys.argv

Many UNIX utilities take arguments on the command line: grep takes a pattern to search for, for example. We can read command-line parameters from Python as well, using the sys.argv list. This list contains all of the parameters passed on the command line, including the same of the script itself.

For example, take the following script, called argv_reader.py:

import sys

for arg in sys.argv:
  print arg
Program: argv_reader.py

The output:

$ python argv_reader.py anteater bonobo cockatoo

The element at index 0 is always the name of the Python program. The elements afterward are whatever strings are typed on the command-line. Handy! Here's a version of glitch.py above that reads from two filenames that you can specify on the command-line, instead of being hard-coded in the file itself:

import random
import sys

# read file contents into strings
left_file = open(sys.argv[1]).read()
right_file = open(sys.argv[2]).read()

for i in range(10):
    left_start = random.randrange(len(left_file))
    left_length = random.randrange(8, 20)
    left_fragment = left_file[left_start:left_start+left_length]

    right_start = random.randrange(len(right_file))
    right_length = random.randrange(8, 20)
    right_fragment = right_file[right_start:right_start+right_length]

    print left_fragment + right_fragment
Program: glitch-argv.py

I chose the words "left" and "right" arbitrarily---they don't have a special meaning here. Running the program:

$ python glitch-argv.py frost.txt sonnets.txt
 to way,
I doubty feeding;
In leaves orm happy show
t the same,

Ande unear'd womb
ges and age debarre'd the bene
ar as I coul to his s
at morninat all the 
as for that the paars not polic
equally lay
In may, yet

And both ow,
They liv
s made all the diffh

EXERCISE: Rewrite any of the other examples in this lesson that use open() to use sys.argv instead of a hard-coded filename.