Natural Language with TextBlob

TextBlob is a library for natural language processing, or NLP. Natural language processing techniques can give us access to units and aspects of the language that underlie the text (like sentences, parts of speech, sentiment, etc.).

Of course, a computer can never really fully "understand" human language, so NLP techniques are always a little bit inaccurate. But often even inaccurate results can be "good enough."

I've pre-installed the TextBlob library on the sandbox server. If you want to use it locally on your own computer, come see me for help!

Parsing sentences

TextBlob allows you to take some text and get the sentences inside of the text. Here's an example from an interactive interpreter session:

>>> from textblob import TextBlob
>>> blob = TextBlob(open("poe.txt").read())
>>> for item in blob.sentences:
...     print item.replace('\n', ' ')
The "Red Death" had long devastated the country.
No pestilence had ever been so fatal, or so hideous.
Blood was its Avatar and its seal--the redness and the horror of blood.
There were sharp pains, and sudden dizziness, and then profuse bleeding at the pores, with dissolution.
The scarlet stains upon the body and especially upon the face of the victim, were the pest ban which shut him out from the aid and from the sympathy of his fellow-men.
And the whole seizure, progress and termination of the disease, were the incidents of half an hour.
But the Prince Prospero was happy and dauntless and sagacious.
When his dominions were half depopulated, he summoned to his presence a thousand hale and light-hearted friends from among the knights and dames of his court, and with these retired to the deep seclusion of one of his castellated abbeys.
This was an extensive and magnificent structure, the creation of the prince's own eccentric yet august taste.
A strong and lofty wall girdled it in.
This wall had gates of iron.
The courtiers, having entered, brought furnaces and massy hammers and welded the bolts.
They resolved to leave means neither of ingress nor egress to the sudden impulses of despair or of frenzy from within.
The abbey was amply provisioned.
With such precautions the courtiers might bid defiance to contagion.
The external world could take care of itself.
In the meantime it was folly to grieve, or to think.
The prince had provided all the appliances of pleasure.
There were buffoons, there were improvisatori, there were ballet-dancers, there were musicians, there was Beauty, there was wine.
All these and security were within.
Without was the "Red Death".
It was towards the close of the fifth or sixth month of his seclusion, and while the pestilence raged most furiously abroad, that the Prince Prospero entertained his thousand friends at a masked ball of the most unusual magnificence. 

Here's how the above example works. First, we import the TextBlob class from the textblob library with this line:

from textblob import TextBlob

(The from module import thing syntax used above simply makes available a single item from the named module. If you use this syntax, you don't have to type the name of the module every time you want to reference the thing you've imported. Another example: if you wanted to import just the choice() function from the random module, you could write: from random import choice. Then, when you wanted to use the function, you could just type choice() instead of having to type random.choice())

On the second line above, we create a TextBlob object, and pass in a string. I used the open() function to read in the contents of a file, but you can put in there any expression that evaluates to a string. We assign the object to a variable blob.

The blob variable has a number of interesting methods and attributes. The .sentences attribute is a list of sentences in the text. In the third line of the example above, we loop over the list of sentences and print them out.

We need to replace \n with a space character, because even though TextBlob parses sentences from the text, it doesn't remove linebreaks.

Parsing words

>>> from textblob import TextBlob
>>> blob = TextBlob(open("sea_rose.txt").read())
>>> for word in blob.words:
...   print word
Rose
harsh
rose
marred
and
with
stint
of
petals
meagre
flower
thin
spare
of
leaf
more
precious
than
a
wet
rose
single
on
a
stem
you
are
caught
in
the
drift
Stunted
with
small
leaf
you
are
flung
on
the
sand
you
are
lifted
in
the
crisp
sand
that
drives
in
the
wind
Can
the
spice-rose
drip
such
acrid
fragrance
hardened
in
a
leaf

This example demonstrates the .words attribute of TextBlob objects: it parses individual words from the text, taking into account punctuation (and not including the punctuation in the words).

Sentiment

TextBlob can calculate the "sentiment" of a sentence. "Sentiment" is a measurement of the emotional content of the sentence: the number is positive (between 0 and 1) if the sentence says something "good" and negative (between 0 and -1) if the sentence says something "bad."

You can access the sentiment of a sentence in TextBlob by looping over the .sentences attribute of a TextBlob object, then checking the .sentiment.polarity attribute of each item in the loop. The following example prints only those sentences from poe.txt that have a positive sentiment (according to TextBlob):

>>> from textblob import TextBlob
>>> blob = TextBlob(open("poe.txt").read())
>>> for item in blob.sentences:
...   if item.sentiment.polarity > 0:
...     print item.replace('\n', ' ')
And the whole seizure, progress and termination of the disease, were the incidents of half an hour.
But the Prince Prospero was happy and dauntless and sagacious.
When his dominions were half depopulated, he summoned to his presence a thousand hale and light-hearted friends from among the knights and dames of his court, and with these retired to the deep seclusion of one of his castellated abbeys.
This was an extensive and magnificent structure, the creation of the prince's own eccentric yet august taste.
A strong and lofty wall girdled it in.
It was towards the close of the fifth or sixth month of his seclusion, and while the pestilence raged most furiously abroad, that the Prince Prospero entertained his thousand friends at a masked ball of the most unusual magnificence. 

And the following example prints only those sentences from poe.txt that have a negative sentiment:

>>> from textblob import TextBlob
>>> blob = TextBlob(open("poe.txt").read())
>>> for item in blob.sentences:
...   if item.sentiment.polarity < 0:
...     print item.replace('\n', ' ')
The "Red Death" had long devastated the country.
There were sharp pains, and sudden dizziness, and then profuse bleeding at the pores, with dissolution.
The scarlet stains upon the body and especially upon the face of the victim, were the pest ban which shut him out from the aid and from the sympathy of his fellow-men.

Getting noun phrases

A "noun phrase" is a kind of phrase you find in a sentence. It consists of a noun and all of that noun's "surrounding matter," such as any adjectives that modify the noun. TextBlob makes it very easy to extract noun phrases from a given text, using its .noun_phrases attribute:

>>> from textblob import TextBlob
>>> blob = TextBlob(open("poe.txt").read())
>>> for item in blob.noun_phrases:
...   print item
death
blood
avatar
sharp pains
sudden dizziness
scarlet stains
pest ban
whole seizure
prospero
deep seclusion
magnificent structure
prince 's
own eccentric
august taste
lofty wall
massy hammers
sudden impulses
such precautions
bid defiance
external world
death
prospero
unusual magnificence

Here we're looping over the noun phrases and printing them out.

Parts of speech

TextBlob can also tell us what part of speech each word in a text corresponds to. It can tell us if a word in a sentence is functioning as a noun, an adjective, a verb, etc. In NLP, associating a word with a part of speech is called "tagging." Correspondingly, the attribute of the TextBlob object we'll use to access this information is .tags.

>>> from textblob import TextBlob
>>> blob = TextBlob("I have a lovely bunch of coconuts.")
>>> for word, pos in blob.tags:
...    print word, pos
I PRP
have VBP
a DT
lovely JJ
bunch NN
of IN
coconuts NNS

This for loop is a little weird, because it has two temporary loop variables instead of one. (The underlying reason for this is that .tags evaluates to a list of two-item tuples, which we can automatically unpack by specifying two items in the for loop. Don't worry about this if it doesn't make sense. Just know that when we're using the .tags attribute, you need two loop variables instead of one.) The first variable, which we've called word here, contains the word; the second variable, called pos here, contains the part of speech.

What the tags mean.

Pluralization

>>> from textblob import Word
>>> w = Word("university")
>>> print w.pluralize()
universities

The .lemmatize() returns the word, but with all morphology (suffixes, etc.) removed.

>>> from textblob import Word
>>> w = Word("running")
>>> print w.lemmatize()
running

Examples

from textblob import TextBlob
import random
import sys

# stdin's read() method just reads in all of standard input as a string;
# use the decode method to convert to ascii (textblob prefers ascii)
text = sys.stdin.read().decode('ascii', errors="replace")
blob = TextBlob(text)

short_sentences = list()
for sentence in blob.sentences:
  if len(sentence.words) <= 5:
    short_sentences.append(sentence.replace("\n", " "))

for item in random.sample(short_sentences, 10):
    print item
Program: hemingwayize.py
$ python hemingwayize.py < austen.txt
How will a conundrum reckon?"
Could there be finer symptoms?
what do you mean?"
replied Elinor.
"Oh!
Adopt her, educate her."
but what shall you do?
cried Harriet, colouring, and astonished.
"I had none.
who can require it?"

Turn any text into a list of instructions

from textblob import TextBlob
import sys
import random

text = sys.stdin.read().decode('ascii', errors="replace")
blob = TextBlob(text)

noun_phrases = blob.noun_phrases

verbs = list()
for word, tag in blob.tags:
  if tag == 'VB':
    verbs.append(word.lemmatize())

for i in range(1, 11):
  print "Step " + str(i) + ". " + random.choice(verbs).title() + " " + \
      random.choice(noun_phrases)
Program: instructify.py
$ python instructify.py < poe.txt
Step 1. Take prince 's
Step 2. Leave lofty wall
Step 3. Leave prospero
Step 4. Take pest ban
Step 5. Close deep seclusion
Step 6. Take massy hammers
Step 7. Take external world
Step 8. Leave sudden dizziness
Step 9. Leave sudden impulses
Step 10. Close pest ban

Create a poor summary of a text

from textblob import TextBlob, Word
import sys
import random

text = sys.stdin.read().decode('ascii', errors="replace")
blob = TextBlob(text)

nouns = list()
for word, tag in blob.tags:
    if tag == 'NN':
        nouns.append(word.lemmatize())

print "This text is about..."
for item in random.sample(nouns, 5):
    word = Word(item)
    print word.pluralize()
Program: summarize_poorly.py
$ python summarize_poorly.py < poe.txt
This text is about...
ingress
bodies
halves
victims
walls