A (Really Brief) Intro to Natural Language Processing

Hacker School: Week 10, Day 2

It’s been far too long since the last post. I promised my check-in group — our daily time for talking about what we’re doing here — every day last week that I would write a blog post. I didn’t. But here it is, in all its glory.

What is Natural Language Processing?

My brief sojourn into working with words — specifically, the Academy Award speeches — sparked an interest in all of the possibilities of combining programming and language. Everything I read and heard about this idea led me back to natural language processing. So, as is my custom, I went straight to good old Wikipedia.

“Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.”

So NLP combines a few things — CS, AI, linguistics — but ultimately concerns how computers work with natural languages (aka our human languages). Don’t worry, it doesn’t take a background in CS, AI, or linguistics to get the basics.

Why Do I Care?

I care because NLP has a huge number of potential applications — many of which are challenging, fascinating, and (eventually) world-changing (if computers could actually understand anything — anything — we ask them). But more immediately, I care because NLP is actually tangible and accessible, even for beginners. Much of what I’ve learned in programming is conceptual, or mathematical, or involves a large number of moving pieces. NLP (at its most basic) is taking organized text — a thing I already understand and _can see — _and manipulating and analyzing it in a way that I could talk about with anyone who has read or written text.

For this (again, really brief) intro, I’m going to use a Python library called Natural Language Toolkit (NLTK). I’ll try to assume as little knowledge as possible, but a basic understanding of how to write Python and interact with the terminal will definitely be helpful. I’m essentially following along with the start of this book (or this version if you like Python 3). Just the first section of the Preface gave me a much better understanding of why NLP is important and uniquely challenging. It’s written by the creators of NLTK, and gives a much more thorough and understandable intro to using the NLTK library than I ever could. I’ll use two of their intro examples to show what you could do right away.

Getting Set Up

I’ll assume you already have Python and some kind of interactive terminal (if you have a Mac, both should already be there). If you’re completely new to Python, but are interested in learning about NLP in Python (or in general), the book I referenced before does a much better job than I ever could easing a newcomer into both Python and NLP/NLTK.

In fact, I’ll just point you to this section for getting set up with NLTK (just those few paragraphs up to ‘Searching Text.’) If you have some experience with pip** **and/or virtual environments, it can make installing NLTK easier (it’s what I used).

Once you have your terminal open, here’s what you’ll type:

$ python

This should open the Python interactive interpreter (or REPL). You can type in Python code and the results of the code will appear immediately. I find it’s a great way to try out my own code, or try using a new library, package, or API, to actually see the results of particular code. I also just love typing some code in and getting immediate feedback. If you have NLTK installed, try (any text following “>>>” is Python code you can enter):

>>> from nltk.book import *

This’ll import a bunch of books! NLTK has a number of text sources — books, news stories, chat logs (anonymized, don’t worry) — to analyze/use in your code. You should see the name of each print as they load. Should look something like this:

Now you can use them!

What Can I Even Do?

Try this:

_text1 — _the full text of _Moby Dick — _is what you just loaded from NLTK above, and is now accessible/usable by Python. More specifically, it is an object of type ‘Text’ — this is specific to NLTK and allows you to perform a bunch of cool operations on it. In fact, try this:

>>> dir(text1)

It should look like this:

dir() is a really helpful Python command — it effectively shows all of the attributes associated with the object you put in the parentheses (important to note that everything in Python is an object, so you can use _dir() _with everything — even dir() itself). Put more simply, it’s everything you can do or access for a given object. It might seem overwhelming, but for now I’ll ignore everything with trailing and/or leading underscores (these are for the most part standard Python operators) and focus on the other stuff, specific to Text objects in NLTK (like ‘collocations’, ‘plot’, and the one we’re about to use).

Searching Text

Promised we would, so let’s search the text!

>>> text1.concordance(‘whale’)

concordance _is one of the actions (aka methods) of Text objects (it’s just to. So you have a text object — _text1 — _call a method on it — .concordance() — and put some argument in the parentheses — ‘whale’. The argument is a string you want to search for in the text. The result of this method is what you see above — all of the places in the text with the word ‘whale’ along with its context (all of the words surrounding it). It’s limited to displaying the top 25 (of 1226 — unsurprisingly, lots of whale talk in _Moby Dick). This is kind of amazing! It’s probably one of the simplest features of NLTK, but I can imagine a ton of applications — not the least of which is it just looks cool: you could make a web app just to do this and show the context for any word in some text!

Sidenote: ‘Concordance’ comes from publishing, and refers to exactly what we did above (list words in a text with their contexts). Its Wikipedia entry highlights what makes NLTK, and computing in general, so amazing: ‘Because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era, only works of special importance, such as the Vedas, Bible, Qur’an or the works of Shakespeare or classical Latin and Greek authors, had concordances prepared for them.’ Now we can do it with one command! That’s just the best.

Finding Similar Words

Let’s try another cool method — similar:

>>> text1.similar(‘whale’)

According to the NLTK book, similar _finds ‘other words [that] appear in a similar range of contexts’ as the given word. So ‘ship’ is found in more contexts with ‘whale’ than any other word — because of this I can reasonably conclude that the two words are similar in _Moby Dick. Already I can see how I’d apply this elsewhere — e.g., I could find words similar to a particular topic, which could form the basis for more complex sentiment analysis. Or, again, I could just make a fun web app that allows you to type in a word and get similar words in a body of text.

This obviously barely scratches the surface of natural language processing and the Natural Language Toolkit, but hopefully gets you a little interested. I’d love to write more about some of the other features/possibilities, and maybe even walk through a project I’ll create with it (though I have to do it first). I’m sure I’ll have more to say soon.

Resources:

NLTK book: The better-written basis for this post. Go try it out!

NLTK: The home for the Natural Language Toolkit. Has some simple/cool examples and additional resources.