When I blog about my own research, I usually describe work I’ve already completed and focus on the results. This post is about a recent effort that ended in frustration, and it focuses on the process. In writing about this aborted project, I have two hopes: 1) to reassure other researchers (and myself) that this kind of failure is normal, and 2) if I’m lucky, to get some help with this task.
This particular ball got rolling a couple of days ago when I read a blog post by Dan Drezner about one aspect of the Obama administration’s new National Security Strategy (NSS) report. A few words in the bits Dan quoted got me thinking about the worldview they represented, and how we might use natural-language processing (NLP) to study that:
At first, I was just going to drop that awkwardly numbered tweetstorm and leave it there. I had some time that afternoon, though, and I’ve been looking for opportunities to learn text mining, so I decided to see what I could do. The NSS reports only became a thing in 1987, so there are still just 16 of them, and they all try to answer the same basic questions: What threats and opportunities does the US face in the world, and what should the government do to meet them? As such, they seemed like the kind of manageable and coherent corpus that would make for a nice training exercise.
I started by checking to see if anyone had already done with earlier reports what I was hoping to do with the latest one. It turned out that someone had, and to good effect:
I promptly emailed the corresponding author to ask if they had replication materials, or even just clean versions of the texts for all previous years. I got an autoreply informing me that the author was on sabbatical and would only intermittently be reading his email. (He replied the next day to say that he would put the question to his co-authors, but that still didn’t solve my problem, and by then I’d moved on anyway.)
Without those materials, I would need to start by getting the documents in the proper format. A little Googling led me to the National Security Strategy Archive, which at the time had PDFs of all but the newest report, and that one was easy enough to find on the White House’s web site. Another search led me to a site that converts PDFs to plain text online for free. I spent the next hour or so running those reports through the converter (and playing a little Crossy Road on my phone while I waited for the jobs to finish). Once I had the reports as .txt files, I figured I could organize my work better and do other researchers a solid by putting them all in a public repository, so I set one up on GitHub (here) and cloned it to my hard drive.
At that point, I was getting excited, thinking: “Hey, this isn’t so hard after all.” In most of the work I do, getting the data is the toughest part, and I already had all the documents I wanted in the format I needed. I was just a few lines of code away from the statistics and plots and that would confirm or infirm my conjectures.
From another recent collaboration, I knew that the next step would be to use some software to ingest those .txt files, scrub them a few different ways, and then generate some word counts and maybe do some topic modeling to explore changes over time in the reports’ contents. I’d heard several people say that Python is really good at these tasks, but I’m an R guy, so I followed the lead on the CRAN Task View for natural language processing and installed and loaded the ‘tm’ package for text mining.
And that’s where the wheels started to come off of my rickety little wagon. Using the package developers’ vignette and an article they published in the Journal of Statistical Software, I started tinkering with some code. After a couple of false starts, I found that I could create a corpus and run some common preprocessing tasks on it without too much trouble, but I couldn’t get the analytical functions to run on the results. Instead, I kept getting this error message:
Error: inherits(doc, "TextDocument") is not TRUE
By then it was dinner time, so I called it a day and went to listen to my sons holler at each other across the table for a while.
When I picked the task back up the next morning, I inspected a few of the scrubbed documents and saw some strange character strings—things like ir1 instead of in and â€™ where an apostrophe should be. That got me wondering if the problem lay in the encoding of those .txt files. Unfortunately, neither the files themselves nor the site that produced them tell me which encoding they use. I ran through a bunch of options, but none of them fixed the problem.
“Okay, no worries,” I thought. “I’ll use gsub() to replace those funky bits in the strings by hand.” The commands ran without a hiccup, but the text didn’t change. Stranger, when I tried to inspect documents in the R terminal, the same command wouldn’t always produce the same result. Sometimes I’d get the head, and sometimes the tail. I tried moving back a step in the process and installed a PDF converter that I could run from R, but R couldn’t find the converter, and my attempts to fix that failed.
At this point, I was about ready to quit, and I tweeted some of that frustration. Igor Brigadir quickly replied to suggest a solution, but it involved another programming language, Python, that I don’t know:
To go that route, I would need to start learning Python. That’s probably a good idea for the long run, but it wasn’t going to happen this week. Then Ken Benoit pointed me toward a new R package he’s developing and even offered to help me :
That sounded promising, so I opened R again and followed the clear instructions on the README at Ken’s repository to install the package. Of course the installation failed, probably because I’m still using R Version 3.1.1 and the package is, I suspect, written for the latest release, 3.1.2.
And that’s where I finally quit—for now. I’d hit a wall, and all my usual strategies for working through or around it had either failed or led to solutions that would require a lot more work. If I were getting paid and on deadline, I’d keep hacking away, but this was supposed to be a “fun” project for my own edification. What seemed at first like a tidy exercise had turned into a tar baby, and I needed to move on.
This cycle of frustration –> problem-solving –> frustration might seem like a distraction from the real business of social science, but in my experience, it is the real business. Unless I’m performing a variation on a familiar task with familiar data, this is normal. It might be boring to read, but then most of the day-to-day work of social science probably is, or at least looks that way to the people who aren’t doing it and therefore can’t see how all those little steps fit into the bigger picture.
So that’s my tale of minor woe. Now, if anyone who actually knows how to do text-mining in R is inspired to help me figure out what I’m doing wrong on that National Security Strategy project, please take a look at that GitHub repo and the script posted there and let me know what you see.