A couple of weeks ago, I blogged about a failed attempt to do some exploratory text-mining on the US National Security Strategy reports (here). That project was supposed to give me a fun way to learn the basics of text mining in R, something I’ve been eager to do of late. In writing the blog post, I had two motives: 1) to help normalize the experience of getting stuck and failing in social science and data science, and 2) to appeal for help from more experienced coders who could help get me unstuck on this particular task.
The post succeeded on both counts. I won’t pepper you with evidence on the commiseration front, but I am excited to share the results of the coding improvements. In addition to learning how to text-mine, I have also been trying to learn how to use RStudio and Shiny to build interactive apps, and this project seemed like a good one to do both. So, I’ve created an app that lets users explore this corpus in three ways:
- Plot word counts over time to see how the use of certain terms has waxed and waned over the 28 years the reports span.
- Generate word clouds showing the 50 most common words in each of the 16 reports.
- Explore associations between terms by picking one and see which 10 others are most closely correlated with it in the entire corpus.
For example, here’s a plot of change over time in the relative frequency of the term ‘terror’. Its usage spikes after 9/11 and then falls sharply when Barack Obama replaces George W. Bush as president.
That pattern contrasts sharply with references to climate, which rarely gets mentioned until the Obama presidency, when its usage spikes upward. (Note, though, that the y-axis has been rescaled from the previous chart, so this large increase still has ‘climat’ only appearing about half as often as ‘terror’.)
And here’s a word cloud of the 50 most common terms from the first US National Security Strategy, published in 1987. Surprise! The Soviet Union dominates the monologue.
When I built an initial version of the app a couple of Sundays ago, I promptly launched it on shinyapps.io to try to show it off. Unfortunately, the Shiny server only gives you 25 hours of free usage per billing cycle, and when I tweeted about the app, it got so much attention that those hours disappeared in a little over a day!
I don’t have my own server to host this thing, and I’m not sure when Shiny’s billing cycle refreshes. So, for the moment, I can’t link to a permanently working version of the app. If anyone reading this post is interested in hosting the app on a semi-permanent basis, please drop me a line at ulfelder <at> gmail. Meanwhile, R users can launch the app from their terminals with these two lines of code, assuming the ‘shiny’ package is already installed: