Over the next year, I plan to learn how to write code to do text mining.
I’m saying this out loud for two reasons. The first is self-centered; I see a public statement about my plans as a commitment device. By saying publicly that I plan to do this thing, I invest some of my credibility in following through, and my credibility is personally and professionally valuable to me.
I’m also saying this out loud, though, because I believe that the thinking behind this decision might interest other people working in my field. There are plenty of things I don’t know how to do that would be useful in my work on understanding and forecasting various forms of political instability. Three others that spring to mind are Bayesian data analysis, network theory, and agent-based modeling.
I’m choosing to focus on text mining instead of something else because I think that the single most significant obstacle to better empirical analysis in the social sciences is the scarcity of data, and I think that text mining is the most promising way out of this desert.
The volume of written and recorded text we produce on topics of interest to social scientists is incomprehensibly vast. Advances in computing technology and the growth of the World Wide Web have finally made it possible to access and analyze those texts—contemporary and historical—on a large scale with efficiency. This situation is still new, however, so most of this potential remains unrealized. There is a lot of unexplored territory on the other side of this frontier, and that territory is still growing faster than our ability to map it.
Lots of other people in political science and sociology are already doing text mining, and many of them are probably doing it better than I ever will. One option would be to wait for their data sets to arrive and then work with them.
My own restlessness discourages me from following that strategy, but there’s also a principled reason not just to take what’s given: we do better analysis when we deeply understand where our data come from. The data sets you know the best are the ones you make. The data sets you know second-best are the ones someone else made with a process or instruments you’ve also used and understand. Either way, it behooves me to learn what these instruments are and how to apply them.
Instead of learning text mining, I could invest my time in learning other modeling and machine-learning techniques to analyze available data. My modeling repertoire is pretty narrow, and the array of options is only growing, so there’s plenty of room for improvement on that front, too.
In my experience, though, more complex models rarely add much to the inferential or predictive power we get from applying relatively simple models to the right data. This may not be true in every field, but it tends to be true in work on political stability and change, where the phenomena are so complex and still so poorly understood. On these topics, the best we can usually do is to find gross patterns that recur among data representing theoretically coherent processes or concepts.
Relatively simple models usually suffice to discover those gross patterns. What’s harder to come by are the requisite data. I think text mining is the most promising way to make them, so I am now going to learn how to do it.