What Darwin Teaches Us about Political Regime Types

Here’s a paragraph, from a 2011 paper by Ian Lustick, that I really wish I’d written. It’s long, yes, but it rewards careful reading.

One might naively imagine that Darwin’s theory of the “origin of species” to be “only” about animals and plants, not human affairs, and therefore presume its irrelevance for politics. But what are species? The reason Darwin’s classic is entitled Origin of Species and not Origin of the Species is because his argument contradicted the essentialist belief that a specific, finite, and unchanging set of categories of kinds had been primordially established. Instead, the theory contends, “species” are analytic categories invented by observers to correspond with stabilized patterns of exhibited characteristics. They are no different in ontological status than “varieties” within them, which are always candidates for being reclassified as species. These categories are, in essence, institutionalized ways of imagining the world. They are institutionalizations of difference that, although neither primordial nor permanent, exert influence on the futures the world can take—both the world of science and the world science seeks to understand. In other words, “species” are “institutions”: crystallized boundaries among “kinds”, constructed as boundaries that interrupt fields of vast and complex patterns of variation. These institutionalized distinctions then operate with consequences beyond the arbitrariness of their location and history to shape, via rules (constraints on interactions), prospects for future kinds of change.

This is one of the big ideas to which I was trying to allude in a post I wrote a couple of months ago on “complexity politics”, and in an ensuing post that used animated heat maps to trace gross variations in forms of government over the past 211 years. Political regime types are the species of comparative politics. They are “analytic categories invented by observers to correspond with stabilized patterns of exhibited characteristics.” In short, they are institutionalized ways of thinking about political institutions. The patterns they describe may be real, but they are not essential. They’re not the natural contours of the moon’s surface; they’re the faces we sometimes see in them.

video game taxonomy

Mary Goodden’s Taxonomy of Video Games

If we could just twist our mental kaleidoscopes a bit, we might find different things in the same landscape. One way to do that would be to use a different set of measures. For the past 20 years or so, political scientists have relied almost exclusively on the same two data sets—Polity and Freedom House’s Freedom in the World—to describe and compare national political regimes in anything other than prose. These data sets are very useful, but they are also profoundly conventional. Polity offers a bit more detail than Freedom House on specific features of national politics, but the two are essentially operationalizing the same assumptions about the underlying taxonomy of forms of government.

Given that fact, it’s hard to see how further distillations of those data sets might surprise us in any deep way. A new project called Varieties of Democracy (V-Dem) promises to bring fresh grist to the mill by greatly expanding the number of institutional elements we can track, but it is still inherently orthodox. Its creators aren’t trying to reinvent the taxonomy; they’re looking to do a better job locating individuals in the prevailing one. That’s a worthy and important endeavor, but it’s not going to produce the kind of gestalt shift I’m talking about here.

New methods of automated text analysis just might. My knowledge of this field is quite limited, but I’m intrigued by the possibilities of applying unsupervised learning techniques, such as latent Dirichlet allocation (LDA), to the problem of identifying political forms and associating specific cases with them. In contrast to conventional measurement strategies, LDA doesn’t oblige us to specify a taxonomy ahead of time and then look for instances of the things in it. Instead, LDA assumes there is an infinite mixture of overlapping but latent categories out there, and these latent categories are partially revealed by characteristic patterns in the ways we talk and write about the world.

Unsupervised learning is still constrained by the documents we choose to include and the language we use in them, but it should still help us find patterns in the practice of politics that our conventional taxonomies overlook. I hope to be getting some funding to try this approach in the near future, and if that happens, I’m genuinely excited to see what we find.

A Coda on Pattern Recognition

Last week, I aired my skepticism on the power of (non-parametric) pattern-recognition techniques to predict many political events, and Phil Schrodt responded with a more optimistic take (see here and here). Since then, Phil and I have carried the conversation a bit further via email. As a coda to this discussion, I thought I would post the relevant pieces of that correspondence.

I wrote to Phil:

At the end of your post, you address my point about rare events by saying that sample size doesn’t matter so much for training PR tools, as long as their are recognizable clusters of features in the examples that are available.

Isn’t that a big “if,” though? For most big political events, my prior belief would be that there would be recognizable clusters of features for the event itself, but many possible clusters or sequences of events or other features preceding those. The analogy that comes to mind is the difference between a bobsled track and sledding hill. Where political phenomena work like a bobsled track, following regular patterns to the common destination, then we should be able to learn their patterns from a small number of examples and extrapolate successfully to future instances. My guess, though, is that they’re often more like a sledding hill, where the approaches to the bottom are more varied and conditional (on things like the starting point), even though the destination is ultimately the same. In situations where that’s right, then we would need a much larger set of examples to identify any regularities and extrapolate successfully from them.

I realize this is, for the most part, an empirical question. Still, I would wager that my original point about the importance of sample size will apply to at least a fair number of the things we might want to forecast in political science.

Phil responded:

I sort of agree with you here and sort of not.

On the “bobsled track vs. sledding hill” analogy — which is nice — that is the point (and I realized at the time I needed to elaborate on it) I meant when I used the phrase “clusters and the variance around them.” The question is not just what something looks like and how many points are close to it, but *how* close they are (plus, of course, the issue of the false positives: how many other points are in the vicinity that shouldn’t be?). So normal elections in established democracies probably all look pretty much the same (in some coding scheme); the collapse of coalition governments in established parliamentary democracies probably has more variance that that, but is still reasonably regular (opposition parties signal their intentions before they actually break, etc). Something like genocide is, as you note, probably on the far end of unpredictable, which is to say that there is a great deal of variance and a large number of false positives. So in those instances, one needs to invest a whole lot of effort to try to find those few things that are in common.

Where I’d still make the case for the clustering/PR approach is that I think it usually can work on *less* data than a corresponding frequentist method, particularly one that is depending on large-sample properties (e.g. approximations to the Normal distribution via the Central Limit Theorem). And, I suppose, that the PR methods are probably more robust against violations of assumptions (or, more generally, they tend to be non-parametric) than are the statistical methods. But they will do better with more information than with less.

And that’s where we’ll leave it for now.

Maybe Pattern Recognition Will Work Better Than I Thought

The eminent Phil Schrodt, master of methods and pater familias of political-science programming, read yesterday’s bit on applied pattern recognition and had some things to say about it. Phil and I know each other from my years working with the Political Instability Task Force, and I know he knows a lot more about this topic than I do, so I asked him to make a guest post out of it. Here’s what Phil wrote:

Three responses to Jay’s recent post on pattern recognition (PR).

First, as I argued rather extensively a couple decades back in an unpublishable book, PR is at the core of much of political analysis; in fact probably most of the political analysis that we call “qualitative” or “case study.” Statistical studies are a tiny and somewhat awkward subset of the means by which we could formally assess political behavior for most categorical political events (as opposed to natural quantitative measures such as public opinion polls), though I will be the first to acknowledge that given the tremendous amount of work over the past decade or so, statistical methods are “unreasonably effective” at this task.

But the bottom line is if not PR, what else is going on? We know that the human brain is not only extraordinarily effective at PR in general, but there is increasing evidence that it is hardwired for various forms of socially-relevant PR.

Take, for example, the fact that you can probably recognize any of the songs played at your high school senior prom after the first couple of bars of music. (Presupposing, of course, that you went to a high school senior prom; like most of the socially dysfunctional geeks reading this blog, I didn’t…I digress…) This ability is closely related to the fact that you could recognize the voice of your (counter-factual) date at that prom within seconds if he/she phoned you after twenty years. Those features, in turn, are related to how we evolved (yes, evolved…sorry, Republican presidential primary contenders except Huntsman) as social primates. There is also evidence — which I could dig up except that I’m writing for a blog — that episodic (“story telling”) PR is another one of those highly evolved features, quite possibly due to the need for social primates to figure out who they should and should not behave altruistically towards — play the C/D option frequently enough in the iterated Prisoner’s Dilemma, and you die. Or at least don’t reproduce. (I hate to think where proms fit in this framework…though this might provide some insight…)

So for me the issue is figuring out how a computer can simulate that PR. This is a difficult problem which, I would suggest, we’ve spent almost no time at all on — the number of PR articles in the whole of the published political science literature probably numbers a two or three dozen at best, plus probably a comparable number by computer scientists who happen to choose politics as a domain for demonstrating machine learning algorithms (see here, for example). The vast bulk of the systematic, data-based work in political science has either been the aforementioned statistical studies, or else (following far behind in numbers, though still much larger than the PR work) approaches that are loosely — at times really, really loosely — based on game theory (see here), another highly simplified approximation to human reasoning that sometimes kinda sorta maybe works in some situations if we don’t look too closely (but I digress…)

Now, on the issue of data: yes, right now we don’t have a whole lot of machine-readable data, but that is a temporary situation and will only improve. The existing event data sets are quite limited and consequently rare events are an issue. This, however, is changing rapidly. For example, if the new DARPA ICEWS global data set being produced by Lockheed is made available (likely a DARPA decision), we will have another data set containing on the order of magnitude of 10-million events (the Lockheed ICEWS data covering only Asia had around 4-million events, so extrapolating, the global set might have around 25-million, just as a guess), and with far greater detail on substate actors (an ICEWS specialty) compared to the VRA/King-Lowe “10-million events” set. Meanwhile Peter Nardulli’s SPEED project is using a variety of text-recognition and natural language processing methods to generate a dense global data set on political events going back to 1945.

But more generally, the availability and reliably of this data — machine coded from machine-readable sources — is only going to improve. At present, the low-hanging fruit — news reports we can (relatively) easily download — goes back only to around 1990 (conveniently, about the time of the end of the Cold War). But that period of data is increasing at the exact rate of one year per year (and, to a much more limited extent, some efforts such as Nardulli’s are also pushing this backwards in time as well), and the number of sources available online is now in the thousands (compared to the single-source Reuters and AFP-based data sets of the VRA and KEDS projects), and is compiled by aggregators such as Google News and the European Media Monitor. Machine coding methods are also improving, through the development of better ontologies, hugely more extensive dictionaries with tens of thousands of actors (ICEWS again), the development of a wide variety of tools such as parsers, automated translation and named-entity-recognition software in computational linguistics, and integration of these into third-generation automated coders such as Lockheed’s JABARI-NLP (link). In all likelihood, machine coding is already more accurate than acutal human coding, though it is not at the level of the claimed human accuracy, which tends to be about twice as high (see this fascinating study by Mikhaylov, Laver, and Benoit on attempting to replicate the coding of the Comparative Manifestos Project, a task very similar to event data coding), and it is only going to get better.

Finally, the issue of rare events (and human PR more generally) is one of identifying archetypes — a radical new idea Max Weber suggested a mere century ago — then selectively getting the data that instantiates both the archetype and determining how much variance there is around the archetype. Suppose tomorrow (it would be nice…) we saw the following three things happen in Syria:

1. Government media go off the air.

2. Tanks surround major government buildings in Damascus and other major cities.

3. Vague reports emerge about a “Emergency Committee for National Salvation and Unity” consisting largely of colonels.

That’s probably sufficient information for us to conclude a “coup” has occurred in Syria. This is based on a combination of how a “coup” has been defined and the various instantiations of “coups” that we have seen in the historical record. In human PR, that historical record goes back considerably further than what we have in the event data record, and also “selects on the dependent variable” — if we want examples of coups, we look at coup-prone countries (e.g. Latin America prior to the 1990s, Africa in the 1970s and 1980s, Thailand, Turkey), not at the historical record generally. Most of the evidence from the cognitive sciences indicates that the brain does this automatically — memories, and particularly episodic memories, are stored in networks of similar events. “If it fires together, it wires together” (link). Approximating this in a computational environment is a difficult task — the structures and intrinsic capabilities of carbon-based and silicon-based memories differ dramatically — but there is plenty of work in this area, and it is certainly becoming easier and more efficient with high-memory, high-speed parallel computer clusters.

Archetypes, in turn, generally depend on clustering, not the sample-size-dependent features of traditional statistics. This is why problems such as IBM Watson/Jeopardy, the Netflix film-preference algorithm, and the WalMart shopping basket co-purchase (e.g. beer and diapers) algorithms work. If there are a couple other people out there who really like vampire films, John Ford Westerns, and classical Looney Tunes animation, Netflix just needs to find that cluster; it doesn’t need to be a big cluster (though that one probably is big…). I can pull out a dozen historical analogues of the political contagion process of the Arab Spring and this gives me plenty to work with from the perspective of figuring out what is typical or not about the current situation; I don’t need a sample of 1,000.

In short, we’re unquestionably not there yet on computational pattern recognition of political activities, and arguably we’ve hardly even started. But both the data side and the computational side have changed exponentially in the past decade — and these trends continue — so this would seem to be a promising avenue of research.

PS. Phil and I wrap up the discussion (for now) with one more exchange, here.

Can Google Translate Predict Politics?

The UK’s Independent newspaper ran a story this week explaining how the fantastic Google Translate works:

Google has created an automatic translation tool that is unlike all others. It is not based on the intellectual presuppositions of early machine translation efforts – it isn’t an algorithm designed only to extract the meaning of an expression from its syntax and vocabulary. In fact, at bottom, it doesn’t deal with meaning at all. Instead of taking a linguistic expression as something that requires decoding, Google Translate (GT) takes it as something that has probably been said before. It uses vast computing power to scour the internet in the blink of an eye, looking for the expression in some text that exists alongside its paired translation. Drawing on the already established patterns of matches between these millions of paired documents, GT uses statistical methods to pick out the most probable acceptable version of what’s been submitted to it. Much of the time, it works.

The emphasis in that quote is mine. The point is that GT operates through pattern recognition, comparing the current “problem” to vast numbers of past examples in order to identify not the single correct answer, but the answer that is most likely. If I’m not mistaken, this strategy is similar to the one IBM uses with Watson, the “supercomputer” that wowed us last year by beating a few top Jeopardy! champions at their own game.

The fantastic success of these pattern-recognition programs in other domains makes me think again about how the same strategy might be used to predict political events. (Apparently, Uncle Sam is doing the same; see this recent solicitation from the Director of National Intelligence’s IARPA shop.) So why haven’t we heard about any big breakthroughs in forecasting political events using these same techniques? If pattern-recognition programs can knock down the Tower of Babel and steamroll Jeopardy!‘s greatest champions, can’t they also tell us where the next riot or coup attempt is going to happen?

Attempts to use pattern recognition to predict politics face two major obstacles. One of them, we can (and maybe already have) overcome; the other, however, we cannot.

The first hurdle has to do with building libraries of examples. For pattern recognition to work well, we need to have lots of processed examples in which patterns can be identified for comparison to new cases. Traditionally, political scientists trying to apply this strategy to political forecasting have looked for patterns in sequences and combinations of events, and the event data on which those searches depended was coded by humans reading news stories. This approach was labor intensive and costly, and to my knowledge it has never produced any great results.

The growth of the World Wide Web and improvements in computer hardware and software may finally have solved this problem. Programs have been developed that can now produce reliable event data automatically at a fraction of the cost of the traditional human-coding efforts. (For examples, see here, here, and here.)  Improvements in automated content analysis are also allowing researchers to break out of the event-analysis box and think about other aspects of relevant texts that might also contain signals about the future. (This post of mine from a few days ago describes one such effort.)

Even as our ability to process the historical record takes great leaps, however, there’s still a second hurdle that no software script can overcome. The more fundamental problem is that most political events of interest occur very rarely. On average, there are fewer than 10 coup attempts each year in countries worldwide. In the past two decades, countries have gone to war with each other only a handful of times. Over the past half-century, the number of transitions to democracy that have occurred is more like 100 than 1,000 or 10,000. In the entire history of the United States, there have only been 56 presidential elections.

The rarity of these events give pattern-recognition techniques very little to train on. We can now produce fantastic quantities of data in which to search for patterns, but we still have very few examples from which those patterns might be identified. The power of statistical analysis depends, in large part, on the size of the sample from which we’re trying to extrapolate. Without stacks and stacks of relevant historical examples, pattern-recognition techniques will forever struggle to get the traction they need to produce accurate predictions of political phenomena.

PS. For a more optimistic view, see this response from the wise Phil Schrodt. Phil and I wrap up the exchange (for now) here.

Follow

Get every new post delivered to your Inbox.

Join 7,151 other followers

%d bloggers like this: