Halloween, Quantified

Some parents dress up for Halloween. Some throw parties. In our house, we—well, I, really; my wife was bemused, my younger son vaguely interested, and my elder son embarrassed—I collect and chart the data.

First, the flow of trick-or-treaters. The figure below shows counts in 15-minute bins of kids who came to our door for candy. The first arrival, a little girl in a fairy/princess costume, showed up around 5:50 PM, well before sunset. The deluge came an hour later, when a mob from a party next door blended with an uptick in other arrivals. The other peak came almost an hour after that and probably had a much higher median age than the earlier one. The final handful strolled through around 8:40, right when we were shutting down so we could fetch and drop off our own teenage boys from other parts of town.

trickortreat.2015

This year, I also tallied which candy the trick-or-treaters chose. The figure below plots the resulting data. If the line ends early, it means we ran out of that kind of candy. As my wife predicted, the kids’ taste is basically the inverse of ours, which, as one costumed adult chaperoning his child pointed out, is “perfect.”

halloween.2015.candy

To collect the data, I sat on my front porch in a beach chair with a notepad, greeted the arriving kids, asked them to pick one, and then jotted tick marks as they left. Colleague Ali Necamp suggested that I put the candies in separate containers to make it easier to track who took what; I did, and she was right. Only a couple of people asked me why the candies were laid out in bins, and I clearly heard one kid approaching the house ask, “Mommy, why is that man sitting on the porch?”

Which NFL Teams Are the Biggest Surprises of 2015 So Far?

We’re now 4.0625 weeks into the NFL’s 2015 regular season. (If you don’t know what the NFL is, you should probably stop reading now.) That’s about one-quarter of the whole 256-game shebang, enough to start taking stock of preseason predictions. So I got to wondering: Which teams have been the biggest surprises so far?

To get one answer to this question, I downloaded game results from Pro-Football-Reference.com (here) and compared them to the central tendencies of my preseason predictive simulations (here). The mean error of the predictions for each team so far is plotted below. The error in this case is the difference between the number of points by which the team was expected to win or lose each game and the number of points by which it actually won or lost. For example, my simulations had the Colts, on average, winning this week’s Thursday-night game against the Texans by 4, but they actually won by 7. That’s an error of +3 for the Colts and -3 for Houston. The mean error is the average of those errors across all games played so far. So, a positive mean error (blue dots) means the team is over-performing relative to the preseason predictions, while a negative mean error (red dots) means it’s under-performing.

team.error.lollipops.20151009

Most of those results won’t surprise regular NFL watchers. The New York Football Jets finished 4–12 last year and ranked near the bottom in my preseason wiki survey, but they’re off to a 3–1 start this year. The Falcons, who went 6–10 in 2014 and garnered a low-middle score in the wiki survey, are undefeated after four weeks. At the other end of the scale, the Dolphins got a high-middle score in the preseason survey, but they have stumbled to a 1–3 start.

It’s also interesting (to me, anyway) to note how the team-specific errors are only loosely correlated with differences between predicted and observed records. For example, the Giants are only 2–2 so far this year, but they show up as one of the biggest over-performers of the first four weeks. That’s partly because both of those two losses were close games that could easily have flipped the other way. The Giants were expected to be on the bad side of mediocre, but they’ve been competitive in every game so far. Ditto for the Ravens, who only show up as mild under-performers but have a 1–3 record (sob). At least three of those four games were expected to be close, and all of them turned on late scores; unfortunately, only one of those four late turns broke in Baltimore’s favor.

This exercise is only interesting if the preseason predictions on which we’re basing the calls about over– or under-performance are sound. So far, they look pretty solid. After four weeks, the root mean squared error for the predicted net scores is 12.8, and the mean squared error is 165. Those look large, but I think they’re in line with other preseason score forecasts. If we convert the predicted net scores to binary predicted outcomes, the model is 40–23 after four weeks, or 41–23 if we include last night’s Colts-Texans game. That’s not exactly clairvoyant, but it beats eight of ESPN’s 13 experts and matches one more, and they make their predictions each week with updated information.

Yes, By Golly, I Am Ready for Some Football

The NFL’s 2015 season sort of got underway last night with the Hall of Fame Game. Real preseason play doesn’t start until this weekend, and the first kickoff of the regular season is still a month away.

No matter, though — I’m taking the Hall of Fame Game as my cue to launch this year’s football forecasting effort. As it has for the past two years (see here and here), the process starts with me asking you to help assess the strength of this year’s teams by voting in a pairwise wiki survey:

In the 2015 NFL season, which team will be better?

That survey produces scores on a scale of 0–100. Those scores will become the crucial inputs into simulations based on a simple statistical model estimated from the past two years’ worth of survey data and game results. Using an R function I wrote, I’ve determined that I should be able to improve the accuracy of my forecasts a bit this year by basing them on a mixed-effects model with random intercepts to account for variation in home-team advantages across the league. Having another season’s worth of predicted and actual outcomes should help, too; with two years on the books, my model-training sample has doubled.

An improvement in accuracy would be great, but I’m also excited about using R Studio’s Shiny to build a web page that will let you explore the forecasts at a few levels: by game, by team, and by week. Here’s a screenshot of the game-level tab from a working version using the 2014 data. It plots the distribution of the net scores (home – visitor) from the 1,000 simulations, and it reports win probabilities for both teams and a line (the median of the simulated scores).

nfl.forecasts.app.game.20150809

The “By team” tab lets you pick a team to see a plot of the forecasts for all 16 of their games, along with their predicted wins (count of games with win probabilities over 0.5) and expected wins (sum of win probabilities for all games) for the year. The “By week” tab (shown below) lets you pick a week to see the forecasts for all the games happening in that slice of the season. Before, I plan to add annotation to the plot, reporting the lines those forecasts imply (e.g., Texans by 7).

nfl.forecasts.app.week.20150809

Of course, the quality of the forecasts displayed in that app will depend heavily on participation in the wiki survey. Without a diverse and well-informed set of voters, it will be hard to do much better than guessing that each team will do as well this year as it did last year. So, please vote here; please share this post or the survey link with friends and family who know something about pro football; and please check back in a few weeks for the results.

ACLED in R

The Armed Conflict Location & Event Data Project, a.k.a. ACLED, produces up-to-date event data on certain kinds of political conflict in Africa and, as of 2015, parts of Asia. In this post, I’m not going to dwell on the project’s sources and methods, which you can read about on ACLED’s About page, in the 2010 journal article that introduced the project, or in the project’s user’s guides. Nor am I going to dwell on the necessity of using all political event data sets, including ACLED, with care—understanding the sources of bias in how they observe events and error in how they code them and interpreting (or, in extreme cases, ignoring) the resulting statistics accordingly.

Instead, my only aim here is to share an R script I’ve written that largely automates the process of downloading and merging ACLED’s historical and current Africa data and then creates a new data frame with counts of events by type at the country-month level. If you use ACLED in R, this script might save you some time and some space on your hard drive.

You can find the R script on GitHub, here.

The chief problem with this script is that the URLs and file names of ACLED’s historical and current data sets change with every update, so the code will need to be modified each time that happens. If the names were modular and the changes to them predictable, it would be easy to rewrite the code to keep up with those changes automatically. Unfortunately, they aren’t, so the best I can do for now is to give step-by-step instructions in comments embedded in the script on how to update the relevant four fields by hand. As long as the basic structure of the .csv files posted by ACLED doesn’t change, though, the rest should keep working.

[UPDATE: I revised the script so it will scrape the link addresses from the ACLED website and parse the file names from them. The new version worked after ACLED updated its real-time file earlier today, when the old version would have broken. Unless ACLED changes its file-naming conventions or the structure of its website, the version should work for the rest of 2015. In case it does fail, instructions on how to hard-code a workaround are included as comments at the bottom of the script.]

It should also be easy to adapt the part of the script that generates country-month event counts to slice the data even more finely, or to count by something other than event type. To do that, you would just need to add variables to the group_by() part of the block of code that produces the object ACLED.cm. For example, if you wanted to get counts of events by type at the level of the state or province, you would revise that line to read group_by(gwno, admin1, year, month, event_type). Or, if you wanted country-month counts of events by the type(s) of actor involved, you could use group_by(gwno, year, month, interaction) and then see this user’s guide to decipher those codes. You get the drift.

The script also shows a couple of examples of how to use ‘gglot2’ to generate time-series plots of those monthly counts. Here’s one I made of monthly counts of battle events by country for the entire period covered by ACLED as of this writing: January 1997–June 2015. A production-ready version of this plot would require some more tinkering with the size of the country names and the labeling of the x-axis, but the kind of small-multiples chart offers a nice way to explore the data before analysis.

Monthly counts of battle events, January 1997-June 2015

Monthly counts of battle events, January 1997-June 2015

If you use the script and find flaws in it or have ideas on how to make it work better or do more, please email me at ulfelder <at> gmail <dot> com.

One Measure By Which Things Have Recently Gotten Worse

The United Nation’s refugee agency today released its annual report on people displaced by war around the world, and the news is bad:

The number of people forcibly displaced at the end of 2014 had risen to a staggering 59.5 million compared to 51.2 million a year earlier and 37.5 million a decade ago.

The increase represents the biggest leap ever seen in a single year. Moreover, the report said the situation was likely to worsen still further.

The report focuses on raw estimates of displaced persons, but I think it makes more sense to look at this group as a share of world population. The number of people on the planet has increased by more than half a billion in the past decade, so we might expect to see some growth in the number of forcibly displaced persons even if the amount of conflict worldwide had held steady. The chart below plots annual totals from the UNHCR report as a share of mid-year world population, as estimated by the U.S. Census Bureau (here).

unhcr.refugee.trends

The number of observations in this time series is too small to use Bayesian change point detection to estimate the likelihood that the upturn after 2012 marks a change in the underlying data-generating process. I’m not sure we need that kind of firepower, though. After holding more or less steady for at least six years, the share of world population forcibly displaced by war has increased by more than 50 percent in just two years, from about one of every 200 people to 1 of every 133 people. Equally important, reports from field workers indicate that this problem only continues to grow in 2015. I don’t think I would call this upturn a “paradigm change,” as UN High Commissioner for Refugees António Guterres did, but there is little doubt that the problem of displacement by war has worsened significantly since 2012.

In historical terms, just how bad is it? Unfortunately, it’s impossible to say for sure. The time series in the UNHCR report only starts in 2004, and a note warns that methodological changes in 2007 render the data before that year incomparable to the more recent estimates. The UNHCR describes the 2014 figure as “the highest level ever recorded,” and that’s technically true but not very informative when recording started only recently. A longer time series assembled by the Center for Systemic Peace (here) supports the claim that the latest raw estimate is the largest ever, but as a share of world population, it’s probably still a bit lower than the levels seen in the post–Cold War tumult of the early 1990s (see here).

Other relevant data affirm the view that, while clearly worsening, the intensity of armed conflict around the world is not at historically high levels, not even for the past few decades. Here is a plot of annual counts of battle-related deaths (low, high, and best estimates) according to the latest edition of UCDP’s data set on that topic (here), which covers the period 1989–2013. Note that these figures have not been adjusted for changes in world population.

Annual estimates of battle-related deaths worldwide, 1989-2013 (data source: UCDP)

Annual estimates of battle-related deaths worldwide, 1989-2013 (data source: UCDP)

We see similar pattern in the Center for Systemic Peace’s Major Episodes of Political Violence data set (second row here), which covers the whole post-WWII period. For the chart below, I have separately summed the data set’s scalar measure of conflict intensity for two types of conflict, civil and interstate (see the codebook for details). Like the UCDP data, these figures show a local increase in the past few years that nevertheless remains well below the prior peak, which came when the Soviet Union fell apart.

Annual intensity of political violence worldwide, 1946-2014 (data source: CSP)

Annual intensity of political violence worldwide, 1946-2014 (data source: CSP)

And, for longer-term perspective, it always helps to take another look at this one, from an earlier UCDP report:

PRIO battle death trends

I’ll wrap this up by pinning a note in something I see when comparing the shorter-term UCDP estimates to the UNHCR estimates on forcibly displaced persons: adjusting for population, it looks like armed conflicts may be killing fewer but displacing more than they used to. That impression is bolstered by a glance at UCDP data on trends in deaths from “intentional attacks on civilians by governments and formally organized armed groups,” which UCDP calls “one-sided violence” (here).  As the plot below shows, the recent upsurge in warfare has not yet produced a large increase in the incidence of these killings, either. The line is bending upward, but it remains close to historical lows.

Estimated annual deaths from one-sided violence, 1989-2013 (Source: UCDP)

Estimated annual deaths from one-sided violence, 1989-2013 (Source: UCDP)

So, in the tumult of the past few years, it looks like the rate of population displacement has surged while the rate of battle deaths has risen more slowly and the rate of one-sided violence targeting civilians hasn’t risen much at all. If that’s true, then why? Improvements in medical care in conflict zones are probably part of the story, but I wonder if changes in norms and values, and in the international institutions and practices instantiating them, aren’t also shaping these trends. Governments that in the past might have wantonly killed populations they regarded as threats now seem more inclined to press those populations by other means—not always, but more often. Meanwhile, international organizations are readier than ever to assist those groups under pressure by feeding and sheltering them, drawing attention to their miseries, and sometimes even protecting them. The trend may be fragile, and the causality is impossible to untangle with confidence, but it deserves contemplation.

Visualizing Strike Activity in China

In my last post, I suggested that the likelihood of social unrest in China is probably higher than a glance at national economic statistics would suggest, because those statistics conceal the fact that economic malaise is hitting some areas much harder than others and local pockets of unrest can have national effects (ask Mikhail Gorbachev about that one). Near the end of the post, I effectively repeated this mistake by showing a chart that summarized strike activity over the past few years…at the national level.

So, what does the picture look like if we disaggregate that national summary?

The best current data on strike activity in China come from China Labour Bulletin (CLB), a Hong Kong–based NGO that collects incident reports from various Chinese-language sources, compiles them in a public data set, and visualizes them in an online map. Those data include a few fields that allow us to disaggregate our analysis, including the province in which an incident occurred (Location), the industry involved (Industry), and the claims strikers made (Demands). On May 28, I downloaded a spreadsheet with data for all available dates (January 2011 to the present) for all types of incidents and wrote an R script that uses small multiples to compare strike activity across groups within each of those categories.

First, here’s the picture by province. This chart shows that Guangdong has been China’s most strike-prone province over the past several years, but several other provinces have seen large increases in labor unrest in the past two years, including Henan, Hebei, Hubei, Shandong, Sichuan, and Jiangsu. Right now, I don’t have monthly or quarterly province-level data on population size and economic growth to model the relationship among these things, but a quick eyeballing of the chart from the FT in my last post indicates that these more strike-prone provinces skew toward the lower end of the range of recent GDP growth rates, as we would expect.

sparklines.province

Now here’s the picture by industry. This chart makes clear that almost all of the surge in strike activity in the past year has come from two sectors: manufacturing and construction. Strikes in the manufacturing sector have been trending upward for a while, but the construction sector really got hit by a wave in just the past year that crested around the time of the Lunar New Year in early 2015. Other sectors also show signs of increased activity in recent months, though, including services, mining, and education, and the transportation sector routinely contributes a non-negligible slice of the national total.

sparklines.industry

And, finally, we can compare trends over time in strikers’ demands. This analysis took a little more work, because the CLB data on Demands do not follow best coding practices in which a set of categories is established a priori and each demand is assigned to one of those categories. In the CLB data, the Demands field is a set of comma-delimited phrases that are mostly but not entirely standardized (e.g., “wage arrears” and “social security” but also “reduction of their operating territory” and “gas-filing problem and too many un-licensed cars”). So, to aggregate the data on this dimension, I created a few categories of my own and used searches for regular expressions to find records that belonged in them. For example, all events for which the Demands field included “wage arrear”, “pay”, “compensation”, “bonus” or “ot” got lumped together in a Pay category, while events involving claims marked as “social security” or “pension” got combined in a Social Security category (see the R script for details).

The results appear below. As CLB has reported, almost all of the strike activity in China is over pay, usually wage arrears. There’s been an uptick in strikes over layoffs in early 2015, but getting paid better, sooner, or at all for work performed is by far the chief concern of strikers in China, according to these data.

sparklines.demands

In closing, a couple of caveats.

First, we know these data are incomplete, and we know that we don’t know exactly how they are incomplete, because there is no “true” record to which they can be compared. It’s possible that the apparent increase in strike activity in the past year or two is really the result of more frequent reporting or more aggressive data collection on a constant or declining stock of events.

I doubt that’s what’s happening here, though, for two reasons. One, other sources have reported the Chinese government has actually gotten more aggressive about censoring reports of social unrest in the past two years, so if anything we should expect the selection bias from that process to bend the trend in the opposite direction. Two, theory derived from historical observation suggests that strike activity should increase as the economy slows and the labor market tightens, and the observed data are consistent with those expectations. So, while the CLB data are surely incomplete, we have reason to believe that the trends they show are real.

Second, the problem I originally identified at the national level also applies at these levels. China’s provinces are larger than many countries in the world, and industry segments like construction and manufacturing contain a tremendous variety of activities. To really escape the ecological fallacy, we would need to drill down much further to the level of specific towns, factories, or even individuals. As academics would say, though, that task lies beyond the scope of the current blog post.

Polity Meets Joy Division

The Center for Systemic Peace posted its annual update of the Polity data set on Friday, here. The data set now covers the period 1800–2014.

For those of you who haven’t already fled the page to go download the data and who aren’t political scientists: Polity measures patterns of political authority in all countries with populations larger than 500,000. It is one of the mostly widely used data sets in the fields of comparative politics and international relations. Polity is also tremendously useful in forecasts of rare political crises—partly because it measures some very important things, but also because it is updated every year on a fairly predictable schedule. Thanks to PITF and CSP for that.

I thought I would mark the occasion by visualizing Polity in a new way (for me, at least). In the past, I’ve used heat maps (here and here) and line plots of summary statistics. This time, I wanted to try something other than a heat map that would show change over time in a distribution, instead of just a central tendency. Weakly inspired by the often-imitated cover of Joy Division’s 1979 album, here’s what I got. Each line in this chart is a kernel density plot of one year’s Polity scores, which range from -10 to 10 and are meant to indicate how democratic a country’s politics are. The small number of cases with special codes that don’t fit on this scale (-66, -77, and -88) have been set aside.

polity.meets.joy.division

The chart shows once again that the world has become much more democratic in the past half-century, with most of those gains occurring in the past 30 years. In the early 1960s, the distribution of national political regimes was bimodal, but authoritarian regimes outnumbering the more-democratic ones. As recently as the early 1970s, most regimes still fell toward the authoritarian end of the scale. Starting in the late 1980s, though, the authoritarian peak eroded quickly, and the balance of the distribution shifted toward the democratic end. Despite continuing talk of a democratic recession, the (political) world in 2014 is still mostly composed of relatively democratic regimes, and this data set doesn’t show much change in that basic pattern over the past decade.

 

An Updated Look at Trends in Political Violence

The Center for Systemic Peace (CSP) has just posted an updated version of its Major Episodes of Political Violence data set, which now covers the period 1946-2014. That data set includes scalar measures of the magnitude of several forms of political violence between and within states. Per the codebook (PDF):

Magnitude scores reflect multiple factors including state capabilities, interactive intensity (means and goals), area and scope of death and destruction, population displacement, and episode duration. Scores are considered to be consistently assigned (i.e., comparable) across episode types and for all states directly involved.

For each country in each year, the magnitude scores range from 0 to 10. The chart below shows annual global sums of those scores for conflicts between and within states (i.e., the INTTOT and CIVTOT columns in the source data).

mepv.intensity.by.year

Consistent with other measures, CSP’s data show an increase in violent political conflict in the past few years. At the same time, those data also indicate that, even at the end of 2014, the scale of conflict worldwide remained well below the peak levels observed in the latter decades of the Cold War and its immediate aftermath. That finding provides no comfort to the people directly affected by the fighting ongoing today. Still, it should (but probably won’t) throw another blanket over hyperbolic statements about the world being more unstable than ever before.

If we look at the trends by region, we see what most avid newsreaders would expect to see. The chart below uses the U.S. State Department’s regional designations. It confirms that the recent increase in conflict within states (the orange lines) has mostly come from Africa and the Middle East. Conflicts persist in the Americas and East and South Asia, but their magnitude has generally diminished in recent years. Europe and Eurasia supplies the least violent conflict of any region, but the war in Ukraine—designated a civil conflict by this source and assigned a magnitude score of 2—increased that supply in 2014.

mepv.intensity.by.year.and.region

CSP saw almost no interstate conflict around the world in 2014. The global score of 1 accrues from U.S. operations in Afghanistan. When interstate conflict has occurred in the post–Cold War period, it has mostly come from Africa and the Middle East, too, but East Asia was also a major contributor as recently as the 1980s.

For a complete list of the episodes of political violence observed by CSP in this data set, go here. For CSP’s analysis of trends in these data, go here.

A Note on Trends in Armed Conflict

In a report released earlier this month, the Project for the Study of the 21st Century (PS21) observed that “the body count from the top twenty deadliest wars in 2014 was more than 28% higher than in the previous year.” They counted approximately 163 thousand deaths in 2014, up from 127 thousand in 2013. The report described that increase as “part of a broader multi-year trend” that began in 2007. The project’s executive director, Peter Epps, also appropriately noted that “assessing casualty figures in conflict is notoriously difficult and many of the figures we are looking at here a probably underestimates.”

This is solid work. I do not doubt the existence of the trend it identifies. That said, I would also encourage us to keep it in perspective:

That chart (source) ends in 2005. Uppsala University’s Department of Peace and Conflict (UCDP) hasn’t updated its widely-used data set on battle-related deaths for 2014 yet, but from last year’s edition, we can see the tail end of that longer period, as well as the start of the recent upward trend PS21 identifies. In this chart—R script here—the solid line marks the annual, global sums of their best estimates, and the dotted lines show the sums of the high and low estimates:
Annual, global battle-related deaths, 1989-2013 (source: UCDP)

Annual, global battle-related deaths, 1989-2013 (Data source: UCDP)

If we mentally tack that chart onto the end of the one before it, we can also see that the increase of the past few years has not yet broken the longer spell of relatively low numbers of battle deaths. Not even close. The peak around 2000 in the middle of the nearer chart is a modest bump in the farther one, and the upward trend we’ve seen since 2007 has not yet matched even that local maximum. This chart stops at the end of 2013, but if we used the data assembled by PS21 for the past year to project an increase in 2014, we’d see that we’re still in reasonably familiar territory.

Both of these things can be true. We could be—we are—seeing a short-term increase that does not mark the end of a longer-term ebb. The global economy has grown fantastically since the 1700s, and yet it still suffers serious crises and recessions. The planet has warmed significantly over the past century, but we still see some unusually cool summers and winters.

Lest this sound too sanguine at a time when armed conflict is waxing, let me add two caveats.

First, the picture from the recent past looks decidedly worse if we widen our aperture to include deliberate killings of civilians outside of battle. UCDP keeps a separate data set on that phenomenon—here—which they label “one-sided” violence. If we add the fatalities tallied in that data set to the battle-related ones summarized in the previous plot, here is what we get:

Annual, global battle-related deaths and deaths from one-sided violence, 1989-2013 (Data source: UCDP)

Annual, global battle-related deaths and deaths from one-sided violence, 1989-2013 (Data source: UCDP)

Note the difference in the scale of the y-axis; it is an order of magnitude larger than the one in the previous chart. At this scale, the peaks and valleys in battle-related deaths from the past 25 years get smoothed out, and a single peak—the Rwandan genocide—dominates the landscape. That peak is still much lower than the massifs marking the two World Wars in the first chart, but it is huge nonetheless. Hundreds of thousands of people were killed in a matter of months.

Second, the long persistence of this lower rate does not prove that the risk of violent conflict on the scale of the two World Wars has been reduced permanently. As Bear Braumoeller (here) and Nassim Nicholas Taleb (here; I link reluctantly, because I don’t care for the scornful and condescending tone) have both pointed out, a single war between great powers could end or even reverse this trend, and it is too soon to say with any confidence whether or not the risk of that happening is much lower than it used to be. Like many observers of international relations, I think we need to see how the system processes the (relative) rise of China and declines of Russia and the United States before updating our beliefs about the risk of major wars. As someone who grew up during the Cold War and was morbidly fascinated by the possibility of nuclear conflagration, I think we also need to remember how close we came to nuclear war on some occasions during that long spell, and to ponder how absurdly destructive and terrible that would be.

Strictly speaking, I’m not an academic, but I do a pretty good impersonation of one, so I’ll conclude with a footnote to that second caveat: I did not attribute the idea that the risk of major war is a thing of the past to Steven Pinker, as some do, because as Pinker points out in a written response to Taleb (here), he does not make precisely that claim, and his wider point about a long-term decline in human violence does not depend entirely on an ebb in warfare persisting. It’s hard to see how Pinker’s larger argument could survive a major war between nuclear powers, but then if that happened, who would care one way or another if it had?

The Stacked-Label Column Plot

Most of the statistical work I do involves events that occur rarely in places over time. One of the best ways to get or give a feel for the structure of data like that is with a plot that shows variation in counts of those events across sequential, evenly-sized slices of time. For me, that usually means a sequence of annual, global counts of those events, like the one below for successful and failed coup attempts over the past several decades (see here for the R script that generated that plot and a few others and here for the data):

Annual, global counts of successful and failed coup attempts per the Cline Center's SPEED Project

Annual, global counts of successful and failed coup attempts per the Cline Center’s SPEED Project, 1946-2005

One thing I don’t like about those plots, though, is the loss of information that comes from converting events to counts. Sometimes we want to know not just how many events occurred in a particular year but also where they occurred, and we don’t want to have to query the database or look at a separate table to find out.

I try to do both in one go with a type of column chart I’ll call the stacked-label column plot. Instead of building columns from bricks of identical color, I use blocks of text that describe another attribute of each unit—usually country names in my work, but it could be lots of things. In order for those blocks to have comparable visual weight, they need to be equally sized, which usually means using labels of uniform length (e.g., two– or three-letter country codes) and a fixed-width font like Courier New.

I started making these kinds of plots in the 1990s, using Excel spreadsheets or tables in Microsoft Word to plot things like protest events and transitions to and from democracy. A couple decades later, I’m finally trying to figure out how to make them in R. Here is my first reasonably successful attempt, using data I just finished updating on when countries joined the World Trade Organization (WTO) or its predecessor, the General Agreement on Tariffs and Trade (GATT).

Note: Because the Wordpress template I use crams blog-post content into a column that’s only half as wide as the screen, you might have trouble reading the text labels in some browsers. If you can’t make out the letters, try clicking on the plot, then increasing the zoom if needed.

Annual, global counts of countries joining the global free-trade regime, 1960-2014

Annual, global counts of countries joining the global free-trade regime, 1960-2014

Without bothering to read the labels, you can see the time trend fine. Since 1960, there have been two waves of countries joining the global free-trade regime: one in the early 1960s, and another in the early 1990s. Those two waves correspond to two spates of state creation, so without the labels, many of us might infer that those stacks are composed mostly or entirely of new states joining.

When we scan the labels, though, we discover a different story. As expected, the wave in the early 1960s does include a lot of newly independent African states, but it also includes a couple of Warsaw Pact countries (Yugoslavia and Poland) and some middle-income cases from other parts of the world (e.g., Argentina and South Korea). Meanwhile, the wave of the early 1990s turns out to include very few post-Communist countries, most of which didn’t join until the end of that decade or early in the next one. Instead, we see a second wave of “developing” countries joining on the eve of the transition from GATT to the WTO, which officially happened on January 1, 1995. I’m sure people who really know the politics of the global free-trade regime, or of specific cases or regions, can spot some other interesting stories in there, too. The point, though, is that we can’t discover those stories if we can’t see the case labels.

Here’s another one that shows which countries had any coup attempts each year between 1960 and 2014, according to Jonathan Powell and Clayton Thyne‘s running list. In this case, color tells us the outcomes of those coup attempts: red if any succeeded, dark grey if they all failed.

Countries with any coup attempts per Powell and Thyne, 1960-2014

One story that immediately catches my eye in this plot is Argentina’s (ARG) remarkable propensity for coups in the early 1960s. It shows up in each of the first four columns, although only in 1962 are any of those attempts successful. Again, this is information we lose when we only plot the counts without identifying the cases.

The way I’m doing it now, this kind of chart requires data to be stored in (or converted to) event-file format, not the time-series cross-sectional format that many of us usually use. Instead of one row per unit–time slice, you want one row for each event. Each row should at least two columns with the case label and the time slice in which the event occurred.

If you’re interested in playing around with these types of plots, you can find the R script I used to generate the ones above here. Perhaps some enterprising soul will take it upon him- or herself to write a function that makes it easy to produce this kind of chart across a variety of data structures.

It would be especially nice to have a function that worked properly when the same label appears more than once in a given time slice. Right now, I’m using the function ‘match’ to assign y values that evenly stack the events within each bin. That doesn’t work for the second or third or nth match, though, because the ‘match’ function always returns the position of the first match in the relevant vector. So, for example, if I try to plot all coup attempts each year instead of all countries with any coup attempts each year, the second or later events in the same country get placed in the same position as the first, which ultimately means they show up as blank spaces in the columns. Sadly, I haven’t figured out yet how to identify location in that vector in a more general way to fix this problem.

%d bloggers like this: