A Few Rules of Thumb for Data Munging in Political Science

1. However hard you think it will be to assemble a data set for a particular analysis, it will be exponentially harder, with the size of the exponent determined by the scope and scale of the required data.

  • Corollary: If the data you need would cover the world (or just poor countries), they probably don’t exist.
  • Corollary: If the data you need would extend very far back in time, they probably don’t exist.
  • Corollary: If the data you need are politically sensitive, they probably don’t exist. If they do exist, you probably can’t get them. If you can get them, you probably shouldn’t trust them.

2. However reliable you think your data are, they probably aren’t.

  • Corollary: A couple of digits after decimal point is plenty. With data this noisy, what do those thousandths really mean, anyway?

3. Just because a data transformation works doesn’t mean it’s doing what you meant it to do.

4. The only really reliable way to make sure that your analysis is replicable is to have someone previously unfamiliar with the work try to replicate it. Unfortunately, a person’s incentive to replicate someone else’s work is inversely correlated with his or her level of prior involvement in the project. Ergo, this will rarely happen until after you have posted your results.

5. If your replication materials will include random parts (e.g., sampling) and you’re using R, don’t forget to set the seed for random number generation at the start. (Alas, I am living this mistake today.)

Please use the Comments to suggest additions, corrections, or modifications.

Leave a comment

13 Comments

  1. Great list, thanks Jay. Obvious addition: Comment your code. Extensively.

    Reply
  2. Another one: Countries never have the same names across datasets, so use country codes.

    Reply
  3. Modification to 5: You can (and should) set your random seed in almost any programming language.

    Reply
  4. On the corollary to #2: Do political scientists use the concept of significant figures? It’s how physical scientists figure out how many digits after the decimal point to use.

    Reply
    • I see the discipline through a soda straw, but in my experience, the concept of significant digits is taught but not widely practiced.

      Reply
      • It’s a very useful concept, more quantitative for the physical sciences, but could be applied as you do above. It’s adhered to in serious calculations, not so much in (and by) the media.

  5. Reblogged this on jetude.

    Reply
  6. However long you think it’ll take to build your own dataset, add a year to that estimate.

    Reply

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: