A few years ago I launched GovTrack.us. I didn’t think of it this way at the time, but these days you might call it a mash-up of data about the U.S. Congress. At the time what I was thinking was just collecting information about Congress from various sources (THOMAS, the Senate website, and the House website) and cross-referencing and hyperlinking the data in a way that no one had done yet. In fact, it was the huge amount of public data on the status of legislation that was made available through THOMAS (as I understand it thanks to the Republican take-over in 1994) that inspired me to try to put the data to new uses. It started with updates by email of what your congressmen were up to each day, generated automatically by grabbing data from THOMAS and, effectively, transforming it into a customized email update for anyone who wanted it.
The trouble with building GovTrack is that one has to do a bit of friendly reverse-engineering. The information is all “out there”, meant for public consumption, but it’s not out there in a way that makes it easy to transform into other formats for other uses, like the email updates, RSS feeds, and cross-referenced pages. The trouble is this: While people have no trouble browsing and searching THOMAS (for instance) for the information they need, we can’t make computers do the same thing automatically without much difficulty. To take an example, if I want to have my computer automatically fetch for me a list of all bills that were acted on the previous day (and in fact this is something GovTrack does), I would write a program that fetches the Daily Digest in the Congressional Record from THOMAS, which has bullets like this:
“Eleven bills and one resolution were introduced, as follows: S. 360-370 and S. Res. 37.”
I have no trouble understanding that. But, well, let me say as someone studying linguistics and natural language processing, computers are a long way from being able to understand English prose as well as people, nay as well as three-year-olds. Was the bill S. 365 introduced yesterday? Yes, of course — even though it was not mentioned explicitly (it’s merely in the range 360-370), and that’s just the first problem for a computer trying to make heads or tails of this information. So what’s a programmer to do?
Let’s go back to the goal of this. Certainly I don’t think it’s the government’s job to necessarily provide email updates, RSS feeds, Google Calendar integration of events, and whatever the latest technology hits are. There are a million and one things that one can do with information about the status of legislation, and someone will want each of them. So the question is this: How can the government, and Congress in particular, publish information about what it is doing in a way that makes it easy for others to put the information to new uses?
To be concrete again, because it’s always good to be concrete: How can THOMAS publish a list of bills that were acted on in a purpose-neutral way, a way that makes it easy for programmers to go and write applications to take the information and do anything with it that someone might want?
This is a question that I’ll probably blog more than once about on this site in the next few months. The answer is what’s called structured (or “machine-readable”) data, and it comes down to publishing information twice, once for humans clicking away at links, and once in boring, explicit tables meant for computer applications to transform into different formats. But more on that later.