Committees: The most important not-understood aspect of Congress

When I first started working on what would become GovTrack six years ago, as a college sophomore who had up to that point zero interest in politics, I had no idea what congressional committees were all about. I don’t think they ever came up in any civics-related classes through High School. Really, we’re all lead to believe that what we can see — on C-SPAN and in the records of votes — that those things are actually where the action in Congress takes place. As it turns out, that’s pretty far from the truth.

It’s really no coincidence that C-SPAN airs basically the same camera angle all day. It’s not because the House and Senate camera operators are lazy. It’s because there’s no one else in the room besides the person talking (and the presiding officer and clerks, etc.). So what’s the camera guy supposed to show- a room full of some 500 empty chairs? Real debate doesn’t take place on the House and Senate floors. Nor are real outcomes the results of votes on legislation. By and large, bills don’t even come to a vote unless the outcome is clear beforehand.

The real legislating takes place off of the cameras, before votes, within committees. I don’t think that’s bad at all, mind you. Only, I wish committees got the attention that they deserve. Information on what the committees are actually doing is particularly difficult to get a handle on.

So now six years later, I still hardly know what committees do. They vote on things. They hold hearings. A bill is made or broken in committee. But how does that process work? Committees need to refocus their websites to make it clear to us everyday people what they are doing: what legislation they are considering, what their votes are on (in clear terms) and what the outcomes are, and what is happening to each bill assigned to them. An “amendment in the nature of a substitute” needs explaining, and needs to be followed up with what was changed, and why.

States are leading the way with downloadable legislative databases

I’ve blogged here before (1, 2) about how publishing raw, structured data that can be processed by computers can have unpredictable benefits, and I feel strongly that Congress should provide a raw database download of the status of all legislation. (They have the database already; it’s what powers THOMAS.) I didn’t realize, though, that a number of state legislatures are already leading the way in this regard.

First, for some background, other federal entities have embraced this notion of providing raw databases of information. To name a few, the House of Representatives itself (in so far as it provides voting records as XML), the Census bureau (e.g. the census itself and TIGER/Line), the SEC, and the FEC. Providing the public with unfettered access to the raw data the government has is not a new or controversial idea.

So for legislative data, it seems some state legislatures are ahead of Congress. Thanks to this links page of state legislature websites, I was able to compile this list of what the states are doing (modulo anything I missed):

Five states provide structured legislative databases (i.e. this is excellent): Illinois (XML wow!!), Connecticut (CSV format), Minnesota, Oregon, and Texas (I think–browsing the FTP site doesn’t work with Firefox).  And, California… but they really have semi-structured data.

Eleven states provides direct-to-you updates of bill status (i.e. this is excellent too, but not raw data): By Email: Alaska, Florida (but non-anonymous registration required!!), Kentucky, Michigan, Minnesota, Nebraska (but some things are not free!), New Mexico, Wisconsin With RSS: Delaware, Michigan, Texas (including committee meetings!), and Utah

All of the other states, and Congress itself, are in the category of providing neither a raw data download, nor RSS feeds, nor any other customized form of legislative tracking.

Final remarks: All of the states had web interfaces to the status of their local legislation, and I have to say that some, like Florida’s, were actually very impressive. Iowa even has bill version tracking.

(And lastly, Alabama’s Legislative Information System shamefully doesn’t even let users in who aren’t using Microsoft Internet Explorer, so I have no way to know if they have a data download! And Kansas charges for multi-bill tracking.)

Bill Versioning: Unintended consequences of data openness

After a bill is introduced in Congress, we know it may change in a whole host of ways before it is passed finally. But if you’ve read the bill at one point in time and want to know how it’s changed since you last read it, until recently your only hope was to scan through all of the amendments and committee/conference reports that occurred in the meanwhile and try to figure out what things like this mean out of context:

On page 9, strike line 24.

On page 10, and 18, strike “intervention.”, and insert the following: intervention; and

(13) no United States military forces should be deployed to Iraq after the date of the enactment of this Act unless the Secretary of Defense certifies to Congress before such deployment that such forces are adequately equipped and trained for the missions to be discharged by such forces in Iraq.

Things could be easier. You’re probably familiar with Microsoft Word’s Track Changes feature, or its Compare Documents tool, which will tell you how a document changed between versions. Finding changes is something computers are good at. Let’s unleash them on bills in Congress to make our lives easier. Earlier this year I did just that on my site See the text of this bill for an example of how tracking changes can be useful.

This — tracking changes to the text of bills, an idea possibly first due to chapter1 on dailykos — is one of these multitude of possibilities that I don’t expect Congress to necessarily provide to us itself, but it is something that I think Congress should enable the public to do if it wanted. From a technical perspective I can tell you what’s needed: we need the text of the bills (obviously), but PDFs won’t do because PDFs are (in a sense) like images of text to a computer: completely un-understandable. A text-only version of the bill is really needed, like what the GPO provides, here for S. 1. But the GPO’s text-only pages for bills are problematic too, for computers can’t easily recognize the difference between section and paragraph numbering (insignificant for the purposes of comparison) and (significant) numbers within bills, like dollar amounts. The trouble is that if the comparison flagged every line number throughout the entire bill as changed just because a single line was inserted or removed, the list of changes would be so overwhelming to look at that it would be useless. So ideally, bill version tracking would use some text-only version of bills that somehow separated the insignificant formatting, changes to which would be ignored, from the significant content.

Fortunately, exactly this type of text-only format of bills is something Congress is already creating in their bill drafting work-flow, so we don’t need to ask Congress to spend any money for us to be able to see changes to bills. In fact, every bill is drafted in this form precisely because the bill drafters don’t want to number paragraphs themselves. They write something akin to “###” as a placeholder for a number, and it gets inserted later in the printing process. That’s exactly the kind of red flag a computer can use to distinguish what is insignificant formatting.

Funny thing, though. These files that the legislative drafters produce are not freely available. Other versions of the same thing are. For instance, most bills in the House these days are drafted first in XML and then converted to the above plain-text format. Those XML files are public and freely available. Likewise, the GPO uses the plain-text bills to generate the PDFs automatically, and the PDFs are public and freely available too. THOMAS uses them to create nice-looking HTML-formatted web versions (i.e. displayed formatted nicely within your web browser without the need of Adobe and a PDF), and of course THOMAS is public and free. Lastly, these text-only files are publicly available from the GPO if you really want them, so there’s no underlying policy issue here about whether the files should be publicly viewable, but you have to subscribe to them from the GPO for nearly $8,000/year. (This is far, far beyond the actual cost to the GPO of distributing the files, which means the GPO is using the sales of these files to compensate for other expenses so they stay fiscally neutral.)

These plain-text files were designed to be created by people, but their purpose was to be fed into a printer to be turned into a pretty, printable bill. Really, these files had no use to anyone but the GPO because why would anyone before 2007 want to see the text of a bill in a format other than PDF or formatted-plain-text as they already provide? So from that perspective it’s easy to see why the GPO wouldn’t see the need to post the files freely.

For the “bill diff’er” on GovTrack, it was fortunate that the web versions of bill text on THOMAS were sufficiently the same as (ie. true to) the underlying text-only bill files that it was possible to use the pages on THOMAS in their place. But this was really an accident — THOMAS didn’t mean to do that at all. But this shows precisely the positive unintended consequences of putting on the Web the raw materials that Congress already has that computers can manipulate for us. In this case, THOMAS provided a little loop-hole for getting (something close to) those plain-text files, and out of that we get an entirely new tool for transparency, tracking bill changes.

This really goes to the same point that I blogged about last time, which is that what it means to be on the Web in 2007 is to publish things in parallel: once for people to view within their web browser as they surf the web (the PDFs, GPO formatted text, and THOMAS HTML formatted text), and again (the THOMAS HTML text as a proxy for the original plain-text files) for computers to mix and match to provide people with useful new ways to appreciate the same information.

The Open House Project

(Just here for archival purposes…)

On my GovTrack blog: It’s rare when Congress asks the people for help being transparent, and so I’m particularly pleased to announce the formation of The Open House Project, a Sunlight Foundation-sponsored project with the encouragement of Speaker Pelosi that will be making specific proposals about how The House can better use the Internet in the interests of transparency. Various people, including myself, will be blogging on that site over the next few weeks about some ideas on this point. Feel free to contribute your ideas by commenting on the TOHP website, joining the project’s mail list, or talking on GovTrack’s own mail list.

And on the TOHP blog:

Mash-ups for government transparency

January 25th, 2007 by Joshua Tauberer

A few years ago I launched I didn’t think of it this way at the time, but these days you might call it a mash-up of data about the U.S. Congress. At the time what I was thinking was just collecting information about Congress from various sources (THOMAS, the Senate website, and the House website) and cross-referencing and hyperlinking the data in a way that no one had done yet. In fact, it was the huge amount of public data on the status of legislation that was made available through THOMAS (as I understand it thanks to the Republican take-over in 1994) that inspired me to try to put the data to new uses. It started with updates by email of what your congressmen were up to each day, generated automatically by grabbing data from THOMAS and, effectively, transforming it into a customized email update for anyone who wanted it.

The trouble with building GovTrack is that one has to do a bit of friendly reverse-engineering. The information is all “out there”, meant for public consumption, but it’s not out there in a way that makes it easy to transform into other formats for other uses, like the email updates, RSS feeds, and cross-referenced pages. The trouble is this: While people have no trouble browsing and searching THOMAS (for instance) for the information they need, we can’t make computers do the same thing automatically without much difficulty. To take an example, if I want to have my computer automatically fetch for me a list of all bills that were acted on the previous day (and in fact this is something GovTrack does), I would write a program that fetches the Daily Digest in the Congressional Record from THOMAS, which has bullets like this:

“Eleven bills and one resolution were introduced, as follows: S. 360-370 and S. Res. 37.”

I have no trouble understanding that. But, well, let me say as someone studying linguistics and natural language processing, computers are a long way from being able to understand English prose as well as people, nay as well as three-year-olds. Was the bill S. 365 introduced yesterday? Yes, of course — even though it was not mentioned explicitly (it’s merely in the range 360-370), and that’s just the first problem for a computer trying to make heads or tails of this information. So what’s a programmer to do?

Let’s go back to the goal of this. Certainly I don’t think it’s the government’s job to necessarily provide email updates, RSS feeds, Google Calendar integration of events, and whatever the latest technology hits are. There are a million and one things that one can do with information about the status of legislation, and someone will want each of them. So the question is this: How can the government, and Congress in particular, publish information about what it is doing in a way that makes it easy for others to put the information to new uses?

To be concrete again, because it’s always good to be concrete: How can THOMAS publish a list of bills that were acted on in a purpose-neutral way, a way that makes it easy for programmers to go and write applications to take the information and do anything with it that someone might want?

This is a question that I’ll probably blog more than once about on this site in the next few months. The answer is what’s called structured (or “machine-readable”) data, and it comes down to publishing information twice, once for humans clicking away at links, and once in boring, explicit tables meant for computer applications to transform into different formats. But more on that later.