Bill Versioning: Unintended consequences of data openness

After a bill is introduced in Congress, we know it may change in a whole host of ways before it is passed finally. But if you’ve read the bill at one point in time and want to know how it’s changed since you last read it, until recently your only hope was to scan through all of the amendments and committee/conference reports that occurred in the meanwhile and try to figure out what things like this mean out of context:

On page 9, strike line 24.

On page 10, and 18, strike “intervention.”, and insert the following: intervention; and

(13) no United States military forces should be deployed to Iraq after the date of the enactment of this Act unless the Secretary of Defense certifies to Congress before such deployment that such forces are adequately equipped and trained for the missions to be discharged by such forces in Iraq.

Things could be easier. You’re probably familiar with Microsoft Word’s Track Changes feature, or its Compare Documents tool, which will tell you how a document changed between versions. Finding changes is something computers are good at. Let’s unleash them on bills in Congress to make our lives easier. Earlier this year I did just that on my site GovTrack.us. See the text of this bill for an example of how tracking changes can be useful.

This — tracking changes to the text of bills, an idea possibly first due to chapter1 on dailykos — is one of these multitude of possibilities that I don’t expect Congress to necessarily provide to us itself, but it is something that I think Congress should enable the public to do if it wanted. From a technical perspective I can tell you what’s needed: we need the text of the bills (obviously), but PDFs won’t do because PDFs are (in a sense) like images of text to a computer: completely un-understandable. A text-only version of the bill is really needed, like what the GPO provides, here for S. 1. But the GPO’s text-only pages for bills are problematic too, for computers can’t easily recognize the difference between section and paragraph numbering (insignificant for the purposes of comparison) and (significant) numbers within bills, like dollar amounts. The trouble is that if the comparison flagged every line number throughout the entire bill as changed just because a single line was inserted or removed, the list of changes would be so overwhelming to look at that it would be useless. So ideally, bill version tracking would use some text-only version of bills that somehow separated the insignificant formatting, changes to which would be ignored, from the significant content.

Fortunately, exactly this type of text-only format of bills is something Congress is already creating in their bill drafting work-flow, so we don’t need to ask Congress to spend any money for us to be able to see changes to bills. In fact, every bill is drafted in this form precisely because the bill drafters don’t want to number paragraphs themselves. They write something akin to “###” as a placeholder for a number, and it gets inserted later in the printing process. That’s exactly the kind of red flag a computer can use to distinguish what is insignificant formatting.

Funny thing, though. These files that the legislative drafters produce are not freely available. Other versions of the same thing are. For instance, most bills in the House these days are drafted first in XML and then converted to the above plain-text format. Those XML files are public and freely available. Likewise, the GPO uses the plain-text bills to generate the PDFs automatically, and the PDFs are public and freely available too. THOMAS uses them to create nice-looking HTML-formatted web versions (i.e. displayed formatted nicely within your web browser without the need of Adobe and a PDF), and of course THOMAS is public and free. Lastly, these text-only files are publicly available from the GPO if you really want them, so there’s no underlying policy issue here about whether the files should be publicly viewable, but you have to subscribe to them from the GPO for nearly $8,000/year. (This is far, far beyond the actual cost to the GPO of distributing the files, which means the GPO is using the sales of these files to compensate for other expenses so they stay fiscally neutral.)

These plain-text files were designed to be created by people, but their purpose was to be fed into a printer to be turned into a pretty, printable bill. Really, these files had no use to anyone but the GPO because why would anyone before 2007 want to see the text of a bill in a format other than PDF or formatted-plain-text as they already provide? So from that perspective it’s easy to see why the GPO wouldn’t see the need to post the files freely.

For the “bill diff’er” on GovTrack, it was fortunate that the web versions of bill text on THOMAS were sufficiently the same as (ie. true to) the underlying text-only bill files that it was possible to use the pages on THOMAS in their place. But this was really an accident — THOMAS didn’t mean to do that at all. But this shows precisely the positive unintended consequences of putting on the Web the raw materials that Congress already has that computers can manipulate for us. In this case, THOMAS provided a little loop-hole for getting (something close to) those plain-text files, and out of that we get an entirely new tool for transparency, tracking bill changes.

This really goes to the same point that I blogged about last time, which is that what it means to be on the Web in 2007 is to publish things in parallel: once for people to view within their web browser as they surf the web (the PDFs, GPO formatted text, and THOMAS HTML formatted text), and again (the THOMAS HTML text as a proxy for the original plain-text files) for computers to mix and match to provide people with useful new ways to appreciate the same information.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s