Legislative XML: What we have and what we’re seeking

John asked me to clarify a bit what legislative information exists in XML and what more would be a good idea for Congress to provide (this is the subject of the first chapter of our report, Legislative Databases).

What exists now, publicly, is an XML markup of the text of some legislation. First the counts, and then I’ll explain what’s in these files. The House has revised its bill drafting process, and, by my count, currently 97% of House bills (3481 out of the 3558 so far this year) are prepared and published publicly as XML. The Senate, I am told, is well into the process of using a similar (or the same?) system, but they are not yet ready to make their XML files available to the public, so there are no such files for Senate bills. (Thus, XML bill files are available for 63% of the bills introduced this year so far.) Also, since this process is relatively new, the availability of XML for bills only goes back to 2003. The files are available on THOMAS (here or by clicking the XML Display link on bill text pages, where available), and are described at http://xml.house.gov.

XML is a type of structured data format that our report urges using in a few sections — such as for publishing committee schedules (e.g. in RSS, a flavor of XML). But, the potential uses of XML depends wholly on what information you encode in the XML.

These bill XML files are markups of the text of the bill, in a structured data format, which means that the organization of the bill into titles, sections, paragraphs, etc. and other formatting considerations like quoted text are explicitly represented. What is is useful for is when applications want to control how to format the text of a bill, rather than using the GPO’s PDF (i.e. display-exactly-as-it-prints) or text-only versions (i.e. no formatting allowed), which don’t look nice if you try to embed them in a web page. For instance, the markup in bill XML files is probably how Sunlight’s LOUIS renders the text of bills in a way that makes it visually pleasing. Marking up the text this way also makes it more readily possible to write applications that tag certain sections with annotations, like an “earmark guide” of some sort. Additionally, references to Members of Congress, like those in the list of a bill’s sponsors, and references to existing law (by name and U.S.C. references) are marked up in the XML files. This means that sites like LOUIS can (and they do) make those words hyperlinks to relevant information about the people or laws, something you can’t get from the PDFs or text form of bills. The benefit of the existing files to the public has primarily to do with making a user-friendly display of the text, as well as indexing and searching the text.

What the House has done for bill XML is useful and important, as can be seen from LOUIS’s use of it to make reading bills easier (than it could have done without XML). However, there is much more information about legislation than the text, and that information is also very important to the public. That is what our report urges the House to make available in a structured data format.

This additional information — what is called bill summary and status at the Library of Congress — is made available to the public through the THOMAS website (administered by the LOC). THOMAS has this information going back to 1989, for every bill. However, that information is not made available in a structured data format, limiting the ability of the public to reuse, transform, and mix it to create new views into the Congress. And that is what we’re asking for.

That information includes (besides what is in the existing bill XML files) CRS summaries of bills, a list of every action taken on each bill (votes, motions, referrals, etc.), a list of all titles a bill goes by, committee assignments, a list of related bills, a list of amendments on the bills (incl. title, sponsor, and legislative activity on them), a list of (LIV) subject terms assigned to the bill by CRS (which is very helpful for the public), and related committee documents.

The information on THOMAS is really crucial for tracking legislation — especially the list of legislative actions, like votes, and amendments. Without it, that is, just with the text of legislation, the public gets a very static view of the process, and are left to the hands of the media to be told about whether a bill passed or what committees are responsible for it.

Our recommendation in the report was thus that to the extent such a database already exists in the Library of Congress (it does — it’s how the THOMAS website manages to exist at all), giving the public a structured data representation of the database should be easy, relatively noncontroversial, and a big tip of the hat to transparency.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s