I was asked yesterday if I would collect some examples of the use of structured data, or more specifically XML, for government legislative information. Here’s what I can think of off the top of my head:
About structured data
- I wrote in late 2006 Prose is Poetry to a Computer: What is structured data? It could use some updating.
Structured data in the U.S. Congress
- The House drafts most of its legislation in XML now, and these files are shared with the public. Unfortunately, the Senate may be drafting legislation in the same format but does not share their files with the public, seriously undermining the usefulness of the House files to the public. These XML files are the text of legislation, so it aids in creating a nice visual display of the text, though the markup is too complicated for me to want to work with it. The files were first systematically shared with the public in 2004, as far as I can see.
- The House publishes its votes in XML (example). This is an interesting case because the XML is actually the primary way it is published to the public. When visitors view the page, they see a visual or HTML rendering of the underlying XML, but technical users can inspect the XML behind the page. It’s completely transparent. This started around 2004-2005, I believe.
- The Senate makes its list of membership and contact information available in XML. They have much more XML than they share. The Library of Congress’s Legislative Information System, which is used internally in the capitol, has XML data for Senate committee membership, for instance, but the Senate web team was not permitted to publish it (and LIS does not have a public face itself).
- The Senate also recently started publishing their committee hearing schedule in XML. This could have been done with RSS, and adding some custom tags. They chose a custom format to more precisely mark up information specific to their needs, which is great. (Unfortunately there will be no data in that file if there are no upcoming meetings.) This feed began in 2008 (afaik).
- The Senate’s lobbying disclosure database is a collection of XML representing filed formed. It is made available to the public on a timely basis. The records go back to 1999, but were first published only in February 2008.
- Various committees publish RSS feeds for their news and events. RSS is a flavor of XML.
- Behind the scenes, the Library of Congress’s LIS unit maintains a rich database of legislative information in XML, but they do not share it with anyone (inside or outside of the capitol), as far as I am aware.
Structured data made independently
- I, of course, try to fill in the gaps in what Congress provides in a structured format, using whatever I can find that Congress provides in a non-structured format. This process of screen-scraping is inexact and brittle, a short-term imperfect solution to a problem with an easy long-term remedy. My GovTrack.us Source Data covers the status of legislation (example), voting records (for both chambers in a common format; example), the text of the Congressional Record, Congressional membership, committee membership (example), etc. I’ve been doing this since 2004.
- The Cornell Legal Information Institute produces an XML version of the U.S. Code, based on some structured but difficult-to-use data files made available by the House. I think they’ve been doing this since around 2004. (more info; example not easily available)
- The Sunlight Labs API provides congressional membership and data-linking information.
State-level legislative information
- In early 2007 I surveyed all of the state legislatures and found that four states provided legislative information in a structured data format, plus California with some semi-structured data. See the links therein for more.
- Richmond Sunlight, an independent site like GovTrack for the Virginia state legislature, provides some structured data based on what it collects, since 2008 (afaik).
Federal non-legislative data
To quickly list some other sources of structured data at the federal level-
- FEC’s electronic filings for campaign contributions and related data (XML and flat fixed-width)
- SEC’s EDGAR system for corporate public filings (XML/SGML), and their recent decision to require documents to be submitted to them in XRBL, a dialect of XML.
- From the Census Bureau, essentially the whole census (flat fixed-width) and geographic data (various formats)
- USDA’s nutrition database (extremely comprehensive and crucially helpful for public health; XML if I recall right; it’s downloadable in bulk somewhere)
- USGS’s Earthquake Hazards Program (Atom, KML, XML, CSV)
- The Bureau of Labor Statistics‘s datasets
- The National Center for Education Statistics provides survey data in both SAS and non-proprietary flat file formats. (description copied from elsewhere)
- EPA‘s Envirofacts Data Warehouse
Other notable government structured data
- Washington DC sets a real example with its Data Catalog. It covers data produced by many aspects of its local government.