More Data that Changed the World

Continuing from my last post on this subject, I found some more examples of influential data sets from a page on FlowingData.com. I’m expanding beyond government data in this post.

“Baseball Statistics: In 2003, Michael M. Lewis’ book, Moneyball: The Art of Winning an Unfair Game, was released. As a result, the way baseball teams were built changed completely. Before Moneyball, teams relied on insider information and the choice of players was highly subjective. However, in 2002, a year before the book was published, the Oakland A’s had $41 million in salary and had to figure out how to compete against teams like the New York Yankees and the Boston Red Sox who spent over $100 million in salaries.”

“Megan’s Law: Since 1994, those who have been convicted of sex crimes against children have been required to register with local law enforcement. That data is made public so that people know about sex offenders in their area. Mash that data with Google Maps. Lo and behold, parents became instantly aware of caution areas and some might never look at their neighbor the same way ever again, while sex offenders start declaring themselves homeless.”

PulseAudio sound forwarding across a network

Here is a command-line-only way to forward audio over a network in Ubuntu 8.10+. Ubuntu uses PulseAudio as a sound server. It can be configured (esp. using the padevchooser package) to do this “properly”, but I wanted something fast that didn’t require configuration (and, uhm, I didn’t know about padevchooser).

The first step is to establish a secure connection between the machines using SSH to forward a local TCP port (the one with the sound application) to a TCP port on the remote machine (the one with the speakers):

(on the computer with the sound application)
ssh -L4000:localhost:4000 remotehost

PulseAudio is a server that listens for connections on a Unix domain socket by default, which means 1) it can’t be accessed remotely, and 2) ssh can’t redirect a TCP port to it directly. One could configure PulseAudio to use a TCP port instead, but then you have to worry about security. So instead, use the socat tool (you might need to install the socat package) on the remote machine to forward the remote machine’s local port 4000 to its PulseAudio Unix socket:

(on the computer with the speakers, e.g. inside the SSH session)
socat TCP-LISTEN:4000,fork UNIX-CONNECT:/tmp/pulse-$USER/native

socat will accept only local connections by default, which is why we need to SSH to the remote machine to connect — that’s a good thing if you like security.

The “,fork” option has socat listen for multiple connections. Otherwise it’ll quit after the first connection (it’ll play one sound file and exit). You can run socat once you’ve logged in with SSH. You might think you can do it all on one line (because ssh takes a second command line argument for a command to run), but it doesn’t work well with this “,fork” option because when you CTRL+C SSH to end the session, socat keeps running (which might be fine, I guess, but it will prevent you from running it a second time since something will already be listening on port 4000).

Now we can have PulseAudio-enabled programs play sound remotely by specifying to put the sound on the local port 4000, rather than to where it normally puts it for local sound. The paplay program plays a wav file:

paplay soundfile.wav (plays it locally)
paplay -s localhost:4000 soundfile.wav (plays it remotely)

Or equivalently by setting an environment variable:
PULSE_SERVER=localhost:4000 paplay soundfile.wav

The environment variable method should work for any other program that plays sound with PulseAudio.

The tricks begin when programs don’t support PulseAudio and instead use OSS or Alsa to play sounds. In principle, the padsp command is able to redirect the use of /dev/dsp (i.e. OSS output) to PulseAudio. Likewise, for programs that output sound using ALSA, you can redirect the output to PulseAudio by setting the environment variable ALSA_PCM_NAME=pulse.

PULSE_SERVER=localhost:4000 padsp application_using_oss
PULSE_SERVER=localhost:4000 ALSA_PCM_NAME=pulse application_using_alsa

I wanted OSS support for running Festival, the speech synthesis program, but this method doesn’t work in the versions of everything I have – Festival segfauls if you use padsp on it. So instead there are more tricks. You can override how Festival plays sounds with some Festival commands. Here’s how to have it output with the PulseAudio sound player (and with the command to redirect the output to the SSH forwarded port):

(Parameter.set ‘Audio_Method ‘Audio_Command)
(Parameter.set ‘Audio_Required_Format ‘wav)
(Parameter.set ‘Audio_Command “paplay -s localhost:4000 $FILE”)

I also wanted this to work with the Praat phonetics program, which currently outputs using ALSA, but it did not recognize the ALSA environment variable setting as described above, so it may not be possible to redirect its sound output this way.

Comparing stimulus bill text versions side-by-side

One of the concrete benefits of open government data is that third parties can use the data to do something useful that no one in government has the mandate, resources, or insight to do. If you think what I am about to tell you below is cool, and helpful, then you are a supporter of open government data.

On my site GovTrack, you can now find comparisons of the text of H.R. 1, the stimulus bill, at different stages in its legislative life — including the House version (as passed) and the current Senate version (amendment 570).

The main page on GovTrack for HR 1 is: here

Here’s a direct link to the comparison:

Comparisons are possible between any two versions of the bill posted by GPO. Comparisons are available for any bill.

If you find this useful, please take a moment to consider that something like this is possible only when Congress takes data openness seriously. When GPO went online and THOMAS was created in the early 90s, they chose good data formats and access policies (mostly). But the work on open government data didn’t end 15 years ago. As “what’s hot” shifts to video and Twitter, the choices made today are going to impact whether or not these sources of data empower us in the future, whether or not we miss exciting opportunities such as having tools like the one above.

(Thanks to John Wonderlich and Peggy Garvin for some side discussion about this before my post. GovTrack wasn’t initially picking up the latest Senate versions because GPO seems to have gone out of its way to accommodate posting the latest versions before they were passed by the Senate, which is great, but caught GovTrack by surprise.)

Open Government Data that Changed the World

I want to make the case that open government data has value not just for geeks, but has the power to change lives in significant ways. I spend a lot of time convincing government managers and staffers that open governemnt data is a good thing, but sometimes we get caught up in the technical details. It’s easy to say that legislative data is an important component of maintaining an educated public, or that open and reusable bits are important for the media to be able to make compelling cases, but it’s all very abstract. So I asked my Open House Project friends: what open government data has changed the world?

Here’s what I got:

Weather data from the NOAA plays an important role in the agricultural sector (hat tip: Clay Shirky, David Weller) and, for that matter, has a lot to do with the weather reports we all use to plan our daily lives. (I tried to get some info on this from NOAA but they ignored my email, ah well.)

Information on publicly traded companies reported to the SEC plays a vital role in the public’s ability to trade fairly. The fact that the SEC continues to break ground on even more comprehensive data requirements for reporting signals that the public availability of these files is extraordinarily important. (Hat tip to Clay for the pointer, and to Carl Malamud for spearheading getting these files originally online in the first place.) Data from other agencies like BLS and USDA affect the trading of other commodities. (Hat tip: Philip Kromer)

The social security death index has been a tool for genealogy research (hat tip: Tom Bruce).

NASA’s photos of Earth from space are part of the bedrock of inspiration of the country. Can you imagine how different the world might be if NASA kept the photos to itself? The Library of Congress publishes digital versions of historical artifacts, like the founding documents — this too is a crical part of inspiring Americans to strive for an ideal. (Hat tip: Clay.)

Geospacial data from the USGS and the Census bureau have made mapping applications like Google Maps and in-car GPS devices like TomTom possible or at least cheaper to make.  (Hat tip: Philip Kromer. Francis Irving notes that the UK is a counterexample. OK.)

Census statistics, epidemiology data, and many state-funded survey projects have played crucial roles in public health and economic research. No doubt CDC data has saved lives, though I don’t know any specifics (hat tip: many).

If you have other examples, or can help me flesh out these examples, please send something my way. To reiterate: I’m looking for open data that changed lives — please tell me what the data is and how it changed lives.