Try hacking for government transparency in GSoC

Does the thought of “hacking Congress” entice you? I don’t mean breaking in to U.S. Capitol servers, of course, but putting your l33t hacking skillz to use to improve government transparency and civic engagement. The Sunlight Foundation (I have no affiliation) is a mentoring organization in Google Summer of Code 2009. Check it out.

Shameless plug: HackingCongress.org

Civic Hacking, the Semantic Web, and Visualization

Yesterday I held a session called Semantic Web II: Civic Hacking, the Semantic Web, and Visualization at Transparency Camp. In addition to posting my slides, here’s basically what I said during the talk (or, now on reflection, what I should have said):

Who I Am: I run the site GovTrack.us which collects information on the status of bills in the U.S. Congress. I don’t make use of the semantic web to run the site, but as an experiment I generate a large semantic web database out of the data I collect, and some additional related data that I find interesting.

Data Isolation: What the semantic web addresses is data isolation. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily mashable that MAPLight is possible (that’s my site GovTrack and the site opensecrets.org). The semantic web wants to make this process cheaper by addressing mashability at the core. This is important for civic (i.e. political/government) data: machines help us sort, search, and transform information so we can learn something, which is good for civic education, journalism (government oversight), and research (health and economy). And it’s important for the data to be mashable by the public because uses of the data go beyond the resources, mission, and mandate of government agencies.

Beyond Metadata: We can think of the semantic web as going beyond metadata if we think of metadata as tabular, isolated data sets. The semantic web helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in congress with members of congress, what districts they represent, etc. We establish relations like sponsorship, represents, voted.

Why I care: Machine processing of knowledge combined with machine processing of language is going to radically and fundamentally transform the way we learn, communicate, and live. But this is far off still. (This explains why I study linguistics…)

Then there are some slides on URIs and RDF.

My Cloud: When the data gets too big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my semantic web data as several clouds. One cloud is the data I generate from GovTrack, which is 13 million triples about legislation and politicians. Another cloud is data I generate about campaign contributions: 18 million triples. A third data set is census data: 1 billion triples. I’ve related the clouds together so we can take interesting slices through it and ask questions: how did politicians vote on bills, what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregted by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode (ZCTA), etc. Once the semantic web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through the work that MAPLight did each time we want a new correlation.

Linked Open Data (LOD): I showed my part of the greater LOD cloud/community.

Implementation: A website ties itself to the LOD or semantic web world by including <link/> elements to RDF URIs for the primary topic of a page. This URI can be plugged into a web browser to retrieve RDF about that resource: it’s self-describing. I showed excerpts from a URI for a bill in congress that I created. It has basic metadata, but goes beyond metadata. The pages are auto-generated from a SPARQL DESCRIBE query as I explained in my Census case study on my site rdfabout.com.

SPARQL: The query language, the SQL, for the semantic web. It is similar to SQL in metaphors and keywords like SELECT, FROM, and WHERE. It differs in every other way. Interestingly, there is a cultural difference: SPARQL servers (“endpoints”) are often made publicly acessible directly, whereas SQL servers are usually private. This might be because SPARQL is read-only.

Example 1: Did a state’s median income predict the votes of Senators on H.R. 1424, the October 2008 stimulus bill? I show the partial RDF graph related to this question and how the graph relates to the SPARQL query. First it is an example SPARQL query. Then the real one. The real one is complicated not because RDF or SPARQL are complicated, but because the data model *I* chose to represent the information is complicated. That is, my data set is very detailed and precise, and it takes a precise query to access it properly. I showed how this data might be plugged into Many Eyes to visualize it.

My visualization dream: Visualization tools like Swivel (ehm: I had real problems getting it to work), Many Eyes, Ggobi, and mapping tools should go from SPARQL query to visualization in one step.

Example 2: Show me the campaign contributions to Rep. Steve Israel (NY-2) by zipcode on a map. I showed the actual SPARQL query I issue on my SPARQL server and a map that I want to generate. In fact, I made a prototype of a form where I can submit any arbitrary SPARQL query and it creates an interactive map showing the information.

Other notes: My SPARQL server uses my own .NET/C# RDF library. That creates a “triple store”, the equivalent of a RDBMS for the semantic web. Underlyingly, though, it stores the triples in a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. See also: D2R server for getting existing data online.

oGosh! IRC Meeting Aug 16 4pm EDT

Join me at an IRC chat to talk about open source civic technology projects, on Saturday, August 16 at 4pm Eastern time! The agenda will be a mix between seeing what various civic technology projects are up to like GovTrack (my site, powered by Mono), OpenCongress, and any others run by people who show up, and getting new people involved in ongoing projects. “oGosh” is Open Government Open Source Hacking (wiki | Facebook), what I’m calling the loose community that binds these projects together.

The chat will be in the #transparency channel on Freenode. For more information on the meeting (and on how to get to the chat), see http://wiki.opengovdata.org/index.php/OGosh.

Suggestions for agenda topics are most welcome either to me directly or by revising the wiki page above. Hope to see you there.

Embedding a JavaScript interpreter with Mono

With the semester coming to a close and free time becoming scarce, new forms of pseudo-work procrastination were bound to pop up — the kind of procrastination where you do something constructive, just not what you’re supposed to be doing. Getting to the point, this semester I’ve been using a program called Praat, which is a tool phoneticians use to analyze sound files. Praat is good for its ability to run macro-like scripts that can automate tasks, but the programming language used in Praat was custom-made for Praat and defies normal conventions in so many ways it’s fairly frustrating to use if you’re already familiar with programming.

So I decided I would try to replace Praat’s scripting language with JavaScript, a language most programmers are either familiar with already or can pick up quickly because of its similarity to all C-like languages. A difficult task? It turns out this was one of the most rapid programming experiences I’ve ever had. I was able to do it in about three hours and less than 200 lines of code, thanks to all of the work that has gone into four open-source projects, Rhino, GNU Classpath, IKVM, and Mono.

Praat is written in C, and so the first requirement was that I find a scripting language runtime that I can embed in C. I didn’t know of any, but the one runtime that I knew was easily embeddable was Rhino, a JavaScript runtime written in Java, created by the Mozilla team. It’s possible to link native programs to Java programs, but as far as I know, it’s not particularly pretty. What is pretty is embedding the Mono runtime in native applications, so I went that route.

Step 1 was converting Rhino to .NET using IKVM, the Java bytecode-to .NET CIL converter. (Brief aside: One of my first C# projects was attempting to do that, before IKVM existed. I didn’t get very far and definitely couldn’t have made anything as comprehensive as Jeroen has done with IKVM.) IKVM relies on GNU Classpath, so hats off to them too. The conversion goes like this:

ikvmc js.jar

The program creates ikvm.dll, the equivalent .NET assembly. Easy enough.

The next step was to create a simple wrapper around the Rhino interpreter. I just adapted a sample from the Rhino docs. I also added a way for the script to call back into Praat’s C routines, using special DllImport’s to call exported functions in the main executable (rather than an external library):

[DllImport ("__Internal", EntryPoint="mono_embed_echo")]
public static extern void Echo (string message);

Okay, now how to get this integrated with Praat? In the end, the Mono runtime will get embedded in the Praat executable. The simple way to do that is to follow Mono’s embedding guidelines. The slight drawback with that method for my purposes is that the resulting program will have an external dependency on the .NET assemblies I need to run the scripts (my wrapper, Rhino as js.dll, and supporting IKVM assemblies). Normally I wouldn’t mind, but Praat compiles to a single standalone file, and to keep deployment just as simple I wanted to keep that.

Mono has a tool called mkbundle that, wonderfully enough, can take a bunch of assemblies and the runtime itself and bundle them into a single native library. mkbundle was envisioned to create a single executable file out of a Mono application to make deployment easy. That’s just what I needed, except I didn’t want mkbundle to generate a native program, but a native library that I could embed in Praat. The output of mkbundle was easy to alter to get that, but I’ve already patched mkbundle in Mono’s SVN repository to make it do this on its own. So this goes like this:

mkbundle -c --nomain -z -o mono_embed_host.c -oo mono_embed_libs.o \
PraatMonoEmbed.dll js.dll IKVM.GNU.Classpath.dll IKVM.Runtime.dll

This creates two files, mono_embed_hosts.c and mono_embed_libs.o. The .o file is a native file that has the four assembly files packaged within it. The assemblies aren’t compiled to native code; they’re just stored as data within the library. It could also have the Mono runtime itself statically linked in, so that the final product will have no external dependencies at all, although I haven’t tried it yet so right now the final product will need Mono to be installed. The .c file is a wrapper to set up the embedded Mono runtime to use the bundled assemblies. We’ll link these files with Praat.

We’re almost there. Next, we need a C routine to pass a script to the C# wrapper around Mono. This routine will initialize the Mono runtime and call a C# method passing the script as an argument. The Mono embedding guide explains how to do that. It basically goes like this:

void run_script(char *code) {
mono_mkbundle_init();
domain = mono_jit_init ("main-domain");
assembly = mono_domain_assembly_open (domain, "PraatMonoEmbed.dll");
driver_desc = mono_method_desc_new ("PraatMonoEmbed:RunScript(string)", 0);
driver = mono_method_desc_search_in_image (driver_desc, mono_assembly_get_image(assembly));
void* params[] = { mono_string_new(domain, code) };
MonoObject* ret = mono_runtime_invoke(driver, NULL, params, NULL);
}

This file is compiled with:

gcc -c mono_embed.c -o mono_embed.o `pkg-config --cflags mono`

Lastly, Praat’s build script gets modified to link in all of the above files and link to the Mono embedding library (libmono). And that’s pretty much it. Praat now has a JavaScript interpreter built-in.

C# 3.0 – That’s Hot

When generics and yield were introduced in the C# 2.0 spec, I was quite impressed, but not as impressed I am from the C# 3 spec (pointed out on the Nemerle blog via Monologue). I can’t wait to use these features.

Here’s an overview of what’s getting a lot better.

In C# today, I find myself using this pattern over and over again:
SomeType[] elements = (SomeType[])elemArrayList.ToArray(typeof(SomeType))
With generics, I haven’t tried this, but I think at least we get to simplify it as:
SomeType[] elements = elemArrayList.ToArray()
That’s not bad. One repetition gone. With C# 3.0, we get to eliminate all repetition with implicitly typed local variables:
var elements = elemArrayList.ToArray()

I have wanted extension methods for so long. Extension methods let you add instance methods to types that you don’t have control over. That is, let’s say you want to implement the Merge Sort algorithm over ArrayLists. Today, you’d have to do this in a separate class and call it as:
MyAlgos.MergeSort(myArrayList).
Not so bad. In C# 3.0, you declare MergeSort as an extension method by adding the this keyword in the parameters:
void MergeSort(this ArrayList list) { … }
That makes it available to be called simply as:
myArrayList.MergeSort().
Oooooo.

In C# 2.0 we got anonymous methods/delegates. I haven’t used them, but apparently you can simplify:
void Print(object x) { Console.WriteLine(x); }
MyFunc(mylist, new MyFuncHandler(Print));
as
MyFunc(mylist, new MyFuncHandler(object x) { Console.WriteLine(x); } );
Or something like that. Still a bit too verbose for me. With C# 3.0, we get lambda expressions, which are tightened-up anonymous methods.
MyFunc(mylist, (x) => { Console.WriteLine(x); } );
Much nicer.

Next, we get object initializers, which are interesting… It’s a bit like the With statement in VisualBasic. If you have a regular Point class with properties X and Y, you now get to simplify creating a point from:
new Point(10, 20)
to
new Point { X = 10; Y = 20; }
Hmmm.

Object initializers are probably just there to support another more interesting (not really) new feature: anonymous types. The scenario where I think these could be most helpful is when you have a function that returns two or more things. An example off the top of my head is, say, a function that returns two factors of a number, e.g. for 10 it returns 2 and 5 (2*5=10), for 21 it returns 3 and 7 (3*7=21). Currently, we have two options. The first is to use out parameter:
void GetFactors(double val, out double firstFactor, out double secondFactor) { … }
The second option is to create a new data type for the return value:
struct GetFactorsReturn { public double FirstFactor, SecondFactor; }
GetFactorsReturn GetFactors(double val) {

GetFactorsReturn retval = new GetFactorsReturn();
retval.FirstFactor = …;
retval.SecondFactor = …;
return retval;
}
In C# 3, we can simplify this second option by not having to define the new data structure explicitly. Forget the struct GetFactorsReturn definition, and the return from GetFactors looks like:
return new { FirstFactor = …; SecondFactor = … }
There’s a problem, though. The name of this anonymous type is generated by the compiler and is not accessible from code, so we can’t declare methods that use this type as a parameter or return value. So our GetFactors function could only be declared as:
object GetFactors(double val) { … }
Also note that anonymous types are always reference types inheriting from Object. If we could reference these types in method signatures somehow, and if we could choose whether they should be value or reference types, this would be much more helpful.

We get some more sugar in the form of collection initializers and implicitly typed arrays:
var myList = new List { 0, 1, 2, 3 };
var myArray = new[] { 0, 1, 2, 3 };

Lastly there are query expressions, which look like a way to use an SQL-like syntax to filter collections of objects. I’m not so sure this is really needed in the language, but it’s a whole new thing that needs to be looked at more closely.

SemWeb RDF Library for C#

Semantic Web/RDF Library for C#

I’ve just posted the first release of my SemWeb library, written in C# for Mono and .NET, at http://taubz.for.net/code/semweb.
Features:
* Simple API; easy to deploy; no platform-specific dependencies.
* Reading and writing RDF/XML, Turtle, NTriples, and most of Notation 3, at around 20,000 statements/second.
* All operations are streaming, so it should scale.
* Two built-in types of RDF stores: an in-memory hashtable-indexed store for small amounts of data and an SQL store with MySQL and SQLite backends for large amounts of data.
* Creating new SQL-based stores takes minutes, and implementing other types of stores is as simple as extending an abstract class.
* Statements are quads, rather than triples, with the fourth ‘meta’ field left for application-specific uses.
I’ve been using SemWeb to push around the 7 million triples created by
GovTrack (shameless plug).

Directory of C# Libraries

I had hoped to time the release of SemWeb with the debut of a new website I’m working on that will be a really great directory of Mono/.NET reusable, open source libraries. But, I need to get some other people’s libraries listed, besides my own, before finishing the site. If you’ve written a C# library that you think others would find useful, please let me know.

Real ID Act

Following up on Miguel’s post about the Real ID act that supposedly is passing through Congress via a clever hack, in fact the Real ID Act is its own act in its own right, and it’s not merely a hidden provision on a giant spending bill (although it seems to be attached to another bill now).

It also hasn’t exactly been slipping through Congress unnoticed. On Feb. 10, the House voted on the bill. It passed 261/161, with 96% of Republicans in favor and 78% of Democrats against. It was also discussed in the House on at least seven occasions.

Now, while it may be true that senators won’t get a chance to vote on the bill separately (and, yes, thanks to the Republican leadership), in all likelihood it wouldn’t make a difference anyway.

What I find most interesting about some of this Real ID debate is that no one is linking anyone to the text of the legislation itself (see the first link), as if everyone wants to be able to make wild claims about the bill that support their side without any factual evidence. In fact there’s no need to rely on their spin. Read the act and form your own opinion.

Here’s an excerpt from the official *summary* of the bill:

Title II – Improved Security for Driver’s Licenses and Personal Identification Cards

Section 202 – Prohibits Federal agencies from accepting State issued driver’s licenses or identification cards unless such documents are determined by the Secretary to meet minimum security requirements, including the incorporation of specified data, a common machine-readable technology, and certain anti-fraud security features.

Sets forth minimum issuance standards for such documents that require: (1) verification of presented information; (2) evidence that the applicant is lawfully present in the United States; and (3) issuance of temporary driver’s licenses or identification cards to persons temporarily present that are valid only for their period of authorized stay (or for one year where the period of stay is indefinite).

Section 203 – Requires States, as a condition of receiving grant funds or other financial assistance under this title, to participate in the interstate compact regarding the sharing of driver’s license data (the Driver License Agreement).

Section 204 – Amends the Federal criminal code to prohibit trafficking in actual as well as false authentication features for use in false identification documents, document-making implements, or means of identification.

Requires the Secretary to enter into the appropriate aviation security screening database information regarding persons convicted of using false driver’s licenses at airports.

Section 205 – Authorizes the Secretary to make grants to assist States in conforming to the minimum standards set forth in this title.

Section 206 – Gives the Secretary all authority to issue regulations, set standards, and issue grants under this title. Gives the Secretary of Transportation all authority to certify compliance with such standards.
Authorizes the Secretary to grant States an extension of time to meet the minimum document requirements and issuance standards of this title, with adequate justification.

And, by the way, next time Miguel mentions Noam Chomsky’s political views, I’m going to follow up with a rant about his linguistic views. 🙂

mod_mono Control Panel

The latest release of Mono adds a little control panel to mod_mono, which I cooked up a while back. (Btw, thanks to Gonzalo for fixing it up and getting it in svn.) The control panel only lets you do one thing right now, which is restarting any mod-mono-server processes that it started. I use this whenever I update the code for GovTrack and need to have Mono reload the DLLs.

To activate the control panel at a URL, add something along these lines to your httpd.conf:

  SetHandler mono-ctrl  Order deny,allow  Deny from all  Allow from 127.0.0.1  [or your IP address]

The Order/Deny/Allow directives make sure that you’re the only one that can restart the server. You can also put these directives in a .htaccess file in the directory where you want the control panel to show up, but omit the tags. (Restart Apache after you put this in httpd.conf.)

Now you can visit http://yourservername/mono and you’ll see the control panel, with a link to restart the mod-mono-servers. Clicking the link immediately restarts the servers. (Not Apache, just the mod-mono-servers.)

For those running virtual hosts: If you’ve placed your Mono* directives for mod_mono all in VirtualHost sections, then you’ve already got separate mod-mono-server instances for each virtual host. The control panel will only see the mod-mono-server(s) for the virtual host that’s serving the page, so you could put different access controls on the control panel for different vhosts to allow different people to restart only the mod-mono-severs that they should be allowed to restart.

Programming Language Syntax

If you know a little bit about how compilers work, you know that the syntax of programming languages is context free. That is to say that each syntactic element of the language can be described as a list of sub-elements, regardless of what context it appears in. For example, a while-loop in C# is (roughly) the keyword ‘while’, followed by an expression, followed by a statement (or block of statements), and it doesn’t matter where the loop appears, that syntactic definition is always the same. This is basically the idea of a context free grammar (CFG).

Natural languages (i.e. human languages) are not context free. It’s impossible to come up with a (concise) list of CFG rules, such as “a sentence is a noun phrase, followed by a verb, followed by a noun phrase; and a noun phrase is an article (a ‘determiner’ in the biz) followed by a noun” to describe English, for instance. That will work for simple sentences like “a man walked a dog”, but not for sentences like “which dog do you think a man walked?”

Now, this raises the question of why we don’t program in languages that closely resemble natural languages, in terms of syntactic structure? Wouldn’t that make it easier to program? There’s a good reason why we don’t do that, actually: No one knows what the syntax of natural languages looks like. Try as we might, natural languages are still beyond our understanding.

The reason I’m writing this is that I just came back from a symposium in honor of one of my professors (I study the syntax of natural language, by the way) who invented Tree Adjoining Grammar. TAG is a type of syntactic formalism that can actually be used to describe English fairly well — in the way that CFGs don’t even come close. At a very high level, TAG adds to CFG the ability to splice together two units of structure. I was wondering whether a TAG-based programming language syntax would let us program with new types of syntactic sugar, although I think the answer is that nothing interesting would come out of it.

A Programming Project

A Programming Task for Someone Looking to Hack

The biggest thing that has helped me to program better is little programming projects. My first was a simple math tutoring program in GW-BASIC, written with the help of my dad back around third grade. I’ve almost always had a little project to keep me busy since then.Today, it’s creating an RDF library in C#.

I know that often people are looking for ideas for programs to write, so I thought I’d post a routine that someone might want to spend some time hacking. This is a mildly advanced routine, but anyway:

The goal is to parse an RDF/XML document using only XmlReader. That is, extract the RDF statements without loading the entire document into memory as an XmlDocument. As far as I know, this has never been programmed in C#, and it is really critical if semantic web applications are going to be built in .NET.

Getting the basics going isn’t too difficult a task. Getting the entire spec implemented is more of a challenge. But what’s life without challenges, eh? If you’re interested in taking a stab at this, drop me an email (tauberer@for.net).

A Design Suggestion

When I was riding the train back from D.C. to Philly last week, the speaker in the car I was in wasn’t working, so no one could hear the conductor’s announcements. Probably no Amtrak person noticed the problem.

It made me think that we often build things that don’t notice when they’re not working. Speakers should be built with microphones that realize when the speaker isn’t emitting the sound it should be, and when that happens it sends back a signal to… somewhere. Software should do the same thing. Applications should realize when things aren’t working right and, more importantly, send back a useful message that a problem occured.

Here’s a for instance. I plugged in a printer to my Linux desktop this week, but I couldn’t print a test page. The only message I got back was that I should increase the debugging level and inspect the output. Well, this is not a useful signal. Even with debugging on, the message I got was that the driver couldn’t be loaded. Pretty vague. It turned out the driver wasn’t even present on my system because I didn’t have the RPM installed. This is a condition that the printing system should have been able to detect and inform me of.

The failure here is there was no mechanism built into the system for passing back useful error messages to the user. If there was a useful message at some point, it was discarded before it reached me. Don’t write software like this.