Navigating legislation (after the fact, of course)

In May, the Congress passed the 2008 Farm Bill, which regulates various food, nutrition, and apparently biofuel issues. Tufts food policy professor Parke Wilde writes on his blog today:

The 629-page text (.pdf) of the 2008 Farm Bill is so complex and unreadable that the U.S. food policy community has been on the edge of our seats waiting for the USDA/ERS side-by-side comparison unveiled today.

The ERS side-by-side tool compares the new Farm Bill with current law, title by title, so we can finally begin to understand what the law really means.

ERS is the USDA’s Economic Research Service. Their side-by-side webpage, which I think was just published this week, shows the provisions of the previous and the current bill side-by-side. (It’s not a comparison of the bill text, but of summaries of the provisions.)

This is interesting on a number of accounts. First, the fact that it is the USDA making this comparison suggests that everyone agrees that the bill itself is effectively incomprehensible even to professionals and scholars on account of its size and summarizing it is costly enough that only the government would do it, taking three months to prepare.

Second, if this is what was needed to understand the Farm Bill, was it passed without anyone understanding it?

Third- This comparison was made by and for professionals and scholars, not by tech geeks. Why aren’t we talking to them?

The ERS tool comes complete with a seemingly unintentionally hilarious intro video — overly dramatic with background music fit for the Miss Universe competition. (Wilde likened it to “a documentary by Kenneth Burns or an account of a manned mission to the moon”.)

The Semantic Web’s Role in Dealing with Disasters

My Census RDF dataset is being used in a public health project:

On SemanticWeb.com: http://www.semanticweb.com/article.php/3764266

The Semantic Web’s Role in Dealing with Disasters
August 8, 2008
By Jennifer Zaino

The University of Southern California Information Sciences Institute and Childrens Hospital Los Angeles have been working together to build a software tool. Dubbed PEDSS (Pediatric Emergency Decision Support System), the tool is designed to help medical service providers more effectively plan for, train for, and respond to serious incidents and disasters affecting children.

The project, a part of the Pediatric Disaster Resource and Training Center (PDRTC), has been going on for about eight months.

Dr. Tatyana Ryutov, a research scientist at the USC Information Sciences Institute, is working on the system. Recently, the Institute contacted Joshua Tauberer, the creator of Govtrak.us and the man who maintains a large RDF (Resource Description Framework) data set of U.S. Census data, about making SPAQRL queries to that data in conjunction with the PEDSS.

“Currently, demographic data (number of children in four age groups) is entered manually. We want the tool to calculate this information automatically based on a zip-code. Therefore, we extend the tool to query the RDF census data server to get this information,” Ryutov writes. Currently this is the only server the software queries, but Ryutov says they plan to add calls to other census data servers to improve reliability. Those servers do not have to be RDF databases.

(and it continues)

Berlin SPARQL Benchmarks for my SemWeb .NET Library

Chris Bizer and team have posted a benchmark specification for SPARQL endpoints, the Berlin SPARQL Benchmark (BSBM). They have “run the initial version of the benchmark against Sesame, Virtuoso, Jena SDB  and against D2R Server, a relational database-to-RDF wrapper. The stores were benchmarked with datasets ranging from 50,000 triples to 100,000,000 triples” (announcement email).

I ran the benchmark against my SemWeb .NET library. Instructions for setting up the benchmark are here and turned out to be a good example for how very quickly to set up a SPARQL endpoint using my library, backed with your SQL database of choice (in this case MySQL). I had some trouble the first time I ran the benchmark though:

  • The first time I ran the tests I found the library had several bugs/limitations: a bug preveting ORDER BY with dataTime values, an error parsing function calls in FILTER expressions, and a glitch in the translation of the query to SQL. I corrected these problems.
  • Query 10 must be modified to change the ordering to ORDER BY xsd:double(str(?price)), which adds the cast xsd:double(str(…)), since ordering by the custom USD datatype is not supported and not required to be supported by the SPARQL specification.
  • In the same query, in FILTER (?date > “2008-06-20″^^<http://www.w3.org/2001/XMLSchema#date> ), xsd:date comparisons are not a part of the SPARQL spec (as I understand it; dateTime comparisons on the other hand are required by the spec). Such comparisons weren’t implemented in my library, but I went ahead and added it.

Also I have some concerns. First, I am not 100% sure if the results of my library are actually correct. Query 4 seemed to always return no results. Second, queries are largely translated into SQL, and there is a good deal of caching going on at the level of MySQL. The benchmark results then are saying a lot about the best-case run time, and indicate something about the overhead of SPARQL processing, but may not indicate general use performance.

Benchmark results reported below are for my desktop: Intel Core2 Duo at 3.00GHz, 2 GB RAM, 32bit Ubuntu 8.04 on Linux 2.6.24-19-generic, Java 1.6.0_06 for the benchmark tools, and Mono 1.9.1. This seems roughly comparable to the machine used in the BSBM.

Load time (in seconds and triples/sec) is reported below for some of the different data set sizes.

50K 250K 1M 5M 25M
Time (sec) 224 16129
triples/sec 4441 1544

For comparison, load time for the 1M data set was 224 seconds. This is about double-to-2.5 times (worse) the time of Jena SDB (Hash) with MySQL over Joseki3 (117s) and Virtuoso Open-Source Edition v5.0.6 and v5.0.7 (87s), as reported in the BSBM results. For the larger 25M dataset, the load time at 4.5 hours was only 1.2 times slower than Jena SDB but 1.7 times faster than Sesame over Tomcat and 3 times faster than Virtuoso. (But, again, the machines were different.)

Results for query execution are reported below. AQET (Average Query Execution Time, in seconds) is reported below for each of the queries for different data set sizes. The results were roughly comparable again to Jena and Virtuoso. But, again, the three caveats above are worth restating: the query results are not validated to be known to be correct, there is significant caching, and the machine was different than the machine used in BSBM.

50K 250K 1M 5M 25M
Query 1 0.019184 0.049200
Query 2 0.051187 0.048590
Query 3 0.030508 0.079187
Query 4 0.032693 0.075603
Query 5 0.172283 0.342828
Query 6 0.102105 3.277656
Query 7 0.256491 1.108414
Query 8 0.175357 0.572258
Query 9 0.059674 0.088451
Query 10 0.089215 0.322246

oGosh! IRC Meeting Aug 16 4pm EDT

Join me at an IRC chat to talk about open source civic technology projects, on Saturday, August 16 at 4pm Eastern time! The agenda will be a mix between seeing what various civic technology projects are up to like GovTrack (my site, powered by Mono), OpenCongress, and any others run by people who show up, and getting new people involved in ongoing projects. “oGosh” is Open Government Open Source Hacking (wiki | Facebook), what I’m calling the loose community that binds these projects together.

The chat will be in the #transparency channel on Freenode. For more information on the meeting (and on how to get to the chat), see http://wiki.opengovdata.org/index.php/OGosh.

Suggestions for agenda topics are most welcome either to me directly or by revising the wiki page above. Hope to see you there.