SPARQL OLE DB Provider

Andy Gueritz announced on the mail list for my SemWeb RDF library for .NET that he has created an OLE provider for a SPARQL endpoint that is usable in Microsoft Excel. He wrote,

In a moment of insanity (but a great learning experience), I gave myself the challenge of writing an OLE DB provider for SPARQL. It is built on top of the SemWeb libary which has saved a substantial amount of effort and also brings some powerful functionality to the table very quickly (Thanks, Joshua!)

The provider as constructed implements a readonly OLE provider that supports all four SPARQL query types and interfaces to SemWeb through COM-Callable Wrapper. It is not extensively tested yet but seems to work with most of the queries I have now put through it, and of course being built on SemWeb it is able to read both local and remote SPARQL sources.

Moral of the story: populate Excel tables with SPARQL queries.

More here.

Civic Hacking, the Semantic Web, and Visualization

Yesterday I held a session called Semantic Web II: Civic Hacking, the Semantic Web, and Visualization at Transparency Camp. In addition to posting my slides, here’s basically what I said during the talk (or, now on reflection, what I should have said):

Who I Am: I run the site GovTrack.us which collects information on the status of bills in the U.S. Congress. I don’t make use of the semantic web to run the site, but as an experiment I generate a large semantic web database out of the data I collect, and some additional related data that I find interesting.

Data Isolation: What the semantic web addresses is data isolation. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily mashable that MAPLight is possible (that’s my site GovTrack and the site opensecrets.org). The semantic web wants to make this process cheaper by addressing mashability at the core. This is important for civic (i.e. political/government) data: machines help us sort, search, and transform information so we can learn something, which is good for civic education, journalism (government oversight), and research (health and economy). And it’s important for the data to be mashable by the public because uses of the data go beyond the resources, mission, and mandate of government agencies.

Beyond Metadata: We can think of the semantic web as going beyond metadata if we think of metadata as tabular, isolated data sets. The semantic web helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in congress with members of congress, what districts they represent, etc. We establish relations like sponsorship, represents, voted.

Why I care: Machine processing of knowledge combined with machine processing of language is going to radically and fundamentally transform the way we learn, communicate, and live. But this is far off still. (This explains why I study linguistics…)

Then there are some slides on URIs and RDF.

My Cloud: When the data gets too big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my semantic web data as several clouds. One cloud is the data I generate from GovTrack, which is 13 million triples about legislation and politicians. Another cloud is data I generate about campaign contributions: 18 million triples. A third data set is census data: 1 billion triples. I’ve related the clouds together so we can take interesting slices through it and ask questions: how did politicians vote on bills, what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregted by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode (ZCTA), etc. Once the semantic web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through the work that MAPLight did each time we want a new correlation.

Linked Open Data (LOD): I showed my part of the greater LOD cloud/community.

Implementation: A website ties itself to the LOD or semantic web world by including <link/> elements to RDF URIs for the primary topic of a page. This URI can be plugged into a web browser to retrieve RDF about that resource: it’s self-describing. I showed excerpts from a URI for a bill in congress that I created. It has basic metadata, but goes beyond metadata. The pages are auto-generated from a SPARQL DESCRIBE query as I explained in my Census case study on my site rdfabout.com.

SPARQL: The query language, the SQL, for the semantic web. It is similar to SQL in metaphors and keywords like SELECT, FROM, and WHERE. It differs in every other way. Interestingly, there is a cultural difference: SPARQL servers (“endpoints”) are often made publicly acessible directly, whereas SQL servers are usually private. This might be because SPARQL is read-only.

Example 1: Did a state’s median income predict the votes of Senators on H.R. 1424, the October 2008 stimulus bill? I show the partial RDF graph related to this question and how the graph relates to the SPARQL query. First it is an example SPARQL query. Then the real one. The real one is complicated not because RDF or SPARQL are complicated, but because the data model *I* chose to represent the information is complicated. That is, my data set is very detailed and precise, and it takes a precise query to access it properly. I showed how this data might be plugged into Many Eyes to visualize it.

My visualization dream: Visualization tools like Swivel (ehm: I had real problems getting it to work), Many Eyes, Ggobi, and mapping tools should go from SPARQL query to visualization in one step.

Example 2: Show me the campaign contributions to Rep. Steve Israel (NY-2) by zipcode on a map. I showed the actual SPARQL query I issue on my SPARQL server and a map that I want to generate. In fact, I made a prototype of a form where I can submit any arbitrary SPARQL query and it creates an interactive map showing the information.

Other notes: My SPARQL server uses my own .NET/C# RDF library. That creates a “triple store”, the equivalent of a RDBMS for the semantic web. Underlyingly, though, it stores the triples in a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. See also: D2R server for getting existing data online.

More Data that Changed the World

Continuing from my last post on this subject, I found some more examples of influential data sets from a page on FlowingData.com. I’m expanding beyond government data in this post.

“Baseball Statistics: In 2003, Michael M. Lewis’ book, Moneyball: The Art of Winning an Unfair Game, was released. As a result, the way baseball teams were built changed completely. Before Moneyball, teams relied on insider information and the choice of players was highly subjective. However, in 2002, a year before the book was published, the Oakland A’s had $41 million in salary and had to figure out how to compete against teams like the New York Yankees and the Boston Red Sox who spent over $100 million in salaries.”

“Megan’s Law: Since 1994, those who have been convicted of sex crimes against children have been required to register with local law enforcement. That data is made public so that people know about sex offenders in their area. Mash that data with Google Maps. Lo and behold, parents became instantly aware of caution areas and some might never look at their neighbor the same way ever again, while sex offenders start declaring themselves homeless.”

Open Government Data that Changed the World

I want to make the case that open government data has value not just for geeks, but has the power to change lives in significant ways. I spend a lot of time convincing government managers and staffers that open governemnt data is a good thing, but sometimes we get caught up in the technical details. It’s easy to say that legislative data is an important component of maintaining an educated public, or that open and reusable bits are important for the media to be able to make compelling cases, but it’s all very abstract. So I asked my Open House Project friends: what open government data has changed the world?

Here’s what I got:

Weather data from the NOAA plays an important role in the agricultural sector (hat tip: Clay Shirky, David Weller) and, for that matter, has a lot to do with the weather reports we all use to plan our daily lives. (I tried to get some info on this from NOAA but they ignored my email, ah well.)

Information on publicly traded companies reported to the SEC plays a vital role in the public’s ability to trade fairly. The fact that the SEC continues to break ground on even more comprehensive data requirements for reporting signals that the public availability of these files is extraordinarily important. (Hat tip to Clay for the pointer, and to Carl Malamud for spearheading getting these files originally online in the first place.) Data from other agencies like BLS and USDA affect the trading of other commodities. (Hat tip: Philip Kromer)

The social security death index has been a tool for genealogy research (hat tip: Tom Bruce).

NASA’s photos of Earth from space are part of the bedrock of inspiration of the country. Can you imagine how different the world might be if NASA kept the photos to itself? The Library of Congress publishes digital versions of historical artifacts, like the founding documents — this too is a crical part of inspiring Americans to strive for an ideal. (Hat tip: Clay.)

Geospacial data from the USGS and the Census bureau have made mapping applications like Google Maps and in-car GPS devices like TomTom possible or at least cheaper to make.  (Hat tip: Philip Kromer. Francis Irving notes that the UK is a counterexample. OK.)

Census statistics, epidemiology data, and many state-funded survey projects have played crucial roles in public health and economic research. No doubt CDC data has saved lives, though I don’t know any specifics (hat tip: many).

If you have other examples, or can help me flesh out these examples, please send something my way. To reiterate: I’m looking for open data that changed lives — please tell me what the data is and how it changed lives.

The Semantic Web’s Role in Dealing with Disasters

My Census RDF dataset is being used in a public health project:

On SemanticWeb.com: http://www.semanticweb.com/article.php/3764266

The Semantic Web’s Role in Dealing with Disasters
August 8, 2008
By Jennifer Zaino

The University of Southern California Information Sciences Institute and Childrens Hospital Los Angeles have been working together to build a software tool. Dubbed PEDSS (Pediatric Emergency Decision Support System), the tool is designed to help medical service providers more effectively plan for, train for, and respond to serious incidents and disasters affecting children.

The project, a part of the Pediatric Disaster Resource and Training Center (PDRTC), has been going on for about eight months.

Dr. Tatyana Ryutov, a research scientist at the USC Information Sciences Institute, is working on the system. Recently, the Institute contacted Joshua Tauberer, the creator of Govtrak.us and the man who maintains a large RDF (Resource Description Framework) data set of U.S. Census data, about making SPAQRL queries to that data in conjunction with the PEDSS.

“Currently, demographic data (number of children in four age groups) is entered manually. We want the tool to calculate this information automatically based on a zip-code. Therefore, we extend the tool to query the RDF census data server to get this information,” Ryutov writes. Currently this is the only server the software queries, but Ryutov says they plan to add calls to other census data servers to improve reliability. Those servers do not have to be RDF databases.

(and it continues)

Berlin SPARQL Benchmarks for my SemWeb .NET Library

Chris Bizer and team have posted a benchmark specification for SPARQL endpoints, the Berlin SPARQL Benchmark (BSBM). They have “run the initial version of the benchmark against Sesame, Virtuoso, Jena SDB  and against D2R Server, a relational database-to-RDF wrapper. The stores were benchmarked with datasets ranging from 50,000 triples to 100,000,000 triples” (announcement email).

I ran the benchmark against my SemWeb .NET library. Instructions for setting up the benchmark are here and turned out to be a good example for how very quickly to set up a SPARQL endpoint using my library, backed with your SQL database of choice (in this case MySQL). I had some trouble the first time I ran the benchmark though:

  • The first time I ran the tests I found the library had several bugs/limitations: a bug preveting ORDER BY with dataTime values, an error parsing function calls in FILTER expressions, and a glitch in the translation of the query to SQL. I corrected these problems.
  • Query 10 must be modified to change the ordering to ORDER BY xsd:double(str(?price)), which adds the cast xsd:double(str(…)), since ordering by the custom USD datatype is not supported and not required to be supported by the SPARQL specification.
  • In the same query, in FILTER (?date > “2008-06-20″^^<http://www.w3.org/2001/XMLSchema#date> ), xsd:date comparisons are not a part of the SPARQL spec (as I understand it; dateTime comparisons on the other hand are required by the spec). Such comparisons weren’t implemented in my library, but I went ahead and added it.

Also I have some concerns. First, I am not 100% sure if the results of my library are actually correct. Query 4 seemed to always return no results. Second, queries are largely translated into SQL, and there is a good deal of caching going on at the level of MySQL. The benchmark results then are saying a lot about the best-case run time, and indicate something about the overhead of SPARQL processing, but may not indicate general use performance.

Benchmark results reported below are for my desktop: Intel Core2 Duo at 3.00GHz, 2 GB RAM, 32bit Ubuntu 8.04 on Linux 2.6.24-19-generic, Java 1.6.0_06 for the benchmark tools, and Mono 1.9.1. This seems roughly comparable to the machine used in the BSBM.

Load time (in seconds and triples/sec) is reported below for some of the different data set sizes.

50K 250K 1M 5M 25M
Time (sec) 224 16129
triples/sec 4441 1544

For comparison, load time for the 1M data set was 224 seconds. This is about double-to-2.5 times (worse) the time of Jena SDB (Hash) with MySQL over Joseki3 (117s) and Virtuoso Open-Source Edition v5.0.6 and v5.0.7 (87s), as reported in the BSBM results. For the larger 25M dataset, the load time at 4.5 hours was only 1.2 times slower than Jena SDB but 1.7 times faster than Sesame over Tomcat and 3 times faster than Virtuoso. (But, again, the machines were different.)

Results for query execution are reported below. AQET (Average Query Execution Time, in seconds) is reported below for each of the queries for different data set sizes. The results were roughly comparable again to Jena and Virtuoso. But, again, the three caveats above are worth restating: the query results are not validated to be known to be correct, there is significant caching, and the machine was different than the machine used in BSBM.

50K 250K 1M 5M 25M
Query 1 0.019184 0.049200
Query 2 0.051187 0.048590
Query 3 0.030508 0.079187
Query 4 0.032693 0.075603
Query 5 0.172283 0.342828
Query 6 0.102105 3.277656
Query 7 0.256491 1.108414
Query 8 0.175357 0.572258
Query 9 0.059674 0.088451
Query 10 0.089215 0.322246

SemWeb RDF Library for C#

Semantic Web/RDF Library for C#

I’ve just posted the first release of my SemWeb library, written in C# for Mono and .NET, at http://taubz.for.net/code/semweb.
Features:
* Simple API; easy to deploy; no platform-specific dependencies.
* Reading and writing RDF/XML, Turtle, NTriples, and most of Notation 3, at around 20,000 statements/second.
* All operations are streaming, so it should scale.
* Two built-in types of RDF stores: an in-memory hashtable-indexed store for small amounts of data and an SQL store with MySQL and SQLite backends for large amounts of data.
* Creating new SQL-based stores takes minutes, and implementing other types of stores is as simple as extending an abstract class.
* Statements are quads, rather than triples, with the fourth ‘meta’ field left for application-specific uses.
I’ve been using SemWeb to push around the 7 million triples created by
GovTrack (shameless plug).

Directory of C# Libraries

I had hoped to time the release of SemWeb with the debut of a new website I’m working on that will be a really great directory of Mono/.NET reusable, open source libraries. But, I need to get some other people’s libraries listed, besides my own, before finishing the site. If you’ve written a C# library that you think others would find useful, please let me know.

A Programming Project

A Programming Task for Someone Looking to Hack

The biggest thing that has helped me to program better is little programming projects. My first was a simple math tutoring program in GW-BASIC, written with the help of my dad back around third grade. I’ve almost always had a little project to keep me busy since then.Today, it’s creating an RDF library in C#.

I know that often people are looking for ideas for programs to write, so I thought I’d post a routine that someone might want to spend some time hacking. This is a mildly advanced routine, but anyway:

The goal is to parse an RDF/XML document using only XmlReader. That is, extract the RDF statements without loading the entire document into memory as an XmlDocument. As far as I know, this has never been programmed in C#, and it is really critical if semantic web applications are going to be built in .NET.

Getting the basics going isn’t too difficult a task. Getting the entire spec implemented is more of a challenge. But what’s life without challenges, eh? If you’re interested in taking a stab at this, drop me an email (tauberer@for.net).

A Design Suggestion

When I was riding the train back from D.C. to Philly last week, the speaker in the car I was in wasn’t working, so no one could hear the conductor’s announcements. Probably no Amtrak person noticed the problem.

It made me think that we often build things that don’t notice when they’re not working. Speakers should be built with microphones that realize when the speaker isn’t emitting the sound it should be, and when that happens it sends back a signal to… somewhere. Software should do the same thing. Applications should realize when things aren’t working right and, more importantly, send back a useful message that a problem occured.

Here’s a for instance. I plugged in a printer to my Linux desktop this week, but I couldn’t print a test page. The only message I got back was that I should increase the debugging level and inspect the output. Well, this is not a useful signal. Even with debugging on, the message I got was that the driver couldn’t be loaded. Pretty vague. It turned out the driver wasn’t even present on my system because I didn’t have the RPM installed. This is a condition that the printing system should have been able to detect and inform me of.

The failure here is there was no mechanism built into the system for passing back useful error messages to the user. If there was a useful message at some point, it was discarded before it reached me. Don’t write software like this.

Diffing and RDF

If you’re reading this, you’re probably reading this on Monologue, and that means I’ve successfully added myself to Monologue. 🙂

Recently I got a helpful bug report for my Diff library for C# which pointed out that my port of Perl’s Algorithm::Diff wasn’t generating the same diffs as the original module. I fixed the bug and reposted a new version of the library.

In unrelated news, I’m working on building the semantic web for information about the U.S. government. This is a spin-off of my work on GovTrack (which is powered by Mono). To get this web built, I’m in the position of having to convince people that RDF is the right way to approach the problem of distributed information — over, for instance, XML, XML Schema, and XQuery. The problem is that RDF is complicated and often misunderstood, and I hadn’t found a good document explaining what RDF is and why it should be used for this. So, I wrote one. I’m not a master of RDF by any means, so any corrections and suggestions are welcome.

By the way, if you’re interested in building this political semantic web, join the GovTrack mail list.

Lastly, with my new interest in RDF, I was looking for a good C# library for working with RDF data models. I didn’t find one that I particularly liked (there are a few ones out there, but for various reasons I just couldn’t see myself using them), so I’m working on my own. I’ll post the source in a few weeks, probably.