50% of the U.S. population lives in 1% of the land area

Everyone likes a good choropleth map, that is, a map with regions colored according to some variable. But when the variable is a function a population of highly unevenly distributed individuals — such as in maps of the United States — we know we can run into some problems:

Half of the population in the continental United States lives in just 1% of the continental U.S. land area. One-fifth of the population lives in just 0.28% of the land area. 95% of the population lives in just 27% of the land area.

There are at least two problems with veridical (regular geographic) choropleth maps. In a rasterized choropleth map (i.e. it has finite resolution), entire cities can get squashed into a single pixel with the result that information is lost. A substantial proportion of the information that the map is trying to show probably doesn’t even appear if the variation occurs where the people are.

And more familiar, veridical maps can misrepresent the aggregate. Individuals in low-density areas are given more space on the map than individuals with high-density areas, biasing aggregate inferences toward the value of individuals in low-density areas.

Coloring a map by district — like by county or congressional district — runs into the same problem. The smallest 50% of the 433 congressional districts in the continental U.S. occupy just 5% of the land area. Six congressional districts, all in New York City, are smaller than one pixel in a typically sized map!  (Where “typically sized” is 650px by 410px.)

When map data falls below the resolution of the map itself one should be very concerned. It’s like tossing out arbitrary data because these data points really aren’t showing up at all. That’s considered academic fraud when the data is shown in the form of a table. I’m not sure why we think it’s okay in map form.

It’s also mostly the urban population that gets squeezed into a small area. This is particularly concerning for politically themed maps since the urban population leans left. All six of those too-small-to-be-seen New York districts are currently represented by Democrats, for instance. Republican-held congressional districts are on average 2.7 times larger than Democrat-held districts despite having equal weight in Congress and so take up disproportionate space in a verdical map. The same is likely true by county too if we were to look at presidential election results.

Considering how much space on a map is taken up by essentially unpopulated land, these maps are also inefficient representations of the data. They give space to meaningless geographies while skipping meaningful ones.

It’s really time we stop using veridical maps to show population data. I get that cartograms are hard to construct and hard to read, but I would rather have no map at all than a map that misrepresents the data it purports to show.

Detailed data:

Here’s a table showing land area as a function of population:

% of Population % of Land Area
5% 0.02%
10% 0.05%
15% 0.11%
20% 0.19%
25% 0.28%
30% 0.39%
35% 0.52%
40% 0.67%
45% 0.85%
50% 1.07%
55% 1.35%
60% 1.71%
65% 2.19%
70% 2.91%
75% 4.02%
80% 5.91%
85% 9.28%
90% 15.44%
95% 27.41%
100% 99.62%

Methodology:

For computing land area resided in by the population, I used the 72,246 Census tracts in the 2010 census that make up the continental United States, meaning I excluded tracts in Alaska, Hawaii, and the five island territories. For land area I used the ALAND10 value in the Census’s shapefiles. The total population and land area of the tracts used were 306,675,006 and 7,653,005 km^2, respectively.

For congressional districts, I used the 433 districts in the continental U.S. (that’s the states minus Alaska and Hawaii and including the DC district). Their “land area” is their 2D area after being projected into EPSG:2163, which is an equal-area projection, using this Census GIS data. The total “land area” computed this way came out to 8,064,815 km^2, the difference being areas of water. For which party holds the district, I filled in the two currently vacant districts (AL-01 & FL-13) with the party of their most recent congressman.

Thanks to Matt Moehr, Lisa Wolfisch, and Pat Grady for some tips on identifying census tracts via Twitter.

Updates: Keith Ivey pointed out that I included Hawaii in the definition of continental U.S. the first time around. Fortunately it’s land area is small enough that it only barely affected the numbers. Instead of 74,003 tracts and 435 congressional districts there are 72,246 and 433; 60% of the population lives in 1.71% and not 1.70% of the land area. Other numbers are unchanged.

I also changed the projection used to compute the land area of congressional districts from “web Mercator” to an equal-area projection, but the numbers (e.g. 50% cover 5% of land) didn’t change. While I was there, I also changed how the Republican/Democrat distortion was measured. I originally wrote “Republican-held congressional districts cover 3.2 times more land area than Democrat-held districts despite Republicans only having 1.2 times as many seats in Congress” but I think the way it’s phrased now is clearer.

– 12/23/2013

Updated Guidance for Federal Agencies’ Open Data Licensing

Eric Mill, Jonathan Gray, and I have updated Best-Practices Language for Making Data “License-Free” which also has a new home at http://theunitedstates.io/licensing/.

We’re also adding a slew of new endorsements, bringing the list to the Sunlight Foundation (read Eric’s blog post), the Open Knowledge Foundation (read their blog post), CDT, EFF (read their blog post), Public Knowledge (read their blog post), the Free Law Project, the OpenGov Foundation, Carl Malamud at Public.Resource.Org, Jim Harper at WashingtonWatch.com, CREW, and MuckRock. Thanks go out to our contacts at all of those organizations and especially to Eric who spearheaded the effort to issue this update.

Redefining “Open”

Our guidance is for federal agencies and is related to the recent open data memo from the White House, M-13-13. That memo directed agencies to make data “open” but told agencies the wrong thing about what open data actually means. We’re correcting that with precise, actionable direction that, to summarize, says “Use CC0.” As we write in our guidance:

It is essential that U.S. federal government agencies have the tools to preserve the United States’ long legal tradition of ensuring that public information created by the federal government is exempt from U.S. copyright and remains free for everyone to use without restriction.

The White House memo opened the possibility that agencies could impose arbitrary licenses (e.g. in the form of terms of use agreements) on government data and still call that data open government data. Or, worse, that the government could copyright data through the backdoor of contract work and still call it open government data. Basically it said open data can be “openly licensed.” In the U.S., where public domain is often the default licensing is a step backward.

Why It Matters

Why does it matter?  Imagine if after FOIA’ing agency deliberative documents The New York Times was legally required to provide attribution to a contractor, or, worse, to the government itself.  If the government doesn’t like the article, maybe they take the Times to court on the grounds that the attribution wasn’t done correctly.  There’s a reason we don’t let our government control access to information.

What We Recommend

In short what we say is “Use Creative Commons Zero” (CC0), which is a public domain dedication. We provide recommended language to put on government datasets and software to put the data and code into the world-wide public domain, which means anyone can use the information without any capricious restriction. In a way, it’s the opposite of a license.

I previously wrote about this in August when we issued the first version of our guidance. Since then, our document has been effective in guiding the use of “open” in three government projects:

  1. OSTP’s Project Open Data re-licensed its schema for federal data catalog inventory files. It had been licensed under CC-BY because of non-governmental contributors to the schema, but now it uses CC0.
  2. CFPB followed our guidance and applied CC0 to their qu project
  3. …and their eRegs platform.

Our advice was also already followed by HHS for its ckanext-datajson project and the Council of the District of Columbia’s Unofficial Code (disclaimer: I was involved in both of those projects already). We’re glad to see that our guidance has already been useful and we hope it continues to be useful as agencies work on compliance with M-13-13.

In this updated version, we cleaned up the suggested legal language, we noted that our recommendations apply equally well to software code as well as to data, and we improved the introductory text, among other changes.

How We Wrote It

The process of updating the guidance was done mostly openly through github. Feel free to open an issue with questions or create a pull request with suggested edits at https://github.com/unitedstates/licensing/issues.

PS.

Other coverage: