Open-source Geo Is Really Something Right Now

In March a helpful Internet person named Michal Migurski tweeted:

Want to help @openaddr derive a parcel dataset for the US? I know of funding for a quick development project.

Many people who work at Postlight are the kinds of people who sit up straight in their chair when someone says “derive a parcel data set.” So I pinged Michal and was introduced to Waldo Jaquith, another helpful Internet person and Director of U.S. Open Data. Then, as Waldo wrote:

U.S. Open Data is a long-time supporter of the Open Addresses project, a volunteer-run project that aggregates government-published address datasets to create a global repository of the coordinates of street addresses. Anecdotally, project volunteers had noticed that a fair number of the data sources contained not just the latitude and longitude of an address, but the boundaries of the parcel. That raised the question of how many of the indexed 257 million addresses might include boundary data that was going unused. Could we have accidentally collected millions of cadastral records?

So we hired Postlight to figure it out for us. Developer Bryan Bickford spent a little over a week creating a Python-based tool to find and extract parcel data from OpenAddresses’ records.

Bryan’s work gave us a hard number: of the 1,511 data sources ingested by OpenAddresses, 383 include parcel boundaries (or 25%). There are a total of 30,461,769 parcels included…

I realize that there are a lot of numbers in the paragraph above. The gist is that a nice person used some money from The Shuttleworth Foundation to hire us to to look at some freely available data, and we found the pattern they suggested we’d find, and as a result the coordinates for 30 million parcels of land were added to the global geographic commons. Cool!

There are massive sets of data out there, floating around, released into the wild by the government, NGOs, and other kinds of organizations. They contain wonders and mysteries. All it takes is time and a little funding for them to share their secrets.

What’s exciting is that right now is that while some days the world seems to be going to hell in a mechanically-made basket, we’re in this good moment over the lifespan of Moore’s Law—lots of data to explore, relatively fast bandwidth to download it, fast processors to process it, big hard drives to store it, frameworks for collaborating, and tons of tools left over from the orgiastic explosion of “big data” interest throughout the tech industry.

The expensive part is still programmers, designers, and product managers which—well, that’s Postlight’s business, so thank God. But once a programmer does something once they can put it on GitHub and never do it the first time again, which is also what happened here.

What can you do with millions of geographic parcels? You can map with them! The open mapping scene is wild right now. Not long ago I noticed lots of weird people started to leave their museum and publishing jobs to go work at Mapzen. (The person who wrote that first tweet is the VP of Mapzen.)

Mapzen is a very large open-sourced mapping stack—like Google Maps but you can download all the data (it’s funded by Samsung). There’s also another mapping tool called Mapbox, which mixes together open and proprietary data. The two platforms use a lot of OpenStreetMap data. You can use Mapzen data inside Mapbox. All these people go to the same conferences. They want to work together but also to make their own awesome things. As a result there’s an enormous amount of geographic data coming online, in increasingly useful formats, and the people bringing it online are getting salaries and making hard choices about how to bundle things up for future users, not just putting it all out there with good intentions.

All of these things are still at the toolkit phase—you can get the data and do things with it, build services atop it. Here are some of the things you can download right now from Mapzen:

Nicely structured, tidy digital city maps. Who knows when you’ll need them? Maybe you’ll make an amazing new app for pedestrians. Maybe you won’t. But, given time and resources, you could, and a lot of the hardest, most boring, expensive stuff has already been done.

And then you look at U.S. Open Data’s GitHub page and you’ll find a dashboard of their projects and you can see they’re doing all kinds of work at the meta level. Nice, steady progress.

At the turn of the millennium there was a big nerdy push for a “Semantic Web,” which was going to assign a URL to every statement about the world using a singular data model. When I talk about the Semantic Web today people throw tomato emojis at me. It turned out that the world doesn’t really want one way to represent knowledge. The people who want one way to represent knowledge are men who have shoulder-length hair and a notebook filled with Rudolph Carnap quotes. The rest of us want to send text messages and look at Chartbeat.

Anyway! There were already lots of standards and rules associated with geography, so the open data around geography is moving towards those standards, to increase re-use. At the same time people are starting to extract more structured information from the web, so you end up with things like Wikidata. The way the data commons seems to work now is that people go, “Oh, this database lets me store squigglejson files. I should go look through those ten superbytes of files I downloaded from the Department of Human-sized Birds and see if I can extract squigglejson. I bet people in the squiggles community would think that was cool.”

Who knows where it’s all headed? It’s pretty exciting to imagine that one day we’ll stumble into the one true universal database of all human knowledge. Of course we won’t. Pokemon epistemology—gotta capture it all—doesn’t really work, because humans. But it’s still fun to imagine.

For about a decade the geographic commons has been coming together but the data has been hard to use. But things are definitely moving along at a righteous clip. Right now the geo commons is at the “here’s a tool that lets you make a new tool” phase. There’s good data, and there’s software that lets you mess with it. Something interesting will happen next. You get the world in your hands.

We’re grateful to work on projects like this and we hope to give more back to the commons as we grow further. Thanks to Waldo Jaquith, U.S. Open Data, and to the Shuttleworth Foundation for letting us play in this gigantic new and ever-enlarging sandbox.