Ruminations on Open Geodata Taxonomy (or lack thereof)

By Bill Johnson  |  April 13, 2017

Taxonomy (noun) – a scheme of classification  

If you were at the recent Midyear conference, you may recall we had a panel of open data experts fielding questions from the audience. Yours truly asked for the microphone and confessed to being “stuck”, since there seems to be no good way to ensure that all of the various open data sites and platforms and cataloging systems can share catalogs to their content, because they don’t use a common set of terms. This seems analogous to every library using their own unique system to catalog their books. That’s probably how libraries started, but we all know that today there are standards in place that make each library a searchable node in an extensive library network. We lack that today for open geodata. Typical open data sites support searchable metadata and keywords or tags as the method for making data discoverable.  But since any given dataset can be described in an infinite variety of ways, the current situation supports only a limited (and unpredictable) amount of cross-platform catalog sharing. What we need is a common taxonomy.

Just for fun tonight, I tried a few keyword searches on data.gov.  The table below summarizes my results for two popular GIS datasets:

keyword result
Roads 4,600
Streets 2,642
Highways 1,655
Transportation Network 327
Road Navigation 163
   
Parcels 1,794
Tax Parcels 269
Land Parcels 1,465
Real Property Boundaries 3,885

 

That last entry in the table was an eye-opener for me.  I normally hear GIS people referring to parcel data or tax parcel data. Yet “real property boundaries”, which I seldom hear, yielded twice the search results compared to “parcels”. So the choice of keyword is vital to finding data, even for fairly obvious data categories. And speaking of categories, it can get more complicated when we assign our datasets to categories. At least there is an existing standard for geodata, ISO 19119 Topic Categories. But these can be confusing, with categories like “geoscientificInformation” or “imageryBaseMapsEarthCover”, along with simpler ones like “farming”, “health”, and “elevation”. Are you likely to refine your search to the geoscientificInformation category when you want soils data? Or how about the difference between “inlandWater” and “oceans” in the ISO categories list.  For the State of NY, where I live, I’m not sure I could correctly assign the Hudson River, which technically is a tidal estuary with brackish water for many miles inland and even a three foot tide all the way to Albany, 150 miles from its mouth. Where does the Hudson become an inland water feature? So I’m not entirely convinced that the ISO topic categories are part of the solution set we need. Maybe. Maybe not. At best they (or another list) are just one element of a complete taxonomy.  

It seems like there is an opportunity in front of us to see if we can help create a taxonomy for the geodata we manage as state and local GIS leaders. This would include master keywords, definitions, synonyms, categories, and rules for their use. I know from some earlier work in my career that this may sound easy, but it’s actually quite hard. A decade ago we put a lot of effort into building out a taxonomy for the GIS data we had assembled for homeland security purposes in New York, a total of about 500 datasets. Can the creation of a broader, more complete taxonomy be automated? We saw some interesting presentations at the Midyear involving machine learning, so perhaps that’s the approach to take.  

As NSGIC continues development of the GIS Inventory, a standard taxonomy for the state and local GIS data that we all develop and expose could have real and lasting value to the larger GIS and open data community. Imagine if all of the NSGIC state reps started using a standard taxonomy.  Harvesting from state and local data sites to the GIS Inventory would be a breeze. Sharing that catalog openly would enable other sites to readily participate in harvesting or searching. And the huge variations in search results like those I experienced tonight on data.gov would be a thing of the past. Then imagine if a NSGIC-led geodata taxonomy became widely adopted by the larger GIS community and implemented by vendors and the open source community in the software we all use. That would be a beautiful thing.  

Here’s hoping.