Friday, 10 October 2008

Community Schemas: Making sense out of disparate datasets

With so many organizations publishing geospatial datasets using standards based web services, a raft of new opportunities for large scale data analysis are presenting themselves. The challenge now is integrating the datasets which use different terms and attributes to describe the same data. For example, "water quality" (good, medium, bad) in one database might equate to "pollution level" (1,2,3,4,5) in another.

Communities, like the hydrology community backing the Australian Water Data Infrastructure (AWDIP), are solving these data integration issues by defining a community schema for their domain, then ensuring all agencies publish data using the community schema.

Community Schemas are used to describe a rich set of semantics for a domain, using basic building blocks provided by Geography Markup Language (GML). This allows communities to define schemas appropriate for their data to be used for data transfer within their community. The schemas can then be referenced to ensure consistent structure and taxonomy between related datasets, improving the communities’ ability to share data. For example, a hydrology schema may define a class called Water Use with acceptable terms defined as irrigation, domestic, and industrial. A dataset published with an invalid Water Use of farming will not validate and the user will know to correct the mistake.

Publishing data through a standardised community schema means:

  • A range of applications and data analysis projects are developed because extensive, quality data becomes available and cost effective

  • Clear data definitions reduce data misinterpretation

Defining a community schema for a domain is non-trivial as it requires participating parties to create and agree upon a data architecture, vocabularies and an interchange protocol. Luckily, the first community schema projects have left a trail of reusable building blocks and processes that can be used by future efforts. For instance, specifications for Observation and Measurement (O&M) were developed as part of the Sensor Web Enablement (SWE) specification and have since been used as a component in Geoscience Markup Language (GeoSciML), Water Markup Language (WaterML) and others. Other building blocks include Geography Markup Language (GML), SensorML, CityGML and the ANZLIC profile of ISO19115 for Metadata.

The other critical component in the development of a community schema is buy-in, governance and testing from the user community. This is often an international effort. GeoSciML, a schema for geology, has participants from BGS (United Kingdom), BRGM (France), CSIRO (Australia), GA (Australia), GSC (Canada), GSV (Australia), APAT (Italy), JGS (Japan), SGU (Sweden) and USGS (USA) and the OGC (International).

Communities need to adopt a governance structure to resolve the inevitable disagreements over technical details. The GeoSciML community, who started in 2003 and are now onto their third schema iteration, have organised working groups for information model development, computational model development, vocabulary definition, defining use cases, testing the schemas in a formal test bed, then promoting the schema though an outreach working group. A key element in the success of GeoSciML is the fact that custodianship of geologic information is managed by similar agencies in most jurisdictions (Geologic Surveys) and these have a history of collaboration.

Most agencies will first encounter a community schema when they are asked to deploy their datasets using one. Spatial data is collected by numerous agencies, for various purposes, following different collection guidelines. Storage models tend to reflect the original data use and are rarely designed for data exchange. When data is published through web services, the schema usually reflects the storage model. This works fine for the original application but is an integration nightmare when trying to share data between agencies. Changing the storage format usually isn’t desirable if it breaks legacy applications or introduces sub-optimal performance. Hence, it is necessary to differentiate between the storage and exchange formats. Storage format can be defined by the custodian who generates and maintains the data. The challenge is to map the storage model to a community schema. Again, prior projects have built a suite of tools to help out.

Funding from Australia’s National Collaborative Research Infrastructure Strategy (NCRIS), CSIRO and DPI Victoria has added community schema support into GeoServer, an open source WFS and WMS server. Deegree, another open source WFS/WMS is also being investigated. The open source FullMoon supports transforming UML data models into various GML application schemas and is being managed by CSIRO. Duckhawk, an open source WFS & WMS robustness and validation testing tool was developed for the Australia Water Data Infrastructure Project (AWDIP) for testing WaterML.

There is a common theme developing around community schemas; many of the tools being developed are open source. Spatial Data Infrastructures (SDI) increase in value as more agencies contribute data to them. Also, users who benefit most from the infrastructure are often not the agencies that collect or manage the data. This results in large organisations developing SDI’s to aggregate and serve data from numerous smaller, more specialised agencies with different priorities, budgets and timelines. In order to encourage these smaller agencies to manage and publish their data using community schemas, SDI sponsoring organisations like NCRIS are developing Open Source tools in order to reduce the financial barriers faced by small agencies in getting their data online.

Australia, like the rest of the world, has a huge variety of data sets all fulfilling their own purpose and while data integration is non-trivial, we have the knowledge, tools and processes to integrate disparate datasets. This will enable more powerful analysis and new business opportunities for all participating parties.

Credits:

I'd like to thank Stefan Hansen, Software Developer at LISAsoft who was technical lead on the Duckhawk WFS conformance and performance testing framework who helped research this blog. Also Rob Atkinson and Simon Cox who provided a lot of background on CSIRO's involvement with Community Schemas.

A version of this article will be published by Position Magazine in their December 2008 edition.

2 comments:

Cameron Shorter said...

Cameron, this problem is extant in the UK. You might usefully contact Ant Beck at Leeds University:[email address removed from comment for privacy].
He is a lead researcher on VISTA, a project being led by UKWIR, funded by Department of Trade and Industry money.

Comment by redmick.

Jody Garnett said...

Hey Cameron; every time you edit your page it jumps to the top of the osgeo blog feed.