Metadata standards for open data
With more and more datasets becoming open, it’s increasingly important for organisations, particularly government, to follow a metadata standard. Using standardised descriptions makes datasets more discoverable, more easily syndicated, more transferable, and ultimately makes it easier for datasets to be used in real-world situations to add value.
What are metadata standards?
Metadata is information that provides a description of a dataset in the form of fields. The fields may include things like title, format type, description, keywords, and so on.
Metadata standards provide a framework for consistency, so all datasets have fields labelled/described in the same way. This delivers many benefits to users and ultimately to citizens.
Why adopt a standard?
If datasets use different fields and different descriptors (different metadata) it makes using the datasets (and particularly combining multiple datasets) difficult.
Consistency between datasets means datasets are:
- More discoverable
- Easily syndicated
- Easily combined with other datasets
And this means they can be used in real-world situations to add value. For example, Australia’s National Map uses many different datasets, and to be able to pull and display all this data, the datasets must all use the same metadata.
In Australia, following a standard metadata process will help with discoverability on the various data repositories at state level and also on federal databases such as data.gov.au. This discoverability extends to broader search platforms too, such as Google.
Salsa often has clients who want to use data from other organisations/agencies and also want their data discovered by meta engines run by other organisations. By conforming to a standard they can spread their data easily rather than adjusting for each platform. If the metadata is different, it can lose context. A particular field may not mean anything to the destination platform. If organisations have different standards there’s a lot of rework and customisation to get data from one place to another.
Standardisation definitely brings many benefits but there are also some challenges. Firstly, you’ll need to look at the different standards available.
Which standard to use
The first decision you’ll need to make is which standard to use, and this means knowing what standards are available and then working out which standard suits your needs.
There are lots of options out there when it comes to open data standards. Some are quite general, while others deal with standards for specific areas, like health or transport. There are also standards based on region or country.
Standards are created using different approaches. For example, they may grow out of a community need, which was the case for OpenSpending and Open311. At the other end of the spectrum are standards established through a formal standardisation body, such as W3C (DCAT), ISO or IETF. This is how many web standards, such as HTML5 and RDF, were formed. Using standards established by more formal bodies can be beneficial because they have a framework for establishing and maintaining standards. However, these more formal standards can also be too rigid and might not meet your specific or changing needs. Often a combination of both approaches is used to create a standard.
Finding open data standards
While there are many open data standards available, there isn’t a simple way to find them. Standardisation bodies have dedicated websites and sometimes catalogues that list their standards. Open data standards often have dedicated websites, for example Open Referral, Open Contracting Data Standard and the Humanitarian Exchange Standard. However, there is no one-stop-shop for finding open data standards if you’re a data publisher wanting to publish standardised open data.
Profiling open data standards
The second challenge is how to choose which open data standard is appropriate for your needs. There are a number of factors that might influence your decision:
- Is the standard applicable to your domain/case?
- Who else is using the standard?
- What level of granularity does the standard address?
- Is the standard actively maintained?
It’s therefore important that this context, or metadata, is available along with the open data standard. Plus, this metadata should be described in a common way across all standards, so it’s easy for potential adopters to search and compare. Yes, a metadata standard for describing standards!
What is DCAT?
One of the most commonly used metadata standards is the W3C’s Data Catalogue Vocabulary (usually referred to as DCAT). DCAT is a resource description framework (RDF) that helps create a standardised way of setting up datasets in terms of descriptions (metadata). It outlines coding and naming conventions for organisations to follow when dealing with datasets.
For example, some of the dataset naming conventions are:
- dct: title
- dct: description
- dct: keyword
- dct: temporal
- dct: spatial
Who’s adopting the standards?
Lots of government agencies are trying to adopt to the standards or using it as a basis for their datasets. However, at the moment there isn’t necessarily consistency across government and so many governments are currently working toward implementing standards. For example, the EU Open Data Portal has been set up to bring European datasets together. Part of this process includes standardisation, with its datasets relying on the DCAT-AP specification.
Australia and data standards
Standardised metadata is essential for the centralisation of data at government level. It can be challenging to conform to the standards, but the benefits are worth it.
Data.gov.au has a guide for publishing data. This guide takes you through the process of publishing a dataset and includes recommendations across a variety of issues, from the preferred file formats to metadata, licensing and data tools. The guide stresses the importance of metadata for open government and there’s also a section on discovering metadata.
Data standards in action
Salsa is working with the Victorian Government on its Open Data Portal. The portal is managed by the Department of Premier and Cabinet (DPC) and as part of the new CKAN instance we’re building, DPC decided it needed a new schema to represent datasets and resources. DPC had lots of fields it wanted to capture and also wanted to be in line with DCAT. However, it’s a complicated metadata schema, and so as well as being based on DCAT, it also uses standards from Dublin Core and some customisation to meet its unique dataset needs.
Ideally government across the state will all comply to the same standard so that datasets can be managed directly by each department in one data portal.
CKAN and DCAT
CKAN is a very common open data tool/platform (read What is CKAN? for more information). Many datasets are on a CKAN portal and the ability to push or pull datasets from one CKAN instance to another is extremely useful. If you don’t follow the same standards you can still get that data but the data gets stored into areas they don’t allow data to be discovered in the same way, which makes is much harder to reuse.
CKAN recommends DCAT as the base standard and has set up a specific extension that provides tools and help for organisations publishing datasets that follow the DCAT metadata standards.
The extension includes:
- RDF DCAT Endpoints
- An RDF Harvester
- An JSON DCAT Harvester
And for implementation:
- A mapping between DCAT and CKAN datasets
- An RDF Parser
- An RDF Serialiser
The harvesters can be used to harvest data from one CKAN instance to another. The harvesters are configured to work with DCAT schema and DCAT fields to simplify that process. If the data doesn’t comply to DCAT schema, it means there’s a lot more work in customisation to use a dataset.
Salsa recommends using DCAT for the base of all government metadata schema, customising only when (and if) absolutely necessary. In this way, a more standard approach can be followed across all levels of government.