What National Information Infrastructure

Jeni Tennison

You’ve probably heard people calling data “the new oil”. The analogy is appropriate in some ways: like oil, data is something you can process to create many different kinds of value. But what this analogy misses is the fact that you have to choose to collect data in the first place. Data deposits don’t just happen. They aren’t things we discover and extract. They are things we create. And the data we choose to collect determines the value that we eventually get.

So I prefer to liken datasets to roads. Roads are things we choose to build because they enable us to get from A to B, just like datasets enable us to find the information we need to make better decisions. We, as a society, have done a bit of thinking and planning and worked out which roads are really important. We’ve given them names like ‘M25’ and invested time, effort and resources in expanding or improving existing roads or adding new ones to make journeys easier.

But the transformational property of roads is that they have junctions. Roads that don’t join with other roads can only take you between two places, but a network of connected roads can take you anywhere the roads reach.

Photo of Spaghetti Junction by Highways Agency

If we were to look at the UK’s National Information Infrastructure as a road network, what we’d see at the moment is a mix of bumpy dirt tracks (data hidden in PDFs), slightly smoother byways (data presented in Excel files), and a few well-paved, high speed motorways (data provided as data), many of which charge a toll to use. There would be relatively few junctions (links between datasets), and what junctions there were would generally be hard to navigate.

We need more trunk roads in our National Information Infrastructure: more of the datasets we frequently use need to be easy to access and process. We need more junctions too: the core datasets should provide ways of travelling from one dataset to another. It’s not that other datasets aren’t important or interesting — they absolutely are, and in fact they’re probably more engaging than the monotony of a motorway — it’s just that it’s much harder to travel without trunk roads.

The equivalent to trunk roads with lots of junctions in the National Information Infrastructure is core reference data. Core reference data is data about the things

that are referenced from other information — these references are the junctions onto the byways of statistical and administrative data
where each item is assigned an identifier, such as a number or code, to make it easy to reference and therefore create a junction with other datasets
where lists of them are probably maintained through some defined processes which ensures the roads and junctions themselves get maintained

Trunk roads need to be composed of the right materials to enable an easy journey: we don’t want our smooth tarmaced motorways to be covered in gravel. In the same way, core reference data needs to contain core information, which tends to

be non-numeric (not counts or percentages)
change infrequently

and not be combined with statistical and administrative information which tends to

be numeric (counts or percentages)
change rapidly over time

Identifying the things about which core reference data needs to be made available isn’t hard. They are the things around us, that make society function. They are the things that we collect information about in registers. They are the things that all the information we collect and use is really about. For example:

registered companies
courts
the river network
schools & academies
power stations
polling stations
hospitals
job centres
bus stops

We’ve created a longer list on our wiki, which is still unlikely to be complete, based on the datasets whose release was targeted within the G8 Open Data Charter.

These datasets aren’t particularly big or complicated. Most of them will just have a few fields: an identifier, a name, a location or boundary, a category. But they are important as connective datasets that enable developers to traverse other, richer and deeper data.

Like trunk roads, some core datasets might need to be built from scratch. New junctions might need to be created between datasets to make them more useful. Like trunk roads, these datasets need to be provided to a high standard; our ideal is for them to be published at the Expert level of the Open Data Certificate.

A plan to progress a National Information Infrastructure should be considering these datasets and work out whether each exists in the first place, whether it’s published, how well it’s published (including how easy it is to find and whether it’s available as open data), and how that publication will be improved. Just as our road network requires continuing investment, so will our information infrastructure.

Creating a list of datasets isn’t hard, just like creating a list of roads isn’t hard. The real challenge for government is to construct a realistic and coherent plan for how our National Information Infrastructure should look — which roads should be trunk roads and where the junctions should be — and to see it through.