At the ODI, we’ve often talked about publishing data on GitHub; to us, it’s a nice solution that not only gives data publishers a convenient (and free) place to publish their data, but also encourages collaboration and provides a historical view of data over time.
Publishing data on GitHub is also a great solution for people and organisations that want to publish small amounts of data, but don’t necessarily have the tools or capacity to publish data on their main website or use a fully-featured data catalogue such as CKAN, Socrata or OpenDataSoft.
However, we’ve also recognised that GitHub, to many non-technical people, can seem intimidating. Despite the fact that GitHub has done some great work towards lowering the barriers to adoption, such as its desktop clients, the fact still remains that GitHub is a technical tool, and, as such, the barriers remain there to those who don’t code.
With this in mind, I decided to use my recent innovation week to build a web-based front-end for publishing data on GitHub.
Using our best practice guidance to publishing data on GitHub as inspiration, I put together Git Data Publisher (the name needs some work). Currently it allows users to sign in using their GitHub account, add some information about their dataset, and add any number of data files.
Once a user has filled out and submitted the form, this generates a new GitHub repo that not only contains their data files, but also has an automatically generated datapackage.json file, containing metadata about the dataset, such as the licence and publication frequency, amongst other things.
It also generates an HTML representation of the dataset (with DCAT metadata embedded inside), which is accessed via GitHub pages (another wonderful service that GitHub provide which allows free publication of static HTML websites).
The way this works fits in perfectly with the many-parts-loosely-joined approach that we take to developing tools at the ODI, and one of the things I’d like to add to this is the ability to automatically generate an Open Data Certificate for the dataset using the Open Data Certificates API, as well as auto-validation of CSV files using CSVlint.
The tool is by no means finished, and some of the features I’d like to add in future innovation weeks include:
You can see more details of the features (and add new ones if you’re so minded) on the project’s issues tracker in GitHub, and please feel free to fork the repo and add your own features if you fancy helping out with the project!