 Good afternoon everyone, welcome to the Open Data Institute. My name is Ellen Broad, I'm the head of policy here at the ODI. For those of you who this is your first time joining us, just a couple of housekeeping things. The hashtag if you're following on Twitter and you want to contribute is ODI Fridays. This lecture is being live streamed and recorded so you're aware in case that affects the questions you might ask. There will be time for questions at the end and I think I can now just jump straight into introducing our speaker today who I'm personally really excited about. It's John Griffin, the founder of ACHI, a digital consultancy and of Data Seed which is what he's going to talk to you about today. I first came across John actually via an introduction from James Cattel. What I love about Data Seed is A, it gives you a really simple way of visualizing data but it also has been a really useful tool for showing back to people how you can improve your spreadsheet management and what good data practice and bad data practice looks like. I've just found it incredibly useful not only as a tool but as an educational tool as well. So without further ado I'm just going to turn straight over to John. Thank you, Ellen. Hi everyone, thanks for coming. So yeah I guess I'm going to talk to you today mostly about some of the issues that our customers of Data Seed have come to me and presented to me about using open data really and I want to start off first of all running through an example, a specific example which is quite a good example, actually this is when things go quite well. So this is a customer of ours called Kate who runs a small business and wants to find out more about G Cloud. So for those who don't know G Cloud is a government procurement framework. The idea is it should help small businesses do business with government easier. So in this scenario Kate wants to find out a bit more about okay who's selling on G Cloud, what are they selling to whom, for how much. So let's go along this journey. So first thing she does is a Google search for G Cloud statistics. The first result here that's not on ad is this one and going into this page it's a nice clear government UK page that gives us at the top some sort of totals and then straight away we've got lots of links to download CSVs. So also on this page is linked through some of these charts here. So these are performance dashboards as they're called and these are really good. I mean you know for a start you can see straight away the total amount that's being spent on the G Cloud framework by different company sizes etc. So it doesn't give you is the insight into exactly who the customers are, who the suppliers are that sort of thing. So we go back, we download one of these CSV files and we pull it into Excel, it looks like this, it's pretty regularly laid out, seems sane, but it's about 70,000 rows so we want to visualize it. So in this case we're going to use data seed to do that. It's a pretty simple process of just dragging the CSV file in. Other data visualization software is available I should say, but I want to just kind of show you this specific example and it applies you know all over the place. So this is excuse me a visualization that's been created with data seed of this dataset. So you can see all these charts here are showing the total amount that's spent so we can see in the top right the line chart here is showing us the total amount spent is going up over time. In the bottom left hand corner we can see SMEs have slightly larger proportion of business than large companies so that's good you know G Cloud is kind of doing what it should be doing in that respect to an extent. So this is good I'm going to pop out of the presentation now and go into the browser so that we can get a bit more interactive with this and follow this through a bit. So this is the same visualization just the interactive version. So what Kate might want to do at this point is to look at some particular suppliers that are competitors of hers. Her business sells CRM software so we might look up sales force in here. Find sales force we can just filter down all these charts the clickable so we can filter and see straight away okay so there's a really lot of money there being spent on sales force 580k. The biggest customer is London Borough of Hounsler Council so they spent 621k in total. So if we click on them we can see one transaction back here 1st of June 500k and another transaction so that was 2013 there. Another transaction here 1st of June 2015 for 120k so not really sure what the reason is behind that you know maybe they just realized they were paying far too much in the first place. You look at some of the other customers down here they're paying you know generally kind of 50-20 down to a few thousand pounds here. So this is all really good information that Kate can use to make business decisions on should I go through the pretty kind of it's a pretty epic sign up process for GCloud. So you know we can decide whether we're going to proceed with that based on the data that we can look at here. So this will work pretty well this is an example of getting some open data getting some answers out of it and doing that all pretty quickly with minimal pain and friction really. But we need to ask some questions before we can kind of draw conclusions. So is there any missing data how is the data collected now on that page where we downloaded the CSVs it's pretty good about giving us information on this it tells us that the data was submitted by suppliers so if you supply on the GCloud framework you have to self-report how much you're getting paid for things. So okay we trust that people are doing that we have to how are the terms defined well they also define what an SME is they don't really define any of the other terms but you know it's pretty good sorry go back what's the license so it tells us also it's under the open government license which is a pretty permissive license and allows us to use this data in various ways. So this is by all means a pretty good example here. These are all issues of data quality and data quality is it's not you know perfectly defined term people kind of use it to mean different things but what we're looking at here now is the results of a study done by the ODI and I think you know the question here was two businesses who are making use of open data which factors affect or you know influence your decision to use open data by the look of this here you know people please feel free to correct me if I'm wrong about any of these things. So you know these are all important factors that you know most of which I would put under the banner of data quality the accuracy of the data we can't do much if it's not accurate licensing can we use it how can we use it ease of access I mean that sort of comes down to how easy is it to find and then to reuse that you know does it change over time how it's published and the format provenance of data where did it come from what process has it been through before it gets to us and the format of the data this is what I'm going to focus on primarily and you know specifically how machine readable the data is because I feel like this is really a low hanging fruit and this is you know specifically what our customers complain about the most really or you know where they get stuck when they're trying to get some insight out of open datasets so a bit harsh but you know what is standing in the way basically why can't we always just go through that process that we just did with the G Cloud datasets and kind of get some answers pretty quickly I mean there's lots of problems obviously but I'm going to focus on spreadsheets and the fact that they are basically just a collection of cells and you can do some pretty horrendous things with that freedom so CSV files specifically we talk about because you know there's lots of ways of publishing data obviously and I don't have a problem with CSV files I've done a lot of work with link data in the past and you know there's some great formats out there which you know solve all these problems but the problem is they're not really that well adopted so you know CSV's got a lot going for it's open simple everyone can open them really with software they've got on their computer already and we're already using them but yeah they allow you to put anything in them so that's the bane of our life basically so yeah I want to kind of run through some specific issues which come up again and again and you know there's only five or six of these and I guess you know these are things that I feel could be quite easily solved and indeed are being solved so I'm not kind of proposing anything that's not already a solved problem here today it's more about encouraging the adoption of solving these problems or the technology that solves these problems so problem number one red is bad by the way green is good sorry if you're red glee green color blinds the top one is bad so having multiple files or having to deal with multiple files is an issue for a lot of people they simply won't go beyond that so you know the first thing is if we can have just one single file that really helps of course it might be massive the reason people split things up you know you might only want to look at a section of the data and that's fine you know both is preferable but you know for people to stitch files together in a lot of cases if they have to write some code to do that that's going to be a big blocker and they're just not going to get past that second simply character encoding just use utf-8 is really the short answer so yeah moving swiftly on I've called this non normalized schema which is a bit of a mouthful but what it really means is you know the top example here we've got columns here for each year and you see this quite a lot and you know what do these numbers mean so I need to go and look at some metadata or something to interpret what these numbers here are telling me you know presumably they're sales right so in this case so we could transform the data into how it looks beneath where we have a column for year and this schema is never going to change then so you know the number of columns is never going to change that means you know it's machine-readable in the sense that whatever we do to read that data doesn't have to change over time which is really handy yeah this is another one as well which is kind of annoying introduction text or any text before head a row you know it's a convention that you just have the head a row at the start of your CSV file and you know introductory text is metadata essentially which should live alongside the data set rather than in it empty cells see what I did there so yeah this is you know the probably the most egregious error because it's a data accuracy issue you know we could perhaps guess that in that space there it should say SME but we're guessing and you know we might get it wrong duplicate terms here yeah so you'll see on the left hand side we've got education spelt in various different ways presumably they should all be grouped together you know they're all the same thing but again some data cleanups got to happen at some stage here you know we've got to group these things together somehow and you know if you visualize this straight away these are all going to show up as different categories it's just unnecessarily annoying really so yeah so these are the kind of things that we just see time and time again constantly and are all really easy to solve ish or you know there are initiatives solve them already so I mean it's a practical sort of stop gap solution yes we can use tools like open refine so you know if you're coming across this issue now then it finds a great solution to that but ultimately the situation that we're in is everyone's doing this cleanup themselves and we're all duplicating this effort and be much better to sort of get this right at the source so yeah I just you know I want to sort of just point out some of the ways in which people are already solving this problem it's not a new problem by any means you know this problem has been around as long as data has been around to be fair I imagine some somewhere or another so CSV link is a really great tool it's a it's available online you know at this address is csvlink.io but it's also open source so you can download it run it basically what it is it's a schema so you know you have a separate file which is a JSON file that lives alongside the CSV file and that describes what the CSV file should look like some things are you know simply best practices that people should be aware of already you know head a row at the top you know number of columns throughout the table should be consistent things like this it should be utf8 but it also does things like check the type of each field so you know you might have a date field or you might have a text or a number and it'll check that that's consistent throughout so or you know you might have fields which are required so we don't get those empty cells there so really this pretty much solves all the issues we just looked at apart from the duplicate terms so the duplicate terms really require us to have another list of you know which terms we're expecting to see in that column so I mean solutions here might be using registers which are canonical lists of things companies buildings etc or you know there's lots of other vocabs from the link data world that could be used here so again this is a sort of it's really a problem that's existed forever in database design relational databases it's really a foreign key to another table without getting too technical you know there's we basically just want to reference from our data set a list of other things so you know some way to do that would be really great and of course then tools like csvlint could check the integrity of that link and make sure that we're not having that we are kind of linking to genuine things and rather making up terms on the fly and this spreadsheet just kind of gets out of order the other thing I want to mention is ODI certificates which already kind of cover all this stuff in a sense you know they cover a lot of the data quality issues which we talked about earlier they also you know make reference to the machine readability of the data so it kind of seems like if we were to see a more adoption of you know all of this stuff then the quality would increase immensely you know overall why is it important well because you know at the moment we've got we've done really well on the quality on the quantity problem you know we've got lots and lots of open data out there but and you know companies are actually making use of open data to just kind of clean up that data and to make it presentable to ordinary citizens you know and that kind of feels like to me that it should be something that you know it's something that there are kind of you know many important people you know Francis more there are many kind of pieces of legislation like this that talk about open data being for citizens for citizens to be able to understand you know what the government's doing and be able to have access to this information in order to make better decisions at the moment I feel like you know these issues are really getting in the way yeah it's the low hanging fruit basically you know it's it's kind of easy to solve these things I feel and that's good news so yeah if we were to solve these things and really we could make open data more accessible more useful to millions more people yeah so that is all that I really want to say I'm gonna wrap up there and then we'll take questions afterwards thank you