 So we're going to talk about open data and the question of quality the spoiler is that open data sucks We focus on publication and not on quality, but we'll get to that First of all, what's my interest? I'm the chief product officer to open at open knowledge international We build we basically build technical products for open data. We do a bunch of other stuff, but that's core What we do? Our goal is to open up all essential public interest information and see it used to create Insight and drive change now, that's really critical. We don't want open data just for the sake of open data We want to see insight derive from open data and we want to see change in the world and that's key to the argument for quality open data Okay, so this talk What I'm going to cover is that the open data ecosystem is pretty much exclusively about publication and data access The focus on data publication actually is by now Say ten years into open data is minimizing the impact of open data And in order to meet the promise of open data what we all want to see from open data We need to focus on data quality oops Towards the end I'll introduce some ideas to think about a tech technology around what's next how we move towards shifting the focus to the quality of open data and some concrete actions that we can all take as Activists as people working in the open data ecosystem to get governments to publish better data First of all we should define open data The open data charter Has a definition open data charter for anyone who doesn't know is a set of principles for open data that it's gaining traction in the open data ecosystem and Governments are starting to adopt the charter as kind of a defining principles for their open data publication and transparency efforts So open data is digital data that is made available with technical and legal characteristics for it to be freely used and re reused and distributed There's earlier open definitions one the open knowledge and others worked on called the open definition But this is the working definition for our talk first we want to ask Is open data really important? Maybe obvious to people at this conference that it is but let's still go over why It's important because it's our data as citizens It's a type of accountability of our governments It's a way that we can hold governments accountable for what for their actions fiscal data and procurement data is a classic example of that Maybe more importantly is that open data if there's good and quality open data published It can lead to a new types of account and new types of participatory cultures New things emerge on top of open data This can be ad hoc groups who are starting to build insights around data and so forth and Most importantly, we do want to see that we want to see insight that leads to changes in the world So that's why open data is important Where are we now with open data The field of open data Some of you may have heard from earlier talks It's based around a few Sort of key aspects of the field. So freedom of information Is still quite an important tool for open data. That's how we get access to data that it's not being voluntarily published Obviously the open data portal data.gov What's left of it and Data.gov.uk were some leaders in this space Based on top of open source software secan that came from open knowledge Open data at present has become quite tied to transparency Transparency efforts of governments are often like data is often used as a signifier of transparency It's not necessarily the same thing, but that's how how you know, that's how it's developed and then there are partnerships and alliances say the open government partnership Where governments are now starting to kind of establish an industry as well as a nonprofit organizations like ours But there's kind of an industry developing of what open data should look like As I say the main actors are NGOs, civic tech government, and of course Mostly philanthropists who fund all of this stuff or the work that we in the nonprofit sector do with open data The way that open data looks now is Strongly influenced by the metrics and incentives that we as civil society Identify for governments to publish data against so That's We all want data We want access to data at the beginning of the open data movement. It was important that we just have raw access to data and The as much data is published as possible But the way that the field currently looks is still highly influenced by that and that's you see that in Initiatives like here the open data index It's a crowdsource project by open knowledge and the open data community. Actually, it was just released today the latest results the open data index Ranks governments ranks the level of openness of governments based on what they publish and how they publish data Australia and Taiwan according to the results published today are the most open governments in the world The US is currently at eighth and Who knows where it'll be next year, but it's currently the eighth most open government in the world according to the open data index Importantly what we find at open knowledge by running the index is governments are really really really interested in what we're measuring and It directly influences policy. We have people contacting us all the time Like at the yearly cycle of building the index people from governments are contacting us all the time Very closely looking at the methodology that we use very closely looking at the data sets that we identify That should be open or that are most valuable if they're open and they're actually designing their Transparency efforts and their open data publication the policies around the type of things we measure We don't measure quality But that's part of the reason Why a platform like this is so Influential because of the attention given by governments to it Similar efforts like the open data barometer. It's another type of index What's different about the barometer? Whereas the Open data index is crowdsourced all the information is actually Contributed by the open data community The open data barometer has a significant aspect of government self-assessment So they all think that they publish great data, obviously And then newer newer incentives like the open government partnership In terms of producing good data the eligibility criteria to be part of the open Government partnership is very low. It doesn't require that much in terms of producing good data But however, it's quite a it's a badger of transparency. So If it's like this if it's like the barometer and the index Really shape the way that publication works the way the publication of open data looks okay so Most of the incentives in the field at the moment Field of open data are really around publication And that means that we miss out on things there's things that we don't see There's very little emphasis by NGOs or any actor within the open data space on Usability of open data. We don't really measure or attempt to measure reuse of data We don't really for people in civic tech or open data NGOs We don't really expect data to be of high quality Huge amount of the work that open knowledge does is actually just wrangling data to make it usable like I Haven't put a monetary value on it, but we spend a huge amount of our funding just on making nice CSV files and We can talk about that a bit further when I look at the studies that I've got here We'll see like sort of the real cost of that But the emphasis that we have in the system means that we don't Even expect governments to produce high quality data and obviously we don't incentivize them to strive for impact We incentivize them to governments to publish data. We don't incentivize governments to publish quality data So I've got a few Studies from work that I've done and other people in our team have done over the last years Working with government open data Just to sort of demonstrate how deep this goes like how bad the quality of open data is How bad the quality of open data is by governments who are apparently leading transparency and open data in the world So first of all UK spend data This is a project I worked on from the end of 2014 and into 2015 And we've recently updated it. It's all recent data To we were working with the UK government who at that time were the number one considered the leading open data publisher The most open government in the world in terms of transparency and open data publication we decided to With cabinet office from the UK government to build a project where we actually assess what they're actually publishing like what the quality is So we we chose a particular set of data It's called 25k spending data It's an interesting set of data to have chosen because the there's actually an addict from the prime minister From 2010 saying Exactly how this data needs to be published There's a schema for the data written in plain text But it's really clearly clearly understandable how the data needs to be published The types like what the what the date field needs to be called what the amount field needs to have and so on It's very very clear the regulations are very straightforward and all Publishing bodies need to adhere to that So all the municipalities and so forth As you see there None of the data is valid That's what there's there's room for error here Something like zero point five percent of the data or zero point three percent of the data that we Checked was actually valid according to the government's own Specifications and standards and the rest had some type of error or another We got this off data dot gov.uk all the data we even discovering the data was difficult enough But yeah, the quality is very low That has a pretty serious impact on the ability to use the data Even though we at Open Knowledge International have worked with the UK government quite a lot This was extremely surprising just to see how bad Really really fundamentally bad the data they publish is and this is one of the most. Yeah, like I say specified data sets that they publish So here we have a global leader in open data publication We have a set of data with very clear edict on their requirements for publishing it Including a very explicit and very simple standard that they need to publish to Hardly anything's valid there's dire problems in Simple file structure as well as adherence to the standard or the schema and there's also dire Problems in the timeliness of the publication. There's some some departments still to haven't published any files since 2013 or something like that, so I'll post a link to this presentation on Twitter a bit later But you can have a look at all the data scripts that we use them the dashboard to display that information and so forth More recently the fiscal team at Open Knowledge Worked on a project called subsidy stories EU In many ways a similar case we took a Structural and structural funds from the EU We wanted to get build a single database out of all structural funds data And we wanted to be able to build stories on top of that data Sounds easy We had to acquire data from 120 different websites just to build this single database the data is published in various ways and CSV PDF Excel and so forth we Over several months. We literally spent several months ETL in Data that once again has a published data standard Has regulations on how to release the data in a timely fashion We spent months doing ETL to build a single database out of this data and to make it usable The end result is a very simple website subsidy stories to a year Which in reality is just the very beginning of being able to make insight out of data There's no insight there all we've done at a significant effort is create a clean database so that other people can maybe now start to Investigate the data which would not have been possible before and find insights find stories within there We won't because we ran out of money Building it That's part of the problem. There's a huge time and effort time and Huge amount of time and money goes into Just wrangling this data Okay Again, there's some links here to information there The data processing code in the app and the data quality report on that project It's not just that open knowledge Obviously anyone here who's been wrangling US data. I'm sure has similar experiences Transparency international recently released a report on the Promise of open data, but the complete lack of progress and complete lack of usability There was an open letter to the open data community recently by data smart city solutions Which address many issues, but one of them the poor quality of the data that's actually published as open data Digiwist a really interesting project in the year at the moment even go as far to say because of the Serious quality problems that they found that they think there should be penalties for non-compliance to most basic quality standards That's how hard it is to work with the data that unless we penalize governments for lack of quality We're not going to get anywhere according to them Okay so They're too too small studies. Well, what can we learn? Governments who are leading an open data can't publish consistent CSV files. That's one thing maybe more surprising Even for me because I work on data standards as well, but standards and regulations Don't actually lead to higher quality or reusable data Standards at least in the case studies that I've or the projects I've been involved in don't really lead to better data at all and Huge amounts of time the money are required to actually gain insight out of open data So quality What do we actually want? In 2007 Rufus the founder of open knowledge said we want raw data and we want it now This is a pretty famous quote in the open data space 10 years ago. This was a great thing to say We're still saying the same thing we're not really asking any more from our governments and They're providing it's not only raw data, but really shitty raw data So we need to sort of get a bit beyond that and we need to sort of push the ball back to government and ask Ask for that. We need them to address the quality problems that they have in the data they publish The goals what are the goals so Again as a programmer plaintext data in one level as a programmer plaintext data seems boring as a programmer who likes CS I understand why we need it We would think that It's a non-goal almost, but it still doesn't happen And we've seen from other from Kate's did a presentation this morning showing how much of the data on data.gov.uk is actually PDF our data.gov is your PDF We want structural integrity Really really really really simple goal We want CSVs that don't have mess and notes and non tabular information and we want Rows to not to be empty and so forth We want schematic consistency This does not mean we want data standards actually It just means that if something's a date column it should be dates for the whole way if something's a number column It should have numbers the whole way. We don't need more than that Like we do we'd love more than that it'd be really good that but that can come after we just basically have numbers than the numbers and dates That are dates and amounts that you know and of course timely release Non-goals There's Some of these non goals are a bit kind of controversial even to myself I work on data standards like I said and open knowledge does do some effort here But what what I'm seeing right now especially in Europe and I guess in the US Mostly around link data, but it's not necessarily about link data But there's a lot of emphasis on like Technologizing these technical solutions to these type of problems and build a data standard and then the data will get better Have beautiful abstract common code lists and then we'll be able to just do amazing stuff across all of the data But as I've just shown the data is just not even there yet So these are really great things to have data standards code lists Comparison and linkage across data sets, but unless we address more fundamental quality issues It's just no point. It's just it's a waste The other thing about these is that they are they all require a highly technological solution. So The what we would gain by what we gain by focusing on this We lose in terms of like human computer interoperability and so forth. So I Really think these are non goals at the present Until we can get make numbers and numbers So What can we actually do to change the situation? What can we do to make sure the government start to publish higher quality data? In non-technical terms We can engage in direct dialogue with governments on the usability of data I I take open knowledge as an example. We do a huge amounts of data wrangling But we very rarely kind of feed back to the governments and say hey your data is crappy. I mean We just need to let governments not understand the the the impact that the quality of what they've reduced in Results in We need to build quality metrics into the tools and processes that incentivize our government So for example the open data barometer the open data chat charter the index Actually need to include quality as a metric that's measured Otherwise, we'll just keep focusing on access and publication at the expense of quality and possibly in some situations activists or Depends on the context should actually just reject data that doesn't meet basic quality Assurances like just say no, this is not this is not data. We need something else Technically We We should start to focus on data validation and build tools for data validation Open knowledge has got some again. There's links to it's in the slide We can build Dashboards that engage with internal Stakeholders at governments so that they can actually so people who are working at the policy level Need to be able to see that the the data that results from their policies is not usable So in order to do that we need to actually start to build ways that they can see that ways that they can understand that Data portals secan and other commercial solutions Need to shift a bit more emphasis on the publication workflows Think there's a huge amount to be done here a huge opportunity actually to focus on The people who sit in an office and put data on a portal So to make it easier for them to fix data to make it easier for them to version and validate data and so forth That's it. You five minutes is left