 Efallai, nawr eistedd y fwy o'r mae'r meddynae i Yrchynid, Efallai, Ilywodr Elywodd Elywodd Eylewodd Erdig, wedi'u gael Ilywodd Erdig.. neu mae hi'n Gorbadag SESP, wedi gael ... efallai, efallai, efallai! Yn'r newydd, efallai, oed, mae'n gofyn efallai. Efallai, Elywodd Eylewodd Eylewodd Eylewodd Eylewodd Eylewodd efallai efallai, mae efallai, efallai, efallai, efallai, efallai, efallai, efallai, echai efallai! I recommend Quantum of Saladso. It's the one newsletter I always actually read, and that's genuine. It's full of data journalism, tools and training tips. It's an amazing curated resource. Giuseppe works in open data in public services in the UK. He's worked for the Department of Transport. He currently is Deputy Director of the AI Lab at the UK's National Health Service. It sounds like you get to lead a team playing around with artificial intelligence tools and scrutinising them and just basically being the open data activist. Within the public service. That's pretty cool. He's a self-conversed data nerd, an activist, and also a massive foodie, and he's here till Sunday. So if you've got questions, you can talk to him at the conference, but I think you're also willing to try and discover the city as well with people. So please talk to Giuseppe while he's here, and I hope you all enjoy your talk. Please, go ahead. Thanks for the intro. Not very fine at all. It's actually the first time I've travelled to an international conference to give a talk. Yay. Nice. Thanks for inviting me. Did you find the reception yesterday? I hope you did. It was really good to relax in the wind, and we collected a few data points about Fennett Brunker and Coke, which is interesting, let's say. We also learned by washing the stars that the planet is not flat. I don't know if you had any doubts about that, which was really good. So this talk, 40 minutes, I will talk through a number of things. Of course, I'm Italian. I wouldn't want to keep you from lunch. You're more than welcome to keep me talking later. But before we start, QR code and URL, if you want to download the slides, there's also references of the talk. There's also a feedback form if you want to give me any feedback. It's all anonymous, and what this talk is about. So I've worked with data in many ways. I've been a data wrangler, a data advocate, a piece of government data furniture, and so this is all about, let's say, my journey, the lessons I've learned or the questions I've asked myself about having that journey. And you will see that the central theme is a theme of friction, how data works in the ideal world and how it works in the real world, and also the difficulty in communicating with data or communicating about different needs with diverse communities, what data can do. So without further ado, a few words about myself at the moment. As Naomi said, I work for the National Health Service. I'm the head of AIS Cancours, which is a very hipster-ish job title to say that fundamentally we bring experimentation and open source to trying artificial intelligence in public health care. I was a Department for Transport. The job was all about increasing data maturity within the department, but also engaging with users, data users out there in the wild, in the transport sector, and how to support them with services. I spent 10 years in academia, I was in the IT crowd, I was a CIS admin fundamentally, and data in that context meant using data for research, enabling researchers, building up high-performance competitive clusters, sharing data with CICAN, but also working on policy. What does it mean to do data well? What does it mean to do data in a compliant way? But my first encounter with data comes from a job I had in Italy actually, working for a software company doing software for labs in hospitals, so connecting the machinery doing blood tests and other more grim things than blood, and producing data that then went into the database and used by software to produce medical reports. An interesting refraction for those years, we were dealing with millions of records created per day, no one would call it big data back then, and our software developers would work on modelling or prediction models, they did this detection model for the insurgence of antibiotic resistance infections in hospitals, and no one would call this data science at the time, and this is one of the themes of this talk, HYPE. I send the newsletter, I have a variety of side gigs, I've been a data wrangler, I will show you some of this later, I do this because it keeps me up to date, it keeps me connecting to the community of data practitioners, and there are many data practitioners, and that links up to my data advocacy, and I frame advocacy in two ways. The first one is bringing this very diverse community together, for example I've been running something called Open Data Camp, it's an conference where we bring together data journalists, data scientists, public officials, and we try fundamentally to make data maturity advance by the cross-pollination that comes from bringing different disciplines together. The other thing I've been doing as an advocate is nudging the UK government from the outside and then from the inside, doing right things or good things with data, especially with open data. My open data advocacy actually started at the time of this famous in the UK statement, this data camera at the time was the leader of the opposition, it then became the prime minister, and he famously described this army of armchair auditors, it's a famous phrase. What happened then is basically that the government had started at launch DataGov UK, so one of the first open data portals, official portals in the world, and there have always also been parliamentary expenses scandals, so MPs, members of parliament have been inflating their expense claims to make money off the taxpayer basically, and every cabinet has said, we should be releasing that data as open source, as open data, and we will enable it as scrutiny with it. Basically the government of those years really brought into open data, brought it into the political agenda, they did things like the code of practice and data sets, basically nudging freedom of information officer to respond to freedom of information requests with machine readable openly licensed data, they did regulations about how to, the requirement to reshare public information through open data, and it also did create links to the community in a very government way, there's this panel called the Open Data User Group, I was a member of this, which basically the idea was can we use these people to connect to the community, and say the political idea behind this was that open data releases should be prioritised based on demand, and that is the work we helped to prioritise in data releases, but also funding open data projects, so we kind of helped the government decide who was going to be receiving this funding. All of this was unpaid, but the chair of the panel was on the same salary, let's say, of the most senior civil service in the country, and once again this kind of highlights how central to government agenda open data was in those years. We kind of were responsible, this is not working again, there was a data request system on data.uk and we helped decide which open data sets should be released, there was a way to really openly monitor all this to the dashboard, and that also gave some insight into what people were really asking, if you stick to this statement you would imagine a data set about financial propriety, about democratic scrutiny about transparency, the reality which conveys a bit the privilege of living in a relatively well functioning democracy was actually that the reality was way more mundane, one of the most requested data set was the location of public toilets. It's a bit of a punch line, it is a punch line, but a toilet is actually a serious topic, and if you think about this, knowing the location of the features, the opening times of a public toilet can get people maybe with a variety of accessibility needs into a toilet when they need one, and this was actually the topic of research of an academic called Gail Ramster, who is at the Royal College of Arts, and toilet accessibility standards in data became one of her research topics. She was trying to fundamentally use data to do co-design of public toilets, which is a very important topic, and she realized that there was no publicly available data about this, there's no, let's say, authoritative source of data, and she set up this brilliant crowdsource website, a great British public toilet map still running today, and I interviewed Gail about who do you think this data should be owned by, and she would say, well, obviously local authorities, but local authorities don't know whether to assign that task to the data people who don't know much about toilet needs, or to the toilet people who don't know much about data standards, and that is one of the problems, and this shed some light on what happens when there is need for data, but the process that generates that data is completely disconnected from the source. So the first two lessons that I'd like to share with you today, so the first one is users have varied data needs, and those data needs sometimes are different from what we as data practitioners think, and the other one is that open data works well when it's part of a process, and that process is a process that is connected to the user, it collects and uses data, and data cannot be made after the fact, fundamentally, and without this link it becomes very easy for the data to become disconnected from the user need. When I was in the open data user group, we tried to convince the government at the time to create an authoritative single source of truth of all the doctor practices in the country. This is something that now exists, but at the time there was no single dataset. Once again, the problem was that there were multiple processes collecting different aspects of this data that were not connected to the need of having one data source for this problem. But also there was a lack of definition of what doctor practices is. Is that the place where the address where a doctor operates? Is that where a group of doctors work? What happens when doctors work in multiple practices which can happen? Is there even a privacy concern about that? The other thing I learned about doing this was that this, to me, looked entirely uncontentious, and well, not everyone was convinced by this. I had this interesting conversation at the good old times of Twitter. This is Sam. Sam is actually someone I really respect. He's an activist or something called Med Confidential. He works on health care privacy. He was saying, look, this stuff you're doing is completely useless. It's pointless. You're wasting resources. This is actually the links to other aspects of my current work which data will often be contentious because it occupies a special role in our society. Some people will think that that focus is wrong. Some people will say you're spending money in a useless way. That could be better spent elsewhere. Some people will think rightly sometimes that there might be nefarious consequences in using collecting data even before releasing it. That's the first part of this talk. Now, I would like to talk about data wrangling a bit. That's where I have fun. This is a project I launched about 10 years ago. It's called Polly Angrams. If you have ever used Google Books, Angram Viewer is fundamentally a rip-off of Google Books Viewer, but with a different data source. It uses parliamentary debates in the UK. What it shows is basically the frequency, how the frequency of words in parliamentary debate changes over time. An example here is that the blue line is terrorism and the red line is war. You can see that although we've been living through 30 years of obsession about terrorism in the political debate, actually that's dwarfed by mentions of war during the 1940s. This thing, which I did entirely for fun, acquired a life of its own. Sometimes it happens. It was used in TV, ITV, a political Sunday morning programme used it. It's been on the financial times, on the Sunday times, and this is all very flattering. I call my mum, my mum has become a source. This is flattering, but also this is very worrying because I'm someone who cares about data quality and data provenance. I shouldn't be a source of this data. What's even more concerning is that there's no way to actually harvest and analyse this data from the actual authoritative source. The authoritative source, this is what it looks like. The API is pretty website with a great UI. If you try and download the data, you can download one statement at a time. It's not structured, so for example it confuses... Oops, where am I? It confuses, I skipped too much. The name with the statement, it's really hard to parse it in an easy way. In fact, even the other API, the API is not particularly usable. I haven't found, that may be my fault, but I haven't found a way to actually download the full transcripts in bulk. I ended up using this other data source, which is once again not the authoritative data source. It's done by a civic tech organisation called MySociety. They have a fantastic, very well easy-to-use standard in XML to download that data, but once again this is not the authoritative source. I'll say this again, I'm very obsessed about parliamentary data for some reason. Now all the debates are broadcast live, they are recorded, you can look at the website to find when a certain Member of Parliament was speaking. The API for the TV logging works relatively well. You can download bits of the video if you like, but it's not linked. It doesn't link very well to the other parts of the data, so it's not easy to go from an MP speaking in video to the transcript of that speech. And it's surprising, right? We are in a data-driven world. Why is this stuff all completely disconnected? Other example, petitions. The UK government allows people to start petitions. These petitions, if they get to a certain number of signatures, they are debated in Parliament. Wouldn't it be nice to basically link this all together? And I tried and failed. And this is sort of the collection of endpoints, some of which had different URLs, a list of URLs I only discovered because I asked someone who works in Parliament to tell me, oh no, you need to use this other domain name. And the reason for all this is that basically these systems are never being designed to be data-first. There's actually a kind of conflict within the Parliamentary Digital Service between those who thought data is important, we should have data first, and those who said, well, actually this system served one purpose each, is disconnected from the data sharing. So the lesson here is that it's a question. What expectation should we have about the availability of authoritative data from public organisations? And that's something that, of course, doesn't have an answer. It's something we can help steer in the right direction with advocacy, but it also means two things. It means understanding what the reality of those organisations, why have those organisations been set up for, and what are the goals that they're trying to achieve with the systems and the data, but also sometimes understanding that the political cycle might not be on our side. And it's very hard that, I have to say, all these discussions could happen 10 years ago. It's now in the UK impossible to really address these problems. I have another session with Parliamentary Data and Data Set. I developed this thing called They Look Like You, which was basically a fun, very fun thing. It allowed you to post your own picture and see which MP looked like you the most. And the idea was, you know, come on, they're not very different from us. Of course, they're not very different from us if you're a white man. And that got me to think about representation and data. And I did this thing which went viral, which was the average face of the British Parliament, which of course is, you know, it's a relatively middle-aged white man. And the BBC actually liked this and they got me to do the average face of the US Congress. I did a few average face of the time. This one is actually really interesting because we printed the face. And interestingly, that face, the US face is more smiley than the British face. Whatever that means. But we shipped a journalist to the US and she went to the Senate and she showed this picture to the senators. And there's a video of this. It's in the links. Look at it because it's quite interesting to see how senators react to this picture. And what's the lesson here? Well, I mean, of course this is feeling good. We're challenging the powerful, we are challenging the lack of diversity, the perpetuation of historical biases. But I think that now, a few years after this, there's some nuance to this. The question I keep asking myself is that where are my ethical boundaries in doing something like this? Because of course, yes, we are using, I was using data to expose lack of diversity that looks good. But on the other hand, this is in a context of increasing personal and political polarisation. In the UK during those five years, two MPs were murdered. That's something that hadn't happened in the UK for 30 years. So things are changing. And I'd say also my understanding of the use of personal data has evolved. I mean, would I want my face to be used for a purpose like this? And this is a question I keep asking myself. I don't think that I have any set answer. And we'll see something along those lines later when I get to speak about what I do now in my current job. But before that, I'd like to mention why I moved into government. This article, which I read about 10 years ago, really got me going. This is the story of our New York City and asked a group of data scientists, policy makers, advocates, wranglers to go into city hall to give access to data. They were let loose on the data and one of the things they achieved was basically this problem of illegal dumping of oils from restaurants which is an environmental health hazard. And this group of people co-producing a data-driven solution, they managed to find a way to enforce this in a much better way. And I was kind of fascinated by this, bringing data and evidence to improve public services. And I have to say I never work in anything as sexy as this, but I work on transport data. Specifically, one example I'd like to mention is something called NAPTA. NAPTA is one of those ugly government acronyms. So I'll spell it out for you. It's a transport access node. And I describe it as a little miracle. NAPTA is the national data set of all bus stops and train stations in the country which has a very interesting story. It was created in the early 2000s to power the national journey plan. This is before the city map before the move it, before the transit. So the UK government thought it would be a good idea for central government to run a national journey planning and they created the data set and they created the process to maintain that data set. Now around 2012 the government realised that there were apps out there that did this job better than the government did so they decided to shut down the service. But the data set survived. Why did it survive? It survived because it had acquired other uses. And some of those uses were about asset management of bus stops or informing policy in central government. It's quite a journey. It's a big data set. It's been going on for 15 years. It's probably the longest standing open data set in government. More than 500,000 records are on it. 95% of these are bus stops. 386 local authorities feeding that data and it's now a legal requirement. It's a statutory requirement for local authorities to produce that data because that data has become the infrastructure layer on which something called the bus open data service or other interest in transport data service that is about live data from buses. It's a relatively simple data set. This is what looks like a CSV. There's also an XML version. It contains the stop name, the geographic coordinates, the bearing of the street where the stop or the station is. It sounds really amazing, but of course it's got some issues and I like to talk about those issues. For example, if you know a bit and you realise that that point shouldn't really be there. That's a bus stop at sea. This is relatively innocent. It can be spotted by just looking at the data. But if you look at, for example, napton is using open street map to create the transport layer on open street map. If you look at the an error log of important data you start realising that there are ghost stops, that there are streets with the wrong bearing and this starts being a bit difficult. This is journey planner. If you have the wrong bearing of a street that will get you in the wrong direction. I did a bit of wrangling myself. This is on my github if you want to have a look. If you look at the distance of a bus stop or train station from the road, you start seeing a lot of variants. This one is 75 metres away from the street. What does that mean? An example, this is a station called Alexander Paray. This is in Glasgow. Well the coordinates of that station are there. That is not on the street, it's not where the entrance is. It's probably on platform 1. I started looking, there's no standard base about this. The reality is that stations have that pin pretty much all over the place and of course this might be maybe okay for the original purpose, journey planning, general purpose journey planning. That's good enough to get you in the right place but first of all journey planning is no longer the purpose of what we said. What are the other purposes? For example, the part of transport is working on accessible travel. Is that good enough? Probably not. To properly have an accessible travel journey planner or policy making, you need to have a three-dimensional representation. For example, a BIM file at the station. Another interesting policy driver of the DFC these days is connected to autonomous vehicles. How can you get a car to park automatically if you only have that and probably need a shape file and then you also need to start to think about precision. Napdan is guaranteed to be one meter off and one meter off will get you to probably crash into the car in front of you. It's not good. Once again, this is the disconnect in my local bus station. It's a collection of bus stops. It exists. I will pass that every day. There's a big sign three meters by two that brings it to that. Why do I mention this? Because during the pandemic someone from the bus policy team came to me and said, can you give me the list of all the bus stations in the country to publish basically to print posters about social distancing and bring them to the bus stops? And of course we couldn't because that concept does not exist in data. The one I've ever asked for it is not the original purpose for which Napdan was created and there hasn't been an ongoing conversation to evolve that data. The lesson here is that data should evolve in line with the purpose. That purpose will change and unless we monitor it somehow there will be this kind of divergence of policy and data. Now, what I've learned in my current job I asked my question. I wanted to work with applications, with data applications and AI was the thing to do. AI is controversial and my plan was can I get in and find a way to actually challenge AI, learn how to use it well observe artworks, evaluate it and of course my answer was open source and open source I think is the best way to create confidence. Now, why confidence is important? If you were my traditional standard audience for work talk I always start my work talk with this. This is a slide. I tell the audience these are the two objections that we most commonly receive about AI. You can't explain how AI works and AI will take my job. As I see the audience nodding I say, haha, that's a trick. These are not objections we receive. These are the objections that Dr Edward Jenner received 200 years ago when he was developing vaccinations. The reason for this is that the problem is not necessarily data or what we do with data. The problem is that every new technology that illustrates will be faced with the same sort of objections and especially when we're talking about people's health. So to me the solution to this is let's try to understand how things work and build openness and transparency and that's what I did when I joined the NHS. So the AI's conquest programme to give you a bit of a walk through what we do. We seek out problems in the healthcare system and the hospital will come to us with a problem and we say, okay, if you have data about a problem we will help. But the way we will help is a very specific way. We'll be co-producing a solution so we work towards a minimal viral product together. So we will bring data scientists, data wranglers, data technologists, data ethicists, experts of regulation. You must bring someone who understands that problem. So a doctor, a nurse, sometimes someone works in business intelligence. We will get sometimes suppliers to work with us. There is only one role that we will have to co-produce everything, we will have to document everything and release everything as open source. By creating this solution together, first of all I think we create an understanding of what it means to work with data and at the same time we open up potentially the black boxes. If suppliers want to work with us they need to be happy, open source in the solution. One example that's very briefly, a hospital in Gloucestershire came to us with this interesting problem. We have this pattern of bed use. So if they define someone as a long stay here, if they stayed in hospital for more than 21 days, which is a bad thing that correlates with worse outcomes, then only 4% or 5%, 4% of their admissions, so 4% of the patients that come to hospital will end up being long stay years. But if you look at beds 34% of beds are occupied by long stay years and that means a variety of things including the problem that percentage can fill up very, very quickly. So the question was can I predict how long a patient is going to be staying in hospital the moment they arrive? Long so short, yes, bingo, we did. There's a model that was able to predict two thirds of long stay years, which is a very good result and this is very complicated. I mean it's a convolutional network being trained on about one million rows. So each row is an admission. Each admission comes with 300 data points. It's then filtered through a cumulative it's very complicated. There's a lot of fireworks in this. What really matters however is how the hospital engages with a data driven process. This question, what do we exactly do with this prediction? And what was difficult was not the development of the model. It was understanding how to deploy it because deploying means deploying a data driven process in a process that is not by nature data driven. And how that affects, for example, the way observations of patients are going to be working, how it affects the way patients are being communicated with. Because if patients are being told you're going to get home because of an AI told me that's going to cause a lot of trouble, for example. It can also cause the problem of returning. People are sent home earlier than they expect to. Will they return even if they don't need to return? The other side of the problem is actually how the technology really is managed by the hospital. Because the expectation is how this is just another IT product. In reality, it's not. For example, this is one of those problems that really lends itself to model decay. Because you get new data coming in all the time the model is not made so that it retrains itself. How do you manage all this? The lesson for this was learning what are the expectations of someone trying to use AI and telling them look, no, no, AI is not... It's not something you can forget the moment you are deployed. I'll move away a bit from my work and I'd like to talk a bit about what I've learned with Manus Letter. I've learned a lot of evolving themes in data. One is new data visualisation approaches. This is the Berliner Morgan Post that created this fantastic spikes map to represent how support for one specific party changes over time. That's been very influential. Many others have done it after this. Replicable data journalism. Journalists today are expected to publish the methodology. Something that comes from academia. It's really, really good. It's all about transworthiness in journalism. Predictive data journalism. This is very exciting. Journalists now are often predicting sports or competitions or political elections. This has a very interesting side to representation of uncertainty in those predictions. Uncertainty is a concept that should be coming together with data. It's all about educating the public on this. Not everyone agrees that predictions are a good thing. Like Mona Shalaby famously, the Guardian US Data Editor said, since when it is the journalist job to make predictions, there's a link to the podcast when she talks about this. It's a very interesting speech that she gives that I really recommend. Interactive data journalism. I'm a big fan of this. Once again, the Berlin Morgan Post asked people to draw the boundaries of the Berlin Wall. This is great because basically it allows readers to engage with their knowledge and to question it, to challenge it sometimes. Left and right is about optimism. I think that the majority in literature are improving. Seeing this from the lens of someone working in the public sector, this is helping people challenge public authorities way, way more than they were 10 years ago. In those authorities, it's about making sure that we can become data-driven and evidence-driven organisations that we want to be. I'm not saying that this is easy or completely successful, but there is progress. This actually comes with an interesting converse effect, which is the effect of, let's say, data literacy also makes data storytelling and sometimes data manipulation easier or more subtle. What you see here is basically the representation of a general election in Italy in 2018. It's two very different maps. The map here on my right on your left was done by the Financial Times. It's a very nice good-looking map. It's basically done by selecting the winner of each constituency. It shows very clearly that Italy was accounted divided in three. While this other map, ugly, I made it. I'm not very good at graphics. It's a dot map. Each dot is 100,000 votes. What it shows is not so much a divided country as a messy country. Italy can be quite messy politically. What I'm trying to say here is that these two maps are based on the same data but are telling you two different stories. These two stories are both in their own way true and false. People say the map is not a territory. What I'm trying to say is that data-driven does not mean impartial. We use data although the data might be mutually collected the way we use data often as an agenda. Sometimes we don't even realise about it. When we say that we want to be driven by data or by evidence, that doesn't mean that the overall narrative will be neutral and therefore the honest thing to do is to acknowledge it and try to be as open as we can about that. This links loosely to the concept of bias in the data. This is something I'm seeing in my current job. This thing here is called the fixed pattern scale. This is the way that mythologists measure the darkness of human skin which of course is important for a variety of medical reasons. We had a proposal for a project which was about a very interesting question. When you go into hospital let's say you are immobile in a bed for a long time you will develop something called a pressure ulcer. It's a wound that happens because you are laying in a bed immobile for a long time. Humans struggle to assess those wounds the darker the skin is. This doctor came to us and said there is a bias in humans. Can we use AI to address that bias? Can we train a model to grade the ulcers so that the outcomes of people with a darker skin shade are the same? Sadly, the answer was no because there is a predetermined bias in the data sets. Most of these data sets are collected about white people. I would say white men. There is a famous book by Caroline Criada Peretz that is called Invisible Women which is a very good read. It's all about how most medical science sadly is very biased towards men. So the answer here was no, we cannot run that project but you can start collecting that data in a slightly more biased way and see where you get to in a few years time. But this was an easy one of course because there are loads of biases that are visible. Some of them are not easy to detect. Some of them are not even present in the data set because the data collection did not involve that right thinking. Some of these features might change over time and the data set was not made to consider that variation over time. So this is interesting for us because doctors take something called the Hippocratic oath when they become doctor which includes a commitment to doing no harm. Doing no harm when we work with data and healthcare seems like a very far away problem and it's not. It's actually part of what we do every day. Two websites that really recommend the data harm record by the Data Justice Lab and the AI incident database which really show how data is misused, deliberately or not and it produces harming people. There are no easy recipes around this if not the consideration about intersectionality. I'm not a columbus an Italian author, journalist she's been described as a data humanizer and she makes a point in which it's about applying intersectionality to data science which basically means always questioning the data it means endeavouring to approach the data through the lens of your own privilege understanding that that privilege can have multiple phases, some of them are historical, some of them are structural some of them might be contextual and applying intersectionality means stopping patting ourselves on the back when we say we are data driven and understanding that data driven often is not neutral because of that the way the data is being collected. So yeah, lesson 10 is all about data can be harmful and I have to say harm, data harm as a sibling which is hype and hype is everywhere in data. I mean this is famously a speech that the UK environment secretary gave a few years back she committed to the release of 8,000 data sets as if that was a good thing it was never explained what those data sets should be about what their features were, what they were presenting so I think hyping data means assigning data this special magic role data release is going to be good using data is going to be good without thinking about the consequence and I have to say we have some responsibility in this happening because we were so much pushing for open data that it was very easy to get the situation in which open data became the goal in itself and not the means and big business of course also contributed to that McKinsey here famously saying open data can help create 3 trillion dollars a year of value how? Who knows and this links to the concept of snake oil, I really recommend this slide deck it's by an academic professor Arbin Arnayan at Princeton University it's about how to detect and recognize AI's snake oil this brings together a thing the concept of hype trusting AI because AI is good and the concept of harm and what it says in this slide deck it's all about how ethical concerns are related to let's say in the application being snake oil or not in the first bucket here face recognition, there's very little snake oil it works very well it's very accurate and that's why we should be worried because it can be misused by authoritarian governments for example this second bucket here conversely hates pitch detection there's a lot of inaccuracy there and that's why it can be dangerous and even more which we saw actually I don't know if you attended yesterday talked by Adrian Williams about the harm that AI is doing to Amazon drivers or artificial or outcomes in general AI is being used based on assumptions that are not necessarily in the data there's a lot of snake oil in there there's a lot of inaccuracy and I think as data practitioners we always need to remember this and to be really careful so yeah the lesson is there is hype and hype is not our lie because hype will produce bad things it will produce harm it will destroy confidence and I have an example here it's maybe entertaining or not it's about critical friends and why hype is not my lie I was once again social media but we should run away from it but I was asking on social media can you come with problems that we can use AI to address and once again this open source idea we can scrutinize AI and this guy which I'm not going to be naming but he came with this negative you have crisis in primary care, GP suicide chaos in secondary care and a variety of things and absolutely this is the time to ask stupid questions about what hashtag AI can solve, suck yourselves now and the thing is that all is happening and of course my take here was it's important to actually do AI because if we don't do it in the open ethical way someone else will and bad things are more likely to happen but hype makes it really really hard to make that argument to communicate that because of course you are associated with all the snake oil and all the bad things that AI is doing I do have a positive example which is something that happened while I was at the DFT this guy reached out to me saying you've been publishing wrong statistics for 10 years and that's a big claim and it turns out it was right and the fact that I engage with him and help the team engage with him and we found out what there was I left the DFT by the time it was fully completed but they republished 10 years of public statistics and they also made their process more solid in this and I think this is a beautiful example of our engagement first of all solves problems and secondly probably deactivates what the most famous concern of public authorities which is bad PR is because I think the PR would be much worse if we are not engaged and this thing has become something a tabloid media to focus on so yeah I think openness and engagement are something that I will keep doing as a public servant because I think it's really positive and sometimes it requires some battles but it's a good thing and actually we have an example in the UK this woman until Clermory Arty she used to be one of the most senior civil servants in the country she's now the CEO of a charity called Citizens Advice and she was an open data advocate and she became an open data advocate and she said that open data enabled all sorts of other good things of dimensions of openness of accountability it made people happy to be held accountable which I think is a great thing because accountability is a good thing for us public servants if you consider that so yeah the lesson here is simply that open data and open source are transformational and if you think of data and code as ways to change openness and transparency I think multiply that effect of building knowledge so I don't have any real final lesson here it's been just a journey of lessons but I call this keynote talking with data because I think that a lot of the work that we do as data workers is about understanding connection between different aspects of data it's about sometimes telling stories with data and some keywords that might bring this all together it's about the discipline and the holistic it's about zooming out, understanding how data impacts people what's beyond the data what is the data trying to represent it's about evolving and iterative often data requires a lot of ongoing work representing concepts in data or in services and open engaging intersectional consequence aware the data work doesn't stop at creating the data understanding how that data is going to be serving or impacting different communities and with that thank you very much wow thank you so much we do have five minutes or so do you want to see if there's any questions in the room? if there are I'm happy to answer but otherwise has anyone got any questions for Giuseppe now yeah we've got a hand up hi you showed a bus stop that's around the corner from where I grew up I was wondering like with the example of the NHS data do you make recommendations or do you just simply provide the data no so my job is about using the data as opposed to providing the data so often what we do is we try to understand whether the people come to us with a problem if the data is suitable or not and as I say we've run about the total of 18 projects over the past two years the ones we have rejected are about 50-55 because there were a lot of problems in which the data was not suitable like the dermatology problem or the data was simply not there and the other aspect of this is about data and digital maturity in some organisation there is this concept that AI is magic so we saw a few applications coming to us with ideas that were not at all implementable because the data did not exist so in those cases what we've done the other thing I haven't spoken about is the AI DIP dives and we try and over time collect a number of people working in the hospital a mixture of clinicians and non-clinicians and bring them to a journey of understanding what AI really is and I'm very kind of agnostic about that label by the way I did university a long time ago and AI was an entirely different thing for what it is today and to me it's more about the data angle AI is something that uses data that uses an outcome and we try and take the hype out of that word and making them maybe understand that they need better data or to collect data in the most ethical way Hi Sorry I have a question about so you mentioned at some point once you come up with a tool based on the data for a certain problem what's the ethical use of this like what do we do with this I was wondering if you have some people in the team that are sort of more experts in the ethic side oh yeah like how do you decide these things so we do have an AI ethics team as well in the lab so the lab in reality is much bigger than the Scancworks there's other people and the ethics team has developed for example something called the Algorithmic Impact Assessments together with an organisation called the Ada Loveless Institute and if you've ever heard of them there's guidance on how to really understand what the algorithm is trying to achieve so we work together with ethicists basically to make sure that we don't incur into problems Hello, hi so at some point you raised the question is it our job to predict and I found that super cool and mega interesting so have you ever been in a personal position where you build your whole system and then you were like maybe I shouldn't predict oh that's a interesting question probably I haven't but what I can say is that often we focus a lot on the predictive power of AI for example and in many cases actually what we're trying to do is to provide in our case doctors for example nurses with more information to take decisions that they've been trained to take so to me the way to go about prediction often is about understanding that that's an extra bit in someone's job is not going to be replacing for journalists of course it's different evolving nature of data journalism the further now there is this constant uncertainty I'm really really happy to see that because I think it's maturing and helping people understand what is the real power of data Thank you it's almost lunch I'm going to ask the last question as a privilege I'm sorry but Giuseppe will be available for more out there but before we leave this moment you've seen a lot of data stories in your journey that you've just shown us and I'm wondering from your first one and a half days at CSVConf this year what have you learnt, what are you taking away I mean there's a lot actually I've learnt that I need to re-watch a lot of the videos because there's a lot of things that I don't understand and once again the community is varied I'd say there's a number of talks especially the one about ethics I'd like to really reply to it I've been busting a little talk about how to really do data work with communities and it was an interesting scale of how you go from being manipulative to being really inclusive and I think that is one of the major lessons for me in terms of working with data is how to get to that level and who are the people you're trying to include in your data Brilliant, thanks for helping us to feel less imposters and for always learning despite the journey Brilliant, thank you so much Giuseppe