 Welcome to our final seminar in our current run on FOI for campaigning and advocacy work. Thank you. If you've come along to everyone so far and we really, really hope that they've been useful to you. Let's get started. Okay, so today we are hearing from two presenters who are going to cover different angles of data secreation using FOI data. Our first presenter is Maya. Maya is the director of After Exploitation, which is a data mapping project that uses FOI extensively to investigate hidden slavery outcomes. Maya uses they-them pronouns, okay, so when you're asking questions, please keep that in mind. Maya, please take it away. Oh, thank you so much for having me. Just to start with a case study and to explain a bit of the context in terms of what we do at After Exploitation to kind of stick together bits of data and information on modern slavery outcomes, I just wanted to start with an example of a data project that we did actually using the what-do-they-know project platform to kind of understand some of the longer-term outcomes after modern slavery survivors report instances of trafficking. So we had a clear motivation for doing this. Firstly, there are a lot of assumptions around victims' experiences when they report to the police. So there are a lot of kind of purported reasons, including the fact that survivors might be nervous engaging with the authorities. Survivors might be coming from countries of origin, whereby authorities are potentially subject to corruption and might be working with traffickers. So there are a lot of assumptions around what investigations of modern slavery look like in practice. In reality, the data on the ground is quite poor. So we don't actually know how many cases of modern slavery investigations are actually being dropped because survivors aren't engaging with the process, versus how many are actually because of resourcing implications within police forces, a lack of awareness within police forces, or any other kind of more nuanced factors that there is no data on. So we really wanted to understand what the kind of crime outcomes were associated with modern slavery police reporting in the UK. So the data that we were looking for was police force level data. So we wanted to understand, according to the crime codes that are already employed by police forces, what the outcomes were. So our cases being dropped because of a lack of evidence, because cases aren't being followed up, are cases being reported, but they're not translating into a crime record, which allows the police to kind of continue that investigation, or are there other evidential difficulties that we just don't know about at the moment? And our audience for this was mostly the media because we wanted to kind of correct the record if the evidence showed us something that practitioners weren't aware of or flew in the face of our expectations. But we also wanted to use this data to engage with practitioners so that on a local level potentially charities and other organisations could know regionally where to target for training and other kinds of intervention. So some really quick example, FOI wording here. We asked, please provide the number of modern slavery and human trafficking cases reported to your force between this time period. Please disaggregate the data according to the crime outcome of the offences. For example, this includes but is not limited to outcomes such as cautions, charges, evidential difficulties. Victim does not support the investigation. So in these cases, we've actually used the exact wording that we know is already used internally by police forces to capture the information that we want, even though this at the time was not in the public domain. So the outcome was really positive. We had a really good success rate. Some of the FOIs had to go to internal review, but we had a really great outcome. I think it was over 80 percent of police forces actually did respond with quality data that we could use for analysis. And I've only included the top of the table, but we've just got a really simple contingency table here with the top outcomes in the UK. So as you can see, in a majority of cases, investigation was complete, but no suspect was identified. There's nothing there that necessarily indicates that it's because of a lack of the victim's involvement in the case. So it's slightly different to the narrative that we tend to see around modern slavery and police reporting in the UK. It's very, very interesting for us to have that data and for charities to have access to this data as well. So thinking about how some of our learnings can potentially be applied to your work, which I hope they can. A majority of the work genuinely goes into the planning rather than the analysis at the end in our experience. So there's kind of three key bits of the planning. You need to think about your need and purpose before you even start to pick your targets and your public authorities. You need to have a clear audience when you get your findings, and then you need to work with stakeholders that are supporting you with the project to make sure that the parameters are really clear. So need and purpose, first question. It sounds really obvious, but I've definitely fallen into this trap of well, is the data already collected? So is it collected as part of existing FOIs? Have you looked at charity reports, other community groups reports, other academics, research that may have used freedom of information, requests to kind of gather the data you've used already, if you're not particularly well connected to the public authorities that you're thinking of targeting, are there bigger organisations or more kind of old school academics potentially that have those relationships that may be using internal data and that have got that in the public domain already? And also, most importantly, this is a really key source of data. Is any of the information that you're looking to collect already in spectra or watchdog reports or released as part of parliamentary enquiries? Those are really key sources of information. It doesn't mean that if any of that data is already in the public domain, it's not worth using an FOI project to get more up to date information or more detailed information, but it will make your life a lot easier if you have a clearer starting point. So, for example, in your FOI work, say a select committee report says how many local councils have cut funding for after school projects or something. They've got a total figure, but you then know the Department of Education has a way of collating that information from the local authorities. So there's nothing to stop you going to those same local authorities referencing the report and saying, can you give me a bit more of a detailed breakdown in your region of what's happened? So the next consideration is, is there already a centralised source of the data? So there might not actually be a need to do a really, really ambitious bit of project work. What you might want to do is look at whether the public authority that you're targeting could just be done so in a centralised manner. So by what I mean is if you want homelessness data, for example, you could go to every single local council and you could ask for a breakdown of the homelessness data. But actually, if you look at the Department of Housing, they're levelling up in communities, they already do that work of collecting the council's data quarterly and they may have a more detailed breakdown that's not published that you could access. So rather than going to 300 plus local authorities, you could actually look at the public data that's already on offer and then make sure that you're just starting with one targeted FOI request to a central government department instead. So again, that isn't always possible, which is kind of why we're here today, but it's always great if you're looking into the feasibility of the project and actually there are simpler solutions under your nose sometimes. The next is what social need does the data serve and can you justify the resource? So some projects are relatively easy. For example, the police force project we did, 43 police forces. It brought a bit of a time commitment, but it was much less labor intensive than the local authority project that we're doing at the moment. So it's always worth considering what is going to be the utility, not just for our organisation, for example, if there's a commercial benefit because you're trying to prove a need for your training or something like that. Is there actually a social benefit beyond what you do that could benefit other groups, other communities and other academics and journalists, essentially. The other consideration is will the data be suitable for your analysis? So it sounds obvious, but you just need to make sure that your requests kind of yield the type of data that you need. So if you're running a stats test, are you making sure that you're asking for data to be provided in the exact breakdown you need? So an example for this is say you want to go to local authorities and ask for the amount of programmes that are offered to children of different ages. You could say I want these age groupings or I want the ages of the children affected with a breakdown. If you want to do your analysis, you start grouping. And then one council says these programmes are for 18 months old. And the other is going by years, the intervals won't match up. So if you're running a statistical test at the end of your work, it's really important to plan backwards and think about being really, really precise in your wording so that there's no mismatch, especially if you're going to really high volume of public authorities with a back request to absolutely make sure that at the end of it, if you have a stats test in mind, you can actually run it. So the next consideration is your audience. So again, it sounds really obvious, but it's easy to get stuck down the rabbit hole. If you do really niche work, is the project interesting or important to your target audience? So if you're engaging with MPs, do you have a sense of what MPs want from you? In our experience, it will be a focus on localised data. If you know that your target's MPs is getting a national picture always going to be relevant or will you actually want to go to regional authorities because that will allow you to reach that target audience. By the same token, if you're working with journalists, they tend to like big numbers and case studies. And if you don't have case studies, then maybe you should think about aiming for the big numbers and on topics that you know is relevant to them. It is sometimes worth if you know there's evidence that you need to gather. You might want to double up on your requests to make sure that there's some element that serves a social function or serves to interest other audience members beyond yourself, just to get a little bit of additional data if you can to kind of bring more people into the cause or into the research that you're doing. The next consideration is what is the target response rate? So just like editors and academics have rules on the point at which you can kind of extrapolate for a general population. So, for example, journalists, you know, they'll tend to say to do polling that's representative of the general public. You need to be interviewing 2000 people or more. The same sometimes does apply in terms of the FOI response rate. So if you're trying to make a claim about local authorities' performance and you only get 10 responses back, that probably isn't going to be that helpful for you. And also, although you might get some really detailed information that would be great for qualitative work, it might show you in depth the practices that are going on and might be helpful to you. It might not hold that much water with other stakeholders or the stakeholders that you're trying to engage with or persuade. So before you even start, it's really worth having two numbers in mind. One for the number of public authorities that you need to hear back from. So your kind of lower limit minimum. And then the other is your target. So you can really, really try and aim for that. But also set expectations internally and externally in terms of what what will allow you to continue without losing legitimacy. The other consideration is, can or should stakeholders influence the data captured? So you might want to check with colleagues, editors or organisations about some of the basic types of data that you plan on capturing and asking whether there's any additional details that you've missed that could add value to your work. So, for example, if you're asking about programs that are offered to a certain community, is it the case that one charity has been asking for gender breakdowns for years and years and years? And this would give them a huge opportunity to be able to make use of your data. And all it would take is for you to just add in a line saying, we need gender breakdowns as well. So it's really worth having those conversations with the kind of groups that you want to engage with and that you already partner with to make sure that when you're doing big projects like this, you're trying to make most of it essentially. Parameters. So it's really important to try and use existing measurements where you can. So try and ask for data using similar or the same methodology already used by the public authority to collect information. For example, do you know existing KPIs? Do you know targets that central government holds these organisations or authorities to? Are there any other units of measurement that you know the organisation or public authority uses to measure performance? If you know what the metrics of success are in the area that you're asking for data on, then you should, in theory, be able to access data more easily because you know that those metrics are something that they're keeping tabs on. The absolute best case scenario is that there's already public guidance available over Google. If you type frontline guidance, public authority name on whatever the topic is, you can sometimes actually access guidance that's used by staff to measure the kind of information that you want to be able to extract. You might also want to submit scoping requests to a handful of the public authorities in the first instance, just to get a sense of what measurements of success they use, what data they use and how the data is managed before you develop your main request that goes out to more organisations and public authorities. The next is inclusion and exclusion criteria. So before you submit the FOI request, it's really important to have an understanding of when you're going to exclude totals and when you're going to include them if there's ambiguity. So, for example, what happens in our case when we're asking for public authorities information on modern slavery training? What happens when we ask about the mention of gender sensitivity and public authorities say that they offer that training but they don't offer the PDF of the training materials so we can't cross-reference it? Do we decide to take people's words for it and caveat that in the research? Or do we absolutely have to make sure that when we log that information, it's on the basis of these materials? There's always some ambiguity, especially when you're kind of dealing with big projects that are going out to really high numbers of public authorities. So it's worth thinking about how you're going to log data or information when the responses aren't black and white. And so, for example, another thing that you might come across is the kind of partial disclosure on the basis of data protection. So especially if you're getting data on potentially vulnerable groups or groups that have been made vulnerable by policy, what you will see is if the groupings are quite low, you will sometimes get data that says under five or over 10 and you're not dealing with whole numbers. So it's important to think about how you'll log information if you don't get the full picture. The next is just guarding against inconsistency. So when you're choosing what to record, you might want to log specific keywords in FOI responses and include a free text box for trigger wording to be uploaded. So, for example, if anything comes up around immigration and you're doing work on racism and how issues of race are dealt with in certain contexts, then you might want to put illegal migration act or whatever else as a trigger word to every time specific legislation comes up or something relevant to what comes up. There's additional kind of richer context there. And then you can kind of follow up at a later point as well. So when it comes to data logging, you want to choose a platform, record the data and ideally get a second pair of eyes on it. So choosing the platform for data logging kind of, you know, it's a bit of a drawn out process sometimes. I would definitely recommend what do they know projects if you can because, yeah, when you're really set on what to do and you've done all that planning, it's a really good tool because you can kind of extract the data after the fact and access it really easily, but that might not be for everyone. So two criteria to think about is accessibility. Can your chosen platform be accessed by everybody you're working with, including people who might be new to the platform or work on different devices to you? So especially if you're kind of working with different communities or you're working with people that obviously aren't paid, especially if you're working with volunteers, you shouldn't just assume that because you have a laptop, everybody has a laptop. It goes without saying, but it sometimes is missed. So really think about the accessibility of software you want to use in terms of whether you can access it on mobile phones, iPads and other operating systems rather than just the ones that you and your team, if you're employed full-time, happen to have. The other thing to think about is security. So there might be a point when you're bringing together data where there's a small point in time where information could be cross-referenced. I think it's very rare this happens because usually the public authorities will be quite careful about not disclosing anything that gives away information, but there is a small chance of that. So depending on the sensitivity of the data, you might want to think about the security of the platform you use. So for example, if you're using Google Sheets, is it appropriate to store information on the cloud or should you be working in a more secure way? So here's some example category. Fields, just for interest. This was taken off from a police project from the What Do They Know project site. This is generated automatically if you sign up with What Do They Know projects, but you can obviously replicate this manually on Excel. So having the request URL is really helpful so that you just have a point of reference when you're kind of marking each other's works. And if you want to follow up to a richer analysis, public authority name, that kind of gives what they're saying, keeping tabs on the latest contributor so that when you're blind marking each other's work, people aren't looking at their own stuff essentially so that a second pair of eyes is always on everything. And then having a tally of whether the FYI was successful, not successful, or kind of pending is really key too. So when you're recording the data, it's important to ensure that everyone has a go-to resource, even if it's just one page, sharing the exclusion and inclusion criteria so that there's no ambiguity there, ensuring that criteria or codes are applied consistently. So every now and then you might wanna do a spot check or somebody else to spot check your work, just doing a handful of cases to see if you're logging things in the right way or if everyone is on the same page in terms of how they're logging stuff. And then ensuring there's no possibility of duplicate logging. So I think this is gonna be covered by the next speaker from the ODI in a bit more detail, but essentially there are definitely ways that you can make sure that things aren't duplicated and a way that you can do that is by making sure that there are specific people, volunteers, staff, individuals who are linked to certain FOIs so you're not doubling up on each other. In terms of the point at which you get somebody to look over everything a second pair of eyes, team support is really great if you have other people in your team that are supporting or it's joint work, making sure that you work from one unified spreadsheet is really key. Lots of problems kind of arise if there are spreadsheets on spreadsheets and people are editing and sending it on email, it's really best to have one document that everybody works off if you can. What do they know projects makes life easier, but if you are using spreadsheets, you might wanna think about Google Sheets or something similar. And then at this point, if you're thinking of buddying up potentially with newspaper journalists or you know that you wanna get into the public domain, you might want to at this point actually involve the stakeholder that you're trying to engage with to see if they actually wanna play a role in fact-checking the work and in verifying it. So in our case, we knew that we were probably gonna go to the BBC for an exclusive with this anyway. So we actually just got support from the BBC to help with the data checking and kind of verifying everything. And that made life so much easier. It kind of went by in half the time because a lot of their staff already did this work day in, day out because they were data journalists. So sometimes it is beneficial when you're relatively confident about the findings to at the last stage kind of involve another stakeholder in the verification if they're trusted. There is a chance that everything goes wrong. It happens, we've all probably been burnt before but there's always something you can do. There is never any such thing as a wasted FOI request in my opinion. If everything goes wrong, here's some stuff to think about. Do you have capacity to chase no or late replies? If you can take all of them to the information commissioner's office and really push for it, it can really, really help. The only thing is you might wanna think about whether you'll lose any legitimacy if you start kind of on an ad hoc basis, kind of raising challenges or complaints for a certain number of the public authorities but you don't have time to do it for all of them. There might be a question there in terms of like whether every public authority has had a fair hearing and whether everyone has been approached in the same way. So you might wanna think about that. Something that we do is we'll tend to prioritize the public authorities that are most relevant to our work. So for example, if you can't do complaints for everyone, we could potentially go for the biggest local authorities or local authorities that have the densest populations and work in that systematic kind of way to prioritize how you're kind of tackling late requests or no requests. You might also wanna think about alternative angles. So are there any additional themes that have arisen in free text fields which could be recorded manually or could the scope of your work be narrowed? So could you narrow it down to just London or Birmingham or Manchester or wherever you're based? Could you think about instead of logging all of the data because you're not getting disclosure on everything? Could you just focus in on the information that does tend to be disclosed? You could also change your methods. So can you change the stats test that you run based on what you have? Sometimes contingency tables are just as good as something really, really fancy because it does set the scene for other academics or researchers or civil society to kind of take that as a problematized issue and run with it and do their own work. It is worth thinking about whether you just change the nature of your report or your work based on what you get back and outline the limitations and be upfront about it. You could also just chase further information. So can you offer authorities a chance to provide more detailed information through new requests that you know will have a good success rate? You might just sometimes want to approach everything from scratch because you've learned and it's sometimes better to do that quicker rather than kind of dragging it all out and then realizing, okay, I got the wording wrong. We're all kind of screwed now. So it's much better if you identify that there's a really systematic issue with getting the information you absolutely need to kind of put in a fresh request and kind of reevaluate the project and whether you want to go ahead with it in its current form or kind of shift the focus a bit more. And then the last thing to think about is reframing with the transparency focus. So if you know that your request has been worded fairly and in a functioning system, you probably would be given the information that you asked for and you just get really low response rate maybe because it's a contentious issue or there's some other problem. The kind of public authority approach is really cash-strapped or resource, struggling and you just don't get the responses that you want. You might want to think about shifting to actually measure the types of responses you get. If some like public authorities, police, anyone else, schools even aren't able to answer the question. Sometimes that can be just as revealing as what is disclosed through FLI. So sometimes storing rejection reasons can also be really helpful. For example, you know, with modern slavery work, a lot of authorities say that they don't hold information sentry when arguably some of the data that we're requesting should be really, really easily accessible because it should be monitored regularly. So those are just some ways to kind of pivot if you don't get the success that you hope for. So the impact of FLI gathering is really, really powerful or it can be really, really powerful. I'm kind of biased because so much of what after expectation does hinges around collecting data or evidence through freedom of information. But here are some kind of quick examples of the impact that we've managed to secure using exclusively freedom of information requests, both the central government and to law enforcement and public authorities, local authorities. So we uncovered in 2019 and the wrongful detention of thousands of potential victims of human trafficking at a time when the government denied that this was a practice that was allowed to happen. What that did was opened up a wider discussion about the frequency with which vulnerable people are kind of locked up under immigration powers and whether there are sufficient safeguards in place to recognize any kind of vulnerability, including indicators of modern slavery. We also recognized by working with so many people with lived experience of modern slavery and exploitation that often the first time survivors come into contact with a first responder that's tasked with identifying them, they will slip through the net because there is a lack of understanding, a lack of awareness and sometimes a lack of resource to enable the survivor to actually be supported and get that referral to official decision makers. So through FLI requests, we found that there are thousands of cases every year where suspected survivors slip through the net because they're not passed on to decision makers for whatever reason. We also revealed that for all decision making was an issue within the home office when it came to modern slavery cases because a vast majority, nearly 80% of trafficking rejections that go to appeal side with the survivor. So that shows us that there are hundreds of cases where people who should have had an immediate right to some really basic safeguards like safe housing, counseling, medical assistance and all the rights that come under international law if you're a recognized trafficking victim, they were instead rejected out of the system and not able to access those rights for a really long time until they were able to have their case revisited. So this information should all be in the public domain but it was only through FLI request that we were able to kind of track how some of these systems of support actually function in reality. So that's me, I think I've gone over time but yeah, thank you so much for having me. Thank you so much, Maya. That was really, really interesting and an amazing work through of all the work that you do at your organization. I know we have one question on the chat already. Alison asked, in terms of data, bearing in mind what we know about certain central government departments having poor response times, have you had to go to individual public authorities because central government departments do not respond and have not published the data? And then there's another question linked to that which is do you ask central governments about the methodology? For example, how to know that local authorities classify data the same way as central government so that central government is not required to do something to make it fit their reporting. Okay, that's really interesting. I'll just say it in two parts. In terms of, I mean, it's a bit like asking your mom for permission and then she says, no, so you go to your dad. There is an element of that with FLI I think and I think that's very valid. Often there is quite good communication though if there is an expectation of central government that a public authority be collecting and providing it quarterly, for example. So it doesn't always work, but there is a chance that if your scope is quite broad, you might get some pieces of information if you're asking about the kind of data that's stored. I do think in our experience we have seen that there's been some kind of communication or we suspect there's some kind of communication with a public authority that's not from central government and central government, because suddenly they got back to us and the language is the same as when we've asked central government. So that is something to bear in mind. There is communication there sometimes fairly because of the need for clarification. Other times just because there's a bit of an overlap with PR and messaging and FLI. So I would say that's a convoluted way of me saying it's definitely worth trying it, but what you might want to do is keep the wording to central government much more formal in streamlines. So for example, if you know there are certain standards that central government has to meet for a data release or to measure the effectiveness of a policy, parliamentary questions are a really good thing to look through, look through Hansard, look through the written questions and answers website to see if MPs have asked for data updates as well. When you're asking for that information from central government authorities, definitely better to ask what are the official standards, what are the official metrics. You can also ask the data creators of public data, what are the standards that the statisticians use if you've got any questions around that. But yeah, then when you get to a more local level, you might want to keep it more broad if it's a phishing expedition. So you might want to ask what are the kinds of things you think about when you're collecting data. Do you know what I mean? So that's kind of ad hoc advice. I'm sure it doesn't always hold water, but it is, yeah, I think it's definitely worth kind of following up with both, but your style might want to change depending on what you're asking for. In terms of do you ask central government about the methodology, how to know that local authorities classify data the same? I would say for that, nine times out of 10, those standards should be in the public domain. And if they're not, you should definitely go to central government and ask what the reporting standards are for certain issues. Again, watchdog and inspectorates are really great for that. So for example, it was by looking at inspectorates work in the police force that we actually got an understanding of the counting rules, 2016, or that they've changed now, that kind of governs the way that all police forces have to be recording data on crime reports and how a crime report is translated into a crime record. Again, that's all completely in the public domain, but I don't know what the disclosure would be like if we went to the home office and asked, what are the kind of rules that you use for police reporting and recording? So if possible, it's good to kind of get those standards from the public. You might also be able to find answers in written answers and questions in parliamentary settings as well. Thanks. This sort of relates to the question you're just discussing. How do you deal with that kind of inconsistency when you just get a bit of a sense from data where you think this isn't quite right? So where there's obviously kind of inconsistent reporting across different areas. So we tried to, and it was the first time we tried to do a big project like this. It just didn't really work. We tried to use the lovely My Society technology to look at how people with dental problems were presenting to A&A, kind of looking at where essentially dentistry was causing more people to have dental emergencies. And we found the way that different hospital trusts were recording dentistry issues and dental emergencies was just so wildly inconsistent that in the end we thought, I'm not sure we can kind of reliably compare these things. Have you had any success in kind of going back and questioning that data a little bit? Because we looked at it and we were like, we don't feel like we've got confidence in this data as it stands. Sometimes do you find you're just like, yeah, this doesn't quite work or are there ways that you can kind of set your own consistency standards that you might go back and kind of query that data or ways that you can ask them more kind of force the consistency even if that consistency doesn't exist. That makes sense? Yeah. I mean, I guess the problem is not all public authorities and not all central government departments actually deal with data that well. There are statistics that are in the public domain that don't always come up to the standards that you would expect and there are still inconsistencies there. But what I would say is that where possible, again, I know I'm repeating myself, definitely, definitely try and use whatever standards are set by central government if you can. So for example, NHS Digital, they do have really, really stringent rules on how stuff is recorded. I'm sure they'll have some kind of breakdown even if it's not as detailed as you'd want of the types of admissions broken down by A&E and from those category fields, you can get a sense of where the more detailed versions of that information are stored by trust. I would always try and work backwards like that and reference the national publications that some of that data might end up in because it's kind of giving a bit of a wink, wink, nudge, nudge to say like, I know that some of this data will end up in national publications. So there must be some kind of consistent standard. And that's kind of what we did with the public, with the police force data by kind of using the exact terminology but not referencing it explicitly. The wording that was in the counting rules that's standard set by the home office, it did improve the disclosure but like you said, there was still inconsistencies. So even though we said exactly how to outline it to use crime outcomes and crime outcome is a recognized kind of phenomenon that's used and it's a term that applies to 18 specific possible types of crime outcomes. It couldn't be simpler. We still had people randomly deciding to bundle them up like these are all the outcomes for 15 and 18. It's like why would you do that? We put in the wording that it should be disaggregated but that was not possible. So where possible, I think going back and we did make a decision to go back to all of the police forces that amalgamated their totals to give those police forces a chance to uncouple them basically. But the way that you deal with inconsistencies I guess it depends on a case by case basis or what your priority is. There was one case, we asked for the number of charges as one of the outcomes but a lot of police forces did bundle together charges for alternative offenses. So we did actually make the decision at the end to also bundle up charges for alternative offenses because we felt like it would still give us a clear picture and it's consistent with the way that the CPS measures that data as well. So it wasn't the end of the world but you might just wanna talk to your team to decide like when you go back and pester everyone for a breakdown or for more consistency or when you just kind of go with the flow and adopt their way of disclosing essentially. Sorry, I feel like that was a really convoluted answer but yeah, I hope you have some luck with it. Hey, welcome back everyone. So thank you so much Kay for joining us. I know you had another training course that ended at 11. So everyone Kay is our second speaker. She comes from the ODI where she's a data trainer on the ODI's learning team and she has a background in analyzing geospatial data. She's gonna take us through combining different sources of data to create data sets over to UK. So thank you, Jen. So thanks for having me here everyone. I'm really glad to be able to be here at the ODI where I work. We are a nonprofit and we work with governments and companies to try and create a world where data is working for everyone. So I work on the learning team and we offer a number of training courses around things like how to open up data sets and publish them for reasons like streamlining freedom of information requests, building trust with stakeholders, increasing transparency. And we also run training courses on things like data ethics making sure that algorithms aren't biased against certain groups within society and also we run some training courses on sort of the technical aspects of data analysis and we do have some free courses that I'm gonna send a link to Jen about so she can pass them on to you because I think one in particular might be really useful for some of you if you are combining data sets from different sources. So I'll send that at the end but right, I'll get right onto the material because I know we haven't got a whole lot of time so I'll start right in. So I thought I would talk to you about a case study we were just looking at in some of our other training courses where they had to combine some data sets from different sources. Now these aren't from a freedom of information request but I thought they are probably really, they highlight some of the issues that you might face if you're getting a data set or you're trying to create a larger data set from multiple sources. So this case study was some researchers who were looking at developing an algorithm like an AI algorithm that is designed to get through chest X-rays quickly because of course the medical system's under a lot of pressure and they wanted to automate that so that if there was a sick person they were sent straight to a specialist to be treated right away and they were concerned that there was going to be a bias in under diagnosis where if somebody's not diagnosed and they are sick, that could cause them to die because they're then not going to be seen by a doctor and they wanted to look at the bias within this algorithm to see if maybe it was going to under diagnose people at higher rates in different demographic groups. So they really wanted to make sure that it was going to have a fair outcome because we can't have, for example, the AI allowing certain people to get sicker and died while treating others. So the way that they developed this they got some chest X-rays from a diverse population which had been examined by specialists. They then developed this algorithm which was learning to predict which people were healthy which they called no finding of disease and then they tested the algorithm on images that it hadn't seen before to see if it could correctly predict what the specialist had already decided about those and when they did this they had to combine three big data sets. So you can see we've got three columns here they had a data set called CXR, a data set called Chexpert and a data set from the National Institutes of Health. And you can see that first row says how many images were in each. So there were over 700,000 people, real patients that this data came from. And when they tried to combine it if you look at this next row the images were labeled by the specialist with diseases or with no finding. So the first two data sets had 14 different diseases that it was labeled with and one no finding label. And the NIH data set had 15 diagnosis labels but the problem was only eight of those diseases actually overlapped with the other data sets. So they couldn't really develop an algorithm that compared rates of certain diseases because they weren't really sure that these three data sets could be combined in that way. So when they decided to create this algorithm using these data sets the only thing they could be really sure was comparable across all three data sets was that no finding label. So that was what they had their algorithm do was predict when is somebody healthy or when do we not think there's a disease because just due to the limitations of trying to combine data sets that was the only thing they could do. So the only one they could be sure about was that no finding. They also wanted to know about the impact of sort of under diagnosis or the rates of under diagnosis in different ethnic groups and different racial groups. So if you look at the racial and ethnicity data line you can see that it was self reported in the CXR data set and it was not reported in the other two data sets. So that just means they can only use that CXR data set to judge race and ethnicity. They can't, they really can't say anything about race and ethnicity from those other data sets which is a shame that that data wasn't there because it would have helped. But fortunately with over 300,000 images they were able to still come to some conclusions. They just had to recognize the fact that their data was a little bit limited. The other issue of course is that it's self reported and as many of you probably know when you ask people to self report about race or ethnicity sometimes people choose not to disclose which might bias your data set. Sometimes people will feel if there's very limited categories that those categories don't reflect who they are and so they might choose other. So that's something to bear in mind that we can't even assume that that data set was right much less related to the others. So we have to take that line with a grain of salt but at least we've got some information about it. They also sort of had a similar issue with insurance type which in America they used as a proxy for socioeconomic status if people were on Medicaid that's the sort of low income sort of health insurance that's provided by the state. So that was only collected if patients were admitted to intensive care so it was only looking at the sickest people so it was possibly selecting a certain group within society that might not be reflective of the larger whole and then they couldn't get socioeconomic status off the other two data sets. So it had a similar problem to like the race and ethnicity data but they could still do their best with it. They also were looking at what type of hospital it was. So in this case there was two data sets from normal hospitals one where the patients were only admitted if the diseases were being studied. So what you have to think about there is the fact that presumably that NIH data set everybody was already sick and they weren't getting any no findings from there. So if you were looking at what percentage of people were sick that third data set from NIH might make it look like people were sicker it might increase your average. So that's something that you have to think very, very critically about when if you think that your data sets are possibly selecting from different pools is that going to affect your overall numbers and maybe you need to sort of look at each data set separately before you combine them and then the combination might not be reflective of sort of real life if they're selecting from different pools. So we can probably assume in that one every patient was probably sick at that hospital. The last thing I wanted to talk about here is the sex data. So you can see that that Chexpert data set the sex was assigned by clinicians. So they were obviously deciding what that person looked like they were and we don't know what the categories were whether it was just male or female or if they had any other options in there. So we have to think about is that going to affect the outcomes of what we're trying to look at here? And in this case, how likely is misgendering going to be how many patients will this be if we're looking at 223,000? How many of those realistically would have been misgendered? We don't know if we were trying to get at some sort of assessment of gender breakdown we might want to leave that one out if we assume that self-reported is going to be more accurate. So obviously these sorts of questions are really going to vary depending on what your data set is. Every data set is going to be different but I would really encourage you to sort of go through each category that you've got and think very critically about where there might be missing information where the data set might be selecting from a biased population already and can you combine them well or not? And if you can't combine them like the sort of the first line we talked about with diseases, if what can you combine what can you extract that can be combined even if they aren't perfectly comparable? Because in this case, we were able to then get a good result but they just had to take a lot of it with a grain of salt, if you know what I mean. If you're curious about the outcomes I know that this isn't expressly about combining the data sets but I thought this is probably a crowd that's going to be interested in seeing if there were biased outcomes. What they found was that purely by looking at the x-ray images the algorithm was giving biased outputs which probably reflects the bias in the medical system. So they found that it was much more likely that a person was going to be underdiagnosed if they were in a marginalized group. So the y-axis here going up the side that subgroup FPR that stands for false positive rate. So basically how often it was saying no finding but they actually were sick and you can see they were more likely to make a mistake with females. They were more likely to make a mistake with people age zero to 20. They were more likely to make a mistake. They were most likely to make a mistake with black people least likely with white people and they were much more likely to make a mistake if somebody had Medicaid. So that would be like a low socioeconomic group. And if somebody was a member of more than one marginalized group that did cause compound bias. So if you were, for example an 18 year old black woman on Medicaid you would be much more likely to receive an underdiagnosis be told you were healthy but you were actually sick. So this just goes to sort of show why it's so important that we see these statistics. And the work that I'm sure you're all doing is so important to making sure that society is more fair. And once you do this type of investigation then you can hopefully put processes in place where maybe we would adjust the thresholds for certain demographic groups so that they would get sent. They sort of needed to meet a lower threshold of illness before sent to a specialist because it's more likely that they might be underdiagnosed. So why did the bias happen? Probably because, well a number of reasons. The tags of what the diseases were came from chat GPT sort of type of AI natural language processing. So that might not have been a hundred percent. The probably the main one is this middle one the diagnosis itself was probably biased. So the AI was probably picking up somehow on underdiagnosis that happens in real life because maybe people from certain marginalized groups aren't believed as much or don't have as much time with a doctor. So that medical bias is real and the pattern that was seen by this thing was actually perfectly reflective of what happens in society which was pretty crazy when you figure the machine did not know the ethnic breakdown of these individuals. It was literally just looking at the picture and the names of the diseases that they had and yet that bias still emerged. The other reason is some demographic groups their representation was probably very low in the dataset. So for the younger people from zero to 20 there probably weren't very many people in that age bracket who were sick and therefore the machine did not get very good at learning what a sick person looks like when they're young. So different reasons why the bias might have emerged but that's by the way not necessarily about combining datasets but I thought worth mentioning because this bias is everywhere and AI of course is in the news all the time and asking for information like this from organizations that might be using algorithms is so important on shedding light on the types of bias that might be perpetuated if we don't ever look at it. So this is a big part of what we at the ODI do where we try and encourage organizations to look at these types of inequalities in any sort of data or algorithmic decision-making process and then publish it for people to see. And I think we do work with a lot of government agencies so we're always encouraging them to do this and hopefully we'll see more and more of this as more and more algorithmic decision-making happens. So the next thing I wanted to sort of talk to you about is once you've got your datasets from different freedom of information requests, I thought it's worth talking about how quite a lot of the time the data that you're gonna get is gonna be very messy. So here we have an example, it's made up. It's supposed to be from a government open data site. So I can see we've got the chat here. I'll give you a minute. I wonder if anybody would like to type in the chat, do you see any problems or mistakes in this dataset that you would have to fix before you could actually analyze it? So if you wanna type your answer in the chat, you can. If you just wanna look for yourself, you can. Now I see someone said empty fields, absolutely. One of the IDs has text, not numbers. Yep, one row. Yet that whole row is shifted to the right, that one that's sort of halfway down. So we're gonna have to fix that. Exactly, the dates are not 2013. So it says it's spend data for 2013. It's not, is ministry of magic real? No, sadly. Anyone else see any others? There's quite a lot. Column D doesn't have, yeah, what even is column D? We don't know what that is. Yeah, there's a narrative where the date should be. Yeah, it looks like definitely that one that shifted over. So some of the spends are less than 25,000. Yeah, it specifically said, you can download our data about spending over 25,000 pounds. And yet some of it isn't, they've put that in by mistake. So I'll circle here sort of all the ones that we found to be a problem. So let's see, what else? I think a lot of them we've got. Oh, we've got a typo ministry of Maggio and not ministry of magic. So that, if you're trying to analyze a really large dataset and just one little typo like that might throw you off, that you have to fix that really. We've got a bunch of different international magic corporations. Ooh, we've got date format problems. So it's like story of my life, right? Some American dates, some British dates. So 228 versus 288. So those might need to be fixed. We've got, somebody said that it's some Excel format. It's not a CSV. Yeah, sometimes it's even a PDF. It's not particularly open. You notice that sort of gray and black logo that's just under the title spend data. That's showing the license for this. So it's actually not particularly openly licensed. You're not allowed to have any derivative works from this, which means that you can't change it and republish it yourself, which isn't published under an open license, which is a shame. I think we've got a lot of the rest of them. Some things we don't know what they are, like CNSS, what does that stand for? Research and development is all in lower case. So a lot of times you do have a dirty dataset like this that you have to clean to use it. Oh, I just saw somebody ask, can you repeat what you said about the license? Yeah, so in order, so when data is published openly, it should be openly licensed, which you see that circle that says CC in it, that's a Creative Commons license. There's a lot of different variations of a Creative Commons license. So the little person standing there that says buy, that means you have to attribute who it came from. The thing with the dollar sign with the line through it, that's NC, that's non-commercial, so you can't then use it in like an app to make money because you can't make money off it if it has a non-commercial limitation. And that equal sign, ND, that means you're not allowed to change it and reshare it. So this is a pretty restrictive license. You can't reuse this for much. I see Lawrence just commented, it can be problematic in some software. Yeah, you would not be allowed to use this in software basically because you'd have to create a derivative of it to use it. And sometimes, if you are using openly licensed data sets, you can't combine ones with different licensing conditions really. I see Hannah just says, you couldn't make a cleaner version of this data and then share it with others either. Not the way this is licensed, no. But I don't know, I'm not an expert on freedom of information requests. I'm not sure quite what the licensing issues are around there, I would like to turn that over to somebody else in the group who probably knows more about the specifics of freedom of information requests. The government does require, most government data sets would not have that level of restriction on it. So most of them, you do have to check the licensing, but most of them can be used commercially and you can publish derivatives, but it is worth checking that. If you download a random data set off like, say Twitter, or what was Twitter, now X. If there is no licensing, you are not allowed to reuse it and republish it because you have to assume if there's no license there that it's probably copyrighted. So if you are just getting data sets off the internet, it's worth thinking about, are they openly licensed for use? And you wanna try and look for that, some kind of licensing like, and there's lots of different variations, but the open government license would let you use it. So I thought I would quickly play for you a little video. I'm not gonna probably play the whole thing in the interest of time about a tool that Google has made available. It's free called Open Refine, which will solve all of these messy data sets for you. It is the most amazing tool. I personally heard imaginary choirs of angels singing when I first saw it like, because it does the job so well where it picks these out for you and fixes it for you. So I'll quickly play a little bit of it so you can get the flavor of it. And then I'll send the link to the video to Jen so she can pass it on to everyone. So you're looking at some data journalism project like ProPublica's Dollars for Docs and say, wow, that's cool. They're putting together data from seven different drug companies in order to discover which drug company paid, which doctor to recommend their drugs. Nice. So you decide to dig up some public data yourself to do some data journalism on important social issues. Like, let's say you go to that government's IT dashboard and get data about projects that the government has contracted out to private companies, download and open it in a spreadsheet program and check out the type of contract column. Firm fixed price. What could that possibly mean? But whatever it means, FFP probably means the same thing. Time and materials, TNM. Wouldn't those be the same as well? And should there be an S here or not? You would quickly discover that public free open data can be inconsistent and messy. Think of it as raw materials that you have to refine before it's useful. And that's where Google Refine comes in. It's a free power tool for working with messy data. So let's load that same data into Google Refine and look at the type of contract column again. One core feature of Refine is the text facet. When created from a column, the text facet groups together identical cells into that column across rows and shows you the number of rows in each group. For example, 512 rows contain FFP in their type of contract cells. Clicking on FFP inside the text facet filters the data table on the right to show only those 512 rows out of 5200 rows in total. Now there are two other groups that look like FFP as well. Clicking edit on the first one shows us that it has a trailing space. Removing that space would merge it into the first group and increase the count to 513. By condensing all 5200 rows into 800 something groups, Refine makes it easier to locate these inconsistencies. In fact, if we suspect that there are trailing white space in other groups as well, we could apply a trimming transformation on the whole column to fix that whole family of problem in one shot. And now we're down to 785 groups. We could even rename FFP to firm fixed price. That performed a find and replace operation on 513 cells. And we can also change TNM to time and materials and so forth. You can sort the groups in the facet by count to find the biggest groups. Firm fixed price is the biggest group consisting of 800 something rows. But it would have been even bigger if it's many alternative forms were written the same way. Time and materials has the very same inconsistency problem. Google Refine has a clustering feature that helps you fix this family of problem. Essentially, the clustering feature tries to group the groups based on some heuristics. And you can pick different heuristics to adjust how aggressive the feature works. Select any group of groups that you want to merge and set the desired new cell value. So pause it there. I feel like it gets kind of technical but you get the idea of what it does. I think that's the kind of thing where there's a little bit of a learning curve to it. So I wanna direct you to a free tutorial that we have at the ODI that will walk you through how to use that Google Refine in a way that will train you how to do it so that it works for you. And it's pretty readable and pretty easy to follow, I think. So I just have a little screenshot here of our website. And if you go to learning at the ODI.org we have all these free courses. So we've got a data ethics course, a thing about data visualization called finding stories and data. But the one that's gonna be relevant for this is open data essentials. So you click on that and then it takes you to a bunch of different modules within sort of that you can learn about open data. The one that this is about is called how to clean your data. And if you click on view you have to scroll all the way to the bottom and then it has this thing get started with data cleaning in open refine. And then you can download a PDF that will and you can download a practice data set. And it tells you all these different ways to clean it that will help you. And so hopefully that will give you some tools that you can then take forward into your use of sort of freedom of information requests data sets. That was helpful. At the ODI, we are a nonprofit. We don't have a lot of spare capital as you can imagine for like most nonprofits. We do charge for most of our courses but we have these free ones that I would definitely encourage you to access. And if anybody is ever interested in attending any further training we offer discounted rates to nonprofits. So please don't hesitate to get in touch. You know, it's really important to us to sort of work on making society better through good use of data. So thank you very much for the opportunity. Thank you so much. It has been so, so interesting. I see Priya has asked a question in the chat. Oh yeah. What advice do you have to sort and analyze cells that have comments for instance from surveys? Oh, that is a really difficult. I mean, I know there are very advanced techniques of like trying to code them in positive and negative. And I feel like if you had a very large data set it would, you'd need to automate the process somehow but how a small organization could do that. I don't know. My advice would be to sort of broadly try and find some groups like negative comment, positive comment, suggestion for improvement and potentially code them into specific words. So if you could, you know, then add another column where you were trying to either put a numeric code in like one equals positive comment, two equals negative comment or just put a keyword in that tries to summarize what that comment is about. That would be my best advice but I do think it would be a tricky task to do. How it says, what are the concerns related with privacy when using a Google tool for formatting data? Right, definitely talk to, you know, your IT specialist about this. My understanding of Google refine is that all the data is held on your computer. So if you are working with personal data, private data I don't think it doesn't like send it to the cloud. Definitely double check but that's my understanding of how that works because I've had people from organizations like the home office and DWP sort of look into these things and they felt that they probably were going to be able to use it because it helped hold it on their computer. Oh, we've got two questions here. Okay, do you have any links about any additional copyright issues that rise when data are combined? Ooh, so basically if you are combining data sets that have very restrictive terms on them, you have to sort of abide by the most restrictive term. And so if it means that you cannot create derivatives of it you really can't then combine it with anything. So there is, I'm trying to remember if in the open data essentials there is, one of the other modules within open data essentials might talk about that. So I'll double check, but there is actually, I remember seeing on Wikipedia which is all creative commons license, a sort of graph that showed how the different creative commons licenses can interact and whether you then can combine them or not and it's quite useful. So I'll include that link as well when I message Jen afterwards. Do you make use of combining data sets with code? I don't personally now in this role I have in the past, but I was always writing like bespoke code for that in MATLAB when I was working in geophysics. So people do it, you certainly can. I feel like the learning curve is really steep there for people where it's like not your full-time job and you haven't got a data background. I think that probably most of what you would want it to do Google refine already does. So I would say before ever tackling like writing some sort of bespoke code probably have a go at Google refine first cause it'll be quicker with less of a learning curve. Pleasure. That makes so much sense. Okay, I think that is basically what we have time for today. So I just wanna say a huge thank you to both Maya and to Kay for your presentations. They were so interesting and I think so useful for someone like me who has literally no idea about data it's actually been quite eye-opening. So I have myself find it incredibly useful.