 All right, good afternoon. My name is Katie Sill and I'm the director of the Waterpoint Data Exchange Program or what we call WPDX with Global Water Challenge. And today my colleague Adam Creave and I will present on challenges and opportunities and harmonizing data to help support evidence-based decisions for the rural water sector. The challenge is that the world is not on track to reach the sustainable development goal target of providing universal access to water and sanitation by 2030 and rural access continues to lag urban. You can see here in the map that countries in sub-Saharan Africa and Oceana, so excluding Australia and New Zealand, currently have the highest access gaps. Contributor to the lack of progress is that most countries don't have adequate finance to reach their wash targets. So we either need to find a lot more money or make our money go further by improving our efficiencies of the investments that we can make. One way to do this is to harness data to prioritize investments to help reach the most unserved people possible. However, in general, we don't have the type of data at the sub-national level that we need to complete these analyses. To give a quick sense of what we mean here when we're talking about rural water and rural water access, here's some photographs from around the world of the types of water points that we're seeking to collect data about. So these are not household systems, but instead public water points where individuals are typically traveling to collect water and then transport it back home. Management of these varies. Sometimes it's the community or it's the local government, a private company, and these are different kinds of, you know, hand pumps, springs, rainwater catchment, and so on. Let's see. There are a lot of groups that are working in this space and a growing recognition that evidence-based decisions will improve the impact of our rural water programs. So more and more organizations are collecting monitoring and other routine data than ever before. They're working towards using that data to inform decisions. However, if the data is not shared, then each entity only has their piece of the much larger puzzle. So WPDX is a platform which is bringing these puzzle pieces together. Through a data standard and an ingestion engine that will go into more detail about later, WPDX harmonizes data regardless of the organization or collection platform used into a singular dataset. This dataset is available online in our data repository and playground. With all of these different contributions, we get a much better sense of the landscape. And once these pieces are together, WPDX can provide analytical tools to give key insights and try and make sense of the picture. I think of it as like making those magic eye puzzles from the 90s. You know, once you put the pieces together, you still need specialized knowledge to find that 3D image to pop out. Just sort of jump to the end and demonstrate why we want to bring this data together. I want to show you a quick summary of our decision support tools. We have four tools in development to support four key questions that have been identified by government decision makers. The first is how many people lack access per district, so at that sub-national level. Second is which water point rehabilitation would reach the most people. Then where is the most efficient place to build a new water point or new construction. And then finally which water points are at the highest risk of failure and why. So the first three of these tools use spatial analytics to determine the populations who are currently unserved based on the functionality of the water points that we have within our data set. And with this approach, we can estimate water access by district or actually sub-district as shown here for Sierra Leone. All of the tools have a map output along with the CSV download if you want to do further analysis. We can also use this approach to identify which water point rehabilitation, so which water points are broken and which ones should we fix first, will have the highest number of people who can be reached. This also includes a CSV output and an optional satellite view to get a better idea of what the community looks like. The larger the circle, the more people who could be served with this fix. Similarly, our new construction prioritization tool shows locations where newly constructed water points would have the potential to reach the greatest number of people who are not currently served by an existing water point. This tool doesn't yet, we are hoping to one day include hydrogeological information, but it does demonstrate where there's population centers who are in need of a water point. And finally, our predict water point status tool, which was developed in partnership with DataRobot, uses attributes from our standard and some external data such as precipitation or land use information paired with a machine learning model to determine which water points may be at higher risk for failure within relevant time frames. So we can either actually get an idea of, you know, is it working as of today in one year and three years based on the understanding of the system. The goal of this is that any one of these tools can provide valuable information, but taken together, they can provide support through the entire process. So starting here on the left, you know, once the national water budget has been determined, we can use this district tool to identify which districts may need the most resources to catch up based on their level of coverage. And then at the district level, district managers can use the tools to identify priority locations for rehabilitation, new construction and even preventative maintenance. And in a recent desktop review of 12 districts in Sierra Leone, we found that on average, sorry, on average, using WPDX support tools would help reach about three times more people nationally and reduce costs by about a third. And that's because a lot of times water points are put in or fixed in areas that are already served by existing points. So our analysis is really designed to help focus on the unserved population. Let's see. So feedback from government partners makes us confident that bringing this data together will allow for meaningful analysis to help make better decisions and ultimately improve services. So we want to take a step back now and dive a bit deeper into some of the challenges that we've faced and talk through some of our approaches. First off, water point data does not come from a single source. We have contributions from governments, NGOs, academic researchers, and many more. And there is no agreement on a standard collection schema. People have a lot of different reasons they collect data and a lot of different parameters that are defined slightly differently. For all of us who work in field data or with field data, the reality is that it's incomplete, it's noisy, it's duplicative and fuzzy. So at the start, we have a lot of puzzle pieces that may or may not actually fit together. So the first step was back in 2014, we collaborated with an international working group to define a data standard. And that was using parameters that were both widely collected and were of interest at scale. This was a very iterative process that resulted eventually in six required parameters and 20 optional parameters that are listed here. The standard is live, it evolves with the sector. But essentially we're looking at parameters that describe the location, the type, the functionality of the water point. All right, and now over to Adam. Okay, so Katie mentioned the standard, but even with the standard and assuming everyone would sort of use the standard, we're still in big problems because the data that is provided inside the standard is inconsistent. And it's not just about the fields, but it's also about the values themselves. So here you have an example of basically the same technology being reported in different rows. It can be provided in lots of different ways and the same kind of water point can be described in lots of different ways. So it's not just about the fields, but also about the data inside the fields. So again, we need to understand the domain to develop some heuristic to do that sort of mapping. So I want to talk a bit about sort of the technological solution that we implemented to tackle those challenges, the multiple organizations and the fuzzy data and the fuzzy structure. And one of the goals in that solution, I hope we can go to the next slide, one of the goals there was to create something that would be friendly enough so that those people in those organizations would be able to upload their data to this system without having to be extremely technical or having to have some data wrangling capabilities. We don't want to sort of require a specific structure or specific values because if we do put these requirements, it would just mean that we won't get the data because sometimes the files are just there and either we do the wrangling or no one else will do that. So we wanted to be able to get the data as is. Again, we didn't want to enforce any specific format. We wanted to be as accepting as possible and giving some agency to the people uploading the data to see if there are any errors or validation errors and be able to fix them themselves. So it really needed to be something that would tackle all the barriers that we usually encounter in these sort of systems. And the way that we tackle that is basically using a platform called DGP. I think that's in the next slide, which basically is a platform that is sort of an evolution of work that I've been doing for quite some time. It started with open spending. I don't know if you know that. It's basically a similar system that was focused on fiscal data. And from there, I sort of took these concepts and tried to build something that would be more generic built on top of the frictionless data framework and the data flows Python library. And the idea is that this platform basically gets a schema. In our case, this schema is more or less a transformation of the standard. Kate, if you can go to the previous slide. So the data standard is basically transformed to a schema or a taxonomy that just says the same things but in a more sort of technical way. So each one of those fields has a definition as a data type. It's defined if it's mandatory or optional. It has some validation rules. And once that schema is sort of put in the DGP platform, then people can basically go to the platform, upload a file and do a simple process of mapping the columns in their source file to the columns of that schema where mandatory columns are sort of mandatory and optional columns are the more of the merrier but are not obviously required. And they can see interactively the different errors or validation errors in case they have them. But again, if some rows are failing, it doesn't mean that the entire file fails. It just gives them some error. They can provide all kinds of parameters. Kate, can you go to the next slide? Yeah. So they can provide all kinds of parameters, for example, undefined values. So if there's some value that actually means that this cell is empty, they can specify that. Or if, for example, they have unique date formats or number formats or different true-false values, which are not exactly true-false. In our case, it would be, I don't know, functional and non-functional sometimes for water points. And on top of that, we apply all kinds of custom cleaning and transformations. For example, converting the water tech, water source that we saw earlier that can come in lots of different flavors, we map them into standard values. Good morning. All right. So let's see how it actually works. So we have a very nice video. So basically, you just log into the system using your Google account. You upload your source data to the system, which can be, again, it can be a CSV file and cell file. It can also provide a link to an API or a Google spreadsheet if you want. You need to provide some metadata. The metadata is, again, configurable. This is metadata on the entire source. And then once you provide that file, the system automatically detects the file format if you have some header rows. And then it gives you an opportunity to map columns from the file to the schema. The idea here is that it looks like a very long process, but once you've mapped a similar file, the system will try to automate it for you. So if you have a column that was some time in the past already mapped to a specific column in the standard, it will automatically suggest that. And at the end of this process, you basically submit this mapping for approval. What happens then is that Katie gets an email saying, okay, there's a new source uploaded. She can then take a look, see if she approves the file. And if so, it would go and basically join the rest of the order points in the harmonized data set. Katie, you want to take it from here? Sure. Thanks, Adam. This is just an example of us bringing in the data. So you can see here on the left is a list of different NGOs and organizations that have shared data with us. And here's as they kind of appear on the map. And you can begin to just see each of these pieces coming together to present a much better overall view. What's neat is at the end, the Ministry of Water, the government data set comes in. But you can actually see that some of these NGOs are still working in spaces where the government didn't have that data. So it can even provide information to the government as well as to others working in the sector. Just to give a quick kind of here's where we're at in the road ahead. We have over 600,000 water point records currently uploaded from over 80 organizations. And we are working to sort of continue to build new and improve upon our existing tools to allow for better visualizations, you know, an easier identification of these locations where you might want to work, figuring out which locations are likely to fail and really digging into that why and what can be done to help avoid that. And this is about, you know, as data driven decisions are, providing an objective perspective to balance with political pressures. Water is a very political area. And so this is trying to kind of balance out and focus on where there's unserved people. Just a nod and appreciation to our very generous funders and key partners. Our contact details here and that's it. Thank you so much. I think we have time for a question from the audience. I'd like to hear how you build relationships with local groups and partners, excuse me, to get work done and to the end result of new water points. How do you build trust to encourage people to share data and work with you? Any stories to share? Probably more stories than we have time for. But yes, I mean, great question. And it's a big part of the challenge and opportunity, I think, in that open data space. Sharing water data can, for some organizations, come very easily and others, it does take a longer term relationship of trying to demonstrate kind of the end result. Here's what you can do with this data if you share it with us. And working with a lot of local groups. So we are based in the US, but we work very closely with NGOs and local partners who can build those longer term, you know, trusting relationships with the governments to ensure that they feel comfortable making your data transparent can be a big step. So we've done sort of longer term relationships with Sierra Leone, which I mentioned, we're working also in Ethiopia and Ghana and Uganda at various stages of kind of developing those relationships and making sure that the data is well, you know, protected and safe. And also that the results are responsive to government need or responsive to organizational need. And that has so far brought us a fair number of organizations that are willing to share.