 decisions. And so we want to highlight some of this. So we have two abstracts, just two. Unfortunately, our third presenter is not able to join us. So you guys have a little bit more time, which is good. So we have two presenters. We first have Stray from South Africa. Tony will be presenting from his South Africa on how we're using a super set to analyze and do some advanced analytics with DHS2. And then after that, we have a John Panger from CDC PMI. He'll be coming up and we're sending on how he's using some art to do some very advanced analytics as well. I'm going to glue some interesting insights from the malaria data he's working with. So I think with that, Tony, are you ready? Okay, I'll hand it over. Thank you. Can you guys hear me? Great. Good afternoon, everyone. And thank you for joining the session. I'm Tanya Governor. I'm the project's portfolio manager from his South Africa. And my colleague Comfort Manga, who is probably still having lunch as usual, will join us soon, but he will cover any technical aspects or any technical queries that you may have following the presentation. Right. So for today's presentation, I will be taking you through the use case. I will introduce to you the super set solution. We will talk a little bit about the data analysis from the multiple data sources. I will introduce the DHS super set portal app. And I will take you through the next steps that we envision for the future. Before I continue, I would also like to get to know you a little bit better. So by the show of hands, how many of you are developers? Great. Cool. And project managers, program managers, product implementers? Awesome. And how about people that actually is utilizing the data? So data analysts here? Awesome. Well, that's really nice. I see some of you have your hand up for all three, which is super cool. So welcome to the session. I hope that you will find it quite interesting. So for the use case, while we were implementing a project in South Africa, we were actually faced with a challenge regarding reporting limitations. We knew what we wanted, right? We wanted to have reporting from multiple data sources with different data models. We needed to expand on the visualization capabilities, and we knew that we needed a business intelligence tool to do that. We wanted to analyze existing DHIS2 data together with non-DHIS2 data. We were quite aware of the licensing costs and the cost of ownership. We also still wanted to utilize the existing DHIS2 systems, but we needed a tool that would complement that technology. So what did we do? And you're also aware that we collect all sorts of different data from different sources, and the need to actually bring that together, analyze and present that from a central point without having to reinvent the wheel actually presented this interesting opportunity to the team. And of course, all of this needed to be done in the most cost effective way, and the tools needed to be super user-friendly. And also what we were considering and prioritizing was the cost of ownership to the client to make sure that that cost of ownership is as low as possible to promote the sustainability and the institutionalization of the product. So what did we do? We pushed on. We were looking for a solution and we knew we were going to find it. Our vision was clear of combining data from multiple data sources. We knew that business intelligence tools would provide the technology to combine multiple data sources and enable a single point of advanced analytics. And we were aware that proprietary business intelligence tools does carry quite a heavy licensing fees. So we had to make it cost effective and sustainable for implementation. With all of this information in mind, here's South Africa then embarked on a process of setting up the Apache SuperSet as a business intelligence tool. So why Apache SuperSet? And apologies for the headings of my slides. I think it doesn't recognize this point. So why is Apache SuperSet? One, it's free and open source. So we didn't have to be too concerned about the licensing fees. Two, it has an awesome interactive filtering capability so the user could actually interact with the data. It has wide SQL database support. So this enables us to handle and integrate multiple source configurations from different sources. It has its own user and role management component which then allows us to manage the access that the user has, the users have to different reporting with the aim of replicating the user management system from the user management app in DHIS2 and align that with the SuperSet software. The reports are easy to share and that they can be downloadable and they can be shared on the multiple communication platforms. And then lastly it does have a community of practice to support us during this implementation. So how did we implement this? So so far HIST South Africa is utilizing the SuperSet in two projects. Firstly it is the South African human resource sorry the South African health management information system where we have set up a SuperSet dashboard for the Minister of Health. The second project is the human resources information system where we are combining human resources information with routine health data. Now there are several ways in which SuperSet can actually connect with the source data but for the purpose of this use case I will be talking more towards the human resources implementation. So the Apache drill is an open source database connection tool that we use to use that we use as a connector between SuperSet and the source data. A Apache SuperSet can still speak to your DHIS2 APIs and other systems such as Citus data, Fire, it also integrates with Python and Streamlet which we are using for machine learning in the country. This diagram actually shows how SuperSet is pulling data from DHIS2 systems and other source data through the Apache drill connector. So all of the data that you are looking for is actually pulled into the SuperSet platform. You set up your dashboards in SuperSet and what is really cool and what we're so excited to share with you today is actually the DHIS2 SuperSet portal app. This app allows you to communicate and to interact with the dashboards that have been set up in the SuperSet. So now that we've implemented what does the solution now allow us to do? What are the benefits that we are seeing and how are we utilizing it? So firstly it does allow us to analyze data from different source data in a single dashboard. It provides more visualization capabilities that is currently not possible in the DHIS2. It is fast, it is lightweight, it is intuitive, it has loads of analysis and visualization capabilities. It is easy for the users to utilize and it has capabilities from simple pie charts or graphs to your detailed geospatial graphs as well. Overall we found that there's not really a need to pull all sorts of data into one huge warehouse but rather have the analytics that can be displayed through the DHIS dashboards using the linkage of SuperSet connecting to your source data. This is the SuperSet dashboard and why we would recommend for you to use it. It expands on the visualization options of DHIS2. It allows the user to actually interact with the SuperSet dashboards that have been created through the DHIS platform. You only need a super user to be able to set up the dashboards in SuperSet and then share those dashboards with the user through the user management roles and groups in DHIS. It allows for advanced SuperSet reporting capabilities from the DHIS2 interface and the generic SuperSet functionality makes it available to any DHIS2 implementation. So what are the next steps that we envision for the future? DHIS2 has a user management app and SuperSet has a module for user management as well. So we are currently looking at aligning the user access in the DHIS2 and SuperSet for the seamless integration of user management. Secondly, the SuperSet app is generic and can be plugged into a SuperSet installation with the documentation available for the implementation of the Apache drill. The portal can benefit lots of DHIS communities and the app will be submitted to the DHIS2 app hub for your consumption. We would also like to expand the SuperSet app to different projects that we are implementing in South Africa. The successful implementation of SuperSet will be expanded to other projects and bring business intelligence and DHIS combining business analytics, data mining, data visualizations, best practices and more data-driven decision making, creating better lives for all. Obviously, this work would not be possible without our funders, our clients and all the contributors and the HISP teams. So we're really grateful for that and we thank all of the contributors. Thank you so much. A few minutes for questions? Yep. I'll invite my colleague, Confit. Yep, please. I don't mind sharing the center stage. Let me hand this over to you, Confit. So if someone asks a question, just repeat the question for everybody online. Okay. Thank you. You know, one of the really cool things about this is everybody is always saying, we can't make this chart in DHIS2 because you want something like Fox and Whisker, Winrose, like these really advanced charts. But now, look, you can. You just got to go through, use the app. Subset. Okay. Any questions yet, Nora? One comment. When did we last have a woman developer standing up there? Thank you. Thank you, Nora. Sorry, you have to say why we were applauded for everyone online. I've just got a wonderful applause for being a woman representing software development teams. Thank you. Any other questions? Are you excited about the app? I see a question at the top. Okay. The question is, I mentioned that we're using Python for machine learning and how does that fit into the presentation. So I was saying that the superset has capabilities to connect with DHIS2 API, as well as other systems. So other systems, including FHIR, other systems, including the Citus data, also it can connect with applications such as Python and Streamlet, which we are using for machine learning. Is that correct? Yes. In addition, superset was developed in Python. So for configuration and other things, I mean, you need a bit of Python to do that. The question is, why did we use Streamlet for the visualizations? Oh, why did we not use Streamlet, but rather use superset? Yeah. I mean, when we started, like, with this work, we looked at Power BI, but we had an issue, I mean, with licensing, right? And then even publishing the dashboards. And then so we wanted Sanya's, like one of the slides, I mean, she was explaining the reasons why we went for supersets like costs and all of those things, right? And we're looking for BI2. So we found superset, and then we tested it out. And yeah, I think Streamlet is not really, I have a little knowledge of it, but I don't think it's a BI2, right? You can do like visualizations and things for your machine learning stuff, but it's not really a BI2. So you need a BI2 to expand on what you already had on DHS2. And for the internal use case, we needed data from multiple sources and just allowing us to, superset was allowing us to make those connections and also to make the same visualizations available on DHS2. So yeah. There was a question online, when will your app be on the app? Okay, we are aiming to have it on the app app, by the before end of the year, because we've just sort of put everything together now and it's coming together very well. So hopefully by the end of the year, we'll have it in the app app. Yeah. If not sooner. Thank you. There was another question. Oh, okay. Go ahead. An example of, oh, don't have a online question. Yeah, sure. Amri, you can take that. The question is, can we show an example or speak to an example of the superset dashboard? Yeah, but please join the universal health coverage session tomorrow to actually see that integration. Is the question is, did superset, did the platform itself have sufficient visualization options or did we need to further customize? Okay. Okay. In terms of visualizations, we didn't have to add in new visualizations. It's more on like pulling the right data and then we have a wide variety of visualizations that we could pick from. So we didn't have to customize or add anything new. Just more on your data sources, making sure that you have all the data you need and then just like in DHS, then you can pick the right visualizations for your data. Yeah. At this time, the visualizations are quite extensive. We haven't come across a situation where we needed to customize as yet. There is quite extensive capabilities. Question? Go ahead, ma'am. Yeah. So with superset, it's low cost and open storage is great, but you mentioned that we need some coding experience. Are there any opportunities that people can take into account when they're comparing the other folks? Yeah. I mean, for us, it was more like, for the ministers of health or for the clients that we work with, right? Not to be restricted by, like I said, we started with Power BI and then we had challenges with licenses. So we needed something that we could implement for them without restrictions on licenses and all those things. So yeah, and supersets came in handy for us. But in terms of programming, it was just for the app to make sure that the visualizations are also available on DHS too. So we had to build the app and also be able to interact with the superset API. But once it's set up, I mean, like for example, the app that we built, once it's set up, I mean, it's generic, then you'll be able to pull like anything from superset. So you need any programming. So yes, for configuration, configuring superset and allowing permissions and all of those things, you will need your system admins to assist you with that. But once that is done, then you just use it. And yeah, and then you'll manage permissions. I think that's the only thing that additional thing that you would need to do. Okay, there's another question from the gentleman. Yep. Okay, so are you saying that as we start pulling lots and lots of data into superset for analysis, is it able to still function well with that load of information? Okay, okay. Yeah. Okay, the first one I think it's around superset is not actually holding all the data, but it's actually pulling when needed. Yes, for example, we have a centralized data analysis of the data of the city and we have 64 data which is 64. We need only 5% frequency in the centralized database for work, but 80 to 90% is located in local database. So we need to hope that when data is based on vertically or horizontally and it will be distributed in different locations. So it will be very abstract. Yeah. Okay, now I think I understand. In one of the slides that Tanya presented, so in terms of connections, I mean, you can go via the connector, right? Like, for example, if you want to map the DHS-3 API, but you can also add the different connections to your other databases and then you'll be able to query them within the databases. So, yeah, it can do that as well. Not so if like that answers your question, but maybe you can look at your particular use case and then discuss. Yeah. We can definitely have a side discussion to further unpack. I'm sorry, I'm going to have to... Okay. Oh, okay. That's it. Okay, thank you. Thanks. Last question. We do have a tea break and lunch and we're all here together. Okay, if it's fast. It's fast. You have one minute. First of all, I really want to appreciate this expression of which I use to... And I really like the idea that you are using the open sources to go down the path of... So this is really appreciable. So actually, we have a product which is similar to that, but that product has a fee and a Dica, which is running on different platforms. So my question is, are you using the same database and same API, running on the same DHS2 platform, or is it running on different platforms and different databases? Okay, we are connected differently, right? Like for DHS2, then we'll never use our account, right? That's created on DHS2. And then we map the DHS2 API to superset, right? So basically in terms of the queries, superset queries DHS2 via the connector, right? Using the API. Yes, but it's through the API. Like, you know, how the query happens. So, yeah, like for example, say if you want to query organizational units, like for example, right? That happens through the connector. And then you're able to put... Yeah, yeah, yes. Yes, we use that. Then when we, for the portal app, so for the portal app to pull the visualizations in, then you interact with the superset API. Yeah, okay. Thank you very much. Okay, of course. Yep. That works. If anybody hiding in the back wants to actually sit on the front row and get a good view, that's a great time to come down. Calla, no, you've never hidden from anything in your life, so it's okay. There's a couple of slides where it's a little hard to see. So, yeah, you really would benefit from being down front. Hi, good afternoon, everybody. I'm John Painter. I'm from the Center for Disease Control in the Malaria Branch. I work with Presence, Malaria Initiative, largely in the Sub-Saharan African countries, where we use DHIS-2 to monitor our malaria programs. Reef outline, I'm going to give a moment of appreciation for DHIS-2 before I go on to tell you why DHIS-2 is not enough. This is particularly for the analysts in the group. But the other folks, I hope you'll appreciate it as well. And in order to provide what is at least maybe enough, we made a web-based app to help throw in some of the gaps that we see. Now, I won't be going through a live demo because at least one day-wise, programmers told me that that's a bad idea. So I've got some screenshots to show you as I talk through why we do what we do. So, a moment of appreciation. We have the most far-to-reach health facilities all over the world. And DHIS-2 has been really impressive. It's transformed our surveillance from very unmanageable paper to analyzable digital form. It's been a game changer. Now, so it's really easy to access health facility data. But does it tell the whole story? No, that's the problem. And to go from data to really good action, we'll need more than just the raw data that's in DHIS-2. And I hope to show you some examples of why that's true. Ideally, when we think about an analytic plan, it should involve documenting what data we're going to use, controlling for reporting bias. We'll move them outliers and then doing some statistical models to get the best idea of whether there is change, whether that change is real or just some random variation. So we've gone through a lot of code to do this, but I don't want to make it sort of more available to most users. We created an external app that has been dubbed by one of our partner's magic classes. Data is an intervention-effective risk and trend analysis tool. It starts, of course, with a DHIS-2 instance and then uses the API to connect to an instance of R that we run with RStudio. That instance then hosts the app that makes it visible on a browser through a Shiny application. We use the API to access the metadata, to access the actual data, do the data cleaning and munging, various visualizations, and then, honestly, the advanced analytics, the time series models. So here's a quick screenshot of the app. It is set up with some tabs across the top so you can go from setup to analysis and six pages. It starts by putting in your credentials for the instance you want to get to and a place where you want to store the data, all the data is stored locally. So the first issues that we wanted to solve was finding and documenting the data that's in the DHIS-2. We found a couple of countries we work with. Metadata is unavailable to most of the DHIS-2 end users. Now, if you go to the DHIS-2 demo, you see all the things in admin would say it's great. But when you're ready to end users, you don't see all that. Only three of the other countries that I reviewed had a metadata app that was available where they could see some of the elements. The funny thing is, it's all actually available through the API. The problem is, it comes in in a format for most people, it's not readable, but we do the business of translating it into a readable form. Though part of the reason for this is that we frequently find that different numbers are published for things like confirmed cases of malaria. If you go to the World Malaria Report, per country you see one number. If you go to the country's annual report, you see one number. If you go to PMI's report, you end up with a different number because each time a different analyst is probe a number. And naively, girls are like, oh, well, I want to see what are your number of confirmed cases. But the DHIS-2 instance doesn't have available called confirmed cases. What it has is a case positive for malaria RDT, where it's a case that's positive for bionicroscopy, where it has outpatient cases, inpatient cases, community health worker cases. And we don't know which ones were pulled. So it's very hard often to compare where the sets are. We want to have data that's reproducible. So this first step is being able to expose to everyone what is available. For most of our countries, there is many as 10,000 data elements. We're not all malaria data elements, but there's hundreds of them. And if you've ever tried to go like the visualizer app or the pivot table app, you've got to box this size to try and find your data element. So this is exposed to everyone. You can search by data elements, by knowing the data set, the period type, et cetera, so that we don't confuse monthly data elements with weekly data elements. And then we expose the indicators as well, importantly, showing what is in the denominator and the numerator. So it's spelled out so that we can see. One country I went to, they were really aggravated that their indicator for confirmed cases did not match what they would have pulled by themselves. And they had no idea why, whereas they came to say, we can't trust the HIS-2, we should go back to our old system. We looked it up here, and we found that when the HMIS folks built the indicator, they'd accidentally written in one of the data elements twice. So therefore, all the numbers didn't quite add up, but they had no way of seeing that themselves, just with their access to the HIS-2. After browsing the all the data elements, we're asked people to make a data dictionary, define exactly what you want to include, then you can say, I want to download this data for a set time period. You get a progress bar, the API is called, and after a bit of time, you have all of that data. And once it's downloaded, you actually don't need any more internet connectivity, which is a great help for some folks. So at that point, you can work with the data offline. All right, so now we have the data, let's analyze it. We're going to check for completeness and the potential for reporting bias, and we're going to check for outliers. So completeness and outliers, we all know that when you go from a register, like you see on the left, over to the digit number, the errors are going to happen. I mean, that's not as well, it should happen if you're doing that much of it. And a huge investment is made in making sure that facilities do this correctly. That's their data quality checks. But we found that even when the data quality is high, there's actually potentially a much greater problem when we don't know how many facilities provided the data. And I'll give you an example of what I mean. So that involves something we call reporting bias. It's potentially a huge problem with aggregated data and if there's one thing you take away from this talk, it's this. So an example, this is one of our countries where there's high levels of malaria. So I selected a county from August 2021 and I pulled out four data elements that I thought would be the typical thing someone would want to review when looking at data from this county. What's the first attendance at the outpatient clinics? Total number of suspected malaria cases. The total number that were tested for malaria and numbers that were confirmed with malaria. And now typically what we do is we'll take those first two values and say, okay, what percentage of patients coming to the facility were suspected with malaria? 13% in a highly endemic month and a highly endemic country is low. It calls into question whether or not providers were thinking about malaria or were dismissing fevers as something else. Okay, among those suspected of malaria, how many were tested? We have a, what is our program, roughly $750 million per year to ensure that every fever in a more endemic country is tested. And here in this country, we received a ton of money. Only 35% of the cases were tested, but it was very low. And yet all of those were at 126% that were positive. Now yesterday there was a great talk and we had this line on screen, I thought that's a great line. I thought that HMIS is the cornerstone of information policy and planning country. Yes, that's why we have it, so I would want to use DHS too. But if this is the data, it's a cornerstone for a new project. What are people going to conclude? Even though this program was seriously underperforming, but the data is bad? They're both. But this gets to what the really underlying problem is, is how many facilities provided the data for these numbers? So what I've put down in the second column here is the number of facilities that provided the number in the first column. And you'll notice that in this county there was 219 facilities that provided the number for first attendance. And yet only 76 facilities provided a number for testing for malaria. And twice as many facilities gave a number for a confirmed malaria, which you may remember we had confirmed many cases then tested. I think that's pretty obvious why that is now. So it wasn't that the data was bad or this malaria program was seriously deficient, it's just that we only had partial data. And interestingly enough, you notice how there's different numbers for each of these elements. So if I had gone to the program officer and said, well, what's the recording rate here in this county? Let's say, oh, it's fantastic. It's actually there was interesting enough that the expected reports are 261 and 253 reports were received or a reporting rate of 97 percent. Oddly that didn't include, all of them didn't even put down first attendance because only 219 facilities filled that in. So this reporting rate for the data set just tells you that the form was sent in. So technically they're reported, but it doesn't mean that all elements were thrown down. So what we need is not the data set reporting but the data element reporting. So that percent reporting rate is unfortunately very deceptive. So one of the things that the Magic Glasses app does is it identifies the facilities that are consistently reporting so that we can look at the data from those facilities and not mix up those that are reporting this month and not the next month. We find that if we restrict our analysis to these facilities, we effectively eliminate reporting bias. But if you do remove some facilities from your analysis, you can't do things like what's the total number of cases. But what you can do is a better sense of these ratios and trends over time. So when I was putting out the data for one month, obviously my now I guess problem happens when we're working over many months. So on the left hand side, you see a chart of the number of Magic Glasses over many years for the facilities that didn't report every month. And this is actually the majority of the facilities and it gives the impression that malaria is on the rise. Again, given our investment in malaria control, this is very disturbing. But if you restrict the analysis to the facilities that reported every month, you see where it's basically a flat line, also is a little bit disturbing, frankly. But none of us are very different picture. So we're comparing the facilities reported every month to the facilities that didn't report show increase in malaria and that is probably due to just then over time we started reporting more and excused the data. Another thing that skews the data is outliers. We go through several algorithms to try and screen values that don't make sense. And like when someone was supposed to put in 105 cases and it became 1,050, those kind of errors. There are entries that happen because of enormous magnitude. And there's also errors that while they look like they're in kind of the right range, maybe 100 cases, they shouldn't have that during the dryest months. So we can pick up those seasonal and overall outliers. Now, there are a lot of ways of doing outliers. I'm not going to compare the algorithms. But I want to point out is there really a key difference between what we're doing here and what happens in the WHO data quality tool. And that's the after identifying these outliers, I can remove them from the analysis and I can document which ones you do. Whereas in the WHO data quality tool, it's great at flagging the outliers. But the other is the flags or the facility can go back and change them. But if they haven't changed that value, the data remains in the data set and that continues to be used for the analysis. So all those erroneous values will screw up the finely tuned analysis that we want to have later on. All this is to get to this final step of really being able to evaluate are things getting better or worse? How well do our interventions work? The key is that we start now with data from the facilities with the most reliable data, those that report every month or sometimes we can allow a little bit of wiggle room, maybe report 11 or the 12 months and the outliers being censored. Okay. So display here is a time series of malaria cases from one of our countries. And you notice that there was a big drop happening in 2019. And so to investigate that, we set sort of an intervention date. In fact, there was an intervention. There was a large bed net campaign here. And then we apply a number of time series models. I won't go into all the different ones, but there's not just a single one. There's many different ones. And they may fit better or worse. And we use a series of cross-validation techniques to find for each facility what is the best model fit. Then we find the best fitting model. We use that to forecast what we believe probably would have happened had there not been an intervention. But then we can, and that's what's shown here with the dotted line. And you can't see it on your screen, but there's a gray shaded area around it because this kind of forecast obviously has some confidence interval. So from that best fitted model, we compare the expected values with the actual values, which are shown in red. And in this case, the best fitted model estimates a 39% reduction over 12 months, which is a fantastic impact from that bed net campaign. Turns out that bed net campaign was actually a little bit more complicated. It had a couple of different bed nets and also a few alternatives of vector control strategies. And we were able to split the data and compare the impact of each of those interventions during the same time period. Here's another example now, it's a little bit takes an extra step, and that's investigating an unexplained drop in cases 2021, 2022, actually going through to the present. What you see here is the number of confirmed cases of malaria. It is documented exactly what that one's going at, but for the purpose of this, I'm just going to say confirmed cases of malaria. And this is the raw data. This is what you would see in DHIS2. And you notice some amazing peaks and valleys. The impression that was given to me from the NMCP is that malaria just swings up and down, and it goes that way every couple of years. But around that have a couple of years that are down. Given there was a pandemic during this time period, one of the biggest things that I was like, oh, is it possible that facilities aren't reporting anymore? That would be, that could be very problematic. So why is this decline in cases real? Is it due to a lack of reporting? Could it be due to decreased attendance, perhaps lack of testing? So we did again, applying some statistical techniques from our recreated and adjusted corrected confirmed cases that controls for the potential for rare testing rates or lower attendance. And we then made a second time series that we believe is the closer to the real number of confirmed cases. That said, we saw what there is still a large decline, decline of 51%, that was not explained by lack of reporting, decreased care seeking, or decreased testing. This is extremely useful example for the malaria program that may indicate a real decline in malaria transmission. I'm going to skip ahead to one more example here. And that's about something called minimal detectable change, maybe not the best phrase. But the idea is that with the emphasis that we have on data quality, sometimes you just have to ask, look, when is good, good enough? When can we, we always want to keep making sure facilities do well, but we often find that there's always an example of a facility that had bad data or an outlet or something. And so it's easy to say, we need to double down on data quality analysis. But you really have to ask, but when is it finally going to be good enough? And I'd say that depends on the size of the change you want to detect. So for instance, going back to the data set that I showed you earlier, the best fitting model I could find for the data, the historic data, going up to, this is 2018, had a fit that was only about 29%. You mean that if there was a change that was greater than 29%, or less than 29%, I wouldn't be able to say that that was a real change and not just some random variation in the data. That means that if we had an intervention that was supposed to make a 20% impact, I don't know that I could actually detect it statistically. Similarly, if matter cases swung up by 20%, I may not be able to say that that's really important. And that far in the same country, through the data quality improvement program, they've gotten to a point where they've had a 10% fix, and actually the last time they did not show it here, it's about 5% to 7%. Which means if they want to, if they have an intervention that they're trying to examine, they should be able to see any change that's greater than 5% to 7%. And honestly, that's unlikely to do any intervention that's supposed to have an impact that's less than that. So one could say that probably for this country, the data is now good enough to do what you want to do. So I just summarize data reported to DHS2 provides a fantastic basis for evaluating program implementation and effectiveness. However, the uncorrected data is potentially really misleading. And the data doesn't need to be perfect, but it does need to have a systematic analytic process to use that data. And we hope that to improve program evaluation, we could get external analysis apps like this installed alongside DHS2 so that we could make the best use of the data we have. If anyone's interested in looking at the app, all of the R code is available at the GitHub site that is listed here. Thank you. So we will have a few minutes for questions as well. But one thing I just wanted to re-emphasize is that we've been talking about this for a couple of years now and John made it extremely clear to us that data set reporting rate analysis is not enough. If that's all you're doing, you're not doing enough. And you have to be looking at data element reporting rates because people are cheating. They're not filling in the data sets. And they know that. And they know that you monitor for that. And that's what they're being graded on. Yeah, and it's a perverse incentive because they're also being monitoring and getting it in on time, which means that I'd like to fill that in, but I'd be better off getting credit for getting it in on time than actually having it fill out. So John's presentation, I'm sure you've got a lot to say. John's presentation inspired us to write some guidance documents on how to use standard DHIs to look at data element reporting rates. And I think I'm still probably the only person who's read it, but it is on our website. We have a data quality guidance document on our website. Please have a look at that and utilize some of the tactics. We got the inspiration from John and hoping that countries actually start using it. But I don't think anyone actually has yet. Okay, Colin, you have a question. Yeah, two comments. One is that a lot of what you're talking about is things that you can normally can address through better tools that you're presenting and also the systematic analysis and monitoring of what's happening. But I'm also interested in why do we have so many problems? Because I looked at lots of different countries. Okay, and I see an enormous backlog in metadata quality and data quality, etc. Things that should be done, even the very basic things are not being done. And I think analyzing why, and I would like to mention those two things. One, I think the problem that DHIs do being generally a centralized system very often under the real control of informatics people, computer science people, and not health managers. It means that they simply don't see where your data don't make any sense from a health and a health management perspective. And to take a practical example here, I do a lot of work in Sierra Leone, I've been working for them for six years now setting up their IDSR system, particularly the case-based IDSR system. And what I find there is that the disease surveillance managers, they're thinking is a still paper-based. They're thinking like forms and data being sent from side to side, instead of thinking, oh, we have a central database, etc. They're thinking is like the exercise controls through data forms. And also what they're used to is weekly aggregating data. But the fact is that weekly aggregating data is always suspect. Right? It's not confirmed cases. Now when we have a case-based system which is actually slightly getting better of higher reporting rates, we can see that we didn't have 93 percent, sorry, 93 cases of acute barohemorrhagic fever in 2022. We had 10 confirmed loss of fever cases. All the other APC cases were of the lab analysis found to be non-confirmed or false. So I'm just saying here, a main problem here is actually to change the way health managers think about their data. And just to give you one thing you could do in any country instance, look at data elements and indicate that, see how many of them have proper definitions. I can tell you, you're going to find that most countries, 80 percent has no definition at all. All the definition is just a copy of the name of the data element and indicate all of it. And that it reflects a developmental problem how managers are interacting with data and information. Yeah, those are good comments. Thank you. The name Magic Glasses came from a health manager who said, oh, she could finally see what was in the system. Yeah. Thank you. Yeah, very wise words from the father of DHIS. He wrote the original code for DHIS. Any actual questions? Yes. Like I explained, when you have partially complete data, you can analyze that data. My other question is, can we use machine learning to predict the missing data? You know, so maybe you can see past data and all the like the reactions. Can you say that? Okay, we don't have this data from this, for example, location for this period. Looking at, can you predict that what like that would have to do? Yeah, it's a great question, which was, can we use machine learning or other techniques to impute the missing values rather than just censor them? And yeah, I spent a lot of time looking at that, thinking about it. I don't know, really necessarily came up with the best answer. But what I found was it was very hard to predict what that missing value was. And in so doing, I was imposing my model onto that data, when really I want the data to tell me what the model is first. So I worried about the tail was wagging the dog. That said, there may be some more sophisticated ways of imputing the values from neighboring facilities that could adjust for it. But in the end, I decided the simplest or truest method would be just to look at the facilities that reported every month where there wasn't missing data. Hopefully that's a majority, that's a good representative sample of the facilities. If it's not, that's problematic, but in most instances where I work, it appears to be a very representative sample, both geographically and in terms of small clinics, large clinics, et cetera. John, there's a question online about differentiating between zeros and blanks. Yeah. Yeah. That's another thing that people wouldn't be able to see in Magic Glass as you know risk when they were done is that zeros sometimes are not stored in DHS too, but it's a setting that is done by data element. It's not that or Malawi doesn't store them and Zimbabwe does. It is a question of which data element has it. And that's one of the attributes that is visible when you look at and measure the data elements. It says whether or not it's there. And so it is something to be mindful of. Fortunately, in most instances, for things like malaria cases, I've seen that countries have turned that switch off so that zeros are recorded. We have a couple more questions, maybe. Yeah, it's not that much, but we have more of such problems that are very important. I think there's quite a lot of great data on the issue of things. It's not that much. When you get there last, there's lots of data, lots of content in some cases. Thank you. Thank you very much. How many data elements do you notice that they are very complicated? That's one of the things that we're ready to talk about. But sometimes it might also be that we're very sensitive to data and how are they not coming up. And I think it's a little bit of a challenge for people that they don't know about. So we have this comment on the impact That's good. I mean, it's really looking at it as a thanks. I appreciate that. I wanted to, Scott and I had a conversation a couple of years ago, but the question was, the comment was about looking at the completeness of the form. If, and I remember when people asked me, why don't they just require it to be an entry? You can't submit the form unless there's an entry. That would do it. Scott, I want to, when you looked at that, that becomes a real barrier to getting the form in. And the fact is there are sometimes hundreds, one country, I guess I want them, their monthly form has over 1,000 boxes on it. And these rare things, it might be viral hemorrhagic fever. They're going to be left blank. Another common example is stock data. There's a box for a number of days, stock out. And they often leave that blank because they didn't have any data stocked out. Yeah, it would be nice if they read with zero, but at least they wrote in the other boxes. So there will be some blanks that are implicit zeroes. And if we specify, if we require one to read a zero, I think there'll be a lot of pushback. We're going to have to end it here guys. So if you're already finished over, so let's give it John another.