 Everybody, it's Monica, and here we are. We're live, I'm starting up the live stream. Today we have a special treat for you. We're gonna be querying the Global Health Data Exchange, or DH, GHDX, I don't know, it's got a cute name. But anyway, I'm gonna show you this tool that one of my customers alerted me to. I kind of knew about it. You'll find that I actually know a lot about the organization that is hosting this tool. I don't think the tool has been used a lot by people outside the people connected with the project, but maybe it'll get used more if people show up to my live stream and I show them about the tool. So I'm just kind of saying stuff so we can get people to join. So there were sort of two reasons I wanted to focus on this tool. One is to just get the word out there that there is this online query tool where you can study, and it's basically data about countries, health data about countries. So it's generated, I'm sure, from individual data, but it's about countries, it's not about individuals, but it's good for like global health, right? So if you study, if you're in charge of a country, especially countries, like I work a lot with people in the Middle Eastern countries, it's nice when you compare one Middle Eastern country to another Middle Eastern country. In the US, we tend to like to compare states because we're so big, and, you know, because like if you think of the country of like Qatar or UAE, there, you know, that's a pretty small country, you know, smaller than like Texas. So, you know, we have different sort of ways of doing comparators globally, but it always is difficult to do it with countries. And so this helps us in a lot of ways. And so I just thought I'd start with introducing you all to it, and hopefully people will start joining. You know, one of the things I've noticed is there's this like counter ICN restream. I was gonna troubleshoot this, I forgot to do it. And it always seems to say zero, but then people are in the chat and so I don't really know how many people are ever on. But anyway, so I'm gonna wait a few minutes for more people to join, you know, and start right at noon Eastern time. And so this tool that I'm gonna demonstrate, you'll see I point out a lot of sort of difficulties I've had using it, you know, cause like I said, a customer came to me and she wanted to do some studies comparing a country with other countries. And so I started just saying, okay, well, what if I wanna do that? You know, how do I use this tool? And I thought, well, this is a good example of why you have to think about the design of these tools. You guys may know that I published a book about how to make a data warehouse in SAS. And it's more conceptual than it is like literal, like, you know, it's not really like an, this is how you build a database, IKEA style like instructions, but it's more like, what do you think of in their design? And it makes me wanna have these people read the book, like maybe I should just send them a free copy because there's some really significant issues with this user interface. Okay, that being said, one thing that's awesome about it is you don't need an account for it. It's all just online. You know how like you can just go online and look up tweets on Twitter and stuff like that, but you don't need an account on Twitter. Well, this is where the global burden of disease database or global health exchange, I'm not exactly sure what to call it. We'll sort of get into that. And yeah, so hopefully people will join and even if people don't join, I can always, you know, put this out there and people can see this demonstration because I'll tell you the customer I was working with is really intelligent. And she's like, I don't know how to use this thing. And I'm like, oh, it's nothing. We'll just do this. And I was like, I spent an hour fighting with it, trying to figure it out, you know. So I thought, well, why not, you know, sort of try, like hopefully people will join and be in the chat and then we can have some interactive thing where I can kind of fight with it in real time in front of you. And then we can demonstrate it and practice with it together. Okay, well, it's noon. So I'm gonna officially open the live stream. Welcome to everybody. And you know, I don't know if you're here unless you chat. So feel free to go in the chat. You don't have to talk to me. You can say hi or thank you or whatever, but you can talk to each other too because hopefully we can get some sort of brainstorming done here. So those of you who don't know me, I'm a data scientist and I do a lot of education. I am a LinkedIn learning author and I have a book on SAS data warehousing. So SAS is one of the things I teach and are, and I'm an epidemiologist is sort of the things. So I'm into healthcare. And so today what we're gonna do is talk about this thing called the global health data exchange. And actually I'm gonna share my screen, start by sharing my screen. Let's see here. Chrome 10, let's see here. Here's where I'm gonna start. So let me see if that shared right, yep. Okay, so this is where I'm starting and this link was in the description to this live stream. So what is this? This is the page for the Lancet, which is most of you will know as a journal, peer review journal out of England. But what's kind of important about the Lancet that a lot of people might not remember is it's a country journal, right? So like the Saudi medical journal, the Qatar medical journal, the Journal of American Medical Association, those are sort of country journals. And so when you have a country journal, there's something special about it in the sense that it will sometimes sort of be provincial. Sometimes it'll get a little country gossipy. And I live in New England, I live in Boston. Those of you who read the New England Journal of Medicine or the Massachusetts Medical Society Journal, occasionally see some of that local sort of gossipy sounding stuff about policy and you know, it's sort of localized. So the fact that the Lancet is the one that's sort of sponsoring this already tells you that this is coming out like an English thought, right? And to me, I'm not sure about that because if we're talking global stuff, do we really need that perspective? Maybe we should be thinking of an African journal or a Middle Eastern journal or just something from a different perspective, right? But I don't know how the Lancet got picked. I don't know why they're the one doing it but they say, welcome to the Lancet Global Burden of Disease Center, bringing together the most comprehensive data analysis of worldwide trends in global health, all of a lot. So if you click on this global burden of disease and just before I go too far, let me just make sure that I understand, we understand what global burden of disease means, like what even burden of disease means. And I'm like an epidemiologist when this customer came to me, I was like, let me Google this just to remind myself. So this is sort of the backstory on burden of disease. So a burden is like this heavy thing you have to carry in life. Like, oh, I dropped out of school and having to go back is such a burden. Like I didn't do that, but well, I kind of did with my PhD. But anyway, so that would be a burden to me to go back and try and get my PhD, okay? We didn't really care about burden in epidemiology and public health until like the 90s because what was happening is before that we had all this infectious disease. So it was acute and people were dying, blah, blah. We stopped that. In fact, we stopped a lot of heart attacks, but then we ended up with people who were post-heart attack who were like felt pretty beat up. And so we were saying, maybe living a long time after a heart attack where you're all beat up and just not feeling well, maybe that's not that cool. Maybe life expectancy isn't everything. So they started, and this was not just happening in the US, it was starting to happen in other countries that where we had thought infectious disease was a big deal and we'd solved it, so we're all happy. And now we were realizing that this was turning into a problem. So we started to have to worrying about the global burden of disease, like living with diabetes, living post-heart attack and stuff. But it is really hard to measure that. Like it's pretty easy to measure births and deaths and all that, but how do you measure like living miserably, right? So that's when that message or that idea started coming out. And then, okay, so what happened was, I'm gonna click on this final more about GBD. So that's just what global burden of disease means generically, but if you think about it, how would you actually do it? Like, so what happened was people came up with a set of metrics that you can calculate to calculate burden of disease, like these kind of algorithms to come up with a number. And I'll tell you, I'll show you them and I don't like those numbers, but they're necessary, okay? And I'm not gonna throw them out. Sorry, I had a dry throat here. The reason why they're, my metaphor for those numbers, and I'll show you some of them, are like the ICACI information criterion, or AIC, when you're comparing models. Excuse me. You wouldn't use the AIC alone. You wouldn't look at that number and say, this is a good or bad model, because it's a meaningless number. What you do is use it to compare models you're fitting to see which one fits better. And that's what I would use these calculations about burden of disease and disability and stuff. It's to compare countries. I would not use it as meaning anything. Of course, if it's really, really high, you'd be scared. If it's really, really low, you'd probably think it was wrong, but I don't really use those numbers. Like I wouldn't wanna use them in a regression or anything like that. Like this is me personally. I know a lot of people just, their whole lives are based on those numbers. And I'm not throwing them away. I'm just saying that the reason, and I'll tell you the reason, the reason I don't like those equation numbers is because each variable that goes into them has so much measurement error in some countries and almost none in others. So they end up super biased. And that's why I just don't think that they're a good measurement. But I think they're great for comparing. And that's, and it's really hard to do that. So the first thing I wanna say is the fact that the Lancet support of putting this together was really awesome because now they made it so these numbers, if you're an epidemiologist, you don't just have to go around to different websites and hope you find the data for these countries. They're sort of doing that work for us and pulling that together, right? I'm not sure where they got the money, but here's what I wanna say when you find out more about it. Global health data to drive change improve lives. Global burden of disease study, most comprehensive by epidemiological study today, to buy is a really important part. Led by the Institute for Health Metrics and Evaluation at the University of Washington, Seattle, USA. The GBD study offers a powerful resource under changing health challenges facing people across the world in the 21st century. Okay, that tells you all you know, right? First of all, why is this UK journal working with a U.S. university study? Well, I don't know the answer to that, but I will tell you that the Center has a reputation for the U.S. state of the United States to be the center for COVID-19 at the beginning of the epidemic, although in the U.S., we were not doing any testing for various reasons. When you do not have any data, predictions are going to be wrong and somebody might even say that that's unethical to predict anything, you might as well just roll the dice, but these guys jumped in on it and there's other things that IHME has done that I don't like. And so it didn't surprise me that I saw their name there. What IHME likes to do is get a whole bunch of data together and kind of control it so that they can get on papers and be part of things. This in and of itself, there's nothing wrong with that. There are other data centers. There's like data center at UT Houston that did WHI. There's the Coordinating Center for Biometric Research, CCBR at the University of Minnesota, which is awesome. There's Ann Arbor, ICPSR, and the stuff they do there. And these are great epidemiologic data centers. So they're not the only one. And if you're like writing a grant and you're like, okay, what if I don't wanna work with them? That's fine, you can work with someone else. But they're just kind of known for trying to get data. And I don't mind that, but I don't like the way they serve it up. And I've known them like since the 90s. I think they were called something different than 90s. And I'll always try to request data from them. It's a pain. And then when I get it, it's like not in good shape. Like I don't, I can't understand it. And because I'm a person who runs data centers, I'm like really picky about those things. So I feel like this was more like a power play from IHME with the Lancet. That being said, they have done an amazing thing, which is they put these data together. And so that allows us as consumers to be able to have access to it. They put it out on the web. They made it so we can get at it. So it's like such a mood swing to be in epidemiology. Because on one hand, you have these infrastructures led by like WHO kind of stuff happening that's led by these westernized places and they're putting it up and they're speaking for us. But yet they're giving us access. So if we empower ourselves, we can go and just get the data and just kind of do with it what we all want to do with it and just do our best with what they're doing for us because they're doing kind of a big service. So it's kind of like this mixed bag in epidemiology. It always is. And so, but anyway, so that's kind of the background with the the Lancet and IHME. Now, the next thing I'm gonna do is just go to their interface. So let me change the, I don't see anybody in the chat. So let me share, okay. So here is their query interface, right? And this is what I linked you to. And first, let me just say here, this is like, if you go here, you get to the IHME data set. They call this GHDX and I'm not really sure exactly why. They call it global health data exchange but it's not really an exchange. Actually, let me show you something from my book. Let me do this here. I pulled some images from my book that I wanna share with you just conceptually about data warehousing because I wanna make sure that you understand some principles before we look at this actual data warehouse. I'm sorry, somebody was texting me. So in chapter four, what I do is I create a diagram of a conceptual thing that you do when you run a data repository warehouse like this global health data exchange. So let's say that I'm a country and the global health data exchange goes to me and says, can you give us our data to put in this interface? I would say, okay, yeah, I'm the provider, I'm giving you a raw data set. So they, let's say I gave it to them in a CSV format. They would use SAS code and they converted to SAS because I'm pretty sure this back end is in SAS. So now it would be a SAS data set. Now, they would remove unneeded columns but they probably wouldn't have any unneeded columns. If I did a bad job supplying it, I might put some columns on that they don't need. The reason why that's usually your first ETL step is you wanna make your code, your data as skinny as possible for the other steps. So then the next one is remove unneeded rows if there are any unneeded rows. Then the next step is you add columns. So I'm gonna give you an example. One of the, like, you know those calculations like disability adjusted life years or dalleys, that's one of those calculations I was talking about in order to make those calculations, they need the raw data. So that raw data comes in here to make the calculations like maybe proportion of people who have diabetes is one of them, you know, they have these different numbers. They put it through this equation and then they get this column, they add a column. So what's important to me as a consumer of the data warehouse is I wanna know which of these are native variables that you got from the data provider and which of these are calculated or transformed variables that warehouse made. And what's important about both the sets of them, the first set is what is this variable? Like where did it come from? You know, like for instance, if it's, I don't know, like a death rate, is it from the death index in that country or something like that? And here, if it's like disability associated life years or whatever it is, disability, I can't remember dailies. It's like, how did you calculate the dailies? Because sometimes, you know, there's some debates about how you calculate things. It's like, what exact variables did you use to get that? So I need that in the documentation when I consume data from a data warehouse. So when I write the methods section of whatever paper I write, I know what to say here. So these are the columns added. This represents these columns being added by IHME to this. And then once they have that, they strip the unneeded columns. Like sometimes you need to add like kind of, you know, administrative columns as you're transforming it. And then you put it, you load it into the format, right? And so now we can go back here and see, oh, great. Now we got, sorry about that. We got a Zoom bomber. I don't get anybody watching, but I get a Zoom bomber, wonderful. All right, so I'm gonna stop sharing that. And I'm gonna go and share our query interface again. All right, so let's see here. So what is important to know is that, what that diagram I just showed you, data and apparently 2019 last time, which is like now, you know, two years ago at least, was given from all of these different countries to these people and they did that ETL. And now I'm here, they put in this query interface and now I'm trying to query it. So I'm trying to figure this out, right? Because my customer is saying, I wanna compare countries. I want these metrics. How do we get them out? And you know, I know in my head that they did what I just showed you in the diagram. So I'm like, well, can we download the data set? Can we download the data set that they loaded? You know what I mean? Is it that simple? Can I just download that exact data set? So that was sort of the first thing I was looking for. So up here, you'll see there's a menu. There's some menus here. There's a few things that are important here, but they don't relate to this query tool. And you'll see these breadcrumbs here. This is very confusing. This is a set of tabs, and this is just a template. It's got a few little pieces of information here. It's sort of randomly. I think they just keep adding tabs or whatever. But, and you can read this if you want. But here is what is sort of important is this is the query panel down here. And this is where the results show up. So you basically fill out this query panel and you click search, okay? And this is not that easy. So, but I'm sort of telling you, this stuff up here is not part of the query panel. This is part of, oh, and by the way, there's like this thing I'm telling you right now. I couldn't find any like educational resources or anything to explain this. I was just messing around and experimenting to figure it out. So this query panel is pretty complicated. But I'll show you how you can fill it out and hit search and it will post results below here. Now, as you can see, these results look weird already. Like what does Val mean? What does year mean? Like this looks odd. Then you'll see that there is a visualization panel down here and I guess you can toggle. So this whole panel looks is separated and you can set parameters on it. But I also want to call attention to the fact that it appears that there's a visualization tool in here somewhere. But I'm not showing you that today because I haven't had a chance to look at that. So first of all, let's see here. I'm surprised there's really nobody in the chat. Yeah, I guess there's nobody in the chat. Nobody's asking any questions. But anyway, okay, so let's go back here. So the first thing I want to point out, this is the query tool up here. I'll just show you each of these because they're super complicated. So we have base here and there's either single or change. And I didn't even get to the point of using change because the format of the results below are so complicated that what I would recommend is if anybody actually wants to calculate change, like for example, let's say they had multiple years, like 2017, 2018, 2019. And let's say some rate in 2017 was 50%. And that same rate in 2017 was like, or whatever, the year later was like 60%. And then you want to see what is the percent change? Just calculate it yourself, I would suggest, because it's really hard to tell what's going on in here. So the single then you should be able to get each one, like the 50 and the 60%. All right, so just leave base alone on single, I would say. Unless you get good at this tool and you think you can know what you're doing. Then the next thing is context. Now, when I click on this, you will see it's very confusing. Some of the things look pretty obvious, like all cause mortality, like let's click on all cause mortality. See how this form changed? So what I realized is most epidemiologists, including my customer, was only interested in cause. And cause is the one where you're looking at like rate of diabetes or rate of, kidney disease or something. These other ones, it's like sort of unclear, like this injury, you're gonna get different things. So it makes you feel like, well, wait a second here. These are basically switching which data set they're querying. So remember how I showed you the ETL thing? At the end of that, you get a certain data set with certain rows and certain comms. It's specified a certain way. Well, what it looks like is they're querying different data sets that are federated, they're not interconnected like in a warehouse format. So again, that is confusing. So if you're somebody who decides you're gonna use the impairment one or whatever, I again encourage you to sit down with that and study that. So what I did with my customers, first of all, I just limited it to just single, we're gonna just handle that. And we're gonna just study, how do you query these causes for each country? So now what it looked like is we could set parameters and location makes some sense, right? So we open this and we see this dropdown. And it's a very interesting dropdown. Like it says, select only countries and territories, select only GBD regions, select only. But what you really do is where you can really actually select anything is under these boxes, right? And if you select any of these, you look like you're getting a group, but you're not really sure who's in the group, right? Then you come down here and if you collect this, Central Europe, whatever, you're getting a group and you can tell what's in the group because there's a hierarchy. So there's this and then there's Central Asia, but within Central Asia is Armenia. And then within Central Europe. So for example, let's say that I was doing some analysis of Central Europe and I wanted to group these countries into a few different groups, right? Maybe by region. This query tool would not allow me to really do that. I can choose Central Europe as a grouping, but if I want Albania, Bosnia, Herzegona and Bulgaria for some reason in a group together and Croatia, Chechnya and Hungary in a group together, it's gonna put them separately out here, right? So this is a classification issue, right? Because you can kind of imagine we might wanna do that kind of a classification. So, and actually let me show you, let me make sure there's nobody Zoom bombing me. Let me show you some issue that I tackle in my book. So we're back to the slides. So first of all, I wanna talk about this continuum. Over here on the left side is a data lake. So what's a data lake? Well, you could see it as a data repository or basically it's a bunch of, like the thing I just showed you this thing, when you get a final transform data set in SAS, imagine you just set it on a server and people who have access to the server can use it that read only to make data sets. That's a data lake. They're all just sitting there, okay? And there's not a lot of processing that's been done. So the users are raw data analysts, people who are gonna make analytic data sets. So a data lake over here has a higher risk of privacy and confidentiality. There are fewer individuals using it but they need a high level of vetting and there's more of an honor system. The documentation must be very comprehensive and accessible because they're just got data sets sitting there. And the results from the analyses tend to be unique findings from focused investigations, right? So their primary support needs is raw code files, like code that you can use to load those data sets into SAS and analyze them and stuff and formatting, labeling and also the original documentation files for the source data sets. Like some of that data comes from the census, you know, from these different countries, that's what you need, okay? Now there's a continuum and this is on one side of the continuum is having a document, kind of like a data lake with documentation. We're over here totally on the other side at an app-based warehouse interface, okay? In here are things in between. Like if you work at a place, like I often give the example of Cognos. Cognos is an IBM implementation that you can put on like a star schema database. It is like a query tool, but it's not quite as locked down. Like people in Cognos usually have more access to the underlying data. So it's sort of in here. But I made this continuum because I wanted to talk about the user needs, right? Because these users at the data lake need this. Now us, we're over here, this is what we need. Because first of all, this is very low risk to privacy and confidentiality as you can see, right? Many individuals requiring basic vetting like us. Little reliance on honor system, lower documentation requires, but results from analysis said to be metrics from routine monitoring of a system that provide evidence on which the base system decision about system performance. Doesn't that sound like a country? Right, here's the application support. Manuals on how to use the application, right? Like there, I couldn't find any. Training materials for application helped us research, okay? I could not find any. Warehouse documentation, documentation to explain how the source variables resulted in what is served up in the warehouse. That, there's some of that and I'll show it to you, okay? I haven't gotten really to it yet, but except for that little panel at the top, but there really is missing just a lot of what you would need if you were gonna, you know, like actually use this for, you know, actually use this application. So why I wrote my book is I've seen so many SAS data warehouses where they really just don't think about the user or the application or the business proposition or anything before they make the warehouse. And then nobody uses it because it's too hard to use. And, you know, people are coming to me saying why isn't anybody using it? And I'm saying because it's too hard to use. And then they'll say, well, I'll say what do your users want? And, you know, it just needs to be redesigned. Like I don't know if they did any user studies or anything. So here we have a cause. So let's say that we're gonna compare maybe two countries in the Middle East, it's not logical, okay? So let's compare, let's see here, United Arab Emirates, okay? Well, first what I told, well, let's do that and let's do another small kind of, let's do like Qatar and Bahrain. A few small countries, right? Now, you can already see this is a very awkward interface. Like I can't see what I picked here, you know, I'll cut you the chase is why I think that is, okay, year. So I have all these years, but well, I'm gonna show you what happens if you pick just like the most recent three years. You know, it seems like it should be reasonable, right? Now cause, that's what we're looking at. Age, here again, we have this grouping issue. So let's just leave it on all ages. And metric, here we have number, percent and rate. And to me, this is kind of hard to answer. Let's pick our cause. So this is where my customer and I had a lot of trouble because of the way these are classified. You know, if you want, like we were looking under, I forgot what it was, something like cardiovascular disease. And we, you know, like here, liver cancer, liver cancer due to these specific diseases. What if you want just a few of these? Well, you're gonna get a separate metric for each of them. It's not usable. It's not very useful. So let's pick something that's a little simple, as simple as we can. Let's just pick one. And let's do, I wanted to do hypertension, but let's do hypertensive heart disease. I'm not really sure. Oh, it says two here. Oh, total causes. Let's just do heart. See that here? And so the measure is gonna be, see this is where you get these years with disability, you know, in this death. What I'm gonna wanna know is prevalence, right? The prevalence of hypertension. I'm trying to keep this simple. And we're gonna have the, now we can go back and look at the metric. And I feel like number isn't, maybe percent or rate, or should they be the same? I kind of don't know what to expect, right? Let me see if it is, no, looks like it's fine. So let me go back here. So now we've set a query, you know, single, we've got our three Malaysia countries, little ones. We got the three years in a row we wanna look at. All ages, we've got percent and rate. We've got this hypertension thing. We've got all sexes. And the measure is prevalence, like how common it is. So now I'm gonna go ahead, search. And what I wanted you to notice is this is not that complex of a query, but that took a kinda long time. All right, let's look at our query results here. Okay. Measure is prevalence, locations, UAE, both sexes. Here we have the, oh, here we, this is 2018, 2017. I guess we don't have 2019 in here. Oh, yeah, it's down here. So this is very difficult, right? Like here's the value. The percent is 0.19 and the rate is 182. I look somewhere else. I think it's below here. See, you can do like this, right? And they're showing you this, right? UAE, both sexes, all ages, hyperattentive, heart disease. And this says 100 per 100K, right? And here's Qatar and Bahrain. This output is very difficult. Like, I mean, you could download the CSV, but just think about the data structure, right? Like, what is the entity here? You know, what are the attributes? This is not in the structure. And there's sort of this confidence interval, you know? You know, most data scientists would want like columns that said, you know, I don't know, country. And then you fill it in with Qatar or Bahrain or whatever. But you see what I'm saying. So this is not exactly the analysis my customer was trying to do, but it was similar. It was going to be something like that. And she wanted to compare like pediatric and, so what is the solution to something like that? Like I'm querying this. Now, realize that my customer actually has a relationship. Her organization has a relationship with this place because I guess I'm not sure why. I think that they were the data providers like she works at the place that provided the data though. And so she's one of those, she's from the firm that was the origination of one of those data sets that went through that ATL. So you'd think that they would just turn around and give them back a data set or give them, I don't know, just like when I would run, I ran a place that was a data lake. It was basically what I described in the data lake. And, you know, I'm kind of loving, you know, I'd say, well, what do you need? You know, and I'd make special data sets just for people's analytics. And I give them, I do data curations. Those of you who take my data curation course, I create a data dictionary that communicated to them, which were the native variables and where I got them and what the source data sets were. And what were the calculated variables and how I calculated them? So they could go on and just do their analysis with the little data set, I mean. Well, the question is who's gonna do that for my customer out of those data? Who's gonna make that little data set so we can analyze, you know, it's boughrain more hypertensive than, you know, UAE and is it getting worse or whatever, right? Well, you're gonna laugh, but what I said is we're gonna have to go through a pretty big process to figure out what's the best way to get the data out of here. And what I started with was spreadsheets, just like I was telling her just set one country in one metric and one whatever and then put it in a datasheet. And then at some point we're maybe gonna abstract data into the right format, like country level formats. But then we're pretty much making our own little data warehouse, which defeats the purpose of the data warehouse that's there. So why did that happen? Why did all this money go to make this data warehouse that's impossible to query and impossible with data? Well, this is actually kind of a hard question to answer, right? You know, like how did nobody notice this was gonna be a huge problem? And you understand that whoever funds this wants people to use it. They don't want people to not use it. They want them to publish and they want them to do a good job using the data. So it's like not serving anybody. Why did it happen? Well, I will share with you a few thoughts on why it happened. And like I showed you one way to get around it. We'll go back to my slide presentation. See, so I'm describing how when you have a data warehouse or a data lake, why not just go get the raw data? So why don't I just go to call up UAE and ask for the raw data in their hypertension? There's a lot of reasons. One is that it's a pain, right? Like even if the UAE is really nice and gives me data, I don't know if it's the same as the data that I'm getting from Bahrain, right? And so doing the service of pulling together standardized data from different entities is really wonderful, whether you're the data lake or you're a data warehouse application. It's really nice that you're doing that. But once you do that, all of the variables that you add that are not native variables should help with any analytics that are gonna be done by your users. The data warehouse added value provides analytic visualization and other tools. It provides application support and documentation support. Well, the problem is they're falling down on this in the GBHX, right? This is a pretty complicated diagram, but it shows you like cloud services here. Here is where we are, we're out here. You don't need to really look at this without seeing dollar signs everywhere. Look, each of these analysts costs money. All this data visually reserved up to private organizations, governments. Like we're over here, we're one of these people. And this was really expensive. Now we're gonna have to do a whole lot of post-processing to be able to use that, okay? So you're probably like, well, why didn't they make an interface that was a little easier to use? Well, there's a few reasons. One is I do not think they read my book. If you read my book, especially if you read chapter 10, it gives advice, even if you're not using SAS. I mean, you're probably using SAS if you have a big data back and then it's epidemiology. You're probably using Snowflake to store it and SAS to do processing because you kind of need SAS, right? But even if you're doing R integration or SQL integration or something like that, please read chapter 10 of my book because it just tells you how to conceptually think about your users. Who are your users? Your users are like in the case of this global burn of disease, we're users out here, but these are really the users, right? I'm an individual, but some of you might be at governments. My customer, I think, I don't know, she's sometimes at the government, sometimes she's at a private organization who's like a non-government organization helping the government. This is all about like help, right? And so if you're gonna do, you have to study those people's needs before you serve up the data in a way that they can use it right away and not have to make a spreadsheet and abstract the data out of your database, right? And so that's the first thing is the number one thing I would say to you before you build one of these things is figure out who your users are and what they're trying to do with the data or what you think they'll try to do with the data and just do a little survey for market research. And if you're like, I don't know how to do that, just contact me, like I'll help you. I'll help you set up your data warehouse such that people like use it. Every time I've designed a data warehouse, people love it because I'm a good designer. I make it so whatever user bases are using it, they find something for themselves. They enjoy it. They have a good user experience. I had a very stressful user experience and so did my customer. She had to come to me with this interface. So that's the first thing is, I don't think market research was done. I don't think this group has a really clear idea of their users. The second thing is, I'm pretty sure the back end of this thing isn't SAS. And the problem with that is, is that SAS is not meant, it's not a good back end for serving up web-based front-ends, right? Like anybody who is listening to this now is probably thinking, isn't R like the big elephant in the room? Like couldn't have all of this been done better in R, which is by the way, open source, or Python, I don't know Python, but I know it could do this better than whatever this is. So this comes with another issue of bias, right? So remember how I said there was bias in the Lancet doing this and the IHME doing this? Like, who picked them? I didn't pick them. The Middle Eastern picked them. Who picked them, right? And so their perspective then is gonna be cast all over this global burden of disease database, right? And then the second thing is this software was picked and the software seemed to be driving the interface. And I like SAS. I mean, I just told you to learn it. I just told you to use it. I just told you it's awesome. But I want to use SAS. I don't wanna use this crappy interface. I want to, in SAS, you can use this thing called SAS access, which allows you to open an ODBC connection to like a back end, like a SQL back end, or even another server back end and just tunnel through and go get your data. You can build using PROC SQL in SAS. You can build views. So if you build me a view in your interface and I go in through connect and I hit the view and I just pull a SAS dataset from you, this is the way I would suggest doing things because that way my local, like I like think about it, this place has just zillions of rows of data. But I just queried. I only wanted like maybe even 20 rows of data, like what UAE, whatever. So I, you know, and this is kind of how the US census does their interface. You know, you sort of like click, you sort of reduce it. It's a sort of crystal reporting, but we don't, but SAS isn't native to that kind of a web implementation. So you have to do design and you have to do our integration or something and it just didn't happen with this one. And so as much as like SAS is cool and all that, please, you know, front end stuff, always use your design thinking where you get the best tool for what you're trying to do. And if it means you wanna integrate another software, Tableau or something for visualization or something that like helps you with downloading with query and download, you know, just do that. Don't just do the whole thing in SAS because I don't know, because you like SAS or whatever. SAS is great for like a lot of things, but not everything. So let me see here. So that was that. Now I wanted to show you, I like nobody's here or one person's here. Hi. Or at least in the counter. I wanted to show you about what if you're like, okay, Monica, I'm gonna do it. I'm gonna explore this. I'm gonna make a spreadsheet and record my data. You could make really good time series images. In fact, that was the assignment I gave this customer. I said, you know, now that we fought with this, just see if you can get a little data and make a little graph out of it just to be going. Now realize the customer has a relationship with this group and has been given permission to write a paper and whatever. If you just go get data from querying this and try to write a paper, you probably shouldn't do that without contacting them. They have some policies around writing, using this data in papers. I mean, obviously if you're a student and you want to use this in like some project at school, it's totally cool. Just make sure you reference it. You know, or you're just a data scientist or you're in healthcare. You just want to practice with the tool. It's definitely there for that. But if you actually want to do what my customer is doing, is write a peer reviewed article. They have strict rules and they want to be an author of the article even though I don't know how they're going to help us because it's not that easy to put in a data request. And that's kind of the next thing I'm going to tell you is remember how I said I ran a data lake and cause I'm like the mother lay type or sister lay type. I'd be like, oh honey, I'll cook you up a data set. Well, we had a highly intelligent SaaS analysts who were consultants who were highly vetted who were on the server. They were working with some consultants and they would just put together their own analytic data sets. They didn't need my help. I just made sure where our curation was up to date and they just do it themselves, you know? So those were like high end users. So to serve the high end users I made sure our curation was up to date, right? Like they helped me. They're really cool people. They were really amazing SaaS analysts. But I started going, well, where's the curation for this? Like let me, let's imagine that somehow my customer, I actually write this paper. Where am I going to say the each of our data points came from? And so I'm going to share with you I found out where it is. So I'm sharing a website with you. This is where it is. You'll notice this is the same menu as you were in before. If you click here, that gets you here, okay? Oh, here's where this visualizations tool is. I'm not really sure what that is. I didn't go on it. This, what I was just demonstrating to you is the GBD results tool, okay? It says here what it includes but we could kind of tell that. What it doesn't say is how they were calculated. And then it says like data input sources tool. I don't know what that is. I tried all of these things and I couldn't figure them out. The best I could do was this. This is their data curation and it is really difficult to use. And what I realized is each of these lines, like see this line? This is one data set. This is another data set. Remember at the beginning where I said you could change that thing from cause to like disability, whatever, and it would change the form? Though that's why it would change the form is because the data set would have different columns and there's different things. So I, as you know, I was working with the customers looking into hypertension. Well, it wasn't hypertension but it was something like hypertension. So I was trying, like this one says demographics. This one says disease and disease burden. So there is this data release information sheet, table index and select article tables and I could not make head and or tail of any of these things. I just couldn't make head and or tail of them. They didn't make any sense to me. To see how there's different curation documents about each of them, I just, I couldn't make sense of it. And so I said to her, you know, why don't we back into them? Why don't we first do our analysis and then I'll figure out where we got our data from and which one of these apply. And just since she has a relationship with them, I figured I could go back and, you know, make her ask them questions. You know, so when we're writing the paper, you know, then I can make sure to write the paper properly. But these people did not take my data curation class because of the course on LinkedIn learning because if they had, they would have made data dictionaries for each of these. They had made data dictionaries for them. There would be a data dictionary under each one. And I'm not the only one that does this. There are a lot of online repositories where you can't even query them. Like you have to get permission to get the data, but they'll put the online data dictionary so you can use it like a sushi menu, order up the data you want, and they'll make you a data set like I used to do for my customers at the data lake. And so I, like I'm wondering, you know, why, I mean, I'm still left with a lot of questions as to how this got here, or even just what they really envision for people using it. Like how did they envision people would use it? Because, and it all gets back to IHME because if you work with a traditional data center like CCBR at the University of Minnesota, like, well, I'll speak for them because I didn't work there ever, but I worked next to them and I worked on studies where they were the data coordinating center. So I know how they ran, how they rule, how they rock and roll. And what they would do is they would set up, they, for each of these big studies, they would have a steering committee. So if the global burden of disease people were doing it like they did, there would be like a group of people representing maybe multiple countries, one, you know, somebody from the least, somebody from UK probably, whatever. And they would be the steering committee and they'd be the people who decided who gets what data, right? And they might have a subcommittee that deals with data requests. But then like I was never really part of these committees. I was always part of the people who worked to support these committees. So you'd get people applying to get the data and I talked, I'm like, this is a terrible application. You got to write this and that. You know, this is what the committee's looking for. And I help them with their application. And then sometimes people who want questions, the committees want to have questions about it because you don't want to prove using the data unless you know these people are going to publish, right? And if they don't have any money or they're kind of not really sure what they're going to do, they won't. So not unlike them, this item like to see proposals from people but what's not clear is how you're supposed to explain getting the data for your project because when I would write proposals, what I would do is download the data dictionaries and I'd talk to the people and be like, okay, here's what we're going to do is I'm going to say in the data dictionaries what data I want, what rows and columns and then they're going to give me just those and they're de-identified or something. And then I write that in the proposal. I'd literally write that because I interacted with them and that's what we were going to do and that would meet the policies of whatever project I was on. I know these people have policies but their policy somehow doesn't get into how is data going to be transferred from them to the researcher? Like what format, when, how is the researcher supposed to steward it? Like can they share it? I mean, I realize this is data about countries that's public. So it's not like I can steal the data and go open a bunch of credit cards, you know? No, I'll just steal the data and guess what? Do a bunch of analysis, which is like kind of that those of you who know my intern, you know, she, we just published our chapter on a redesign of a public dashboard where she went in, she scraped the data on that stash where she says this thing's ugly. Let's make a better one. You know, I haven't shown her this. Don't show her this, please. Nobody showed my intern this. But she thought that dashboard wasn't very good. I mean, I agree with her. She scraped the data out, put it in a nicer format and we can use her test now, you know? Like if I user tested this, I wouldn't even waste any money on user testing and it fails in everything. Like I, even when I was working with that customer, actually let me see if I can get it to do it. I was working with the customer and I was doing these serial queries with it because we were doing, we were looking at these. So let's say that I go back here and I'm just gonna, let's pick on Qatar, poor Qatar. Okay, how do I get rid of the other one? Where's the other one? What else did I pick? I don't remember the other one I picked. Okay, already like F, select all, select only countries. I don't even know how to, okay, global. No, I don't want that. How do I unselect? I can't, I can't see the other one I selected. Oh my God, this is so annoying. Oh look, there's all the states in there. Why they put the states in there? I guess this is all under the middle east. I checked two under the middle east. Yeah, here they are. Okay, bough rate. Okay, that was a fail. Okay, so this we're gonna pick one year. I'm gonna pick the most recent year. Okay, we're just gonna simplify this. We're gonna pick percent with this hypertension. Okay, so I've done that. Now I'm gonna hit search. Okay, that's a fail. Okay, so here we have both, and their prevalence is the value is 0.05% which seems like kind of low for hypertension, doesn't that? Well, let me go back here and do, I guess if we do cardiovascular diseases, let's change it and do that. Okay, 3.5% of people in Qatar have cardiovascular disease. Have you ever been to Qatar? Okay, this is not right. I'm sorry, man. Like that is not right. All right, so I hate to say it, because 3.5% of anybody anywhere, it's not that low anywhere. All right, it's at least, just to give you an idea, it's usually about 10%, 14%, maybe it's 8% like in Middle Eastern countries. So I just know this is not right, right? And they have this upper and lower confidence interval here, but I just, how can you believe this doesn't seem right? Okay, but let's say I was gonna collect data, I wanna compare. Now let's say that I wanna get the, this was 2019, we see this as 3.5. Let's do 2018, just change it, and then we do our search. Again, this just takes too long. So here we have this 3.3, now, okay, this work. So one of the things I did was I was doing some of this. Now let's change Qatar to Oman, because this is pretty much what we would need to do and then hit search again. If you're trying to get data out of this and use it in anything, right? I mean, I don't know how you could set up this to come out where you downloaded it. You maybe could do this, like maybe my intern would have the patience to set up everything they wanted, download the CSV, bring this really badly formatted CSV into SAS and spend the rest of your life putting it in some format you could use. What we were having trouble with is by doing this over and over again. And actually, let me just, for an example, pick a big country like Saudi Arabia that might be hard to, maybe not, might be hard to get the data. Because I have a feeling it's querying the original data. I see this 4.4. And I know from just analyzing Saudi data that it's like 10% or 15%. So this is also another reason why you have to do user studies, which I learned the hard way, is that if I'm taking other people's data, like when I was at the Army, I'm taking their data, putting it in my interface. Well, they know what their data says. They know how many people were in the Army and they know all of that stuff. And so if my numbers are off by an order of magnitude, they're gonna know, you know? So it's really important to be always interacting with your users. If you're gonna have some big interface like this, where everybody's gonna use it. I mean, if you've got a mistake in your data, the whole world sees it. If you've got something where it's, people are misinterpreting something because you use the weird word, the whole, you know, it just has a cascading effect. It's a huge deal. And so that's why, you know, I run my data warehouse is kind of with an iron fist. I make a lot of policies. I demand a lot from every tier of worker on the database, you know, not like you work on weekends. I don't demand that. I just mean like, you really are dedicated to really annoying things like data stewardship, like documentation. Like I want people to just really care about these stupid little details because they're not stupid. Like basically it's the kind of thing like, databases and research are just disasters waiting to happen. Like they're naturally in a disaster state most of the time. They're naturally moving towards chaos constantly. So it's like, they're kind of like toddlers, you know? If you've ever, I don't have a toddler, I couldn't handle it. You have to be on top of that toddler. That toddler is getting into everything. Like this idea that you could work with a toddler around, forget it, you could never do that, you know? And that's the way like data are in research is, you know, anything with data is just wants to get everywhere and be messed up and cause people to fight and break and get reached and whatever. And so you just have to be constantly on it. Just like you're constantly on that toddler, you know, constantly, you know, redirecting them, constantly making sure they're not eating something sharp or whatever, you know? And that's the, and of course, if you put the toddler in a playpen before you do much, that's a good start. And so I always set up like really strong policies and everybody understands what they are. They're kind of like administrative playpen because then you can play, right? You can play inside the playpen and have some fun, you know? And so what I've learned is that people actually do have fun at the data warehouse. You can have fun at the data warehouse. You can have fun doing ETL back and you can have fun playing with the interface. It's fun when you're well informed, you know what's going on, you know you're following the rules, you've got all these standards, everybody's got your back, everybody's speaking the same language, which is why I really emphasize data curation no matter what you're doing. Because like if these people are curated, their data pretty well, but their interface kind of sucks, the curation can kind of paper over it. But their curation is also very challenging. So my bet, if IHME is sort of known for putting out data sets that people don't really use, you know, for these reasons. So it wouldn't surprise me if people won't really use this. But now that I did this live stream and I promoted and I can show you there are ways of getting over that interface and actually getting some very useful data that was hard to put together that we should probably use for looking at bone burn disease than we really ought to do. All right, well, I guess I didn't see anybody at the live stream except for the Zoom bomber. I guess I should call me a YouTube bomber. And I didn't get any questions, but hopefully whoever watches this now knows how to use this database and knows about it. And maybe you can actually use it for your data science projects, just practicing with it, or maybe you'll even someday write a peer review article. All right, well, thanks to those of you who are watching. If you're watching like a recording of this or you did show up, you know, I'd really like it if you watch my other videos, you know, find something you like. This was more of a instructional thing. Sometimes I have like tutorials on how you SAS and R. I have lectures about public health, you know, so check it out. And I'm trying to increase my subscribers. So if you like it, like what you see, go ahead and subscribe. And then it's easier for you to find out. You'll see more about my live streams when I hold them, how I can help you. All right, well, if you watch this on the weekend, have a good weekend. Bye bye.