 This video is part of the Public Health to Data Science rebrand program. All right, so welcome to our meetup tonight. Thank you for coming. So right now we're doing portfolio projects and we're looking at the data we're going to use. And I just met with Mika and we were talking about what we were going to try and do, which apparently is kind of confounded right now. I'm going to meet with you, Beth. I think on Sunday and we're going to nail down what you're going to do. But actually, and Sakeeb hasn't really even started. So is it okay if we start with you, Beth? Why don't you talk about first of all what data you are working with and what data you think you want to use and just how it's going? Sure. So I would like to use the readily available, like the news, with water data they have for the CDC since we're doing it for much smaller facilities for my work right now. So it would be a great, you know, data set to work with and to do different ecological studies in comparison and different statistical tests I would like to run and just visualizations. And I'm also very new to the field of environmental science. And so I'm just trying to get up to speed and learn as much as I can and also just also do some practical things with statistical analysis and designs that I have not done in a while. Can you tell me about what kind of fields are in your data? So there are a lot of things that are when I send you is there are a lot of the processes I'm still learning, but so it's a lot of data and mostly the data fields are. Of course, there's a lot of epi data. When you say epi data, what do you mean? Like epidemiological data like. Like in wastewater? Yes, compared to where it has been studied and what population has been studied. So, and then why those facilities or group of people or what that population is like, you know, consider vaccination status of the patients within that study group. Oh, I see. So, so, so Beth is on a team of people who are doing analysis of data where what the scientists have done is they've sampled wastewater so when you think about that that's so different from sampling people right. Like if you go to sample, like, like I'm thinking I'm right next to a river, like if you go sample from that river it really matters what day it is really matters where in the river you got it. It doesn't matter how big your sample was it was kind of like even taking blood from somebody and then analyzing it. And so what if I hear you correctly, Beth. You have these data, these rows of data that are about the samples but they'll also patch on. Like I'm talking about information about like the weather and the location, but it sounds like they'll even patch on information about epidemiologic like rates around the region in there. That's usually provided by, you know, the participants so we're not collecting it directly. But, and then the lab data is mostly like you know what kind of controls that are being used what kind of concentration methods they're used and the different like lots of different tests that they run on one particular sample in their replicates and tracking them and it sounds like a puzzle. Like it sounds like it's very hard to even just get an analytic data set together. Yes, and so that's what I've been working on and I've been tasked with creating some analytical data set right now. Well, some analysis and so I'm just saying that it has not been documented or does not have a data dictionary or sort but it does when it comes out of like whatever software they're used but it's not like you know the you know that that's an interesting point and I just want to stop here to emphasize that point. I started making data dictionaries just by hand because I had problems and I need to keep track of stuff. Later, I was around people who were running SQL databases. I said to them, Oh, I need to make a data use agreement with you. Do you have any data dictionaries for any of your SQL tables and they're like, Oh, it's easy. We automate that like you can automate that. Well, you can't really automate that. I mean, you can automate in SAS and in SQL and in a lot of programs, a function to go through and get all the field names in a table and they're with and whatever. But that's really not the information you need. You could do that by hand. What you want to know is what the field meant or where it came from or, you know, all this other stuff. This is sort of human stuff. So that can be a source of miscommunication when I've gone to people and say, Can I have your data dictionary? I don't, you know, I get the SQL barf. And then I come back to them. I'm like, I don't understand what any of these fields are. I mean, they've got kind of revealing field names. I don't really know what they need. What I often do in that case is if there are people doing data entry, like I think you guys are using red cap, right? So what I'll do is I'll take a screenshot of the form, the data entry form. Then I'll put it on a PowerPoint slide. Then I'll research each field in there. And I'll put a text field on it and like read on the PowerPoint slide for what is the field name that lands it. That is always something that can help you navigate through an undocumented backend. But if you're just getting like an extract from like lab software, that's not going to work. So, okay, well, it sounds like you're getting somewhere. You at least know how big the problem is. Well, very good. Do you have any specific questions, Beth, at this point for me or the group that you think we could help you with? Not really. It's really, not at this point. You're still sort of like waiting through it all. I have a few questions if you don't mind. Yeah, go ahead. So Beth, so what is your goal? What are you looking for at the end? What do you mean like what's my goal for the work project? Yes, so there, you put the data together, analyze it. What do you like to find it? Yeah, like what's the research question? Poor Beth, I've been asking her this. And it's like torture. It's torture. So Beth, I'm going to probably try to tell you, but I'll tell you, Mika, this is what happens when epidemiologists don't design studies. Okay, go ahead, Beth. So some of the questions that I'm trying to work through is, so they have just an example, they have different methods that they use to detect a certain virus. So we have to do a lot of comparison to see which method was more effective at, you know, detecting more of the virus that we were looking at. Or it would be like doing a correlation between certain, like the water quality parameters against the virus. Say, if like the pH level or the temperature has somewhat, or right now at this point we don't really have a gold standard but we're just keeping the descriptive part of this or what we're finding in the water. I mean, the hypothesis is that if the water is too hot, then like if we're using COVID, supposedly it will, it would not survive or you'll not be able to capture a lot of, detect a lot of the virus. So, but for now, we're just doing correlation and just keeping it as what we see. We're not doing much with the water parameter like parameters. So, so basically the problem is they have some logical problems with however they design this right. So, if you're trying to detect COVID and wastewater, what you're probably trying to do is either predict what's happening in the population, or just diagnose what's happening in the population, just see what's happening, right? Either you're doing something to try and predict what's going to happen or you're trying to predict, trying to see what's happening now. But if that's why you're looking in there because you don't know what's happening, then, and you're not sure the right way to look like you don't know what is the optimal way to test this. Then my opinion, my opinion is this should have been worked out in the lab, you know, or 80% worked out in the lab under great conditions. And then just a few things where, you know, a few choices like the choices are narrowed down in the lab, because the problem like, like Beth said is they don't really have a gold standard like they, if they detect a lot like they take these different samples in different ways. If they detect a lot of virus, does that mean it's right? And the other ones are wrong? You know, like, well, maybe this one's not right and there's one that would detect more and that one's right. Like, like, how do you even know when you're right? So, what I was thinking is that she's going to be living in ecological study land, where you've got like rates in the population or in a location, and then rates in the water, and you're just trying to correlate them. But I don't know how to answer anything that way. I mean, maybe a regression, but yeah, do you have an idea? Actually, I was thinking that, so there, you know, San Diego has been taking that wastewater, carpet, and I don't know, a piece of the carpet, and they are researching it. They are just sampling regularly. And then, so if you see there, just visualize it, and then so how many people are hospitalized, how many people are tested positive, COVID positive, and wastewater account of the COVID peace. Interesting. It's just so interesting when it goes parallel. First of all, the COVID peace number start to increase. And then so then that patient account increases. It just goes to parallel almost. Well, one falls fast, and another one sort of kind of chasing it visually. You can see that if you visualize it. You can poke around the San Diego UCSD. I might be able to say. Yeah, if you can find it, go ahead and share your screen. So do you get what she's describing Mika? I mean, Beth, what Mika's describing? Okay, I think what she's describing is, so imagine a time series plot, right? Like over time. Okay. So now let's put a line on it, which is like the rate of COVID in this region where you're testing the wastewater, like, you know, people getting sick, right? Now on top of it, you put the rate of COVID in your wastewater. So that's on top of it. Then you have these other metrics. And what Mika was saying is you can watch these metrics go through time, like one chasing the other, you know, one goes up and then goes up. It's like really cool pattern. Everybody's dying, but it's still really cool pattern. And she's looking for the visualizations because San Diego's already doing it, which is why I was a little confused, Beth, when I met, when I learned you were working on this project, because I was like, this sounds like this problem has been solved. You know, obviously San Diego's doing crazy stuff with it. It sounds like it was solved, but the CDC is saying, no, it's not solved. You guys need to solve it, then I guess it's not solved. You're right. It is solved. I think I didn't give a bigger picture. I just went into what I'm doing or what I'm supposed to be doing. We are trying to establish a standard or like the best method that you could use so that other state local departments could be able to use those to adapt. Yeah, but what's wrong with doing what San Diego does, like they're already, and they already built a whole shop around it. They've got visualization, everything. I mean, even if you come up with something that's so called better, San Diego's already running with this. So, you know what I mean? So this test that we're comparing, they're not new at all. Those are the tests that are already been out there. We're just doing a comparison of which one is better so that others could adapt it, if that makes sense. I guess so, but I would just like if I were, and then this is at a very small facility compared to like a community wide. So this is like, this is like colleges and nursing homes and very small congregated areas. And then it's just so that they could adapt with our surveillance as part of their services, overall services. Sort of like micro surveillance, like they could do micro surveillance. I guess that's kind of what you're trying to do, right? Yeah. So what were you going to say? The second one is that, so there maybe you can just narrow down what virus or what piece you want to look at. And so maybe hepatitis or whatever, I don't know. If that's all that people's nursing facility that some single thoughts, even, I don't know. Well, I think they only did certain tests on that water, right? Is that right, Beth? Say that again? Like Mika was saying, you could look for hepatitis too. And I was thinking your protocol made them only test for like COVID things, right? Or did they test for other stuff? Most of the tests that we're doing, the analytical tests, our analysis right now is on COVID because we just finished that. It was over a certain period of time. So just concluded. And then the new project will be on AR. So we'll be looking at antibiotics resistant bacteria. But yeah, you could test for drugs as well in the wastewater. And I think they've been a lot of studies done. Okay. It's an emerging field. Okay, I'm going to share my screen. Yeah, I guess I probably should read up on it. So this is what Mika sent everybody, right? Nice. Okay. Okay. Oh, look at this. Yeah, I'm just staring at it because I've never seen that shape before, right? Oh, look at this is really beautiful. See, Beth, you could just like steal it. You know, I steal this thing. Here's Encina. Oh, now I'm looking at how this looks. And I'm wondering, what do you think this is? Is this Python or maybe R? It probably is Python. Yeah, something like that. I don't know. Yeah, because I've seen this before, like see how it shifts. See how it, sometimes it's really elegant when it moves like that. Look at how pretty this is. See how this reports up there. Yeah, you can kind of see. Yeah, so what we got reported cases and viral load and wastewater while it's right on top of each other. Now you see, Beth, what you could do is if you read into this you could figure out what method they used, right? And, and just see, you know, if you're like you might have samples that use that method like whatever one they use. And then you could be like, Oh, okay, well I'm just for fun. I'm going to take those that use the same method as this and try to make a graph like this. Like that's usually where I start is like just copy someone else. This one, what's going on? Oh, these are so pretty. Aren't these pretty? I have trouble like. Yeah, it's pretty, but I don't understand this. I can't understand it either because like what's going on here. You know, this little yellow thing like how, how does it. Oh, I guess, I guess there are different. Yeah, you just have to like go you just have to look at a slice at a time or else it will go crazy. Apparently, like if you go here you're like, Okay, I get it. It's like, it's basically a stacked bar chart not separated into bars. Crazy, go crazy. Okay, but it's beautiful and we know that Mika's getting her taxpayer dollars worth. Oh, well, just be happy you don't work there. She'd have to make that. All right, so, um, so that's a really great thing in my opinion that Mika share that with you because when I don't know how to visualize some data. I just start by trying to do it the way other people did the only thing that ever stumped me was that whole antibiotic resistance in CAUTI, you know, like I there was so much data like how do you visualize it. And I really didn't know what to do but then I solved it I did an upset plot and that worked I showed you my Natasha and my dashboard. So yeah, it's like, you do kind of have to find the right visualization that really speaks to you that answers the question but you usually go through a whole bunch of them first that don't speak to you. What I'm believing that Beth will probably end up doing a lot of in the beginning is like correlations and like heat maps like correlation matrices. You all know what a heat map is right. Because the, the issue with what she's got is everything is going to be correlated, right. It was kind of like that article we were looking at Mika, some things will be more more correlated than others like some will be more outliery than others and maybe that's interesting but that's probably what we will have to start doing at the beginning to see really understand how the data all behave with each other. Now it's my thought anyway. So, sake, why don't you talk to us. I know you're you do statistics now. What kind of data do you work with that work. Oh, age cup. Do not have any age cup. Yeah, Mika, have you ever used age cup. I created it. You created age cup that's developed it. Oh, that's so cool. How did you do that. Oh, no, no, I worked for that. So they're company who are actually contracted with arc. And then so my company was the one actually, you know, generated every year. And then I was part of that team generating the data. Is it, is it any good Mika you're behind the scenes is it any good. It was a lot of work, but it's generally pretty clean we cleaned it as much as we can ready to further research. That's a research data. So you yeah so I don't work with age cup because I don't know much about actual costs because that's why people tend to work with age cup is they like to use the money fields in there. Like, like, that's so interesting you prepared those files. Like, I almost never analyze them because I just don't need to do money things. But maybe I, but this might be just a bias I have. So keep what kind of analysis do you do with the age cup data. And just, just before to make sure everybody understands it's healthcare utilization project or program data. So it's about healthcare utilization. Okay, go ahead. I'm sorry. I have just started so I have heard that they use age cup data so I had to go through initial training like what it is. So I know, like, this is a big data so my other co worker, he is dealing with it. And, like, we don't do like very hardcore analysis like just a basic analysis kind of like tea taste or report or ANOVA those kind of thing. So I haven't gotten any project with this data, but I'm supposed to get it but my my other co worker. And I know that it's a big data and like he, like he's trying to understand it but it's if he told me today that it's kind of like very confusing. So they're trying to get connected with the age cup team like to get what it is like there was an option like or number of clusters, like when they ran the data in says they found there is a there is a term called number of clusters. So my co worker was having issues like what is this cluster mean like, like what is this cluster actually is it like a cluster from the cluster sampling or what is that so in other words they gave each cup gave him a field that was named something but he didn't know what it meant he didn't know what kind of cluster it was like he couldn't find in the documentation there. No, yeah so because even like the investigator principal investigator he doesn't know. Nobody knows I mean you have to study each cup documentation, like for years like you have to study it a lot. It's complicated. That's why that's the tool but I thought that, like a big data. So I thought that maybe a big is a particular source like from where people can pull up data. So my assumption was also that hcap maybe is a particular source like people can pull up their data. No, no, here this is what happens is a lot of people have medical records in the epic system. If they're a participating hcup facility. There's the structured like data they have to pull out of epic and ETL they've got to put it in a certain shape. There's usually this interface control document like hcup. Maybe Mika knows about this like you probably had instructions or the agency had instructions of everybody submitting their data in a certain format. What happens to keep is you're not wrong. That data that lands in hcup starts a lot of places in the epic, but the epic structure is all different at different places and so so somebody has to repair the data to be an hcup structure to give it to the agency to give to Mika's old old consulting group so they can transform it into what you're analyzing now which is a lot of work, like Mika was saying. That's really like I was in the situation when I was at the army where I ran a thing like hcup I had all these data sets that were ready for analysis. And a bunch of universities decided they wanted to do a study with that data, but they decided to get the data originally from the agencies or from the commands rather than from my system. Why do you want to do that? It's going to take you so much time to learn about the data and figure out what to use. My days all cleaned up and documented. Why don't you just use mine? Well, they didn't and then they spent six months just in sheer confusion, just sheer confusion, because as Mika will know like because she did those data sets, you do a lot of cleaning up so that the data sets will fit together if you connect them. These poor people all these different university people were getting all these data sets from the original source and they couldn't connect them. And so that's what would happen to keep if you went just to epic that data is a mess. And so the people who at each epic location who transform that data into what they're supposed to submit to hcup that's a whole production. Yeah, I don't have any idea because for most of the projects so far I have been given like, like, maybe they pulled their data from redcap, they have that data. I think they just give it. And then with that with the protocol, and then there's the analysis that's it. But there's no documentation there was that like maybe they said like this shared the SAS studio, and they skip that is available in SAS is to do so my assumption was that maybe I can just download the data and maybe just analyze normal but I don't because right now, like from this conversation it seems like maybe it's not like that maybe something. Well, we'll figure it out. Mika, what were you going to say. First of all, so the hcup is a hospital discharge data. We saw the hcup collect collect the data from that so each state public health agency, those agencies are actually gathered that for their hospital data for their own purpose. We purchased those data from each state, and then put them together. And so they're so we saw their hcup data has a couple of different. They different things so they're state to patient to data inside each state, nationwide to patient to data is the one that do we pull that so all the state to data together, put them together and then study we do the statistical. That's the nis right the national in patient sample right the nis yeah nationwide national nationwide we call it nationwide nationwide in patient sample. That's the one actually uses that stratified cluster. I don't remember. Oh yeah so you got like that those fields with the cluster number or whatever so they're otherwise so otherwise so they're big state like Texas, big state like California dominates it, and then we don't want that. So it's a big data so my co worker who is dealing with the like with a PI. And then what is happening like whenever he's trying to run it in SAS studio. It's been a long time like maybe one day or something. So that's why the company right now that they like recently they have had a conversation with SAS via, like which is kind of cloud thing so they thought that maybe they would go to the cloud system so that maybe they could have got the computing power, but I don't have any idea, really I don't have any idea, and I told you, I told you that like even though I do have a stat background, but I didn't have that must expose or to analyzing data. So, so like, so I don't know. I just want to interrupt all of you for a second because I want to tell you, I'm on this invited SAS community, like this insider community I don't know I because I wrote a book maybe they got me in there. And we all were on a discussion board and just saying hi I'm this person I'm that person everybody had been using SAS for like 20 years. One of the topics that came up to keep was that a lot of people who have SAS servers, basically are are having a problem you're having, and they can't go to via. It's like, it's like trying to carry a rock up a ladder, like getting all the data into the cloud, like, it's like they, they're stuck, you know. So, this whole idea where you're like okay you're out of room on your SAS server just convert to buy and put it in the cloud. A lot of people are running into real problems with that just in real life. Now Mika what were you going to say. The back, the back in time when I was working on that so they're that they each cup data. Yeah, I was also doing user support. And so they're exactly that problem that running so slow and so it is so big and those are the time we actually suggested that just to select exactly the variables you need and then load it so they don't load all the data into your pro into your computer. That's overwhelming. So that's one of the things that we usually suggested that we occasionally write a program to select selectively loading that H cup data. Yeah, yeah, and I believe what you're talking about Mika is when you write really complex like in in file like I forgot what is that input data thing where you tell you take a fix with file and you carefully tell SAS to skip over. Things like write this part then skip over right in this one skip over. So you're only like writing in the data you really need like from the raw data and this is really picky but we used to do that to at the army. But there's a there's another there's another trick I've tried and it works. And that is where you load all that data into a SQL server. And then you build a view in the SQL server because SQL runs a lot better has a lot. Most sequels have a lot better data management. If you build a view in the SQL server that has the data you want exactly the data you want exactly the rows and columns view is like a filter. Then you can use SAS access to, you can go from your SAS environment and use an ODBC connection, and you have to log in you have to get everything authenticate, but then you can hit that view, and just like data a set view, and that comes in as a table. Yeah, or so they're totally changed to that so they're using Python. Python is reading that one row at a time. But SAS reads one row at a time too. What's that. SAS reads one row at a time too. My impression is, well, I don't know. So SAS really doesn't have you ever seen that diagram Mika about how SAS processes each row, when it's reading it in. I don't know. I don't, I've never seen that. I'll see if I can find it it's they always people who write about. I'm trying to the PDV. Do you remember the PDV in SAS it's the something data vector. It's how they're build it builds the data vector. You know, I mean Python apparently is operating like SAS. I didn't know it did that. You know that it was line by line like that because SQL doesn't really run that way it's got an optimizer and it's using a lot of statistics. But um, but yeah, my opinion is whoever said I think it was Mika whoever said, do your documentation study your documentation. Pick out what data you actually want to read in and just read that in. Generically, that's really a good idea. With big data. All right, well then, when we meet to keep we can talk about what you want to do, you know, you can each cup is not private data right like if if you can get your hands on it and do an independent project we can make really cool interesting things with an each. Yeah, you can do it. It just, it just depending on the featured each cup data you want to use some of them cost you it shouldn't be so expensive, at least back then. Well we have to pay you Mika for making such nice data set for it. Well I think he has like it sounds like he has a PI and the PI at a grant and they bought. So, because he has it at work. It's just, it's not as touchy as your data, you know, at your work Mika because you know each cup is each cup it's meant to be shared. It's just a matter of buying it so I was thinking, you know, whatever to keep is assigned at work we can still use each cup to do something maybe more interesting. All right, well we're coming to the end of our hour. Does anybody have any questions. Thank you for watching this video, which is part of the public health to data science rebrand program. If you are interested in joining the program, please sign up for a 30 minute zoom interview using the link in the description.