 Hello, everybody. This is Monika Wahee, and it's five minutes to noon Eastern time. And I'm just starting with the live stream about five minutes early just to allow everybody to join. So now I'm going to do my little professor thing I do when I'm waiting for everybody to come in the classroom where I just kind of repeat stuff that everybody should already know, but I'm going to say it anyway to make sure everybody feels welcome and knows what we're doing here. So those of you, if you're coming into the classroom and you're like, am I in the right class? If you're here for data curation, you're in the right class. It's, I'm, well, hello, lungs detoxification. Here's my fan that I love. Let me tell you, you should detoxify your lungs. I agree with lung detoxification. That's a great, I mean, I guess what would I name myself? I guess I would be like get rid of your atherosclerotic plaques. That's what I would name myself instead of lung detoxification. But we all get to be, you know, some chronic disease thing that you should get rid of, right? Anyway, that's just epidemiologic jokes because everything is so depressing in epidemiology. What do we have? Eminem, Eminem, right? It sounds like a candy, but it's really morbidity and mortality. So you just got to laugh at whatever you can find to laugh at. So today, you know, so I'm a data scientist. I'm an epidemiologist and a data scientist. I do biostatistics, I do informatics, make data warehouses, but today what we're talking about is data curation. And the reason why I wanted to hold the live stream on this is unlike other topics in data science. When I bring up this topic, a lot of people don't know what I'm talking about. And so if they're like, you know, like I'll say, do you know how to do regression? And people will be like, no, I don't know. Or yes, I do know. Or I'll say, do you know how to do, how do you know Python? And they'll say yes or no. But if I say, do you know data curation, they'll be like, what's that? And oh, I got to show you, it's Saba. I'm so glad to see you, Saba. It's been a few years, I guess, maybe a year or so. I hope you're doing well. So Saba's one of my customers and yeah, so say good things about me in the chat. And also, Saba, go ahead and ask me questions because you're comfortable with me. Like, no me, you can ask me personal stuff. No, probably don't ask me questions. But anyway, I'm so happy to see you, Saba. I just love you. I had so much fun working with you. But you know, I'm kind of that way. Oh, it's good to see you too. That's sweet. So those of you will know I'm kind of lovey. I get bonded with people. And then I, like when I started my tutoring service, some of my, you know, because it was in person and in Boston and some of my, I was tutoring statistics people, you know, I didn't really have any customers at the beginning of my business. And my customers are going to be saying, you're not a normal tutor. Like most people sit down with us and teach us mean and median. You're like up all night trying to make sure we pass tests. So I decided maybe I'm not a tutor. Maybe I'm a mentor. And those of you who go to my web page will see that I write about mentoring. Because I started my business. Now it's 10 years ago, I started my business and I thought I was a tutor, but I now have mentoring skills because I've been mentoring all these people. And now they're successful, like Saba was very successful. And oh, and then lunch detoxification is asking the perfect question, what is state of curation, right? Can I call you lungs? Let's just wait one more minute for the live stream to officially start. And then I'll start going on and on about data curation because, so like let's say that you sell brocade, okay? But nobody knows what brocade is. Then you have trouble with the demand, right? Just to let you know brocade is a type of fabric. I used to be a fashion designer, okay? So if you have brocade and you're trying to sell brocade and nobody knows what it is, you first have to educate people on what it is. So that's part of the reason why I'm holding this live stream is I sell data curation. So what does that mean? Well, I actually have a course on it in LinkedIn Learning, which, you know, I mean, if you don't know what it is, you could take that course or you could talk to me or whatever. But also what I sell is teaching people how to do it one-on-one and also just doing it for people. Because what I learned about data curation is it's actually, like I'm not that good of a programmer, but I can program and pretty much anybody can learn to program. What I learned about data curation, it's a little different, it's like harder. If you don't have a knack for it, it's harder to figure it out than programming. So, okay, so now it's noon and I'm gonna officially open the live stream. Welcome, everybody. My name is Monica Wahee. I'm a data scientist and I'm holding this live stream today to talk about what data curation is. And my fan here, who I just love, longs to talk to occasion, is asking the quintessential question, what is data curation? So I'm gonna start by saying, I'm gonna explain it to you and you may not really get it from the definition, but then I'm gonna show you examples and then you'll get it, right, from the examples. But first I'm gonna explain it starting with the word curation. Now, where we normally hear the word curation is in museums, okay? So I used to be a fashion designer and I was interested in history, so I worked at the Museum of American History at the Smithsonian. And one of the things we had to do was study like historical dresses of the wives of our presidents because we don't have female presidents, so wives. And study the dresses, gather information, but then we had to make a book or make little signs explaining about the dresses. And I remember Abraham Lincoln's wife, Mary Lincoln, her dress was just a mess, it was just falling apart. The silk had rotted and everything, but it was so important historically. So we had to curate that, we had to explain this few scraps of silk that looked like nothing, that this was really historically important. So that's kind of what data curation is, is where you've got data. Now, if you just think about data, right, like think about a data table, what does it have? It's got columns and what are the columns full of? A lot of times they're full of like numbers that you don't get, like they're often indexes, like they might be one, two, three, four, but they mean like a state or a country, maybe there's letters, like you don't even know what the data means. Like how many times, those of you are actual data scientists already, how many times has somebody given you like a data set and there's a column for like gender or sex, and it says one or two, but you don't know if one male or is one female or what? And then what if it's not one or two, what if it's somebody who's not male or female because there are other genders? So the, and how do we handle that? You know, I mean, they exist. Did you just not put them in the data set or what? So it's like data curation solves all those questions. Data curation is the way that you go from, why are we putting a bunch of pieces of rotting silk on display to, oh, I get it, this is Mary Lincoln's dress and this is really historically important, whatever, right? Okay, so another one of my fans, I'm so happy you showed up, Alejandro, I believe is how I would pronounce your name, correct me if I'm wrong. Hi Monica, thanks for the share and teach. How do you explain the importance of data curation into the business context? That is a good question because it's just like, what I just described was so abstract. And if I explained that to, like if I went into a group and I say, you should make a contract with me, I'll curate all your data, pay me thousands of dollars, they'd be like, I need that, right? How it works is when people are in crisis, that's when I teach them about data curation. So what I mean is this, oftentimes how I'm introducing people I meet to data curation is they bring me a model, like a statistical model and they're like, Monica, my model's a mess, like I don't get it, I don't know how to interpret it, I don't know if I'm fitting it right. And I'll be like, okay, let me take it apart here. So most regression models, like all regression models basically are dependent variable and much of independent variable. So all studies dependent variables, studies, I'll say, what are these variables? And that's when they'll be like, oh, okay, well here we have sex. And I'll say, okay, that's coded one zero, what is one? And I'll be like, okay, that's okay. How do we figure out what one is? I don't know. Well, sometimes they're using maybe survey software. And so maybe we can break into survey monkey and see what was coded one. But this is just an example of like, now we're going down the rabbit hole, right? We're going down the rabbit hole with each of these variables. And every rabbit hole, like what's this next one? City, one, two, three, what city is three? Okay, going down the rabbit hole. So I gotta keep track of what happens in each rabbit hole. So what happens in each rabbit hole? It might be a table, might be a figure, I mean, who knows. And that's data curation. So, and I can't tell you how many times we go, oh, that's what's happening, right? So you're probably thinking, well, why don't we prevent going down the rabbit hole? Well, that's data curation, right? It's like, if I design, like let's say I'm making a survey about, I actually write down what one means in gender. You know, look before we program the survey. What a new idea. And so that's what data curation is. It's sort of anticipating all the problems you're gonna have, all the questions you're gonna have about your data and then coming up with different kinds of files to answer that. Data dictionaries, you'll hear me talk about a lot, those I usually put in Excel. So you'll see tabular data that you're, like, and it's like, if you've heard the term metadata, you know, data about data, that's kind of, you know, what curation is. Oh, all you have to do to come up, be featured here is to compliment me. Thank you, Savva, great example. You're such a great teacher. Thank you. Oh, and thank you for showing up. Rothanam, is that how you pronounce your name? I appreciate you showing up. Thank you so much. We have another LinkedIn user who asks, how do you use data curation and real life applications? I'm glad you asked and I'm gonna give you an example. Because I went into my customer, secret customer files and pulled up some sexy data curation files for you. Why do I call them sexy? Because I have not featured them in my other things, right? So let's see, how do you, in real life applications? Well, I'm gonna pick one application here and share it with you. Now, what happened here in this situation, and those of you who are status distance will relate to this, is somebody came to me and they wanted to know sample size. They wanted me to estimate how much sample they needed for a study, right? And I wanted to keep track of all the estimates I gave them. Let's see here. So I'm gonna do the share screens. I'm not so good at this. Let me see here. Oh, there it is. Okay, so now I'm gonna go out here and yeah. So this isn't very big. Let me make this big. Okay. So I'm gonna just explain to you what I did. Oh, Dr. Heina's here. Hello, Dr. Heina. Thank you for showing up. You'll probably recognize some of these things. Hi, Chow. Is that your name? Hi, Chow. All right, so somebody asked me, somebody, those of you might recognize these terminology from oral health. BOP is not bebop, bebop is bleeding on probing, right? So if you probe someone's gingiva, like their gums and it bleeds pretty bad in flame. Okay, so we had decided, our assumptions for our sample size was alpha equals 0.05. Those of you who understand statistics will understand this. I'm sorry, I apologize for those of you who don't. And then the power is at least 80%. And then the outcome that I was calculating these sample sizes for was difference in BOP. I said difference in different. You can make typos in your curation documents. Visit one and visit three between, there were two different treatment groups for gingivitis. I don't remember CX and NSM, I don't know what that stood for. Okay, so what was I doing? I was calculating how much sample do you need in each group? Oh, two-tailed. And I was using G-Power, which is actually a free application you can use for sample size calculation. But what I wanted to do was try different effect sizes. So if you see that column on the way left, says effect size 0.1, effect size 0.2. So basically those of you who understand math and are into math who may not understand the sample size calculation, all I did was I had this equation where I was setting alpha 0.05, power at least 80%, the outcome with two tails, whatever. And the effect size is what I was changing and then I was trying to see, each time I changed that, how does it change the N in each group? And then times two would be the total N. I documented all of this. Now, what have I seen in the past? I've seen statisticians who make all these calculations just kind of write them down on a scrap of paper. And I'm like, no, let's just keep track of every single thing we did. Let's just put it in a table so that when I open it 500 years from now, I can look and remember exactly. I used G-Power. That's one of the things. Do you see that? That's how I know I put it on the tab. And actually, let me go and see what's on the other side. So I did the simulation, right? Here, oh, let me make it big enough, okay? So in the simulation, if CX reduces BOP, so BOP was the outcome, CX was one of the treatment groups. And we were thinking CX would be better than the other one, NSM or something like that. If CX reduces BOP by this much, oh, and NSM reduces BOP by this much, here's the effect, here's the effect size, and here's the N in each group. So basically, I'm doing these simulations because what do I have? I have a dentist who doesn't wanna put a bunch of people in a study, right? So I'm just like, okay, whatever. So this is just the first example of a data curation document that I'm giving you. So this file, so if somebody came to one of my customers and said, we have to calculate sample size, I'd probably be making a curation file like this for them, and that way, I, what often happens when you're calculating sample size is you're doing a grant or you're doing a protocol and either way, it's gonna be a delay before you actually do the project. And you're gonna totally forget what you did. You're gonna forget why you picked what effect size, what you did, did you use G PowerHard, did you use SAS, you can do this in SAS, whatever. So that's what curation's about. So I'm anticipating all this, why? Because when you sit down to write up your report, you're like, how do we get the effect size? What was the effect size? What was the, you don't wanna forget that, right? All right, so that, so if you have any questions about that one, let me know. I'm gonna trot up another one that's from a totally different world, okay? So you might hear me talk about my intern all the time because I just love her. Okay, my intern is so smart and she's also very conscientious. What she did was she found a dashboard online that was published that you could use that had some public data in it being displayed and she didn't like the display. She said, Monica, I can do so much better. I can make this much better display of the dashboard data. And I said, okay, but how are we gonna get the data out? Can you download the data, whatever? And she's like, fought with the thing. She couldn't get it and she's like, Monica, I'm gonna scrape it, I'm gonna scrape it. And she's very powerful. I can't do any front-end programming. I could never figure out how to scrape something, but she could, right? Well, that meant that we had to start cooperating. Like, if she's gonna scrape data, she's gonna put in some sort of structure and I'm gonna have to sort of analyze it. So we needed some sort of guidance. So what I did was, let's see here, I'm gonna open this. So you'll find I use PowerPoint a lot. I teach it in my data curation course and I also use PowerPoint a lot to make diagrams. And so here I'm gonna be showing you this PowerPoint. Let's see how it is here. Okay, so one of the things that, so this PowerPoint has three slides in it. The first one shows a screenshot of one of the reports that could be downloaded from this dashboard that had some information about what was in it, right? And so you'll see here, in red it says hospital name and in blue it says hospital dot name. And then you'll also see it says, red is the name of the variable when scraped and blue is the name of variable in data frame. So you can probably already tell what's going on is data frame, that's our language, we're using our, and we're scraping data. And so she's scraping data into one name and I'm scraping, I'm analyzing data in another name. So let's go to the next. So here again, red is the name of the variable when scraped and here's the red names and blue is the name of the variable in the data frame. So what was happening is my intern was using some sort of function that was scraping data into columns that were named these things. And I wasn't sure exactly, you know, I wasn't really sure. Like you can, I didn't want those names in data analysis. So I wanted to create a crosswalk and this is, or no, actually I think she did it. She scraped this and this is what she named it and I was mad at her because I couldn't figure out the names or, so I made this with her basically. This was me interviewing her and making this while I was talking to her. So I could tell, I guess she gave me the data and I was like, I don't know, I don't know where these data are from. And so we had this conversation, I wrote it down. So there's kind of this theme here where, let me stop sharing. A lot of times the reason I'm doing data curation is because I'm trying to document some information, some business rules that a subject matter expert is explaining to me. Now, one of the things that I noticed in life is that nobody does that really. Like that is not people's natural inclination. I'll give you an example. A long time ago, like in the 90s, early 2000s, I worked on a flat database, right? Like an old mainframe database. And those of you who know about it would be familiar with it. It was called Carecast, IDX Carecast. And in there, those of you who are familiar with flat databases know that it's basically like a big quilt in the sky and there are different places on that big quilt that are designated as tables, sort of virtual tables. These were called dictionaries in IBM Carecast. And most of the time, you didn't have to do anything with these dictionaries. They were mostly picklists, like providers that worked at different places. But of course I had to enter providers. That was one of my job. So I would have to go into dictionary 471. That's why I still remember it. And edit dictionary 471. Well, dictionary 471 only had a few columns in it, like name and specialty and whatever. But once you know it, I can't remember what those columns are. And so before I'd go in to go data entry, I wouldn't remember, I wouldn't have all the data prepared. So there are a few dictionaries like that in Carecast. There was 471 and then there were a few others like vendor dictionary, whatever, where you had to edit it once in a while when there was like a new customer or something. One day, all those dictionaries, I just made a data dictionary of them and I put them in tabs on Excel spreadsheet. I just said, okay, my little department like me, maybe three other people as me and my lead worker and a few other, like we're the only one who do these dictionaries. Every time we do it, I totally forget what I did last time and it takes me an hour to figure it out. I just made this doco and now we can have it. I just remember everybody's like blown away. They were like, oh my God, what a brilliant idea. And I was like, it's not that brilliant. I was irritated. That's why I made it. And then I started noticing this happening to me over and over. Like one of my friends is a transfusionist and he was working with a group that was trying to get data from their respective hospitals and put it together and do some benchmarking. And the group had been meeting like once a month for a few, like a year and a half. And I guess he had just joined it. He had just got a new job at some transfusion place and joined it. And he's like, Monica, we're not an interdiscipline. We're multiple sites. We have different sites, but we do the same thing. We do transfusion and we're all trying to get data and we do informatics and we're all trying to get data in one place and we're just not moving forward with the project. Can you meet with us or something? I said, sure. So I had this meeting with them that consisted of me really asking a million questions and I think about transfusion or blood or what they were trying to do anyway. And it turns out, just to cut to the chase, blood products that you store in a hospital or whatever in a blood bank are scarce. So you have to prevent overusing them. So these hospitals wanted to benchmark their use because they didn't know what overuse constitutes, right? Like they could just compare to each other. So they want to put de-identified data into some sort of central repository and have somebody like me analyze it. So after we met, and I had kind of asked everybody what they were thinking, I drew the most simplest picture in the world, okay? And actually, I'm gonna, let me see if I have an example of it. Oververyrecruitmentmodel.co, maybe this is it. I have like, nope, this wasn't it. I made like this, it was basically a flow chart of some sort, I don't have an example. I basically made a flow chart that was like the simplest flow chart in the world. It showed like a few hospitals with arrows, you know, with like little, little datas with arrows going to a central repository and then a person like analyzing it. And I sent this image around. I said, this is what you guys are trying to do. And I was like, oh my God, this is it? This is it? This is wonderful. I'm like, you know, I just did that to make sure I understood what they were saying. They hugged me, they were so happy. And I'm like, okay, there's something special going on here. And that is that I think people have this idea that data should be intuitive, that you should just automatically know about it. Because like one of the other things I did is when I worked at, I was a civilian working for an army data repository. And they decided to do an epidemiologic study involving some analysts from Harvard, but they weren't gonna use data from my data warehouse, but they were gonna use data that my data warehouse had inside it. So I was familiar with those data sets. So I went to just help them, just teach them about the data sets. But the data provider, they went to, they went to the original data providers. They didn't come to my warehouse. At my warehouse, I curated everything. I had everything already for them. It was all nice, you know? But no, they had to go get the raw data. Like everybody always does that, right? So they had to go get the raw data and then it was a mass and they had no curation, whatever. So I remember going to meet with them. Like each analyst was assigned sort of a different data set. And the first thing they would do is sit down and present to me what they found from analyzing the data. And then they'd stop and I'd say, you know, you're pretty brilliant because I've been sitting on those data in my repository and there's a whole bunch of fields I don't think I can even validate that I don't think we can even use. And you were analyzing them like, do you know more than I do? And then they would be really embarrassed. And then I'd say, you know, these data sets are very complicated at the Army because they were coming from production databases. They were coming from, you know, like if before a person gets deployed overseas from the US, they have to go through a particular process and there's a data set that reflects data from that process. Each time a person goes through that process. Well, I didn't know anything about that process. I had to study that process in order to be able to use the data from it to understand the terminology. And so it's, and I had to document it. You don't think I went to a meeting with somebody and they told me a whole bunch of stuff and I memorized it. No, I had to write it down. And what I found is that documenting these things just improves your communication, especially with subject matter experts who don't really know data science. I'm gonna show you. So if you take my LinkedIn learning course, you'll see that any of my LinkedIn learning courses, you'll see that I emphasize data dictionaries and I'm actually gonna, let's see, show you one. Like one of the things that I did was I posted on my website or on my blog an example of a portfolio project. And the reason I did that is I wanted people to understand what you do if you do a portfolio project, right? And I did the portfolio project. I'm a healthcare data scientist, but I did the portfolio project on casinos. Why? Because I wanna get hired by our local casino and to do some data analytics because I'm a casino fan and I'm a customer. And so I know what it's like to be a fan and a customer. I didn't really know what it was like to be a business person who does data analytics for casinos trying to improve their business. So I did a portfolio project in it where I found some data online and I analyzed it. And actually, let me look it up here and just so I can link you to it and you can know what I'm talking about. Because I know a lot of you out there might be at that point in your data science learning journey where you want to apply like some of your programming ability, but you don't really have an opportunity. You're not really sure. So that's a link to this web or this blog post where I did a little analysis of some public data that was posted. And the existence of the three casinos in Massachusetts has something to do with taxes and everything, there's some governmental stuff. So they have an obligation to give the government some data each month and that's what I analyzed. So what I want to show you is when I went and analyzed that I kept a data dictionary and I just want to show you that data dictionary. And the reason I really want to show it to you is because I did this, like you can look on the blog posts. This was a few years ago I did this, but if I were to open this up today, it would be like, I'd know, see the field name. I've got the columns here, field name, order, ID, year cal, month cal, and then the source. Well, I put together different data sets in order to analyze this data. And that's like the hard part of data sciences, weaving together different data sets. Now I'm just like really good at it, but that's often hard. Like sometimes like data scientists, they'll just have one table like a survey and they have trouble analyzing. So I'm kind of advanced that sort of like, I was trying to do like this, you know, stellar perfect portfolio project. So people could kind of see a great example, but also I wanted to demonstrate that a great portfolio project doesn't need regression or artificial intelligence. This is a descriptive analysis, right? And you can just get a lot out of that because a lot of times people just don't bother to analyze the data. Like I'll bet you I'm the first one to analyze these data sets, but they're sitting out there in my entire fair money page for them. So I thought, what the heck? So as you can see, you might be able to see at the bottom there's two tabs here. Here, this one says casino and this one says by month. Why? Because there were two tables in my two analytic tables. One was the casino entity. Those of you who are familiar with like normal farm, the casino, so documentation about the casino. And then there was monthly, let's see here, thank you for coming. But I don't know how to pronounce what you said. Thank you for being here. So this is an example of data dictionary documentation, backend documentation that will now help me. And one of the things that's really helpful in this documentation, let's see here. See the source. This says revenue reports. This says R. Why does it say R? Because I made it an R. I was using R. This is so important if you're running a data warehouse. Because the revenue report data, like tot underscore tax, that's just raw data, okay? Date underscore Cal, which I made an R. That's, I made that variable. So when you're a person who's analyzing data from a data warehouse, when you download data from a data warehouse, like let's say I made a data warehouse and I served up this data. If you're gonna use the variables I generated, you're gonna wanna see what did I do? How did I make that variable? What does it actually mean, right? And so, see, if you look at towards the bottom, ggr underscore per underscore slot. So slot underscore ggr divided by number underscore slots from the casino table. Thank you, man, thank you. I can't tell you how much I've wanted to know stuff like that about data given to me, you know? Usually just a mystery. Like imagine you got a data set with just those field names and a bunch of numbers in it. Like you can't tell what's going on, you know? So I'm a big fan of data dictionaries and that's one of the things that I almost always insist that my learners or my customers make. Whether that we're just doing it for ourselves or we're doing it for somebody else, you know, because I'm never gonna remember what we did. If I need to help them again. And in my field of science, what we often do is do a study where we collect data and just the whole data collection thing has its own curation associated with it, you know, depending on how we do it. Maybe it's a survey, maybe we're measuring labs. What are we doing, right? So we've got all that curation. Then there's the analysis curation where we, you know, just like what I just showed you where we figure out what variables we're making and putting in our regressions or whatever. And then what happens? We submit it to a journal. Submit it and forget it, right? Submit it to a journal like six months later. They're like, okay, we want these changes. I'm like, who wrote this paper? Oh, I guess we did. I don't remember. I write papers like every day, you know, I don't remember, sometimes I'll be reading something. I'm like, this is really good. I was like, you wrote that. I don't remember. Obviously, if I like it, I must like what I'm writing. That's good, that's a good check, but anyway. So I'll get this paper back and maybe they're quibbling with my regression. Well, I'm not gonna remember any of this. So I pull out all this curation and in five minutes I'm like, I got my head in the game and I can, and also there's something important that goes along with curation and that is organizing your files. So those of you take my LinkedIn learning courses in like R and SAS where I'm teaching programming, you'll see that I always make these folders, right? Or directories. One I call data, which is where you're gonna bring in the data and put out the data and it's like, you could erase it. Like it's not your raw data from your survey that's stored in a lockbox somewhere. This is just like kind of data in, data out, right? Now, those of you who are SAS programmers, I was always taught to put a different lib name for in and out. I don't like that. I like in and out to be the same lib name, you know? It's just easier. I just find it easier. And then, so you have the data and then I'll have code and that's whether it's SAS or R or whatever. I put my code and I have code naming conventions, which are really important. I use numbers at the beginning of my code so we can run it at order, okay? But, and again, I encourage you to look at my LinkedIn learning courses in data analytics and my book in data warehousing and SAS to talk about these naming conventions. Because it's so important when you're on a team to use naming conventions and to use policies, organizational policies, like people are like, I hate following policies. Well, you know what? I make these policies for myself and I follow them, you know? And I, my life is easier, you know what I mean? So it's kind of like, like even if you have a designated place for your keys every time you come home, you never lose your keys. What's wrong with that, right? And so people are like, you know, I'm like, come on, what's wrong with consistent field names and stuff like that? It just makes everybody's life easier. It's just that what's hard is coming up with the policy. What's hard is coming up with something that's gonna work, you know, and that you're gonna like in a year, you know? And then that worked for you. I've worked for years and years and decided, you know, I found the best, I found the best naming convention. So, you've got this data folder and you've got this code folder with the code naming conventions. Everybody knows what the code does. Everybody knows what order it's in. And then you've got DACO, I call it DACO, the documentation folder where you've got your curation. You've got your data dictionaries. Like if you're doing a survey, you've got a copy of the survey. Let me pull out a few different, let me pull out this one. So, one thing that you find yourself doing in healthcare a lot, and you may even do this in fintech. I don't know if you do it in other fields, but it's where you're collecting data from looking up records online in some other application. In other words, it's called data abstraction. That's another term most people don't know. So, I'm gonna just show you an example of what I mean by that. Okay, so what we have here is not a real medical record. This is actually a screenshot from Wicca media that I changed data on here. I made it like a real fake record, right? Real looking fake record. So, this looks like a medical record that you might be able to look up, okay? And it's kind of small, but you see the name and there's stuff around it, okay? A lot of times in healthcare, we have this fantasy that if we have this medical records thing, we can just download the data from the backend, but most of the time you can't. The backend probably looks like a spiderweb, okay? So, you usually have to come up with an abstraction thing where you're gonna just make a list of all the diabetics or whatever and look them up one by one and get the data out. And so, what I do is like here, this is an example of me trying to make an example. Like here I'm saying, this is where you go to gather the patient current plan. This is thyroxine plan. Like if you have to gather data, that would be from there. And then here's another one that says, it's curating, it says, this is where you get the date of birth and the gender, primary physician, medical record number. I remember working with an eye physician who was abstracting data. And one of the questions was, are they on a medication? And in her screenshots, there was actually three or four different places where that could be recorded. So it was really important for our abstraction to do that. Okay, I've got a question from the chat. How do you tackle a massive data set and what amount of data useful to curate? This is a very good question. So this is how I do it. First, imagine that those casino data sets were not small. Imagine instead of having only about 10 variables that they had 100 variables. What I would have to do, like they only had 10 variables, I just curated them all. It's kind of like I curated them all. It's like I saw some colored stones and I picked through them all and washed them all off and I just picked the pretty ones, right? Well, if you've got piles and piles of colored stones, you don't have time to go through and do that. So to carry the analogy, you just look for the shiny ones, right? So you kind of have to step back from your data set and ask yourself, well, what am I trying to do? And that actually can be a very hard question, again, for curation purposes. So one of the things I have is my LinkedIn learning courses that I have in big data healthcare study design, it's like one and two. They're based on epidemiology. And the reason why I made them is I make stuff that I'm filling in the cracks of data science where I'm seeing there's no instruction. And one was data creation, another was study design. How do you tackle a massive data set? So one of the things is, let's go back to this casino thing. Let's pretend I saw, oh my God, there's 100 columns. One thing that I would step back and say is, Monica, I know myself, I like to play the slot machine. That's what I like to do. I like to play the slot machines and I like to go to the restaurants, right? That's what I like to do at casino. So I don't go to the spots too expensive. I don't know much about the spots. So I would start saying, I maybe need some research questions around, you know, just like around the, oh, great. Now how do you get rid of people from? I don't have any block user on YouTube. Okay, sorry about that. We had like a, we had like, I don't have any moderators. I get these Zoom bombers or anything like that. So let's say, so I like slot machines, right? And I know a lot about slot machines. I don't play the tables. I don't know a lot about poker. So I would have probably taken that big data set and tried to find the stuff about slot machines because I might understand that and I might be able to make a research question about it. And so it's kind of like you sort of make, throw together a little research design. And again, I emphasize descriptive is good. Like when I got to the Army to run that data warehouse, the data warehouse was centered around looking at rates of injury. And one of the injuries that's really important in the military is knee injury because you can't walk around then you have to rehabilitate. So one of the things that we were really concentrating on preventing was knee injury or dealing with knee injury. And I said to them, well, what are the rates of knee injury in the Army? Yearly or whatever. And nobody really knew. And I took over after a team had been running it for about 10 years. So I looked at what that team had been doing. The team had been doing like analytic studies, like with hypotheses of causal like does, for instance, like when women ovulate, their ACL I think gets stiffer or something and you might have more of a knee injury. Like they were looking at causal things. But I was like, but how much is there? Like what is the rate? And, you know, we want to make it go down. How, where do we want to put it? And nobody had done that work. And that's not unusual. You know, I even in 2008, like my transfusionist friend, my blood brain friend, you know, nobody had really pulled together the rates of people falling over or fainting, having a vasovagal reaction after giving blood. I mean, they had a lot of rates out there, but they hadn't really pulled them together into one publication. And we need to know that. So, and that's just descriptive. And so that's kind of what I would say to the answer to the question is if you've got a big data set and you're thinking of analyzing it to answer a question or to answer some questions, first just sort of figure out what those questions exactly are and then sort of try and figure out somewhere in there what variable might be of interest. Just curate those. And even if you pull a variable that you think might be of interest and it turns out to be bad or something, like I remember I had a bunch of encounter data, clinical encounter data, or it was billing data. It was in Florida and surgeons gave it to me. And there was a flag for in-hospital mortality. And I didn't trust the flag because there were people who had diagnosis of like dying and they didn't have in-hospital mortality. And there were people who were discharged that where it said that in-hospital mortality. So I just didn't trust the flag. So I said, yeah, I got this flag, but I don't want to use it. And I put that in the data dictionary. I put all that documentation. In fact, I made a whole bunch of curation documents just about that flag because they kept wanting me to use it. But when I would do some sort of sensitivity, validity, reliability checks, it looked like it wasn't right. Like it may have been a flag that was used for insurance purposes or something, but it wasn't accurate for the dataset. So yeah, so that's why curation is important is that, let's see here, I want to stop. Is that you want to make sure you understand your data and you want to pay attention to the data you're actually using in your analysis. Let me see here what other goodies I have here. I have a free course available online in how to set up data collection, which I made mainly for people who are in like a PhD program or a master's program where they have to go do a lot of data extraction, like I was saying. So this is an example of this form that I made for data abstraction. So you can kind of see where it says data panel laboratory. This is all an example I made for that course. I suppose I should probably give you a link to that course. Let me see here. Here's a blog post, I'll give you that. It's a free course in data collection. It leads to courses you can pay for so you can learn how to make the curation documents. But the free course just explains what they are. I'm sorry, my blog is a little slow to search so I'm looking for the post. Let's see here. And data collection is always this thing where people are like, oh, let's just get some students do data collection. You know what I mean? Like who cares about data collection? Data collection, data collection. Who cares, who cares? I'm sorry, I super care about data collection. That's why I made a whole course in it. What is data collection? It's measurement. It's a measurement of what you're gonna put in your model. So you don't want students who you're not supervising doing that. It's really important that you measure stuff right. And especially if you're doing a study. So all of this pickiness around data collection, I'm the only one you'll hear talking about that and it really bothers me. And one of the reasons like I like, for instance, Hosmer and Lemischow, they're still professors of, you know, I think they're very legendary professors in statistics. They often fit their models with clinical data that probably is wrong. That's probably mismeasured. It has a lot of error in it. And I don't think they even realize how much error is in it. And they over, I think they are overfitting stuff because historically men do not, never did data collection. It was like, like, you know, when you have these jobs that only women do or only men do in some culture, I don't know, they just make up a rule that this is women's work and this is men's work. Well, they just made up a rule in society like Western society that women do key punch. They used to call it key punch or, you know, data entry and men do data analysis. And that's why you have like Proc Lemischow in SAS, you know, we have a professor at Lemischow, but you have nothing named after women in SAS, you know, no Procs, nothing, because we were the one doing double entry, you know, we were entering the data once and then entering it again and doing Proc compared to make sure it was accurate, which is a waste of my time. And I proved that, I wrote a paper where we studied how much of a waste of time is double entry, you know. It's like so rude that these people were so worried about us measuring stuff and they weren't even, they didn't even curate their data. They didn't even bother to see if these measurements are any good. So that's part of why nobody knows what data creation is. You know, it's because we stratified these roles. The measurement role went to women and then analytic role went to men. And now that I'm like, I'm in both, okay, the measurement matters, oh, not this guy yet. I'm gonna have to get some moderators. Well, at least I'm getting zoom bombed. Let me see if there's another one I haven't shown you. Oh, here's a good one. Those of you who like to fit regression models, you'll like this one. This is where, so let me make this a little bit bigger. So this is where I keep track of the models I fit when I'm fitting like a logistic regression model, okay? So as you can see, this one, it's called badden because that's what I named the outcome. It's a binary outcome. And you can see under covariates, I list all the covariates that I put in the model and then I named the model with a number and then I make a list of the significant covariates. And I also keep notes about the covariates because I'm modeling and I wanna know why, like here, I'm keeping track of some of them. Like here's a new working model. And you know, those of you who've taken my courses know that I do how I fit my models that they're, you know, that's this bidirectional step-wise or whatever, or step-wise selection. I guess Lemmichau, Hosmer and Lemmichau call it step-wise selection. But anyway, so this keeps track of each step and each selection. Not this again. So then, let's see here. So I do that and I finally arrive at the final model, and then here is where I put that. And this was exported out of R. So you see I've got, it's basically the same thing. I've got the estimate, the standard error. I've got everything on here and this is the odds ratio I exponentiated. And so I've got the actual model. This is ugly. This is not what I would put in the manuscript, right? But it just keeps track of what actually happened in the model and what I finally, what my final model actually was. And if you look at the, there's four tabs. And I was actually writing a paper where we had two outcomes. I made two separate models and so that's what I was doing. So this is an example of a curation file I use to keep track of statistical models and why I made modeling decisions, all right? So that was another one I hadn't shared with you yet. I think I've shown people that in my LinkedIn learning courses. Let's see here. Oh, oh, here. This is what I was looking for before. I wanna show you these. I was gonna show you flow documents and I can find them and now I found them, right? So flow documents become really important. Whoops, what just happened? Flow depot, I'm sorry, clumsy. Flow documents become really important. Here, this is maybe not the greatest way of showing it but this is a flow document from a research study. I think it's that one I showed you earlier at the beginning of the stream where they were gonna do a study on gingivitis on two different treatments for gingivitis. So begin visit one. Participant completes participant questionnaire. Clinician completes gingiva observation form. So you can see that there's data collection going on. Like participants completing something, the clinicians completing something, I guess the clinician cleans their teeth or something and gives them oral health instructions. And then the clinician gives compliance form instructions to the participant. So the participant is basically being sent home with a treatment and they're supposed to keep track of their compliance with the treatment. Oh, this is a, I forgot. This was a randomized clinical trial where they randomized them to three treatments, three different groups. I think we ended up with two groups in this. This was one of the early creation documents. And so they don't know what they're getting but they're supposed to write down whenever they took it, right? So this is something that I would make if we were making a study and I would put it in the study protocol. So we remembered everybody in the world who's working on the study protocol. We remember this is what we're doing. And so let's say you're a data scientist or a statistician and you're like, I don't want to have anything to do with your data collection and all that. I just want to analyze things. I'm like, well, isn't it helpful to have this to know that the patient is filling out or the participant is filling out the questionnaire for themselves, but the clinician is completing the gingiva observation form. And often you just get the data as the data scientist like you'll get some data set and somebody will tell you it's from the gingiva observation form. But then you also want to see that form. It's helpful to see that form. So the blank form itself is curation files. This is a curation file. Anything, and these things help, like let me show you. Maybe I pulled on another flow diagram. This is a little bit more complicated here. So this is the recruitment process for a study. Let me make this a little bit bigger. Okay, so you see how they're going through some sort of eligibility and if they don't meet it, they'll not recruit. Here's something happens, do not recruit. All of this, those of you who are programmers who use this to plan code or plan applications, and well, that's another thing that these are used for is application flow. If you can think about an application you use, like I used an Uber today. So I opened the Uber app and it says, where are you going? And I choose a location. And then it connects me with the driver. But let's say I didn't choose that. Let's say I chose I wanted to add a different credit card, right? Well, you can make a flow diagram for each function differently. Like adding a credit card, you'd have, you know, start and then say, you know, click on add credit card. And then, you know, does the credit card, you know, as a valid, as a number entered valid, yes, no. And what happens? And remember, you can make an application flow diagram to design what the application is supposed to do. Or you can make it to document what the application actually does so you can modify it or so you can study it. That's one of the things I learned from one of my customers that he worked at a place where people were filling out like applications to get on an insurance or something, but online. They'd submit their application, but they wouldn't all be approved, right? And he wanted to do studies of people who got halfway through the process, why they didn't finish. But, you know, the first thing I had to do was just really understand that. Like, it was an app that you can't, like, you know, it was sort of a proprietary app. So I had to actually like be with him and watch him log in and try to map out this flow while he was doing it. So I figured out what the application flow was so that I could even recommend. One of the things, he's a data scientist and they were asking him for recommendations. Should we change the flow or whatever? Well, what is it, you know? So we're coming towards the end of our time together. I wanted to make sure that I answered any questions except for from the Zoom Palmer, but anybody who has data science questions about data curation, I wanted to make sure that you understand that there's backend curation where you have more tables, where you're talking more about the data that's in there, you know, how it got there and what it's about. And then you have more front end curation where you're looking at maybe like a medical record or a dashboard or something and you're trying to figure out what those fields mean. But sometimes you've got tables, sometimes you've got PowerPoint and you're making, you know, documenting things. Like what's wonderful is you can always take a screenshot of something, throw it in PowerPoint and just put circles and arrows on top of it. And another thing you can do, what's great about PowerPoint is if you've got a PowerPoint, if you've got a PowerPoint like slide deck and you make a diagram on one of the slides, you can save as a JPEG and then it gives you an opportunity to choose, to save the whole thing as JPEGs or just a current slide and say just a current slide and then you can name it. And if it's like a square, like you just use too much slide, you can then open it up and just crop it. And then you can put that in a manual. You can put that in a protocol. And so that's why I use PowerPoint a lot to make those, to stage those, you know, because it's just so easy to plop the screenshot down on a PowerPoint and start putting arrows and stuff around it and, you know, text boxes. That the problem always comes with, usually knowing you need some sort of curation is starts to become easy. Like, like when I was, I gave that example of IBM Carecast. When I kept logging into dictionary 471 and it took me 20 minutes to figure out what I was doing, I was like, I cannot handle this. You know, I have to make some documentation so it doesn't take me 20 minutes to do two little edits each time I do this, you know? So you might be able to get to the point where you're like, I know we need curation. But what I have experienced when I've worked in IT departments and I've worked in a few IT departments is that once our group gets there, we need curation, no one knows what to do. Like no one knows what the next step is. I teach data curation, obviously, I have this course or whatever. Well, let's be honest with you. That part, that valley between we need it and what do we need? I don't know how to help you. Like I just know, like, and in the beginning, like I'll tell you, when I was doing my master's degree, I was trying to write in my master's thesis the situation whereby people were recruited for a clinical trial, but they didn't all qualify. And my advisor taught me to make that flow diagram like the one I just showed you towards the end. Literally, she taught that to me. If she had not taught that to me, probably wouldn't know it. So after that, I was able to take that knowledge and then expand it, like apply it to other things like application flows, right? That's where I see a lot of times people are just not, like once, like some of my customers I've worked with for years, you know, I always do data curation with them. I always do it. What they tend to be able to adopt is the data dictionaries. They're good at adopting that. They're good at adopting my survey curation, which you can learn about if you take my data curation course or my data collection course. But they tend to be able to adopt those. It's the complex stuff where I'm designing dashboards using PowerPoint or I'm trying to explain the function of something using PowerPoint that becomes unintuitive. Like they don't just automatically know how to do that. Now I'm gonna just tell you something that is probably the reason. It took me five years to get my undergraduate degree, but I got a double major. My major is in costume design and my other majors in textiles and clothing. I had a core set of courses in design. I'm a designer, you know, I'm not really a programmer, you know, or whatever. So like I can, I design programs, you know. Because I'm a designer, a lot of the stuff just happens for me. And I'm sorry, you know, like that's data science. When I encounter engineers, they do stuff I could never do. Like that guy was talking about with that application flow. He's like an actuary. Like he does SQL, I could never do SQL like that. So we all bring something different to the table as data scientists. And I bring curation and I bring just, I'm naturally good at it. I can do the amban stuff very easily. I'm bad at programming. I don't even know how to use Python. Like I'm gonna learn Python one day when I have to, because that's the only way I learn the software is because I have to. Hello, LinkedIn user. I don't know who you are, but hello. Thank you for showing up. I'm towards the end. You'll have to watch the, you'll have to watch the rewind or whatever the re-stream of it. And those of you, if you're joining late, this is gonna be, this is on LinkedIn, but it's also on YouTube. And I actually try to go in after these and sort of put timestamps in there so people can get something out of them because I realized sometimes this is not a good time for everybody. But I like to hold them anyway because I don't know what questions everybody has. And it's really nice to just answer real people questions about this, that people just out there have. And so we're coming to the end of the stream. And I just wanted to thank everyone for showing up. Thank you to everyone who asked questions. No thank you to the Zoom bomber, but at least that's a little excitement. If you ever have questions or you want a free consultation, go ahead and contact me. We'll set up a Skype or a Zoom or whatever you like and talk about it. If I can give you quick advice that you can run with great, if it looks like you need to pay me and we need to set up some sort of, business relationship, we will. And then of course, thanks for sharing. Thank you Alejandro for showing up. Thanks for sharing this. We sure will do your data curations. It seems crucial if we want to get an order. I love it. Yes. You think of it as this is your new year's resolution is you're gonna get everything in order and you can take data curation course to help you with that. Well, thank you very much for showing up everybody. I appreciate you coming and please make sure that you feel free to connect with me on LinkedIn. I like connections and please subscribe to my YouTube. And so then you can know when I do these live streams and I'm gonna try and do them more often so I can get out there and answer those questions. All right. And if I don't see you before the weekend, have a good weekend.