 Well, hello and welcome today. It's good to see you. I guess I'll get started. So welcome everyone. It's good to see you today. I hope you're having a very good Tuesday. And hopefully I can share some knowledge with you. My name is Monica and I'm an epidemiologist and bio statistician, but I'm also basically data scientist. And what I'm trying to do, I've got my next lecture series I'm doing right now is portfolio projects with pizzazz. Because what I'm trying to get a lot of people to do, I'm a LinkedIn learning author of a lot of courses, data science courses, not a ton of them, but a few of them. And I write books on data science, so I do a lot of teaching. And what I'm finding is a lot of people now it's today, right, we can learn a lot on the internet about data science, we can learn Python, we can learn our it's just a wonderful time to be involved. But the problem is, people will get to this point where they've taken a lot of courses, and then they have these skills, but they can't really put it together. And that's when I start recommending you really should be doing portfolio projects. But what I started realizing is that's a very easy thing to say. It's not that easy to do. You know, it's just like anything if you learn certain skills like I used to be a fashion designer, you can learn how to sew you can learn how to knit and stuff. But how do you put that together into actually like designing your own thing and like making it without a pattern from scratch and so people will say oh I love Kegel, you know, challenges or things like that which are not bad definitely do And I don't say don't do courses do courses but eventually you have to come to a point where you're putting it all together and I'm what I'm going to talk about today is one of those tools in your toolbox of putting together a portfolio project because in data science one of the things we really like to do. It's use a lot of different data sets, so it's very exciting to make a project where you combine data sets, because then you're making new knowledge, like nobody ever did before maybe. Right. Like if you just analyze like a survey data set you're like okay everybody's done that. You'll probably everybody's asked the same questions but when you connect data sets you're coming up with something new and so the reason why I called today's event poor women's documentation of an etl pipeline is because I just want you to do it in PowerPoint like you don't have to worry about it. You know you don't have to get anything fancy and even if you don't have PowerPoint like you can use Google I don't know what it's called the Google version of PowerPoint. But anyway so that's what we're going to talk about today and I really appreciate you showing up. And I got you in the chat here I'm going to try and look in the chat if you have any questions just put them in the chat and I'll try to answer them. So today's lecture is about etl doco so one of the things my, my customers asked me is, what do I mean by doco. Well, I literally mean the word documentation. But what do I mean by documentation right so and when you think about it, when you do, like a portfolio project where you take data, and you ask a research question, even a simple one, and you analyze it, you have this output which is usually some tables and figures. But the doco or the documentation is what you keep to remind yourself like how you did that. And the reason why you have to remind yourself how you did that like if you're doing scientific research like I do. The reason why you have to remind yourself how you did that is because you have to write about it later if you're going to write it in a scientific publication. But, but more than that you have to just remember what you did because if you go back to it you're like okay why did I include these in a certain group or whatever. You have to remember that. Okay, it's science. And if you're making a portfolio project where you're going to be posting it maybe online like on GitHub or WordPress. So you're going to want to probably explain how you did stuff in your data so somebody tries to use the same data. And they do what you did they'll get the same thing. So it's basically replicability, whether it's you replicating your own work, or you communicating with somebody else. So they can replicate your work, or you just being on a team. There's a trick to making documentation that's easy to use like I've seen. I've gotten to shops where I'm like, you know, like data warehouses where they're doing a lot of transformation of data, and I'll be like, do you have any documentation they'll be like here and it looks like a story. I'm like, I can't read the story I mean, I don't. How am I going to look things up in the story, you know, so it. And so I appreciate people trying to at least keep some documentation, but if you don't keep documentation a really efficient way. You can't look up what you're looking for and so. So doco really means all these collections of curation files you create. I call it backstage while you're doing your data analytics and try and answer your research question with tables and figures. So later you can write about it. Maybe even some of this documentation can be diagrams in fact what we're talking about today is exactly that. Some of the documentation is tables and some of it is writing I don't want to make fun of people writing a story but very little of it usually is writing explanation a lot of it is just some sort of systematic knowledge you keep. So that's what doco is. So what is ETL. Well, if you're in data warehousing you're very familiar with the concept of ETL it technically stands for extract, transform and load. So if I've got a data warehouse here and I'm getting data from different places. Let's say I'm getting data from Medicare I have to go to Medicare extract the data, then transform it to make sure it fits into my shape of my warehouse and then load it. And then now let's say I go get data from Medicaid, I go extract, transform and load it right so that's ETL. So if you're thinking well wait a second what if I'm doing a portfolio project I'm not running a data warehouse. Okay, in fact what if I get data from a data warehouse and I'm like well. You're still kind of doing extract, transform and load it's just the load is a little weird right. So let's say let's say you do something simple like you get data from like a surveillance data set like BRFSS, which I often demonstrated my LinkedIn learning courses. And you extract some of that data. Oh hello welcome I'm so glad to see all you guys joining. So you go to BRFSS and you say okay I'm going to extract this data. So let's say you extract educational level along with other things. You know because it's a survey of people. And then you're like okay wait a second here. This has too many levels like there are too many categories. I've got to recategorize I'm going to make ed you to and recategorize in smaller categories. Well that's you transforming it right. So the extract is you take just what you need from the original data sets, then transform is you make the variables you want, and then load is basically you have your analytic data set. Now, I just said what if you're taking BRFSS data because that's kind of easy it's like a survey data. Now let's make it a little more complicated. Let's say you're taking data from two different places like maybe you're doing a study of hospitals and so you have the data from the hospital but maybe the hospitals aren't counties and you have like county data that you're going to put together. Well, then you're going to extract from the hospital data transform and load and you're going to extract transform and load the county data, and then see how I'm going in the air that means it's complicated and maybe you should make a diagram. So don't get wrapped around the axle it's not that hard. You can just use PowerPoint and you know most people can kind of figure out PowerPoint but I'll just be honest with you takes a little practice to get good using the shapes and knowing like the real estate knowing how to move it around or even like what exactly to put like that was one of the things that was hard for me is sometimes you have like a big picture. ETL and sometimes you have like specific little ETLs and I'm like oh okay you can make different pictures some that are more bird's eye view and some that are more specific. So what I wanted to teach you about today is it's just easier to explain how to make ETL doco. If you're doing a research project whether it's a portfolio project that you're going to publish online with your GitHub, or it's a scientific paper you're going to write the method section. Regardless, it's super cool to keep like what I'll do is I'll actually I like PowerPoint, you can use whatever you want but in Macintosh is there's keynote, which is an amazing program don't get me wrong keynote is a crazy cool program, but it doesn't work that well for this kind of documentation. It's more for making really cool presentations like it's got animation all that. I recommend PowerPoint or something PowerPoint like like, like Google has or whatever for making these diagrams. So here I was talking about how doc doc always short for documentation. There's other words for that there's data curation which I also use, and then also metadata, like, especially, you know how I was just saying, what if you download two different data sets and you just pieces of them for your portfolio project. Well, those data sets often come with documentation just like brfs so you can download like the codebook and the, and the actual survey they asked and that's metadata or data curation files and that's all like in your little backstage toolbox so you remember what you're doing or you're informed about what you're doing. And then ETL is this extract transform and load and. People in research don't use the term ETL. I'm going to use it to mean this process I just define. And you want to, by the way, you want to download these slides, mainly because I put so many links in there, let me actually, I'm getting better at this. I'm going to put this link in the chat so you can download these slides. Just because it was simpler for me to just make them and put these links. These are links to blog posts where I have all these different use cases because it's a lot easier to just show you examples. Then try to explain how to do it. Because what I usually end up doing is with my customers is they're like well I need to make document or I'm like trying to communicate with them. And I'm like, okay, we're not communicating. And so then I start trying to make an ETL diagram and then I we communicate. That's usually how it starts. And so of course I don't even know how like somebody asked me the other day. You know, how do you know when you need a diagram? And my generic answer was, if you get tongue tied and you start like not communicating, you probably need a diagram. So, so we're going to talk about that. So, low charts are a really common thing to use generically in business and in data science. Not so much in epidemiology and biostatistics, but we're getting there. It's very common in informatics because that's business. So, low charts are so awesome. And when I was growing up, I was aware that flow charts are used for planning your programming. But, you know, we use, you can use flow charts to explain how people go through a clinic, like a clinic flow. You can explain like an ETL documentation like we're going to talk about today. You can do it where like you can make decision trees like what, I don't know what diagnosis to give somebody under these circumstances. So, since flow charts are so cool, it's good to make them. And one way to easily make them is in PowerPoint because most people have it. And one thing I didn't put in the slides, but I do want to tell you is, you know, the PowerPoint file, if you make a PowerPoint file, it has obviously a bunch of slides, like I'm literally showing you one right now. But, and it's dot pptx. If you like make a slide with a diagram on it, you can do save as, and you can choose JPEG, and then you can type a name. And then if you put save PowerPoint will ask you do you want to save all of them as JPEGs or just this one. And if you put just this one, then there you have a JPEG. And so this is not a lot of people make memes as they'll put like a photograph or some, I don't know, some clip art or something on PowerPoint and annotated or whatever. And so that's basically what you're doing when you're making ETL documentation or any of these flow charts is you're arranging it on PowerPoint, but then you can do that save as JPEG thing. What can you do with that? Well, you can post it on the internet as an image if you're making like a blog post. But what I always end up doing is putting it in a manual, right? Because I'm always trying to control people, you know, trying to do governance and trying to, you know, get people to do data entry, right, or whatever. And so I often am putting these things in a Word document, which you can do, right? So I'm talking about PowerPoint because it's so easy. You know, a lot of people have PowerPoint, even if people don't have PowerPoint, you can save as a PDF and show it to somebody, you know, or like the whole thing as a PDF. The other software out there for using this stuff for doing this flow charting like Microsoft Visio, I used to use this online tool called Gliffy. And the reason was is I had trouble sharing the flow chart with my colleagues and obviously I'm going to share something with you today. Like I wanted my colleagues to be able to edit it. Well, you can make it in PowerPoint and send it to them and then they can edit it. I mean, you know what I mean. So just PowerPoint is for women's documentation. You don't have to mess around. Everybody can edit it. You have to know what you're doing. One of the things I realized is that not everybody realizes that the official flow chart shapes that we have, which I'll show you, that they actually mean something. So if you choose like this hot dog shape, that means like a terminus or a terminal, which means like the beginning or the end of a process. If you choose something else to indicate the beginning or end of the process, I'll be like, I'll confuse. And so a lot of other people. On the other hand, nowadays you can use like better items in your diagrams. Like you can use like a picture of a programmer when you mean programmer. You don't have to, you know, just use a dot or something. And so the goal of every diagram you make or every visual should be visual communication. And one of the things I sometimes end up doing is splitting diagrams. Like I'll put this big complex diagram like, okay, that's too complex. And I'll figure out a way that maybe split it into two diagrams, like a copy, paste, and then remove some stuff from one. Kind of think of it like a map. Like if you look at a map and you're like, I want to know where the trees are, and you're like, oh, that's interesting. Okay, remove the trees. I want to know where the railroad tracks are. Okay, that's interesting. It's like sometimes you can't just look at everything at once. So it's actually harder than you think. But what's cool about it is you can play with it and change it around. So first example is the super complex example here, which is warehouse style ETL doco. So I'm going to just go through this diagram, which is it's not from a real, real project, but it's based on like, oh, my diagrams are sort of based on a true story. And this is super complex. But when I read it to you, you'll be like, wait a second, that's not as complex as I thought. Okay. So notice, first of all, I'm going to just call out these flowchart shapes. See this hot dog shape? This is start, original, they said notice how it's green. So it's like saying, oh, okay, I get it. And it's up in the upper left. Like already you're going, okay, this is where it starts, right? And then, and by the way, this is ETL doco, right? And so here's the end. It's red. That's kind of a bright red. And this is your analytic data set. So the definition of the analytic data set is the data set you're going to use to analyze. Like you're going to read it and do, you know, bivariate or multivariate analysis with and answer your research question. So this is the data set where you've already applied inclusion exclusion criteria, you know, it's just the rows you're going to have and you already have transformed your variables. And that's why it's at the end, right? Because we just did all that. That's what this is. Okay. So this is the terminus. And then look at these cans. These represents data sets. And if you look up flow charting shapes, there's other shapes for data, but lately we've all been using these cans. And it's good to use them if you have any sizeable data set. If you have like just one row of data, just like one file, you could probably use one of those other ones. And then you'll see this one here, which is this rectangle with these parallel lines, which means predefined process. And anybody who knows me, I'm like all governance, you know, like you're forced to do this, use this official code or use this macro, the SAS macro, don't just get creative. And so that's why I use the predefined process. If it doesn't have the parallel there, if it's just a rectangle, it just means a process. So if I feel like they get to be creative that I don't use the parallel one. But anyway, so let's read this. Okay. So start original data set. So we're importing. And I guess the data sets called January 2023. So you're kind of imagining a monthly data set. It says code 100. And I'll get back to what this number is. Okay. So then here in code 100, it's identifying which code file we're talking about. So what are we doing? Well, we've imported this data, whatever it is. And there's other curation documents that go with this, like there would be a data dictionary, there'd be like other things, but I'm just like focusing on the ETL. So we would add a row ID as a primary key, file name, file year and file month. So one of the things you do a lot in warehouses is you get either monthly data or quarterly data. And it's about the same things, right? So like I knew a professor at USF who ran a data warehouse that was, it was like county based data warehouse. So he'd get like, quarterly, he'd have to load county data. Okay. So you think about it, counties are counties, like they're the same all the time, like they have the same hospitals in the same zip codes. So he was always sort of updating data so he could create something like this, like a procedure like this that was regular that they could do every quarter, right? And so one of the things that you want to keep track of when you're doing this is what quarter are we on, like what quarter are we uploading. And sometimes in the first step, you're going to upload just like identity information about this version of the data. And so I said output data frame is month underscore A. So how we would always do it, and I think this is the only right way to do it, is when you're doing your transformations, you import a data set, like I was calling it month underscore A. And then I transform it and then I output month underscore B. And then in the next transformation step I read in month underscore B, transform and export month underscore C. So you can see what's going on. This is an idea I got from SQL like this is literally how structured query languages automatically run is every time they commit or write data to disk. They set what's called a rollback point so that if like the electricity goes off or something, and it's in the middle of a query, and then it comes back, it goes back to that rollback point. So, like, for example, here we add this. And here we're importing a crosswalk here, and we're merging it here. And dropping the SSN, see this crosswalk is crosswalking on an ID and dropping the social security number, which is one of the things you do in warehousing is you do the identify data that way. And then the output is month underscore C. And so you can see that. Let's say that I was the supervisor or the supervisor. And I looked at month underscore C and I said hey SSN is still in there you didn't drop it. Then we'd be like, oh, we know it happens somewhere in 105 or 100. You see what I'm saying is like, if, if later we see the MOS is screwed up. We know that that happened here. And so you can like troubleshoot. And so this is called like modular code. And so when I literally name the code you'll see it on the subsequent slide. I name it by pre pending those numbers like 100 underscore read and data and 105 underscore add crosswalk. I forgot what I called it but we'll see it on the slide. So as you can see, if if I bothered to be so organized about setting this whole thing up now realize I set this whole thing up first like to get the data going. I made the diagram later so if it's like you're going to be this meticulous about setting up this ETL pipeline, you might as well make this diagram to showcase what you did because then if you share it with an analyst they're like okay I kind of get all this code fits together. I mean it's obviously sorting an order but you know now I kind of get the picture right. So as you can you can probably read this in here is calculate and add age. You know, import age group crosswalk. So this is a crosswalk. So I, if you follow me and you come to my the other events in the series, one of them is focused on how to make crosswalks like I'm really going to crosswalks and I'll tell you, crosswalks really make a great portfolio project you can add your own data by crosswalking in a cool way so make sure you show up for that. But anyway, so, as you can see you can just read this you can see these are doing different operations but then here's the big one export analytic. And then and we've got the analytic data set. So as you probably intuited this is like, maybe I wouldn't make this. You definitely want to make this if it's a warehouse because everybody's doing it and you want everybody to do it the same way and you want it to be easier for the next person. You know, I usually add curation documentation over time because we're doing almost the same thing each time we're loading data into the warehouse right. But if you are like doing an analysis with a team of just like a one time analysis like you got some data from somewhere and you're just doing this one time analysis, probably won't have this big diagram probably would not be that complicated. Okay, so for example, you might not put, like, the number of the code on each of them. In fact, you might not even make a diagram if you were just doing like brfs s data, you know, you might not need one. You might be able to just look at your code in the name of it and remember what you did in your other court duration documents. And so you don't always need this but usually you end up making these if you're doing like a warehouse because you're doing the same thing over and over and you keep just adding, making it easier. Now, I want to tell you, even if you're just using PowerPoint, and you make this nice etl pipeline you make this document documentation you're like okay this is a pretty diagram. What a waste of my time making a diagram. It is not a waste of your time because that's actually how a lot of this software works and this is a screenshot from a SAS white paper. That talks about SAS data integration studio so those of you are not that familiar with SAS, it's a statistical program, like our like Python, but it's kind of like most of the old SAS users like me, we got used to just use it just coding. Right. And SAS has come out with, I mean they always were doing this but now they're really emphasizing the use of a lot of their platforms where you can drag and drop a lot. And this is like a drag and drop thing they have which is SAS data integration studio, but if you actually use or DI studio for short, but if you actually use that like see these little objects on the figure. If you double click in them there's like code in them there's like stuff in them and yeah you can drag and drop that but you don't want to just sit down to SAS DI studio and go okay I'm going to play with some data. I mean you want to know what you're doing and so these diagrams are really helpful because then you can program SAS DI studio, and you probably have some data lying around you or code lying around that you worked on that you want to put in each of these steps. And so it's kind of like your diagram can just help you go from what you were it just can help you not be confused about what you're actually trying to do. So now I promise you I get to the code. So this is the code. You want to have like, like see here this is from GitHub. See how it's 100 read in and convert to RDS 105 de identified 110 and age variables 115 and MLS an expert. So we in order to run the code and order. This is what I do is I prepend these numbers, but I also put a little description. Right. And then like I was saying, also now this is the code. I was saying how when I read in a data set, you know, like I'll read in underscore a and I'll export underscore B. And that way I can do a rollback. Okay, so if you organize your data this or your code this way and your data input and output that way. And you've got a team of people and you're all using SAS or you're all using our you will keep yourself organized. And it's nice to have this documentation with it. And the ETL like again, you need to be able to make it replicable for everybody. So let's say we do publish on the RFSS data. There's a possibility somebody at the journal they don't do this much anymore they probably should could say hey we want to make sure that you're not screwing up your data analysis. So you have the original data and your code and they should be able to replicate your analysis, even if you just give them the documentation they should be able to replicate their analysis. A good example, I think was I was hired to do this project where I was analyzing a survey data set that was national. It was about business owners. It wasn't even epidemiology. And the group hiring me, they wanted me to do this analysis of certain states, each time, right. So somebody had done the project analyzing certain states. And then that person wasn't there anymore and they were going to ask me to do some other states and that person had left a technical appendix in SAS, okay, that had their SAS code and some documentation. So I was going to redo the analysis, the new one for the other states in R because at the time I didn't have access to SAS because they didn't have SAS ODA. So I was just going to do it in R. But because they included their SAS code and all this documentation, I could replicate their analysis. But you know, honestly, they didn't their documentation was like, it was pretty good because oftentimes you get nothing. And so their documentation is pretty good. But the documentation I give others is way clearer because I don't want anybody like miscommunicating with me. Like I don't want somebody doing something and saying, why did you get 100,000 records? I only get 50,000. You're like, that's pretty different. So I'm just really more due diligence about it. So here's an example that's more like from research that's not warehousing. And that's where you go. This is from the NHANES. So I was just talking about BRFSS. And in NHANES, it's like BRFSS, but it includes examination data. Now, because it includes examination data, it's got multiple data sets. And so like you've got the osteoporosis data set where they ask about osteoporosis, there's like a biochemistry panel. And so in this case, because there's so many different data sets just to assemble your analytic data set, this is even the simple diagram makes it easy to hand it off to another analyst or even hand it off to yourself. Because a lot of times when you work on these things, you do a lot of work like you might work all weekend and you think you're never going to forget what you did. But sure enough, when you get to it the next weekend, you can't remember what you did. So you're going to love it if you keep this documentation and your boss will love it if you keep this documentation. All right. And especially you want to keep this documentation if you have multiple source data sets. So I did this analysis of hospitals. I was working with a colleague and we were adding data about the hospitals from so many different sources. See the analytic data set in the middle of the diagram. And there's just arrows going towards it and it's a very big picture diagram. But even this was just helpful to make to just remember whenever I returned to this project, what was going on, you know, it just quickly reoriented me. It's almost like a map just like looking quickly at a map and going oh I remember this project because you'll end up doing a lot of projects. And these imagine putting something like this in a portfolio project like publishing it online like employers would be like this is awesome I could communicate with this person, you know, that's one thing that I really like about it is it makes communication easier. So, we've been talking about documenting data, and we've been talking about what if we're using data from different sources. And, like, I've been talking about BRF SS and and Haynes which are research sources, but we don't always get data from research sources. We're used to research data, like it's prospectively gathered and it's well documented like if you go to BRF SS you can just download the documentation. But more and more we're getting data from applications like EMR applications or health apps like on people's iPhones, and sometimes like you're on a research group you're in a research group. And you get that data and it's not even from your own organization. Like, I know when I worked at the State Department of Health, we often gave out data sets to people in academia that we extracted from our public health systems. And they were like, what is this, you know, this would happen to me when I worked at the army, people didn't understand data coming out of HR systems and they were like how are you going to analyze this. And so I'm holding this workshop called application basics, and it's adding analytics to pipelines that's our theme this month. And so if you come to this workshop, you'll be able to understand the data sets that come from applications well enough to analyze them and to produce results so you won't be stuck at square one, not knowing what to do. First, the workshops about computer applications is about how they're designed, like the design approaches, how teams work together to build them, and how the data actually are stored in those applications. That's usually what we're dealing with. I go over a lot of terminology and application development. And with if you come to this workshop and get this knowledge, you can break through communication barriers and get answers you need to complete your analysis and be seen as an expert with the data that you're given from these applications. So here are the details of the workshop. Again, the title is application basics, adding analytics to pipelines. And this month, it's Saturday and Sunday, December 16 and 17. And each session is starts at noon Eastern time, and it runs about three hours. And we have a lot of different interactive activities. And, you know, I looked it up comparable workshops like this data science workshops where you actually gain a skill and learn a lot there. You can run from like $250 to $750 per like a weekend workshop. But today, your cost is free to sign up for this workshop. So I'm going to put the link in the chat, and then you can sign up if you're interested. And what's great about learning this kind of information in the workshop is that you have experiences with the other people there. And when you have experiences, and you actually interact about these things, it sticks in your mind. You're more likely to have said, hey, I spent a few hours with this, and I remember what I learned than if you just take an online course. So if you're interested at all, I encourage you to sign up. Well, thank you very much for coming to my work to my event today about the poor women's ETL documentation. I hope you learned something and I hope you feel inspired to open up PowerPoint and just start practicing by making diagrams because the more you practice and the more you show them to people and get feedback, the better you'll get at it. All right. Well, I hope thank you very much for showing up on this Tuesday, and I hope you have a very good week. If you're interested in joining the program, please sign up for a 30 minute Zoom interview using the link in the description.