 This video is part of the public health to data science rebrand program Hi, everyone. It's Monica the health data scientist here and today. We are going to talk about ETL pipelines Here are the topics. I'm going to cover in this video First if you clicked on this video and you didn't know what an ETL pipeline is never fear The first thing I will explain is what an ETL pipeline actually is Next I'm going to explain why you want to document your ETL pipeline I'm going to focus on why you would do this when you are working on a research project But I will also give a shout out as to why you might do it in other circumstances Actually, the best reason I can give you to document your ETL pipeline is Communicated by the image on the slide, which is a picture of two people being very friendly with each other If you want to make friends and not enemies in the research field, please take me seriously Document your ETL pipeline Third I'm going to explain exactly how to document your ETL pipeline There are different ways, but I'll show you my style I'll explain the different files you can make to document your ETL pipeline and walk you through an example Based on some real work. I did at the US Army Finally, if you are a SAS user, there is a bonus for you at the end of the video. So please keep watching Okay, let's move on to answering the question. What is an ETL pipeline? Let's start by considering what the E and the T and the L in the abbreviation ETL actually mean Specifically in the context of a research study Let's start with the E which stands for extract or extract Extract refers to the original data you received from the data provider or original extract So if you are on a research team Analyzing clinical trial data, the extract would be the raw data You took out from the data system that holds your live study data from your clinical trial As you know if you are in this business of data analysis When you extract data from a data system, it has a lot of problems with it from the standpoint of a research analyst First the original extract likely has confidential Identifiers or other private data in it that you need for your analysis Like to link study visit data together But you do not want to share that private data with the entire research team Also, your extract might have native field names. You do not like When I get data from a sequel database those field names are often very long with many underscores I do not want to type all of that when I'm analyzing data So those native field names could be a problem for me, too Another problem you could have with your extract is that there are fields in it that have issues for various reasons Like they have messy or inconsistent coding or they are easily misinterpreted They might be fields that you can clean up so to speak and then you can share the clean version with others But you don't want to give them the raw variable because it's a mess and it's easy to mishandle in an analysis Okay, let's move on from e extract to t which stands for transform So transform refers to the process of taking the extract and doing Transformation steps with your code to turn it into the analytic data set or the data set on which you will base your research analysis I say transformation steps because if you do this stuff, you will know that even in relatively simple cases You often have to do a lot of coding to go from the original variables You got to the analytic variables. You actually want all that coding is called transformation code and each change you make to go from the native extract to the analytic data set is called a transformation step Finally after e extract and t transform we have L for load in the case of a research project Where we are using a static analytic data set the term load is kind of a funny word to use Because we are not really loading our analytic data set into any data system load in our case technically means publishing the analytic data set to the group of analysts or basically sharing the analytic data set with the research team and Pointing out that this is the official data set that will now be used for the research project If you are on a research team analyzing clinical trial data You will appreciate why load is so important If you are writing three different papers off of one analytic data set You need to make sure all the ends and percents agree across each paper Basically, you have to keep your story straight because if anyone reads all three papers They will see discrepancies if everyone on the different writing teams was not on the same page with your ETL pipeline Okay, now that I've explained what an ETL pipeline is I will delve deeper into why you need to document your ETL pipeline First I'll focus on why you want to do it in a research study Basically, if more than one person doing ETL has access to the e the raw data extract from the system The ETL pipeline documentation is a way to keep all those people on the same page as to how the data is Supposed to be transformed from the e extract to the final published analytic data set Usually in that case you have official load code that you all maintain and you keep the ETL pipeline well documented But let's say you are doing a research study and only one person has access to the raw data This is not unusual for privacy and confidentiality purposes How will the research team understand the data you create to analyze if they do not know how it turned into your analytic data set? non-analysts on the research team are often involved in data collection like they remember some of the data They collected and entered they will be suspicious if they do not recognize the results coming out of your analytic data set So ETL pipeline documentation is a way to answer their queries and even other analysts may have Experience with the data set and they will want to know how you went from the original variables You received to the ones you are publishing to share with them in the analytic data set Now this video is not focused on documenting ETL pipelines as part of running a data system However, I did want to mention that if you want to automate a data system You really need to get good at ETL pipeline documentation If you are running a data warehouse and regularly load data into it You have to do a good job of documenting your ETL pipeline But even in a small research study Documenting your ETL pipeline is just a part of good data stewardship and governance So you really should learn the skills of documenting your ETL pipeline And you should always try to do a good job of it on every project where you have an ETL pipeline Now I'm going to show you how to document your ETL pipeline First we are going to go over the minimum documentation You should develop which consists of two things in my opinion as you can see on the left side of the slide The first thing you need to create is a complete data dictionary And when I say complete, I mean that you need to make available all Crosswalk and pick list documentation as well as documentation of all the fields in both the extract and analytic data sets If you are not sure what crosswalk and pick list documentation is or You basically just didn't understand what I just said. Don't worry. I'll show you an example And then the second file you should provide in your minimal documentation is the actual analytic data set But that's kind of a gimme because you would be sharing that anyway with the team Before we get into the additional ETL pipeline documentation you can do which is on the right side of the slide Let's just go over examples of these two items Let's start by looking at my computer and seeing how I organize my directories for this project Okay, you will see I have three folders here. The first one is called analytic. That is my analytic environment I described in my book Mastering SAS programming for data warehousing how and why I set up this environment But I'll explain it quickly now You will see there are three subfolders code data and daco which stands for documentation Let's see what's in the code folder. Wow. Look at that. It's our transformation code Notice the naming conventions. The first one is named 100 followed by an underscore Followed by a label that gives a hint as to what the code does which is read in and convert to RDS By naming it using numbers first I can adjust the numbers so that I can deliberately line the code up in the order. It should be wrong That way I can keep my transformation steps in order Notice I keep some space between the code numbers like 105 110 and 115 That way if I screw up and need to sneak in another transformation step between code 105 and 110 I can name it with a leading 107 or 108 to keep it in the right order of operation But also putting a little label after the number in the name helps It helps whoever is using the code to keep track of what it is doing Okay, now let's click on the data folder and see what's in it Hmm look at all these data sets The data folder in the analytic environment is a special folder If you are a SAS user, this would be considered both the infile and the output folder for data transformation steps This is the folder where we would put our original extract It is also the folder where we would save any incremental data sets we made while executing the transformation steps Also, if we wanted to patch any data on from external sources Like if we wanted to decode some diagnostic codes using an external table with code classifiers and descriptors The raw external table needs to be in this folder so it can be imported for processing and Finally, this would be the folder where the final analytic data set would land after processing The basic rule for the data folder in the analytic environment is that you should Theoretically be able to erase all the data in the folder and not lose any original copies of data In other words, this data folder just holds copies of original data like a copy of your extract a copy of your External diagnostic code table and a copy of your analytic data set Of course data sets written to this folder during transformation steps can be regenerated by just running the code But you have to make sure you store the official copy of the other data sets somewhere else for safekeeping Okay, now let's click on the doco folder and see what's in there Well, it looks like we have two files in our documentation folder The first is a data dictionary in Excel and the second looks like a diagram in PowerPoint We will look at these momentarily, but let's look at these other folders that are outside the analytic environment first Okay, it looks like we have two different folders to look at outside the analytic environment Let's look in the folder called crosswalks Okay, we see three crosswalk files in there The first is called age group crosswalk the nexus MOS group crosswalk and the final one is called SSN DEID crosswalk So these are examples of external tables you can use in transformation steps to help decode original variables You get like the diagnostic code example I gave The original crosswalk files are here Okay, now let's look in the last folder called original and final data sets Okay, it looks like we have three data sets here The first one labeled one is the original data set from the provider Of course, this is just fake data. I made up But I wanted to make it realistic Actually, let's open this file up and look at it This is our e for extract Wow, this is a very small data set. It only has 10 rows that's because I was lazy and just wanted to make up a small data set for demonstration in Real life this could have thousands or millions of rows But notice it also only has three columns But yet you'll see that to get to the final analytic data set we do multiple transformation steps That's not unusual Let's look at these three fields. We got from the data provider see this variable called SSN That stands for social security number, which is a private identifier used in the United States for each individual How you can easily tell these are fake data is that my SSNs here only have six digits and Real SSNs are nine digits But the reality was that when I worked at the army, we would use private identifiers like SSN to link records And we didn't want to share data with SSNs in it But we wanted people to be able to link records across data sets I will show you how we solve that problem through our transformation steps Here is our next column date of birth abbreviated do be This is another very private piece of information that no one analyzing these data really needs to know However, you can see a use for us knowing what age each individual is and since we have a date of birth We could calculate age at any known date So we definitely could make an age variable to give out to analysts. That's not very private We just have to choose a date from which to subtract the do be to calculate age But if you just think about having age as a variable, what's the first thing we always do with a variable like age? Well, we put it into groups But before I can get that far. I'm going to have to convert this field Right now it is a character field with sort of a date format, but not an official date field See this first record the date of birth of that person is in the year 2001 Followed by the month of February followed by the day of 14 Hmm. I wonder whose birthday is on Valentine's Day in real life In this fake data, it is military member one two three four five six And in order to figure out this member's age and age group We will have to do a lot of processing on this native do be field, which is a date stored as a character string The last of these three fields we got from the data provider is named mos I use this particular field deliberately to illustrate an example. I ran into at the army I also have run into this problem in other big data contexts Where a system has data over a long period of time First let me explain that these mos codes refer to the occupations of service members in the army As you can see the first three individuals have the mos code of 11 b The way this works is that I know that the 11 means that they are part of infantry But the b part the letter after the code That's where the debate starts Let me give you more backstory See this web page that says as bab. This is the test people take in the army So the army can decide what occupational class to train them in the classes have what are called mos codes See here is a succinct list Remember how I just said that 11 b meant 11 so that meant infantry We can see here. There are a lot of these mos classifications Some are ones you'd think of with an army like 13 for artillery and 25 for signals communication But there are others you might not think of like 27 for legal services and one of my personal favorites 68 for medical These are really enduring classifications If someone was in the us army during world war two and they were a medic They would have had an mos classification of 68 pretty crazy, huh? Now let's get into what's wrong with the letter after the mos code Let's look at these subclassifications for the artillery class, which is 13 We see 13 b is a cannon crew member Maybe this has been around since world war two because we definitely had cannons then But then let's look at 13 p multiple launch rocket system automated tactical data system specialists Okay, I assure you there were no multiple rocket system automated tactical data system specialists working during world war two But here's the kicker that does not necessarily mean that there were no occurrences of the code 13 p in the data from world war two service members And why is that? Well over the years they recycled the codes And they did not keep track of which years they recycled what codes So let's say that during world war two they had some old-fashioned job class based on old technology As an example with communication back then they had telegraph operators in the army that you would not have today So for certain years 13 p probably meant something pretty old-fashioned Like something to do with cannons or other old technology as we saw with 13 b But then when we started getting into multiple rocket launch systems and needed tactical data system specialists We needed a code to assign to that new job class So 13 p could have meant something to do with cannons or other old technology during world war two But then when we retired that technology and started using this new technology We needed a code so 13 p gets reassigned For that reason I did not want to serve up the full mos code to anyone Because the codes with the letters on them were so unstable I just wanted to provide everyone with the mos classification and a way to decode it like that list. I was showing you online All right now that you were able to take a peek at my example extract with all the private data in it Let's turn to one of the minimally necessary curation files The data dictionary Okay, let's get our bearings in this data dictionary before I explain it to you We have three tabs in this data dictionary. The first one I called main This is the main tab to document fields The other two tabs age group for age group and mos group for mos group Our documented pick list, which I'll show you shortly Now let's focus on what the columns are in the main part of the data dictionary Let's start with this order column The purpose of this column is to use it in sorting. I'll give you an example See these two fields ssn followed by de id Right now these are on rows five and six of the data dictionary Imagine I wanted to add another id to the data set and I wanted the id to fall in the data set between ssn and de id I'll show you how I would use the order column Let's pretend this new id is called military id I'll pretend to document it on the last row of the data dictionary Now remember how I wanted it between rows five and six Well, I can make the order value be 5.5 Then I can select the entire data sheet and sort it by order and the field will come into the right place But I'm not going to do this just now. I just wanted to show you this Let's cancel out of this and erase that military id field Next we have an original field name column At the army we like those native field names because they were short and easy to use in programming But sometimes when you receive data you do not like the original field names If that is the case you can easily mass rename all the native fields in the transformation step What you would do is make an extra column in your data dictionary Where you would indicate the new field name you want to use for that variable But we are not going to do that. So let's undo out of this By the way, how do you feel about the native field names you get in data? Do you like to mass rename them? And if you do what kind of naming conventions do you use? Go ahead and share your experience in the comments Next we have the description column. You know what that is. It describes the data element like date of birth But then we come to the source column This column is really important. I use a code in this column The word native means that this variable was given to me in the e extract from the data provider Remember, we only have three variables in our source data set Can you find them labeled in the source column? Good job. It's our old friends ssn d o b and m o s that we just talked about But basically what that means is that the other variables labeled r are ones that I literally manufactured in r Look at all of them. I literally added nine variables based on the three I got So you can see why you need to meticulously keep track of all this Let's move to the values column I basically put two types of information in the values column Depending upon whether the variable is a code that can be decoded like m o s classification Or the variables are just a straight up value like date of birth Actually in our case the first nine rows are for variables that are just a straight up value The first variable row id that column is just filled with integers For the next variable file name you can see that I provided an example of a value for file name Which is 2023 jan meaning the 2023 january monthly file This might even be the actual file name of the e extract in my sass data warehousing book I demonstrate how if you rename your extract strategically and use code arrays and macros You can automate reading in multiple month and year files at a time So long as you strategically name the files In the automation you can ask sass to copy in the file name to one of the fields Which is what we did here The next field file year is the year of the file Which would be 2023 and the next variable file month includes integers for months As you can see in the values column Then on the next row we document ssn and we show an example ssn 123456 in the values column So you can see how I just put an example of a value from one of the variables in the values column But now let's look here at variable 10, which is age group Notice how the term in the values column is a g e g r p Basically, it's referring you to the age group pick list, which is documented in the second tab After that we have m o s and an example value from the values column, which is 11 b But then for our last variable m o s g r p We have the same situation where the values column refers us to a pick list documented on the third tab You are probably curious about what is on those tabs. We'll look at those in a moment Let me first point out this next column, which is called include in analytic This is a really important column if your e extract has a lot of private data in it Notice how I put an x in the rows indicating which variables I want to serve up in the analytic file Do you notice any patterns to how I flag these fields? First I want you to notice that I left out these variables because they are private like ssn Or because they are messy and I don't want you to have the variable. That's m o s code We can't decode 11 b or 13 p accurately over time So why should we let you have that code? That was my thinking the other fields are there to basically add value to the data set I'll explain more when I go over our actual etl code later But for now, let's move on to looking at our pick list starting with age group Let's look at how this pick list is documented We see four columns a g e g r p Which has the value of the age group variable And a pair of columns labeled low age and high age If you wonder what those are they are limits that can be used in programming If we turn this pick list into an actual data set we read into r and use in transformation steps We can use these variables to classify raw age values into the groups defined under description as you can see on the right Now this is the army, which is why we have just a few younger age groups It's a business rule Actually, let's take a quick peek at the other pick list m o s group We'll click on the third tab Okay, you recognize this list. I just showed it to you on that web page All right, we got our data dictionary. Let's go back to our slides and see if we are missing anything Well, let's see for our minimum necessary list I just showed you a complete data dictionary with pick list documented And I showed you the original crosswalk files in the crosswalk folder You would not give out the ssn to deid crosswalk because the ssn's are private But you could give out the other crosswalks and pick lists so long as they didn't have private information in them So that's number one of the two minimum necessary files to document your etl pipeline What's the second one? It's your actual analytic data set. Let's look at ours Remember our folder titled original and final data sets? That's where we will click to look for our analytic data set See the data set labeled two final analytic data set. We'll open this one. All right. Here it is Are you surprised by these variables? No, you should not be surprised at all. You should have expected all of these variables Why? Because we just went over each one in the data dictionary You know what deid is? It's a de-identification variable. So I don't have to use ssn And you know what mols group is and why mols is not there So imagine you were a person receiving this analytic data set So you could analyze it If you were already familiar with ssn and how it is used to link files You'd be looking for a linkage variable in your analytic data set And if you are familiar with mols, you'd be looking for mols in your analytic data set Can you see how simply giving someone analyzing the data a complete data dictionary and the analytic data set Can really save a lot of communication trouble Like if the data receiver said they needed a linkage variable, I'd point them to the deid variable Or if they ask why I just gave them the mols grouping and not the code, I tell them why So this is very helpful for communication In fact, when I make HIPAA data use agreements I usually include data dictionaries like this in our agreements as an attachment Just to make sure we know exactly what data fields we are talking about in our agreement Okay, now I'm going to get to the additional documentation you can do which is on the right side of the slide I put the documentation in the order of most helpful to least helpful But trust me all these have at least some amount of helpfulness Let's start with the first item well organized etl code Let's take a look at our code folder Okay, this looks like well organized code doesn't it? Let's open one of these We'll just be lazy and open it in notepad Let's open code 105 which is labeled deidentify So don't feel bad if you don't know R and can't read this code I'll tell you what it says and you can probably follow along First we import a data set in rds format called month underscore a Next we import our ssn to deid crosswalk Then we merge the two data sets joining on ssn Finally, we drop the ssn So basically we replace the ssn with the deid And then we export an rds data file called month underscore c So this is an example where the data receiver could receive this code file But not receive the actual ssn field or the ssn to deid crosswalk But this code file could convince a skeptical data receiver that they actually are getting a real linkage variable based on ssn So this is why it can be helpful to provide your well organized load code to the data receiver So they totally get what you did to manufacture the data set Okay, that code was nice But let's admit it. It was pretty hard to pick through that code and figure out what it did And let's say this is a real big project. Sometimes there are a zillion code files So you might be thinking do I really need to pick through all that code? Is all that code really helpful? Well the answer to that is yes The code is helpful But you can make that code even more helpful if you include item two on the right side of the slide Which is an etl pipeline diagram. Let me show you that You might remember this file this powerpoint file called pipeline diagram This was the only other file in the doco folder along with the data dictionary Let's just stare at this diagram for a moment Can you identify the transformation step? We just looked at in the code where we patch on the ssn to de id crosswalk To de identify the data and remove the ssn Well, there you go See that you zoomed right in on the transformation steps happening in code number 105 Remember we just looked at code 105 where we imported the private ssn to de id crosswalk And then merged it with the main data set and outputted it keeping the new identifier and dropping the ssn So you can see how if you were a data receiver and you receive the code and also this nifty diagram You would basically have a map you could look at before you decided to actually open a code file and look at the code All right So over on the right side of the slide. I showed you item one Well organize etl code and item two the etl pipeline diagram You might think zoinks. That's already an awful lot of curation files. Do we really need anymore? Well once in a while I get really really confused about data I am remembering a few times in my life where I got data from a provider With identifiers and linkage variables that I really did not understand In that case, the data provider gave me the last item on the list number three Which is an example of de identified source data. Let's look at this Let's look at this Okay, here we are in our file explorer. Let's click on the folder original and final data sets See the file labeled three de identified sample of data. Let's open it and look at it Okay, remember much earlier in this video Where I showed you the e extract that only have three variables Well, this is essentially the same thing only it is deliberately faked There are only five records. It's not unusual to receive just a few records Let's look at what is an h field You'll notice the ssn is obviously faked, but it is in the six-digit format That communicates to me that the raw ssn's look like this six digits Now if I'm the data receiver, I will definitely notice this when I get the data But if I'm just negotiating a data use agreement, and I don't have the data yet This little fake extract can tell me a lot Next we can look at do b These are obviously fake do bs. Each person is born on january 1st, but in different years 1999 2000 2001 and so on If I get this fabricated data before I get the real data Like after our agreement is signed and before they have time to get me the extract Then I can use this fake do b field to prepare how I will process the data Remember, this is a date stored as a character and there's going to be a lot of programming work done on that So it is helpful to have this fake data that is realistic so I can begin to create my load code And as you can see we include some mos codes to show an example of what is in that field Okay, here we are now you should completely understand everything on the slide You should know how to prepare a complete data dictionary and give it to your data receiver along with the analytic data set To help them understand how you transform the native data into the analytic data set You can do this efficiently without sharing private data Hot tip if you are doing a dissertation project Use the data dictionary to communicate with your committee and your advisors about how you transform the analytic data set for your project And if there is more confusion among your research group or between you and a data provider or receiver Just make the other additional documentation. You may not be able to share data But you can usually share your etl code And you can always share an etl pipeline diagram so long as you don't put any private information on it And you can always make a sample of fake data to communicate what the real native data look like Okay sass users. Did you hang in with me until now? Wonderful. Here is your bonus. Here is your special sass surprise Sass users, do you feel overwhelmed? The reason I ask is that I find it really challenging to program in sass So I get really exhausted And if you are like me and are already exhausted from programming code in sass You are probably doubly exhausted by looking at all this curation documentation But your reward sass user for doing all this curation documentation is that you can actually implement it If you carefully curate all these steps, you can automate your etl code in sass data integration studio Known as di studio for those of us in the new My point is that if you learn about sass di studio, you'll realize that an etl pipeline diagram Organized code and most of the other curation files. I mentioned in this video Absolutely need to be in order before you even approach using di studio All right, here's the bonus part Please follow my instructions and go to the link on the slide Which I also put in the description It is a link to a sass support page with some tutorials Then scroll down to the section that says videos and tutorials and I want you to watch two videos The first one I want you to watch is called creating a new job And the second one is called populating a simple job They are really short like less than five minutes Hey, if any of you out there watching this are working at a place that actually automates its jobs using sass di studio Please make a comment and tell us about what it's doing I'm explaining using di studio for an etl pipeline But it can obviously be used for other types of pipelines as well Okay, I'm so excited about these two di studio videos. I can't leave it alone I have to summarize what is in those videos in the first video the creating a new job video They basically tell you that each time you build an etl pipeline in di studio. It's called a job So basically you create a job and save it and that is your etl pipeline that you can keep working on until it's perfect So in this creating a new job video The instructor shows a screen with a bunch of menus on the left as I diagramed here Then the instructor drags a data set from one of the menus out to the main screen This indicates the original data set at the start of the pipeline Basically the e extract And then you save the job meaning you save the pipeline so you can come back to di studio and work on it later Next we move to the populating a simple job video Which picks up where we left off after the last video In this video the same instructor already has the saved job open What he's doing in this transformation step is just sorting the data by some variable And then exporting it in the sorted format So he basically just has one transformation in this step And how he sets that up is he drags a sort function from the menus out to the main screen If you actually watch the video the instructor shows you how to make that connector arrow between the data set and the sort function He will show you how you can configure that arrow to do things a certain way, which is really cool Then after that the instructor goes back to the menus and drags out the final data set that will be exported I called it b in the diagram Again, he demonstrates connecting the sort function to the final data set and how you can configure that connection So the take home message is that whether you are a sass user or not etl pipeline documentation is eminently useful If you are a researcher even a graduate student It can help you communicate with your colleagues supervisors and professors And if you are building a data warehouse in the end, if you do a good job with your etl documentation You can even build a job in di studio But the most important point I want you to learn from this video is that creating and sharing comprehensive data Documentation is good governance It's an example of excellent data stewardship And it's the right way to communicate clearly about data without actually sharing the data Thanks for joining me monica wahee healthcare data scientist for another adventure in data curation I hope you had as much fun as I did Bye. Bye. See you soon Thank you for watching this video Which is part of the public health to data science rebrand program If you are interested in joining the program, please sign up for a 30 minute zoom interview using the link in the description