 Hello and welcome to this presentation, Cleaning and Organizing Data. This presentation is part of a larger project made possible by a grant from the U.S. Institute of Museum and Library Services. This presentation will cover some common strategies for how to turn your raw data that you may download from a survey tool or from a library system and turn it into useful data so you can analyze and visualize it. We'll start with a brief overview of data cleaning and organizing and why it's important. Then we'll talk through getting to know your data, followed by some demonstrations and guided practice in Excel to split and concatenate data, filter and remove duplicates, and create and use lookup tables. Everything that I'll demonstrate in Excel also matches how you'd go about this work if you're in Google Sheets with a few minor exceptions. In those cases, I'll also jump to Google Sheets and show you how things work in that tool as well. As we work through this presentation, I hope you won't find yourself overwhelmed with trying to remember formulas and details that you'll hear and see demonstrated today. Rather, my primary goals with this presentation are that you will develop better understanding about what goes into cleaning and organizing, understand how Excel or Google Sheets can make your life easier, and learn or at least become aware of some basic terminology and tools that will help you make your data ready for analysis. And remember, there are hundreds if not thousands of great tutorials online that can help you fill in the details once you're working on your own. So let's begin by talking just a bit about data cleaning and organizing. It's really a rare occasion when you get a raw data set and it's ready for analysis or visualization. Data cleaning often involves things like fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a data set. Data organizing can be a lot of different things. Some of the things I typically find myself doing as I organize data are connecting two or more data sets, creating new categories where you combine like items, changing text format into number format, modifying date and times, and splicing data from one column into two or more. Essentially data cleaning and organizing takes you from a raw data set, something like this, which is a survey file downloaded from a survey tool. I think this one was Qualtrics to a tidy, well organized data set that is standardized, files are properly formatted, corrupted, missing data is dealt with, duplicates are removed, meeting categories are created and aligned, etc. This then sets you up for visualization and analysis. So once you have a clean, well organized data set, you can easily create visualizations, things like charts, graphs, and interactive dashboards, such as this Tableau dashboard, with filters that allow you or your users to interact with the data in nuanced ways. And this also sets you up for much easier data analysis. To help bring the concepts to life, I'll be walking you through some data cleaning and organizing steps that are commonly needed or used through a guided practice scenario that you can follow along with on your own computer. The scenario we'll be working with is that your colleagues in the research and instruction department, I'm going to refer to them as RIS, I have four years worth of student engagement data that they like helping analyzing. Before beginning any analysis, I like to spend some time getting to know my data and asking some additional questions so I know what I'm working with in the end goals. Before digging in and getting started, I recommend talking further with your colleagues to understand what they really want to know and why they want to know it. How do they plan to use their data? Who are their audiences that they may share findings with? For example, they may want to know things like how many engagement sessions did they conduct in a year? What type of sessions? Who are they serving? Do they have a lot of repeat students? Are they increasing their reach over time? And while these findings are sure to be interesting to the RIS department, you'll also want to know how they plan to use that data. For example, will they use it to help make decisions about how they conduct their work? Are they planning to share this? Any findings with campus decision makers, such as the dean or provost, to advocate for more financial resources? Knowing this type of information will help to guide you in your next steps and determine if and what other types of data you may need to answer these questions. In the next series of slides, I'll be taking you through and demonstrating the guided practice scenario using two Excel data sets, the two data sets noted below, student engagement, Excel, XS, and student information. If you'd like to follow along on your own computers, I recommend pausing this video and downloading these two files to your computer. The download links should be available right under or next to this video. So let's say the RIS team gave you a student engagement spreadsheet that contains information like that, which is outlined on the left side of your screen about the types of engagements they had with students over a four year period of time. While this information is important, it is not enough to answer all of the questions they want to know about. To answer their questions, they may also need or want information about the student's college, their major, and their classification. And of course, you'll need a way to tie this information to each student who engaged with the library in some sort of interaction. If you're looking to tie engagement data with some sort of outcome measure like academic success, something like cumulative GPA, retention, or graduation data, you'll likely need to request this from your Office of Institutional Research. Before beginning any cleaning and organizing, I first like to take a quick tour of my data sets. So let's go ahead and do that. Let's begin with the student engagement file. So I find this helpful to see how the files are organized. I see I have an email address, a semester, year, and engagement type. It looks all very standard or very, very typical. I like to see how many rows we're dealing with. I see we have 732 records. 732 records. And let's also take a look at the student information file that this information is just something I downloaded from a patron database. And please know that the user names are really just made up user names. Nothing is actual user information. So in the student information data, I have a username, a student ID, college, major, and classification. As I'm scrolling through, so this information came from the patron database. So you would expect it to be pretty standardized. Everything already to be pretty clean, which it looks like it probably is. But it's still a good idea to look through things. So I'm seeing there are a few different versions of graduate. And sometimes the capitalization lower and uppercase are different. That tells me that if I was going to use this field, I would need to do a little bit of data cleanup. I'm going to jump back to the student engagement data. If you recall from an earlier slide, the RIS team had seven different types of engagements. But looking through, and I see those types listed here, but I see that there are a lot of inconsistencies with misspellings, abbreviations, etc. This tells me that we are likely to need to do some data cleanup for this particular field, since that's a primary field of interest for the RIS team. I'm going to navigate back to the presentation slides. So now that we have a sense of what we're working with, a first critical step is to figure out how to tie your two data sets together. For this, you'll need to have a unique identifier, something that goes to the individual student or user level, like their username, email address, or student ID number. Each file needs to have a field that is formatted exactly the same. Sometimes you need a little bit of work, need to do a little bit of work to make this happen, which is where splitting and or concatenating come in. As we look at the fields in the two data sets again, I see that the username over here in the right is actually the prefix of a student's email address. The easiest thing to do then to make these two fields compatible is to split this email address so that we're extracting the username from that email address. And the simplest way to do that is by a simple text to columns feature where Excel or Google Sheets parses the text in one cell or column into two or more columns using a delimiter. Delimiter, something like a comma, semi-colon, could be a space or a tab, or you could also identify other characters such as the at sign, which like what we have in our email addresses. The opposite of that is concatenating. That's where you're joining two or more text strings together into a single string. And concatenating also allows you to add could be spaces or text or commas or even a word or a dash if you want to. For example, in this example below, I have fall in column B, year in column C, and I concatenate bringing those two fields together and I add a space between them. So let's navigate back to the student engagement data set. And so it's in this field email that we're going to do a text to columns. So the first thing I like to do is I like to give myself a little bit of space to work in. So I'm just highlighting cells B, C and D with my left cursor, my left button on my mouse. Now I'm gonna right click and say insert. That just gives me a few blank rows. I'm actually going to duplicate the email address into column B. This is just a safeguard in case I make a mistake. Then I can go back and still have my original data intact. I am going to now highlight column B and then I'm gonna go to data up here at the top ribbon. And I am going to go about halfway across the screen in the ribbon to text to columns and I'm gonna click with my left key. I want it to break it apart using a delimiter. And then here are the most common delimiters. We're gonna click other and add that out sign. Down here you can see a data preview and this is what we're looking for. And then next and then finish. Then I get this pop up warning. There's already data here meaning in column B do you want to replace it? And we can go ahead and say, okay. So perfect, that's what we're looking to do. I'm gonna go ahead and delete these columns here. And I actually don't need the email address anymore either. But before I delete that, I'm going to actually at this point do a file save as. And I'm just gonna call this working. I should have done this a little earlier. I find it's always a good idea to keep your raw data or your original data intact in case you ever need to revert to it. Let's go ahead and I'll show you what concatenate is all about to. And again, it's a very simple process. I'm just gonna click on column E. Gonna right click and insert. Just giving myself a little room to work here. You could concatenate and put it into any cell. But we're gonna bring together semester and year and we're gonna call this new field term. So I'm simply gonna press equals that brings the formula functions to life in Excel. I'm gonna type the word concatenate and actually Excel gave me some helper text which I usually take advantage of. So I'm gonna double click. So it's added a parentheses and now it says text one. So I'm gonna click and sell C2. That's the first part of my concatenate operation. I'm gonna press comma, which just tells Excel that we're advancing to the next step. Quotation mark here, which means I'm gonna add my own information in here and I'm gonna add that space just hitting the space bar. Quotation mark means I'm done with my own text that I'm adding comma. And then the next text string, which is actually this year. So I'm clicking there and then I'm closing my parentheses. And then I hit enter. And you can see I have fall 2016 here, which is exactly what I was looking to do. Filling in the rest of the column here is really easy. If you just have a few hundred records, you could just drag this down like this. But the easiest thing to do, especially in cases where you may have thousands of records, is to just see this little crossbar at the lower right corner of the cell. I'm hovering and I have that and I'm just gonna double click. And that auto fills everything. And it is always a good idea just to do a quick scan of your data to make sure it did fill all the way down. Once in a while it doesn't and then you have to find where it broke and then proceed. So I feel good about that. But as I'm clicking in this first cell E2 and then looking up here, I can see that there's still a formula. That tells me that this information here is not really text yet, it's just a formula. And most typically you're going to wanna turn that into a value, something that's static and not an active formula. So to do that, I'm just gonna left click, column E. I'm gonna right click and say copy. And then I'm going to right click again and under paste, I'm gonna go to the second option, paste values. And then I'm just gonna hit escape to deactivate that cell was highlighted. And now you can see here, clicking in here and then looking up above, we're looking at text strings only, not formulas. So that's exactly what we're looking to do there. And then finally to facilitate the rest of our cleaning and organizing and bridging data, let's go ahead and bring in the student information into our working documents so that we're not working with two files, we're just working with our working document. So I'm gonna go back to Excel, finding the student information worksheet. And you can do two things. You can either click in the very upper left corner that's gonna highlight everything, including blank rows and columns. Or you can click in this upper left cell and do control A, which just highlights the rows and columns that are being actively used. So control A is control all or a select all. And I'm gonna do control C to copy. And then I'm just all tabbing back to my working spreadsheet working. I'm gonna click a plus sign down here at the bottom of the screen for a new sheet. I click in that upper left cell A1 and do control V to copy that data in. And then I'm just gonna do a little formatting cleanup so it's easier for you to see on your screen. I'm having a funny time. There we go. You may also find it helpful to freeze the top row. I'll show you what I mean here. Just gonna click in over here at the one to highlight the whole row. I'm going to go to view, freeze pains and freeze the top row. That way when you're scrolling, especially if you have like thousands and thousands of records, you know what you're looking at. It looks like I already had the participation worksheet top row frozen. I'm gonna just go ahead and close that student information file. We don't need that anymore. And then I'm going to rename the worksheet with the student information to information just to help us keep track. And then I'm gonna save that file. It's always helpful to save your files. The concatenate feature works the same in Google Sheets. And the text, the columns works the same in Google Sheets except you just have to click in a couple of separate locations, couple of different locations. So I'll show you how to do that. Text the columns in Google Sheets. Just one moment. Okay, so here we are in a mirror version of the files we've been working with except this is a Google Sheets version. So remember, we're looking to extract the username out of the email. I'm gonna go ahead and add a few blank columns just to be on the save side just like we did before. And then this time I'm gonna click activate column B just like we did before. I'm gonna select data. And then from the dropdown, you just select text to columns. And down here, it's a little bit hard to see it first. There is a little pop-up thing called separator and you can detect automatically or you can select custom. We're gonna do custom and enter in that at sign and press enter. And that is exactly what we were looking to happen. So now I'll go ahead and say username. I'll delete these columns, we don't need them anymore and delete this column. Okay. Let's see if we can find our way back to the slides. So sometimes we have to create categories to make our data meaningful. The categories will standardize information into tidy containers that will enable you to analyze and visualize your data much more easily. This often is facilitated by filtering and removing duplicates as important first steps. So let's go back to our working Excel document to the participation tab and I'll show you how to go about these tasks. So here we are in our participation tab. So let's take a look at column D engagement type. Remember the RIS team, our colleagues said there were seven different types of engagements. That means that we would only want seven options, not all of these variations of how things are spelled or listed. To see what we're working with, we can simply apply a filter. So I'm gonna highlight this column engagement type. I'm gonna go to data and select filter. So now the only thing that really looks different right now is we have this little dropdown carrot. Let's go ahead and click on that. And you can see here these are all the unique instances of each word. So it's removing the duplicates in our view. So I could, maybe I just wanna see how things look for the data literacy instruction. So I can deselect everything and then only look at data literacy things. So now all of the other items are just hiding in the background. I can maybe I wanted to see what's up with the blanks. Let's go ahead and see there. The blanks are just empty records at the end. Sometimes you need to get rid of blanks. Sometimes you need to add more information if it's available, such like that. I am going to deselect filter here. And you can filter everything on all the columns in the worksheet by just clicking the upper left corner and then selecting filter. And so now you can see, look at for semester, you can see that we have two semesters for year. We have four years and for term, that's that field we just concatenated. We have looks like about eight different things. I'm going to go ahead and deselect everything. And then let's talk briefly about removing duplicates. In this case, we're going to look at engagement type. We're going to click on that and we're going to click control C to copy it. And we are going to click on a new worksheet, paste it, let's call this engagement. We're going to use this later to turn this into a lookup table, but for now we're just going to practice removing duplicates. I often find it's easier on my eyes and brain if I sort it. So I've just highlighted everything here. I'm going to go under data to sort, make sure my data has headers is clicked and I'm just going to sort A to Z. So remember, we want to remove duplicates because we have lots and lots of repeats. That's really easy in Excel and Google Sheets alike. I'm going to go ahead and click to highlight. Still in the data part of the ribbon. I'm going to go all the way to the middle to select remove duplicates. It's just double checking what we want to remove duplicates from. And then I get this pop up, 953 duplicate values were found and removed, 24 values remain. So that's great. So now we have this tidy list. We're still going to need to standardize it, but this is going to make our life easier. I'll show you real quick about removing duplicates in Google Sheets. Concepts, oh, and the filters too. Concepts are exactly the same. Just you got to click in a couple of different places. To apply filters, you can again, you can just filter one column or all the columns, we'll just do one. I'm going to click data and then simple as that, creative filter. Up here in your menu, you can see it's here or you can get it back to it using data. So just like before, their filters have been activated. Instead of check boxes, we just have check marks. So I could deactivate and only select maybe data literacy just like we did the last time and turn off the filters for now. And then we could remove duplicates, highlighting that column. I'm copying it, giving my worksheet a name and I'm going to paste. Sorting is exactly the same. You just go to data, sort range. I usually select advanced range sorting options. Just got some safety checks in there that are sometimes helpful. Now everything's nicely sorted. You don't have to do that part. I just find this a little bit helpful. And then data, now we're going to remove duplicates. So you go to data, this time data cleanup and remove duplicates. And now we just have these duplicates left or these particular engagement types left. So doing all of this work then allows you to do what I think is the fun stuff creating and using lookup tables. Lookup tables allows you to tie data from different worksheets together into a single location which then makes your ability to analyze and visualize your data a thousand times better. So this work usually follows filtering and removing duplicates. Most typically one worksheet will be your main or destination worksheet where you want everything to go. And then the other worksheets with the supplemental information will service the lookup files. If you're creating a new categorical field you will need to create a lookup table with the definitions or your categories. And then in the main worksheet you'll use a V lookup formula. V lookup means vertical lookup to tie the new categories together to your existing data. For example, if you had age and then you wanted to put all of the ages into each ranges, you could create a lookup table for age and age range and then use the formula to bring the ranges into your destination or main working file. So I'll show you how to do that back in our working Excel file. Not that one, one second. Oh, okay. So here we are back in our working Excel file. We're using participation as our main worksheet or our destination. And we wanna bring in that student information things like their college and their major or their classification. Oh, let's go over to information. For the purposes of this example let's just work on college and major. We would do classification but it's still messy and we would need to want to standardize that first. So I'm just gonna highlight college and major and copy those into our participation worksheet just putting in some placeholders to help me stay organized. We don't need this email address anymore. I'm gonna delete that. This is actually our username now. So to bring in the information about college and major I'm gonna first start off with the equals telling Excel we're gonna do a formula and I'm gonna just type in vlookup. We've got this helper text. I'm gonna go ahead and double click that to take advantage of it. That puts in a parentheses that says lookup value. What we want the computer to do is lookup username, comma. Now it says table array. So we're telling the computer, okay computer go look in a particular array of data. In this case our array are all columns and rows in the student information worksheet. So I tapped on the information tab. I'm highlighting all of the columns. So I was telling the computer look anywhere in this worksheet that you find a match do something. So we're gonna do a comma and then we're gonna say if the computer finds a match with username I want the computer to deliver the information in the college row. So that's column C. Though the formula requires us to use a number for the column. So column C is the third column, comma. And then we're gonna use false. So false you see here is looking for exact matches only. True would be approximate matches which might confuse things. And then I'm doing a closed parenthesis to end the equation that pops us back over to our student participation file and it filled in exactly what we were looking for. I could scroll, I could get the double arrow here or what do we call that the cross and just scroll down or I could just double click to auto fill. Just making sure everything filled in like it should looks like it did. We can repeat that process for major equals B lookup looking cell column A starting with cell A2, comma. Table array, we're gonna go back to the student information and this time we want it to deliver major which is the fourth column, comma and then false. Perfect. And just take note here that even though it says liberal arts and sciences is actually a formula up here. Whoops. And then same for major. It's actually a formula. All of these information really are relying on formulas and to facilitate our work later we're gonna wanna turn these into static text fields not formulas. So I'm just gonna highlight F and G right click and say copy, right click again and under paste, I'm gonna go to the second option paste values. And then I'm gonna just hit escape to deactivate the highlighting. And now you can see that we've got text up here that matches the text down here. Finally, let's create an engagement lookup table so that we can turn this messy engagement type column into something that's standardized. And we already got started with our engagement work when we reduced that list of engagement types and got rid of all of the duplicates. So I'm gonna start a new column here, column B and I'm gonna create an engagement type that is standardized. You can really call this anything you want. I'm just cleaning up my formatting. So I'm going back over to the information the RIS team gave me. So they said they had seven different ways students could engage with them. So our job here is to look at the engagement type in the left column, the A column and find the relevant type of engagement, the standardized engagement. In this case, it's data literacy instruction. And we're gonna put that any place in here where we think that is the better choice. We've got end note workshops. Seems like I saw end note. I saw that down here also. We've got info, lit, information, literacy instruction. Just putting that here in all of the relevant locations through this library orientation. I assume that's information, literacy instruction. Now we're on tours. Let's see, down at the bottom, there was a tour. We've got patent workshops. Seems like I saw a couple of different instances of those. What's next? Maker orientation, training, maker space, maker space. So those are all maker space things. Then finally, we have reference consultations. Great. I'm just cleaning up the formatting, though it doesn't really matter. So now this is called our lookup table. So we're gonna do one last concatenate function. Let's go ahead and add a new column right next to the messy engagement type. So I'm gonna click on F, right click, insert. Let's call it engagement type standardized. We're gonna do very similar VLOOKUP like we did before. We're gonna go equals VLOOKUP. This time we want it to look in the engagement type column for that messy item, comma, table array. We're gonna go to our engagement worksheet, highlighting everything here. And this time we're telling the computer if there's a match, deliver the standardized information in column two. And then false and then close parentheses. Then I'm just gonna double click. This all looks pretty nice. So I'm going to do a quick scan. Actually, I see we've got some weird random spaces in front of patent workshops. There was another in library tours. So let's go back to our lookup table. Oh, sure enough, there's library tours. Just gonna delete that space. And then the other one was related to patent workshops. Everything would still work, but the formatting you may find irritating. So now I'm back in our participation worksheet. I can see those errors were corrected because everything is actually just a formula. Let's go ahead and copy and then paste special, paste as values only. That way we're working with static data when it comes time for analysis and visualization. So let's recap in this presentation we briefly touched on a variety of topics, including the necessity of data cleanup, getting to know your data, splitting and concatenating data, filtering and removing duplicates and creating and using lookup tables. While I know this is a lot to digest in one sitting, the main thing I hope you'll be able to take away is having some familiarity with basic concepts and ideas about what goes into data organizing and cleaning, how to go about this, or perhaps more importantly, knowing that Excel or Google Sheets formulas and tools can make your life a million times easier. And hopefully you're starting to build an awareness of the sort of things to search for on Google or elsewhere if you can't remember the exact details or exact steps you need to follow to accomplish these tasks. Thank you very much. And please, I hope you'll take a few minutes to respond to the follow-up survey. Thanks.