 So this talk is the top 15 Python tips for data cleaning and understanding. So yeah, so good luck, you're ready to go. Thank you. Okay, yes. So hi everyone, thanks for tuning in. So like for today, for today's talk, I'll be talking quite a bit about like what I do at work like 70, 80% of my time because like, you know, as a data scientist as someone who works on data science project, like 70 to 80% of our time is spent on data cleaning and data understanding. But in a way, like, I feel like this topic is kind of underrated. Like a lot of people don't talk about it. And maybe I guess it's because like it's boring, like you don't see any fantastic product from this process itself. But I feel like it's something that, you know, all of us whenever we are learning how to work with data, how to understand and analyze data, these are the few comments that I feel like are really useful for me. And I hope like at the end of this talk, you'll find this or all these comments and also tasks that I'm going through useful for you. Yeah. So this, you can think of it as like a consolidation of my experience. And of course, like I am also happy to hear from you, like if you have experience working on different projects and if there's anything that you feel that I have missed out, yeah, I'm happy to add them in as well. So a little bit about me, I'm currently working at a global media agency called Essence. And I am actually helping to build a data architecture over there. So we are trying to build a data warehouse and we are laying the foundation basically for more analytics capabilities. And this is where like we do more ETL, extract, transform and loading. And there's a lot of data training involved as well, data understanding, like trying to make sense of what are the valid data and what data do we need to, you know, clean, how do we clean them? Yeah. And outside of work, I run a data science blog called Data Level Confirm. So if there's a little bit of time towards the end, I'll share a bit about my blog. And yeah, basically I actually have been working with data, like since my undergrad days. And back then in school, like I feel like a lot of times, the data sets that we work with are clean and also partly because like the focus of the curriculum is more on statistical analysis and also on modeling. So in a way, like the focus on data cleaning and data understanding is slightly less. And I feel like most of the time, I mean my experience on how we should clean data is related back to my work, like right on the job, I learn about the data itself, like how it's being collected and then all the rules governing the data, like, you know, for example, why this data is being collected at a sudden frequency and why this data is being quoted in a sudden way. So a lot of this is like really dependent on the business context. And I guess this is something that, you know, in schools, we might not be able to work with such data because, you know, they are not really business-centric in a way, like that's why for academic projects, sometimes the data can be much cleaner than the real-world data that we actually work with eventually. So over the seven years of working with real-world data and also with open data, I keep going back to the same stack overflow answers to find out, like, you know, how to do certain tasks, like how to do certain operations, which is why like I eventually came up with this framework, I would say like this process, this guide which I'm gonna talk through later on on the 15 tips, yeah. So without further ado, I will just give a high-level overview first before going down into the code. What these 15 processes, I mean, 15 tasks is, yeah. So mainly like whenever we get the data set, we are interested in knowing like what data variables there are in the data set. So usually we will query to find out what the column name is. And then subsequently we also want to know whether the data that we read in using Python is correct or not, like, you know, whether Python loaded the data correctly. So it can be in various UI, say, for example, if you use Spider or say, in my case, I like to use Jupyter Notebook. So if I read the data in, I want to check that all the data, all the records is being read in, yeah. Then next, I will want to understand like what the data types of my variables are. So, you know, when Python reads the data in, right, like, or to be more specific, like, when we're in using Pandas in the library, then the library itself, like, they will automatically try to detect, you know, what is this particular data type. So it is smart in a way, like, for example, if you have string, like for a particular variable and string format, and then it will detect it as an object. Like, if you have integers of float, yeah, then they would subsequently try to detect it in that way. But in certain cases, like, it can go a bit wrong. So this is why, like, we also want to check whether the data type is correct or not. Yeah. Then next, sometimes we want to understand, like, what are the unique values for each variable that we have, okay. And this is really one way to identify and clean data. You know, like, for example, you know, like, for a particular categorical variable that you have, you only have say, for example, for gender, you have, like, female, male, undisclosed, okay. And then somehow, like, during the data collection process, say, for example, it's not collected through system, or it allows, like, manual entry kind of stuff. And then you have people cleaning in, like, M or F, you know. So they are similar in a way, like, you know, male and M is, like, referring to the same thing, but they are treated currently as two separate categories in terms of unique values. So this is where, you know, like, we want to eventually do some correction over there. And then first, this is why we need to know what other unique values there are. So next, similarly, like, so for continuous variables, we want to understand what the range of values we have for them, right? So, like, in certain cases, you only have positive values, like, you can only take positive values, but somehow, this particular variable shows a negative value in terms of when you look at its distribution. So this is probably, like, an incorrect input, or, you know, like, something happened along the way during data collection that resulted in this, yeah. And it's something that we want to investigate further. So which is why we do this task. Okay, so subsequently, next, we will also get the count of values, say, by different groups, like, by different levels. So you overlay, so instead of, like, univariate, like, one level, getting just the unique values you want to get account on. So this is somewhat related to data understanding. And then in certain cases, you want to rename columns before you do any, say, merging or appending, right? So this is why I also included this as a tip. And then as well, there are certain cases where, you know, you have people inputting, say, for example, values in Excel spreadsheet, right? And then they have coded it in as a currency. So you have, like, dollar sign and as well as commerce. And then, you know, eventually, when you read the data set in using pandas, this variable is identified as, like, I'll check, like, it is not identified as integer or float as you would like it to be. So in these cases, like, we want to remove the dollar sign and the commerce, okay, from our variable. Okay, and yeah, relating back to the data types where never, like, we want to convert the data types to the correct format, yeah. So there could be cases where we want to convert string to numeric and string to date. So this is one of the tips that I'm going through as well. And in other cases where, you know, you want to replace values with another values, like, back to the example of the male and female case, like, you want to replace M with male and then F, like, yeah, with female. And I am also including slightly more complex transformation over here where, you know, we want to identify, like, data variables similar or different across data sets. So this is, you know, a way also to understand, like, what are the keys that we can merge on, you know, if we want to. And then, again, like, whether if you want to do, like, a left or right join kind of thing, like, what are the variables, what are the records that are going to drop because they don't appear in both data sets, right? They only appear in one, yeah. And then, similarly, if you have, like, a lot of data that is collected across time, and that's when, you know, you want to append data sets, like, to make your analysis more robust and more meaningful, like, you have basically quarter one, quarter two, quarter three, you know, and then, yeah, each quarter, they collect the same kind of data. And so, every time you have new data set, you want to add them together, yeah. So this is also something that is very common when working on any data science project, when, you know, you have so much data that you can work with, yeah. And similarly, like, if you have a lot of data that, you know, you are, like, pending, then there could be cases of duplication. And these are the times where, you know, you want to dedubicate our data. So there's different ways where we can choose to drop the earlier record or the late record, you know, kind of thing, yeah. And pandas can handle this very easily, yeah. And next, yeah, similar to a pending, say, for example, you have data from different sources, and then you want to combine them together, right? So for example, if you have, like, demographic data from one database, and then you have transaction data and other database, yeah. Then you want to merge them together, yeah. And then based on, like, a certain key, like a customer key, for example, and this is something that is very useful as well, which, so I'm also going to touch on, yeah. And next, the last tip before the two bonus tips is about recording. So in these cases, like, we want to, say, change, or rather, like, include a new variable based on the existing variables, okay. So we can do some form of recording, and yeah, pandas can allow us to do it very easily as well. So there's two other, I would say, tasks that I do it sometimes, not so often, but I think they are very interesting and they are very useful as well. So I've included them in, yeah. So just to give a bit of context on what we are going to go through later on, there's this particular scenario that I have basically created that is relevant to my work, yeah, what we do at work, for my current work at Essence, where, you know, we want to investigate, say, certain practice behind campaign success, like digital advertising campaign success, or like, yeah, if we want to do some post-campaign analysis, then we have data from different sources, right? We have data that is related to the campaign details set up, like how it's being, like, in terms of what market is being run in, how many days is it run in, you know, this kind of thing. And then we have also viewability matrix, like, for example, related to impressions, clicks, you know, yeah, and then next we also have results from the different platform that help us run Brandleaf, yeah. So we've sold, you know, data from so many different sources, like there's going to be common issues, you know, like related to inconsistent naming of variables or fields across datasets, yeah. So it can be relating to the same thing, but they are just turned differently in different datasets. And then also in appropriate data formats, or like there can be invalid duplicate or missing value. So all these are related to how clean the data can be, yeah. So the materials for today's talk can be found here, yeah. So I will be sharing this link, I mean, I'll be opening up this Google Drive later on, so you can check it out later, yeah. And the two libraries that I'm going to go through today is mainly Pandas and Ampai, and yeah, as you can see, if you search on the web, right, to learn data science, like to learn Python for data science. Mainly, I think these are the two most essential libraries that you need. So before you go into, say, fancy modeling or, you know, like algorithms, then eventually, I mean, what essentially you need to be able to first process and clean the data, right? So these are the two libraries that lays the groundwork, okay? So Pandas is built on Ampai, and yeah, Ampai basically is a library that helps to handle different kinds of like data structure relating to arrays, yeah. So there's three datasets that I have created, they are all mock data, and this is a preview of the dataset. So main key, these first dataset, hot campaigns, have information relating to the campaigns, like the different advertising campaigns, okay? So you have spans and information relating to which marker and platform they are run on, okay? And then there are also two other datasets that are relating more on durability matrix. So like again, impressions and different kinds of like impressions like this is measurable, and then they are collected across different times. So like H1, you can think of it as being the first half of the year, and then he's shooting second half of the year, but there are more data being collected in the second half of the year. So the thing is that like, there can be different people, different teams collecting the data differently. So in a way, there could be some inconsistencies here and there when it comes to like naming. So if you look at it, right? Like over here, we have campaign name, okay? But in a matrix, we have campaign, for example, but they are similar, referring to the same thing, yeah. So all this kind of like, I would say a reflection of the challenges, like the data challenges I face at work, and in this case, like U2 confidentiality. So I am just creating mock data, like, you know, fake data, because yeah, we can't share the data that we have. So for the rest of my talk today, I'll be using Jupyter to go through the code, okay? So Jupyter notebook is like a very friendly UI. And yeah, if you go to the drive later on, there's like a HTML format as well. So yeah, so mainly, if you just run through, you can see like all the different commands structured type to the task number that I have gone through just now in my presentation, okay? So very easily, like you can see after you run the Python code, okay? Then the output will show you exactly, you know, like what are the column names that you have in each data set. And then subsequently, you also can see, like if you just run dot shape, right? So all these are like commands that is type to either numpy or pandas. And then if you just import them at the start, right? Then you can already run it. And by default, like if you install Anaconda, Anaconda has Jupyter notebook and also Python directly within the distribution itself. So it's, yeah, by default already installed. So you don't need to run any, like further pit install to install numpy or pandas, yeah. And then the thing that, yeah, I just want to highlight is like, for example, yeah, all this, you know, all these commands, they are repeated for all the three data sets. So you can kind of like keep seeing the same thing, but really it's like a process of understanding our data. So we want to be very clear, you know, whenever we receive a data set, we go through the same similar process, same treatment for all the data sets. So just one thing to note over here, like I have created this scenario in my data set whereby, you know, people actually do different kinds of coding or rather like do different kinds of recording of the platforms where Facebook and FB, they are actually the same thing. And then YouTube, you can have like, you know, people not doing caps for the tea and also just abbreviating it as YT. So they are referring to the same thing. So this is something that we want to clean as well. Okay, so yeah, over here, we can make use of the described function, you know, to look at variables that are numeric in nature. Okay, yeah. And for variables that are categorical in nature, we can make use of value count, okay? And then similarly, like if you want to look at more detailed breakdown by different levels, okay? So this is by market and by vertical, okay? You can make use of the dot count function. And then next, we can also, you know, instead of outputting all the columns, okay, they are all the same because there's like three records for, you know, all these columns. We can just choose a particular column to output. Yeah, so these numbers over here is the same. That's what you see at the top, okay? So in cases where, you know, we want to rename our column names, okay? So this is how we can do the renaming. Yeah, and this is just directly replacing them in the data set, okay? So again, like to remove symbols in values, we can make use of the replace function. So now the data for spans is clean. Yeah, we have removed those unnecessary symbols. And then similarly, like for converting string to numeric and string to date, yeah, there is this two numeric function in pandas, okay? And then over here, when you see this section, I mean, this part of the code that says errors equals course means like whenever there is a missing value, for example, or like, you know, format that cannot be changed to numerals, there will be output as missing, okay? Yeah, so over here, we want to clean our values for Facebook and YouTube, okay? And then also we've made use of the replace function, okay? So in this particular line of code, what I'm trying to do is to add like underscore between all the values in the different columns. So I'm overwriting this campaign over here because the keys that I want to match on have to be corrected first, okay? Before I can match with the other data sets, okay? So over here, yeah, I'm looking at the variables similar or different across data sets, okay? So the similar column name across all three different data sets is campaign, okay? Yeah, and over here, sorry, just to go back a little bit, yeah. Clips and measurable is present in matrix H2, but not in matrix H1, which you have seen above just now. Yeah, so congecting or appending is very straightforward. Yeah, you just have to make use of the function pan. And then now you can see there is a total of 36. Rows. So cost, in fact, actually when I created these data sets, I made it a case whereby we have some form of duplicates, okay? So I am keeping, yeah, the last, the latest record, okay? So, yeah, I didn't want the earlier record switch unclean to me, okay? So next we do some form of merging, okay? Across the three data sets, okay? And then now I want to add in a new column variable, okay? called days, where it's the number of days between the start and end date. So I want to do some other form of checking or identifying different priority groups, for example, and then I do some form of recording based on the number of days, okay? So this is where I cover the recording, okay? And then this very particular, I mean, this is a new package, okay? Which you have to install. And it's very powerful, I feel. So it gives us an overview of all the variables and its distribution, okay? The unit counts. And from here we can get a sense. Like for example, you see nowadays you have like negative which is something wrong, okay? So yeah, from here you can identify like until data and they actually give you a correlation matrix. So yeah, I think this is really useful if you want to have like some firsthand preliminary understanding of the relationship between variables. So of course, like some might not make sense. Some sort of correlation here might not make sense because it's not data. So in terms of imputing missing values, we can first like, you know, if I'm interested to know which are the growths that have contained missing values, you can, I mean, you can filter it using this particular code, okay? And then next I impute the missing values based on mean proportion, okay? Of the measurable impressions to impressions, okay? And then now after I do the imputation, okay? I have, yeah, all these values now. Yeah, they were originally missing as you can see from the top, yeah? So let me jump back to my side, okay? So just to quickly wrap up, yeah. Mainly there are actually more exercises relating to Python and stuff that you can find on my blog if you're interested. So do check it out, yeah. Thank you. Good.