 Hi everyone, my name is Olga, I'm a research associate at INSEAD Business School here in Singapore and today I will be talking about data cleaning. About one year ago when moving to Singapore was just a dream for me, my husband told me an interesting story about Singaporean MRT. He said that Singaporean trains were stopping in tunnels seemingly for no reason and the transport company couldn't figure out why it happens until they started working with a group of data scientists and this group using only data set provided by SMRT and data analysis were able to solve the puzzle. So if you want to do data analysis you will probably go through these steps. First you will be collecting data, this is something you didn't just demonstrate it, then you get raw data. After that it will be data cleaning phase and that's something I'm going to talk about. Only after that you start data analysis. You would probably want your data look like this. This is an Excel spreadsheet where every single row is exactly one observation and every single column is exactly one variable. This is organized as a matrix making it easy to import to our Python or other programming language for further analysis. But usually the data you get is not so nice. Raw data might look like this. Raw data comes in many different ways. It can be a complicated JSON from an API. It can be an Excel spreadsheet you get from a company you're working with. It can be a database dump. It can even be manually entered data. With raw data you need to figure out its structure and extract the relevant parts. Today for data cleaning we will be using a Python library called Pandas. Here you can see Wes McKinney, the developer of Pandas. He sees Pandas as a tool more powerful than Excel but still accessible for people without hard core computer science background. Pandas is super popular and this diagram shows that 1% of all Stack Overflow questions are about Pandas. And one more thing before we start coding. The main data structure we are going to use today is Pandas DataFrame. Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. Remember the tidy data slide with columns as variables and rows as observations? That's exactly what Pandas DataFrame is. Let's go down to coding. Please open your notebooks. I mean Jupyter notebooks. The first notebook we are going to use is DataCleaningSg. Is it small or should I do the screen figure? And first thing I want to ask you to do is restart and clean output. But this is only for the case if you have downloaded the files ULINE has generated. Because otherwise you won't have all the outputs. So run this if you went through ULINE steps. Now we have only code without any outputs. And first dataset we are going to clean today is Singaporean disease dataset. And our goal is to have a table where for each year, week and disease we will have total number of cases in Singapore. Any questions? Good. Let's run. Let's import Pandas first and then read the dataset. Here it is. It's almost the same as we need. But we don't have country column and year week column is not split. It is combined into one column. Year and week. So first step is very simple. We just create a new column Singapore and read first five rows from the data frame. Here it is. You see the new column. Next step is a bit more involved. We need to split year week column into two columns. And first let's practice with split function. What it does? It takes one string and it splits the base string into two parts. Everything to the left from the past argument and everything to the right. So when we run this cell you see that variable A has first and variable B has second. Now let's do the same for our data frame. See we have two new columns. Column year has everything to the left from minus W and column week has everything to the right. So we know how to create new columns. Let's see how to remove the unneeded. We just call drop function and we do it in place because we don't want to create a new data frame. We change the existing. One last thing. Here you see I was printing not the first five rows but the row starting from 110. I did it to show that there are some weak values with leading zeros. We don't need these zeros and now we will run with. We will be using L strip function. Let's practice with it. I have created a test string with many many zeros and 65 and when I call L strip and pass zero argument it removes everything from the left every zero. And we do the same for our data frame. Here you can see column week doesn't have zeros anymore and we save the results into a CSV file. That's all. So we just cleaned Singapore data frame, data set, sorry. And now let's open a bit more complicated one. This is this dengue data set for Malaysia and our goal is a bit different. Here we want to have a breakdown by region and we have only one disease dengue fever. We import pandas again and let's take a look at this data set. Pandas can also read Excel files. Here it is. This data frame is very long and now I'll explain how this data organized. For each year from 2010 to 2015 we have a rectangular block. It's long. And every row from this block is a region and every column is a week and every cell is the number of cases in this region during that week. NA and NA-N means that data is not available. Today I'm going to show how to clean one block for year 2010. The rest years can be done similarly. First I defined slice parameters. Parameter skip rows means how much rows we want to skip from the beginning of the file. Parameter skip future means how much lines we want to skip from the bottom of the file. And parse calls means how many columns we want to read from the file. Here it is. Now we have only data for block for year 2010. We are going to reshape this data frame and for reshaping we are going to use a function called melt. Let's practice with melt a bit for the beginning. Here I created a data frame, a simpler one. Each row in this data frame is student and it has three columns for three different subjects. And every cell is a grade of the student in this subject. From this data frame we want to create another data frame which would have one row for every cell in this original data frame. To do so we call melt with our name subject value name grade and it var student. Now melt creates from white data frame, a tall data frame. So for every cell in the original data frame we have one row with three columns, students, subject and grade. And now we will do the same for our original data frame. Repeat again what we have. We call melt but this time instead of subjects we have weeks. Instead of grades we have cases and instead of students we have regions. Here it is. It's only five rows but it is not white anymore, it is tall. Now we want to remove the not available values. We just call drop NA and we do it in place again. Now you see there are no NA anymore but the values. But as you can notice the indexing changed. It was from zero. Now it is from 105. To reset the indexing we call reset index function. So the indexing is starting from zero again. Here we're just adding a bunch of new columns. Year, country and disease. They're static. One more thing left. For the week column there is Malaysian week and number in values but we don't need the repeating week word. So we call a function which takes last two non-white space characters and take response to the number of the week. Here it is. We renamed negric column to region call. Here it is. It's done. Thank you. Questions?