 customer sanitation in Python so the audience is for people who are very relatively new to data science in Python. I'll do a step-by-step walkthrough so if you are already an expert you're welcome to stay as well. You can check out my GitHub account here and directly look at the slides and the data and the sample data and code if you want to walk through as I'm talking. Customer sanitation is very important for every single business. Imagine if you can segregating your user base into different groups then you can provide them with customized products and services and customization normally makes people happy that probably means higher profit for your company is great for the business. Take the 2012 Obama re-election campaign for example Obama is a data science team they use very tailored messages to target the voters based on the behavioral data, demographic data, financial data so that definitely have helped Obama win a lot of undecided voters and the election as well. So for the rest of the talk I'm going to pretend I'm a new data scientist in this company called Ypedia. It's one of the world's leading hotel booking companies I'm really excited to join it and obviously has some competitors out there with similar names so the mission of the company is to revolutionize how people travel through data and technology so on my very first day of work the CEO gave me a huge assignment. He has a board meeting coming up he needs to answer a few questions such as providing a report of underperforming and overperforming segments and how to tailor different new marketing campaigns for different cities coming up in the next few weeks and the most important question how to improve user booking rate which is a bottom line of the business so as a newbie in a company I want to impress the CEO I say yes I'm going to help you with that so what should I do? I go to our data engineering team and they give me this CSV file. They give me this CSV file with a lot of numbers and I have new need to come out of the work plan very quickly to answer the very difficult question for a board meeting so I'm going to first explore data then understand what kind of user information I have in the first place then I'm going to try to use some of the basic data tactics to answer the three questions so I know in Panda I can I know in Python I can use NumPy and Pandas that can help me easily explore the data so first of all I load the data I load the CSV file into Pandas using a read CSV function in Pandas okay is it better? okay great so and this this is this sample data is from the WIPDF homepage and the data engineering team has logged the basic information about what kind of information people are searching for and whether they have made a booking or not so after we after we load the data into a data frame which is some something similar to Excel table in pandas and we can use function such as head to take a look at the first 20 rows of the data so this is the kind of information we have in a database so we have date we have date time which is we should have time stamp when people make a booking we have location data where where the user made a booking so it's worth noting that the data here is anonymized into integers so into seeing the actual location location name the city name we're seeing an integer here corresponding to a different value and we have information like whether the booking is made on mobile or not whether it's a marketing package related to it and what marketing channel is a booking from checking date check out date great so these are really useful information so in pandas we can we can operate we can use a few different operations on on data frame we can do view that's what we did just now we can also select we can also select data let's say if I want to just look at the different site names we have what I can do here is yeah you would there you would in this way I'm only selecting one column instead of the whole data frame I can we can also merge different data frame together into one table into one data frame and then we can also group by data frame to understand the aggregator numbers on based on the based on the group values that's something I'm going to demonstrate later on so in pandas and many different functions that's very easy to that's very easy to use they can directly give you exploratory data analysis results for example min and max and unix to give you a unique number of entries in the in the column and the describe function info function d types function these are also very very easy to use exploratory data analysis functions you will see a lot of information of the data frame by just typing one word so after we understand what data we have the next step is to understand whether we can validate some of the business logic so one of the biggest lessons and learn so far is really never trust a raw data given to you because whatever data that's given to you is very easy it's very common to have programming bugs when people are logging the data for human mistakes if this data entered by human beings so in this case there are a few business logics I want to check the first one is it's a checking date actually later in a booking date so you're booking a date for some you're booking a trip for some time in the future not for some time that's has already passed and whether the checkout date should be greater than the checking date so you are actually you're staying for a positive number of days instead of a negative number of days and the number of guests should be more than zero so you are booking it not for a ghost but for actual human beings yeah so these are the and that definitely a lot more checks you can think of and I encourage everyone who is out there doing data analysis to perform as many business logic checks as you can think of so for demonstration purposes I'm just going to check whether the checking date are there any data entries where the checking date is actually after whether check is before the booking date so I'm performing a select operation on data frame here here I'm going to select where the checking date is less than the booking date yeah yeah we do have a many entries on where that doesn't follow this business logic and it's interesting to take note that when for data frame operations you can directly do add, subtract and multiply same as how we manipulate numbers here which is one of the beauty of data frame operation for example when I create new variables sample duration and a duration or days in advance I'm directly using subtraction here between different columns when we have when we create new columns that with more complicated business logics for example I want to exclude those data entries we found invalid just now using a date check and I wanted to assign a null value to these data entries that means I need to have write some if or else if function here when the data logic is more complicated we normally use a row operation instead of directly do a subtraction between two columns we can define a row operation so the operation then can be aligned row by row to the data columns for example I define a row I define a function here on duration that's to be applied to every single row in the data frame then later on I can directly apply this function to the to the whole data frame that's how I get my new column so that's another common operation in data frame okay with this understanding without data so we can go ahead and answer first question what are some of our performing and underperforming segments so here I'm going to introduce a different function that's called group by so what group by does it will split the data points in based on certain object based on certain columns into different based on into different group values then you can apply a aggregate function on each of the group values then the last step is combine the results together to form a new data frame let's see how it's been done here we can try to get the booking rate for every single channel what I'm doing here is our group by the data frame based on the channel values then I will apply aggregate function mean and count to this column is booking is booking here is a binary column is at a 0 1 so the mean of this column will give me a booking rate and the count of this booking and the count of this column will give me the number of booking attempts let's see what we get yeah so now we can see we get the number of bookings and the booking rate for every single channel here and it's worth noting that after you apply a group by aggregate function the channel which is a group name will become an index so instead of a common data frame column you'll be index what we normally do is we will reset index so this group name will become a column name again after that you can sort the values by booking rate or sort the value by channel name up to your preference yeah so let's take a look at channel 0 here it has about 12,000 number of bookings and booking rate is 7.2% this compared to the overall booking rate let's take a look at what it is which is 7.9% so can I say this channel has a lower booking rate than the rest of other channels so for other economic sense statistician out there you might say that might not be true because I'm simply comparing average so what's the statistical significance behind it so that leads us to the next question how do we test the statistical significance of the outperformance so I'm introducing a concept of T sample two sample t-test here two sample t-test is a hypothesis test for the equality between two binomial samples and in this case I'm looking in a booking rate for each sample and this is the binomial sample because we are looking at data points where the value is either 0 or 1 that's why we call it binomial and in order to look at whether there's a real difference in a probability distribution of the two samples we need to understand the concept of random error some random sampling error so if we see the difference between the two samples is it because of random sampling error but because there is a real underlying difference between the two samples so introducing a concept of Z score and p-value here which some of you are probably familiar with so the p-value is a value that measures how likely how likely the difference we see is actually due to random sampling error and the underlying distribution is actually the same so if the p-value is very small that means the two distribution we are looking at are indeed different so let's look how we implement this in Python. I created a function here called that's comparison so I'm using looking at a formula here I need to get n which is a number of bookings and p which is a booking rate for both for both samples and I calculate n and p for the first sample then I can calculate n and p for rest of the bookings then I use the step package from sci-pi to calculate the z score and the p-value and add a new column here significant if the value is one sorry if the p-value is less than 0.1 then I can confidently say the two samples are indeed different so assign value one negative one and zero if it's significantly larger smaller whether it's not significant at all so let's perform this function in in a channel we are looking at just now so that's the result we have can you know all these intermediate values I use here to calculate the z-square and p-value let's look at channel 0 again yes it's booking rate 7.2 percent and for other bookings other than this sub-segment the average booking rate is 8.1 percent and we can we can see this this channel is performing worse in average with statistical confidence and some of the channel here where where the booking rate is actually when the number of bookings is very small we can see that we cannot is that sample size is not big enough for the fast test statistical significance so with that we can confidently answer the first question here is a little outperforming segment here's a little underperforming segments and there are some segments here we simply don't have enough data to say it is outperforming underperforming so with that we can move on to our next question how do we tailor different marketing campaigns for different cities so before we understand how to tailor different marketing campaigns we need to understand what are different clusters of cities are there what are the characteristics after understanding that we can think about how do we tailor our marketing campaigns so I'm going to introduce that your classroom here which is what we call unsupervised learning in machine learning terms so unsupervised learning it means that there is no label assigned to each city you can consider each city has one data point to get looking at here there's no label assigned to it and the classroom help us find the underlying structure among these data points so in Python we can use SK learn package yes all the common machine learning algorithm estimators we can use and for these unsupervised estimators such as k-means or SVM the common function we use is model predict the predict function will help us predict model in a classroom algorithm so the way you run machine learning algorithm in Python does it very standard you create the object of the algorithm estimator then you use predict to predict label in the classroom algorithm there are a few steps I normally follow when I create classroom in pandas first one we need to first step is we need to understand what are the features we can use so in this case our goal is to distinguish different cities so we need to understand what are the characteristics what are the features that distinguish one city from the other then this can come from your business sense this also come from the exploratory analysis you did just now I selected eight different features with which I think are going to be relevant here then what I'm doing here is I'm going to the group by function I explained just now I use group by to create the city level data for these metrics and the second step is to understand whether we should standardize the data by standardizing the data I mean getting a number of standard deviations from the average so this step is very important especially for classroom mechanism like key means because when we calculate the distance between different variables if the magnitude of one variable is a lot larger than other variable so we are putting a lot more weight into one variable instead of the other for example if one variable is in a magnitude of a thousandth and harass another variable is in a magnitude of a single digit so the distance between the first variable we have much larger weight in terms of overall distance compared to the second one so by standardizing the data we can avoid this problem so the next step we need to choose classroom method I'm using key means here and also a number of clusters number of clusters I'm choosing an ad hoc three number three here you can imagine the marketing campaign budget only allows three different marketing campaigns for three different types of cities so that's definitely subject to a business context and what how many classes you want to get and there are also a lot of different methods out there for you to determine what is the optimum optimal number of k right one of the common methodology you can use is what we call apple method so you help you select the optimal number of k even you increase the number of k at the point the is not going to help you fit a model better yeah it's as I mentioned just now the way to create a classroom algorithm on using csk learn is very easy you create the object you specify the parameter in here and class equals three then you can immediately feed the data so we two lines are called you created a classroom algorithm great let's run this great so now we created these three clusters in an eight feature space eight dimensional space how do we visualize the data we know we can visualize 2d data 3d data probably not 8d data so one way to visualize the data is to use the technique called principle component analysis so you help you reduce the eight dimensional space into two dimensional space you can imagine component our principal component analysis is something like if we reduce the dimensional space but the steel tries to maintain the data variability within the raw data so in that way you can get a two-dimensional graph that maintains a lot of the variation in the data I'm using a package decomposition and a function pca here and again with one line of code you can easily reduce the dimension from 8 to 2 with that I can plot I can make a scatter plot based on the two dimension we have then let's see what this graph shows so this is three clusters we created just now one color indicates one cluster and this is a good this is a good illustration of the basic principles behind k-means points that are very close to each other are in the same cluster so how would this be useful right so we have these three different city clusters how do we apply it in this marketing campaign context so we need to understand what are the exact differences in the business metrics for these three different clusters here I need to merge a data merge a data first merge is another common operation for data frame in pandas this is a standardized data I use just now I selected user location city and cluster from this standardized data frame then merge it with the original data that's before standardization and using a common column which is a user user location city here then I group by the user look I group by on cluster then I get them the mean of all the business metrics for each cluster yeah so we have cluster zero one two here let's see how they are being different right so let's look at cluster one you can see that people in cluster one stay for a lot longer and they also book a lot more in advance and they travel a lot more they travel a lot further and for cluster zero here for cluster zero here they tend to be a larger group with a high number of adults high number of children and they book more rooms so these are probably the family types who always travel with a larger group so by understanding these characteristics of different clusters so your marketing campaigns can tailor to the specific purpose for why people travel and that means your marketing campaigns are going to be more effective because you know how to target them exactly great so um and other remarketing there are a few other common users of clustering for example insurance by grouping different policy holders into different groups you can understand what are the policy holders who have a higher average claim costs so that might help you set the premium price for different for different insurance policies and for city planning if you have a large number of households and you want to be community centers it's important to understand what are different clusters based on the house type value geographical locations yeah so with that we can go ahead answer our next question which is basically um how do we improve how do we improve the chance of booking for individuals before understand what triggers people to um how do we do that that we need to understand what triggers people to have a high chance of booking what are the factors that lead to people who are more likely to make a booking so i'm introducing a concept of supervised learning here because when we look at individual user data we have a label associated with each data point the label here is whether the person has made a booking or not so what decision tree does it will split the data into different subsets and they will try to make sure a subset is either all zero either all one so and the data is splitting in a way that you will choose the attribute that explains that does the best job explaining the data first for example example i show here is duration lesson lesson we go on 2.5 so people who only stay for one or two days are very different from people who stay more than that so that's how the data is being split up again we can use package from a scale learn the tree decomposition cost validation and so on the steps are very similar to clustering but one major difference here is because this is we are we are using a data points we're using a data sample with labels so we can split the data first into test and train we can build a model based on the training data and later we can test the accuracy of using a test data so using test train test split this single function i can split the data into test and train with the test size being 20 percent of the data um i then create a object using under the decision tree classifier class then again similarly i feed the data with the feature that i have and the outcome variable which is the label here so let's see what that gives me yeah so in this case i'm using the class zero from the sample okay then after we create a decision tree we can visualize the data to see what it looks like to visualize the decision tree i'm using pie dot data here so i convert the decision tree into a pie dot data and i use a function export graph width to plot the graph of what the of the decision tree after converting it to a pdf okay this is expected because the sample i'm using is only 200 so that's also one of the limitations of decision tree if your sample size is very small and and you set very stringent conditions such as i want to have six different leaf loads and the minimum sample is at least 200 right so that's the incompatibility of your sample size and the parameters you are using so in this case i'm using a cluster zero let's see what the different what are the other clusters here so i'll sample group by i'll group by the cluster and to see what is the count for different cluster so we see um cluster two has a much larger sample so let's use cluster two here to see what we get okay great so with the largest sample we get a much nicer tree so let's look at the left side of the tree here so for people who are staying only for one or two days and only with one adult and they are booking really you know zero days in a month that means they are doing the same booking um we can see the probability of booking in this case the first value here is a number of zero this is the second value number one so the overall booking rate here is almost like 30 30 plus percent right so this is compared to average booking rate only eight percent so you see people are traveling in very small groups and they're not booking very much in advance they have a much higher chance of booking and similarly you can see other branches where we give you we give you subgroups they have a much lower chance of booking so with that in mind you can customize your product and services to nudge people to book more for example um if someone is booking 90 days in advance or 180 days in advance you know this person is probably less likely to book in that case um your marketing campaign you can target them with a marketing promo code and for your product you can nudge them you can nudge them with nicer features for example a futures like a recommendation for recommendation for the season we are going to travel so with this kind of customized product and services by understanding the chance actual chance of booking um you can improve the overall chance of booking for your website yeah so besides product marketing nanocomu use case in decision tree is weather forecasting what kind of signals we're more likely to lead to rain yeah so we have written probably less than 100 code 100 lines of code now and um we answered some of the really hard questions for the board meeting and this is some of the basic customer segmentation in python and at least some of the resources here you can explore and further learn more about it my best advice is uh really dive into our problems and uh get a data set yeah very interesting and analyze it and how you improve um how you improve your data analysis skills how you improve how you apply the data to business context is just to practice thank you any questions yeah that was packo sg 2016 how would you assess the effectiveness of your clustering of decision tree using only offline validation what that means is you got 80 percent how you assess it so maybe you can create five different models and you have the budget to invest all of you on that how would you pick which one to go with yeah yeah that's a very good question so i think before you decide which algorithm to use you need to first understand the data for example k-means um the k-means algorithm demonstrates us now it's only suitable for certain type of data because for every um machine learning algorithm there we are making huge assumptions about what kind of data we have for example machine k-means is very bad for elongated data it's assuming the clusters are wrong right so if you know your data set as something that's not run clusters that's probably something you shouldn't use and in order to validate the data there are few techniques you can use without ab testing for example you can use cross validation that means you'll keep training your data using different proportion of test and train and to only um then you'll get the you only get optimal model with the best with the highest um accuracy so that um there are these are mechanics you can use to make sure the model for the parameters are using are optimal but i would say um there's nothing that can replace ab testing to understand how if ab testing a lot about behavioral response it's about after you have the intervention how people are going to respond to it right so all my talk um i think all this methodology i demonstrated just now is only on the very first part about what kind of um what kind of data structure what kind of data insights you can get to understand what are actions that might likely to lead to a difference in behavior for example um i'm assuming banners and the higher chance and lower chance of booking you can come up with ideas for the intervention but the effective intervention is a whole different story and how people are going to respond to it thanks thank you